# **Data Wrangling and Exploratory Data Analysis**

This Jupyter Notebook explores and analyzes structured data stored in TSV (Tab Separated Values) and JSON (JavaScript Object Notation) formats. It includes loading data, performing basic inspections like file size and line counts, displaying sample data entries, and converting raw data into a more accessible format using Pandas DataFrames. The goal is to provide an intuitive approach to handling data files typically used in data science and analytics, demonstrating file operations, data parsing, and preliminary data exploration.

***

## **1. Import Libraries and Set Data Directory**

In [1]:
import os
import json
import pandas as pd

***

## **2. Handling TSV Data**

### 2.1 ~ File Information

Display the size of the TSV file that contains restaurant data.

In [2]:
restaurants_tsv_file = os.path.join("../data/", "restaurants.tsv")
print(f"File size: {os.path.getsize(restaurants_tsv_file) / 1e6} MB")

File size: 12.773724 MB


### 2.2 ~ Count Lines

Count and print the number of lines in the file.

In [3]:
with open(restaurants_tsv_file, "r") as file:
    line_count = sum(1 for line in file)
    print(f"Number of lines: {line_count}")

Number of lines: 53974


### 2.3 ~ Display Sample Data

Print the first 10 lines of the file to preview the data.

In [4]:
with open(restaurants_tsv_file, "r") as file:
    print("First 10 lines of the TSV file:")
    for i in range(10):
        print(file.readline().strip())

First 10 lines of the TSV file:
business_id	business_name	business_address	business_city	business_state	business_postal_code	business_latitude	business_longitude	business_location	business_phone_number	inspection_id	inspection_date	inspection_score	inspection_type	violation_id	violation_description	risk_category	Neighborhoods	SF Find Neighborhoods	Current Police Districts	Current Supervisor Districts	Analysis Neighborhoods
0	835	Kam Po Kitchen	801 Broadway St	San Francisco	CA	94133	37.797223	-122.410513	POINT (-122.410513 37.797223)		835_20180917	09/17/2018 12:00:00 AM	88.0	Routine - Unscheduled	835_20180917_103139	Improper food storage	Low Risk	107.0	107.0	6.0	3.0	6.0
1	905	Working Girls' Cafe'	0259 Kearny St	San Francisco	CA	94108	37.790477	-122.404033	POINT (-122.404033 37.790477)		905_20190415	04/15/2019 12:00:00 AM	87.0	Routine - Unscheduled	905_20190415_103114	High risk vermin infestation	High Risk	19.0	19.0	6.0	3.0	8.0
2	1203	TAWAN'S THAI FOOD	4403 GEARY Blvd	San Francisco	CA	94

### 2.4 ~ Load Data into DataFrame

Load the restaurant data into a pandas DataFrame and display the first few rows.

In [5]:
restaurants_df = pd.read_csv(restaurants_tsv_file, delimiter='\t')
print(restaurants_df.head())

   Unnamed: 0  business_id         business_name    business_address  \
0           0          835        Kam Po Kitchen     801 Broadway St   
1           1          905  Working Girls' Cafe'      0259 Kearny St   
2           2         1203     TAWAN'S THAI FOOD     4403 GEARY Blvd   
3           3         1345           Cordon Bleu  1574 California St   
4           4         1352           LA TORTILLA     495 Castro St B   

   business_city business_state business_postal_code  business_latitude  \
0  San Francisco             CA                94133          37.797223   
1  San Francisco             CA                94108          37.790477   
2  San Francisco             CA                94118          37.780834   
3  San Francisco             CA                94109          37.790683   
4  San Francisco             CA                94114          37.760954   

   business_longitude              business_location  ...  inspection_score  \
0         -122.410513  POINT (-122.41

***

## **3. Handling JSON Data**

### 3.1 ~ File Information

Display the size of the JSON file containing COVID-19 confirmed cases.

In [6]:
covid_file = os.path.join("../data/", "confirmed-cases.json")
print(f"File size: {os.path.getsize(covid_file) / 1e6} MB")

File size: 0.116367 MB


### 3.2 ~ Count Lines

Count and print the number of lines in the file.

In [7]:
with open(covid_file, "r") as file:
    line_count = sum(1 for line in file)
    print(f"Number of lines: {line_count}")

Number of lines: 1110


### 3.3 ~ Load and Explore JSON Data

Load the JSON data and explore its structure.

In [8]:
with open(covid_file, "rb") as file:
    covid_json = json.load(file)
    print("Data type of the loaded JSON:", type(covid_json))
    print("Keys in the top-level JSON object:", covid_json.keys())
    print("Keys under the 'meta' key:", covid_json['meta'].keys())
    print("Further nested keys under 'meta'->'view':", covid_json['meta']['view'].keys())
    print("Description of the data:", covid_json['meta']['view']['description'])

Data type of the loaded JSON: <class 'dict'>
Keys in the top-level JSON object: dict_keys(['meta', 'data'])
Keys under the 'meta' key: dict_keys(['view'])
Further nested keys under 'meta'->'view': dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
Description of the data: Counts of confirmed COVID-19 cases among Berkeley residents by date.


### 3.4 ~ Create DataFrame from JSON

Extract data columns using metadata and load into a pandas DataFrame. Display the last few rows.

In [9]:
covid_df = pd.DataFrame(
    covid_json['data'],
    columns=[col['name'] for col in covid_json['meta']['view']['columns']]
)
print(covid_df.tail())

                    sid                                    id  position  \
699  row-49b6_x8zv.gyum  00000000-0000-0000-A18C-9174A6D05774         0   
700  row-gs55-p5em.y4v9  00000000-0000-0000-F41D-5724AEABB4D6         0   
701  row-3pyj.tf95-qu67  00000000-0000-0000-BEE3-B0188D2518BD         0   
702  row-cgnd.8syv.jvjn  00000000-0000-0000-C318-63CF75F7F740         0   
703  row-qywv_24x6-237y  00000000-0000-0000-FE92-9789FED3AA20         0   

     created_at created_meta  updated_at updated_meta meta  \
699  1643733903         None  1643733903         None  { }   
700  1643733903         None  1643733903         None  { }   
701  1643733903         None  1643733903         None  { }   
702  1643733903         None  1643733903         None  { }   
703  1643733903         None  1643733903         None  { }   

                    Date New Cases Cumulative Cases  
699  2022-01-27T00:00:00       106            10694  
700  2022-01-28T00:00:00       223            10917  
701  2022-01-2

### 3.5 ~ Check Column Data Type

Check and print the data type of a specific column in the COVID DataFrame.

In [10]:
print("Data type of 'sid' column in the COVID DataFrame:", covid_df["sid"].dtype)

Data type of 'sid' column in the COVID DataFrame: object
