# Working with different file formats

As a data person you will deal with various type of data and it's imprtant to learn
how to handle these file formats.

# Working in JSON files

Javascript object notation(JSON) it has became most widely used to interchange information
across internet.

### how JSON format looks like


## Writing JSON files

In [1]:
import json

In [2]:
data = {
    "president": {
        "name" : "Zaphod Beedlebrox",
        "species": "Betelgeusian"
    }
}

In [3]:
with open("data_file.json", "w") as write_file:
    json.dump(data, write_file)

Note that dump() takes two positional arguments:
    1. the data object to be in correct format(i.e, in json or dictionary), and 
    2. the file-like object to which the bytes will be written.

## Reading JSON files

In [4]:
with open("data_file.json", "r") as read_file:
    data = json.load(read_file)

In [5]:
type(data)

dict

In [6]:
data

{'president': {'name': 'Zaphod Beedlebrox', 'species': 'Betelgeusian'}}

## Reading JSON as DataFrame in Pandas

In [7]:
import pandas as pd
from io import StringIO
jsonStr = '''{"Index0":{"Courses": "Pandas","Discount": "1200"},
              "Index1":{"Courses": "Hadoop","Discount":"1400"},
              "Index2":{"Courses": "Spark","Discount":"1900"}
              }'''
#convert JSON to DataFrame Using read_json()
df2 = pd.read_json(StringIO(jsonStr),orient = 'index')
print(df2)

       Courses  Discount
Index0  Pandas      1200
Index1  Hadoop      1400
Index2   Spark      1900


### convert Dict to DF

In [8]:
import pandas as df

df3 = pd.DataFrame.from_dict(data, orient = 'columns')

In [9]:
df3

Unnamed: 0,president
name,Zaphod Beedlebrox
species,Betelgeusian


## Working with CSV files

Comma Separated Values file is a type of plain text file that uses specific struturing to arragnge tabular data.

In [10]:
df = pd.read_csv('hrdata.csv',index_col = 'Name')

In [11]:
df.head()

Unnamed: 0_level_0,Hire Date,Salary,Sick Days remaining
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Graham Chapman,03/15/14,50000.0,10
John Cleese,06/01/15,65000.0,8
Eric Idle,05/12/14,45000.0,10
Terry Jones,11/01/13,70000.0,3
Terry Gilliam,08/12/14,48000.0,7


In [12]:
from datetime import datetime

dateparse = lambda x : datetime.strptime(x, '%m/%d/%y')

df= pd.read_csv('hrdata.csv',index_col = 'Name', parse_dates = ['Hire Date'], date_parser = dateparse)
df

  df= pd.read_csv('hrdata.csv',index_col = 'Name', parse_dates = ['Hire Date'], date_parser = dateparse)


Unnamed: 0_level_0,Hire Date,Salary,Sick Days remaining
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Graham Chapman,2014-03-15,50000.0,10
John Cleese,2015-06-01,65000.0,8
Eric Idle,2014-05-12,45000.0,10
Terry Jones,2013-11-01,70000.0,3
Terry Gilliam,2014-08-12,48000.0,7
Michael Palin,2013-05-23,66000.0,8


## Working with Excel Files
To work with Excel files wwe have package in python openpyxl

In [13]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
python.exe -m pip install --upgrade pip

SyntaxError: invalid syntax (842801469.py, line 1)

In [None]:
from openpyxl import Workbook     # Workbook is class in openpyxl

workbook = Workbook()             # creating instance of class Workbook()
sheet = workbook.active           #active worksheet (usulally the first sheet) in the workbook

sheet['A1'] = "hello"
sheet['B1'] = "World"
sheet['C1'] = "Your Brand new data engineer has arrived"

workbook.save(filename = "hello_world.xlsx")

In [None]:
#Reading excel file

from openpyxl import load_workbook          #load_workbook is used to open and read existing Excel .xlsx files.
workbook = load_workbook(filename="sample-xlsx-file.xlsx")
workbook.sheetnames
['Sheet 1']

In [None]:
sheet = workbook.active

In [None]:
sheet

In [None]:
sheet.title

In [None]:
sheet['A1']

In [None]:
sheet['A4'].value

In [None]:
sheet.cell(row=5,column=4)

In [None]:
sheet.cell(row=5,column=4).value

In [None]:
sheet.cell(row=3 , column=3 ).value

In [None]:
sheet["A1:C2"]

In [None]:
for row in sheet:
    print(row)

In [None]:
for row in sheet.iter_rows(values_only=True):
    print(row)

# Khatam tata bye bye 

# Read Excel file as DataFrame using Pandas

In [None]:
import pandas as pd
excel_df = pd.read_excel('sample-xlsx-file.xlsx')

In [None]:
excel_df

In [None]:
#kaam badhana hai to 

excel_df.to_excel('sample-xlsx-file-modified.xlsx')

#to_excel() is a method provided by Pandas that allows you to write a DataFrame to an Excel file.


# Working with AVRO files

.avro files different from other file types.

why its different: 
.avro files stores schemas of data in its header

explaination:
in other file types like .csv, xlsx, JSON, etc, when these file types contains data they 
treat each and every data as string be it date, or integer(are considered as string)
and we have to explicitly change the datatypes as we have done previously many times.

There comes .avro for rescue where it stores all data schemas like date , int, char, bool, string like that we don't have to explicitly change it as .avro has done it all

In [None]:
pip install avro-python3