# Accessing Different File Types with Pandas

In [None]:
#@title ### Run the following cell to download the necessary files for this lesson { display-mode: "form" } 
#@markdown Don't worry about what's in this collapsed cell

!pip install -q PyYaml
print('Downloading employees.json...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/employees.json -q -O employees.json
print('Downloading employees_2.json...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/employees_2.json -q -O employees_2.json
print('Downloading Salaries.csv...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/Salaries.csv -q -O Salaries.csv
print('Downloading animals.yaml...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/animals.yaml -q -O animals.yaml


## Learning Objectives

- Understand how to read different file types into Pandas
- Understand how to write different file types from Pandas

## Introduction

Pandas offers a variety of methods to read and write data from different file types. In this lesson, we will explore how to read and write data from different file types.

As we keep progressing through this notebook, you will see that the syntax for reading and writing data from different file types is very similar. This is because Pandas uses the same `read_` and `to_` methods for reading and writing data from different file types. The only difference is the file type written after the `read_` or `to_` methods.

In this notebook, we will see how to read and write data from the following file types:

- CSV
- JSON
- YAML

## Reading Data

As mentioned, you can use the `read_` methods to read data from different file types. The only difference is the file type written after the `read_` method.

For example, to read data from a CSV file, you can use the `read_csv` method. To read data from an Excel file, you can use the `read_excel` method. To read data from an XML file, you can use the `read_xml` method. And so on.

### CSV

To read data from a CSV file, you can use the `read_csv` method. The `read_csv` method takes in the path to the CSV file as an argument. The `read_csv` method returns a Pandas DataFrame. Let's see how to read data from a CSV file.


In [1]:
import pandas as pd

df_csv = pd.read_csv('Salaries.csv')

You have now a DataFrame with the data from the CSV file. You can use the `head` method to view the first five rows of the DataFrame.

In [3]:
display(df_csv.head())

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


Note that there is a column called `Id` in the data with the same numbers as the index given by pandas. In these cases, it might be helpful to set the `Id` column as the index of the DataFrame. You can do this by setting the `index_col` argument to the name of the column you want to set as the index. Let's see how to do this.

In [12]:
df_csv = pd.read_csv('Salaries.csv', index_col="Id")
display(df_csv.head())

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


As a DataFrame, you can use all the methods that you have learned so far to manipulate the data. For example, you can use the `describe` method to get a summary of the data.

In [13]:
df_csv.describe()

Unnamed: 0,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Status
count,676.0,676.0,676.0,0.0,676.0,676.0,676.0,0.0,0.0
mean,149900.019867,30577.231657,23448.147426,,203925.39895,203925.39895,2011.0,,
std,41837.73621,35124.001536,30720.629073,,31798.561906,31798.561906,0.0,,
min,25400.0,0.0,0.0,,180312.67,180312.67,2011.0,,
25%,117268.875,0.0,7006.3175,,185695.47,185695.47,2011.0,,
50%,144042.16,16664.63,16491.805,,194842.065,194842.065,2011.0,,
75%,184727.1325,57995.9625,27305.23,,209450.04,209450.04,2011.0,,
max,294580.02,245131.88,400184.25,,567595.43,567595.43,2011.0,,


### JSON

You can read JSON files into Pandas using the `read_json` method and get a Pandas DataFrame. Let's see how to read data from a JSON file.

In [19]:
df_json = pd.read_json('employees.json')

Again, you have now a DataFrame with the data from the JSON file. Let's take a look at the first rows of the DataFrame.

In [21]:
df_json.head()

Unnamed: 0,userId,jobTitle,firstName,lastName,employeeCode,region,phoneNumber
0,28e0a8ff-3a16-46be-be7e-94dc7f67a7d1,Developer,Albin,Bailey,E1,CA,123456
1,ebeb806d-82ec-4c0e-9a91-328fddec94a3,Developer,Carlos,Diaz,E2,CA,1111111
2,facbd40a-53cd-4357-9dac-e85d8bcc0636,Program Directory,Eugene,Faraday,E3,CA,2222222


In this case, the JSON file has a tidy structure, but it is not always the case. Sometimes, the JSON file has a nested structure. The `employees_2.json` file has a different structure, and the `read_json` works, but the DataFrame is not what we want.

In [24]:
df_json_2 = pd.read_json('employees_2.json')
display(df_json_2.head())

Unnamed: 0,Data,Number of Employees,Employees
0,Employees,3,{'userId': '28e0a8ff-3a16-46be-be7e-94dc7f67a7...
1,Employees,3,{'userId': 'ebeb806d-82ec-4c0e-9a91-328fddec94...
2,Employees,3,{'userId': 'facbd40a-53cd-4357-9dac-e85d8bcc06...


Notice that the third column doesn't have regular data, but rather a dictionary. This is because the JSON file has a nested structure. The `read_json` method can read nested JSON files, but it doesn't create a DataFrame with the nested structure. Instead, it creates a DataFrame with the data flattened out. In these cases, you can use the `json_normalize` method to create a DataFrame with the nested structure. Let's check how to do this.

In [28]:
employees_df = pd.json_normalize(df_json_2['Employees'])
display(employees_df.head())

Unnamed: 0,userId,jobTitle,firstName,lastName,employeeCode,region,phoneNumber
0,28e0a8ff-3a16-46be-be7e-94dc7f67a7d1,Developer,Albin,Bailey,E1,CA,123456
1,ebeb806d-82ec-4c0e-9a91-328fddec94a3,Developer,Carlos,Diaz,E2,CA,1111111
2,facbd40a-53cd-4357-9dac-e85d8bcc0636,Program Directory,Eugene,Faraday,E3,CA,2222222


The `json_normalize` can a list of dictionaries as an argument. In this case, the `data` variable is a list of dictionaries. The `json_normalize` method returns a DataFrame with the nested structure.

### YAML

Pandas doesn't have a method to read data from a YAML file. However, you can use the `safe_load` method from the `yaml` package to read data from a YAML file. Let's see how to read data from a YAML file.

In [34]:
import yaml

with open('animals.yaml', 'r') as f:
    animals = yaml.safe_load(f)

print(animals)

{'Animals': [{'name': 'Fifi', 'species': 'Koala'}, {'name': 'Spot', 'species': 'Rabbit'}, {'name': 'Fluffy', 'species': 'Wombat'}]}


Note that the animals variable is a dictionary whose key is the word "Animals" and the value is a list of dictionaries. Each dictionary in the list represents an animal with its name and species. Let's create a DataFrame with the data.

In [35]:
df_yaml = pd.DataFrame(animals['Animals'])
display(df_yaml.head())

Unnamed: 0,name,species
0,Fifi,Koala
1,Spot,Rabbit
2,Fluffy,Wombat


## Writing Data

You can use the `to_` methods to write data from different file types. The only difference is the file type written after the `to_` method.

For example, to write data to a CSV file, you can use the `to_csv` method. Let's see how to write data to CSV, JSON, and YAML files.

### CSV

Let's create a sample DataFrame with some data and write it to a CSV file.

In [39]:
sample_df = pd.DataFrame({'column_1': [1, 2, 3, 4, 5], 'column_2': [6, 7, 8, 9, 10], 'column_3': [11, 12, 13, 14, 15]})
display(sample_df.head())

Unnamed: 0,column_1,column_2,column_3
0,1,6,11
1,2,7,12
2,3,8,13
3,4,9,14
4,5,10,15


Now, we can use the `to_csv` method to write the data to a CSV file. The `to_csv` method takes in the path to the CSV file as an argument

In [40]:
sample_df.to_csv('sample.csv', index=False) # index=False is used to avoid saving the index column. Try removing it and see what happens.

### JSON

You can use the `to_json` method to write data to a JSON file. The `to_json` method takes in the path to the JSON file as an argument. Let's try it out with the sample data we created

In [43]:
sample_df.to_json('sample.json')

### YAML

The same way pandas doesn't have a way to read yaml files, it doesn't have a way to write yaml files. However, you can use the `safe_dump` method from the `yaml` package to write data to a YAML file. Let's see how to write data to a YAML file.

In [44]:
# First, let's convert the dataframe to a dictionary

sample_dict = sample_df.to_dict()

# Then, we can convert the dictionary to a yaml file

with open('sample.yaml', 'w') as f:
    yaml.dump(sample_dict, f)
    

## Summary

In this notebook, we saw that reading and writing data from different file types is very similar. You can check the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) to see all the methods available to read and write data from different file types. 

Alternatively, you can write `pd.read_` or `df.to_` in a cell and press `Tab` (or the autocomplete button in your IDE) to see all the methods available to read and write data from different file types.

<p align=center> <img src="images/pandas_files.png" width="500"> <img src="images/pandas_to.png" width="500"></p>

