# Pandas Input/Output

## Objectives

- Explore different methods for reading and writing data using Pandas.
- Understand how to handle various data formats like CSV, JSON, ARFF, and Excel.
- Learn techniques for preprocessing and normalizing data from different sources.

## Background

This notebook provides a comprehensive guide on managing data input and output with Pandas, showcasing the versatility of handling various data formats and the importance of preprocessing for data analysis.

## Datasets Used

- **Automobile Dataset**: From the UCI Machine Learning Repository, illustrating CSV file handling.
- **Higher Education Students Performance Evaluation Dataset**: Demonstrating semicolon-separated values and preprocessing steps.
- **Student Academics Performance Dataset**: An example of ARFF file processing and data decoding.
- **Countries Dataset**: Showcases JSON file manipulation and normalization.
- **Immunotherapy Dataset**: Used for Excel file reading and descriptive statistics.

## Some Datasets Repositories

In [1]:
import pandas as pd

In [2]:
# Controlling the number of columns a DataFrame shows
pd.set_option('display.max_columns', 8)

**UCI Machine Learning Repository**

The UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php] is a set of datasets, domain theories, and data generators mainly used by the machine learning community. It was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine.

For each data set, the UCI Machine Learning Repository has information like:
- Source
- Data Set Information
- Attribute Information
- Relevant Papers

**Data.World**

Data.world [https://data.world/] is the enterprise data catalog for the modern data stack. We can access a variety of datasets in a variety of formats here.

## Reading data from a `.csv` file

Due to its simplicity, the Comma Separated Values(CSV) files are among the most used formats for storing and sharing data.

`read_csv()`: reads data from the csv files and creates a DataFrame object.

We will use the Automobile dataset [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. 

Defining the headers

In [3]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

In [4]:
df_auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
print(df_auto.shape)
df_auto.head()

(205, 26)


Unnamed: 0,symboling,normalized_losses,make,fuel_type,...,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,...,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,...,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,...,5000.0,19,26,16500.0
3,2,164.0,audi,gas,...,5500.0,24,30,13950.0
4,2,164.0,audi,gas,...,5500.0,18,22,17450.0


Saving a local copy

In [5]:
df_auto.to_csv('auto.csv')

Reading the local copy

In [6]:
df2_auto = pd.read_csv('auto.csv')
print(df2_auto.shape)
df2_auto.head()

(205, 27)


Unnamed: 0.1,Unnamed: 0,symboling,normalized_losses,make,...,peak_rpm,city_mpg,highway_mpg,price
0,0,3,,alfa-romero,...,5000.0,21,27,13495.0
1,1,3,,alfa-romero,...,5000.0,21,27,16500.0
2,2,1,,alfa-romero,...,5000.0,19,26,16500.0
3,3,2,164.0,audi,...,5500.0,24,30,13950.0
4,4,2,164.0,audi,...,5500.0,18,22,17450.0


The two dataframes are slightly different. 

The Unnamed column is the index column. How to avoit it?

Saving the dataframe to a csv file without the index column

In [7]:
df_auto.to_csv('auto.csv',index=False)

Reading auto.csv

In [8]:
df2_auto = pd.read_csv('auto.csv')
df2_auto.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,...,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,...,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,...,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,...,5000.0,19,26,16500.0
3,2,164.0,audi,gas,...,5500.0,24,30,13950.0
4,2,164.0,audi,gas,...,5500.0,18,22,17450.0


Describing the numerical variables of the dataset

In [9]:
df_auto.describe()

Unnamed: 0,symboling,normalized_losses,wheel_base,length,...,peak_rpm,city_mpg,highway_mpg,price
count,205.0,164.0,205.0,205.0,...,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,...,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,...,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,...,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,...,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,...,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,...,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,...,6600.0,49.0,54.0,45400.0


But what if the file does not use commas? Maybe it uses semicolons, tabs, slash, or anything else. Let's see an example.

We will use the Higher Education Students Performance Evaluation Dataset [https://archive.ics.uci.edu/ml/datasets/Higher+Education+Students+Performance+Evaluation+Dataset] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. 

In [10]:
df_hst = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00623/DATA.csv",
                  na_values="?" )
print(df_hst.shape)
df_hst.head()

(145, 1)


Unnamed: 0,STUDENT ID;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;COURSE ID;GRADE
0,STUDENT1;2;2;3;3;1;2;2;1;1;1;1;2;3;1;2;5;3;2;2...
1,STUDENT2;2;2;3;3;1;2;2;1;1;1;2;3;2;1;2;1;2;2;2...
2,STUDENT3;2;2;2;3;2;2;2;2;4;2;2;2;2;1;2;1;2;1;2...
3,STUDENT4;1;1;1;3;1;2;1;2;1;2;1;2;5;1;2;1;3;1;2...
4,STUDENT5;2;2;1;3;2;2;1;3;1;4;3;3;2;1;2;4;2;1;1...


As you can see, all the information is in one column. Separators in this file are ';' instead of ',' We have to specify it.

In [11]:
df_hst = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00623/DATA.csv",
                  sep=';', na_values="?" )
print(df_hst.shape)
df_hst.head()

(145, 33)


Unnamed: 0,STUDENT ID,1,2,3,...,29,30,COURSE ID,GRADE
0,STUDENT1,2,2,3,...,1,1,1,1
1,STUDENT2,2,2,3,...,2,3,1,1
2,STUDENT3,2,2,2,...,2,2,1,1
3,STUDENT4,1,1,1,...,3,2,1,1
4,STUDENT5,2,2,1,...,2,2,1,1


Describing the numerical variables of the dataset

In [12]:
df_hst.describe()

Unnamed: 0,1,2,3,4,...,29,30,COURSE ID,GRADE
count,145.0,145.0,145.0,145.0,...,145.0,145.0,145.0,145.0
mean,1.62069,1.6,1.944828,3.572414,...,3.124138,2.724138,4.131034,3.227586
std,0.613154,0.491596,0.537216,0.80575,...,1.301083,0.916536,3.260145,2.197678
min,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0
25%,1.0,1.0,2.0,3.0,...,2.0,2.0,1.0,1.0
50%,2.0,2.0,2.0,3.0,...,3.0,3.0,3.0,3.0
75%,2.0,2.0,2.0,4.0,...,4.0,3.0,7.0,5.0
max,3.0,2.0,3.0,5.0,...,5.0,4.0,9.0,7.0


## Reading data from an `.arff` file

An `arff` (Attribute-Relation File Format) file is an ASCII text file. The Machine Learning Project at the Computer Science Department of The University of Waikato developed it.

`arff` files have two distinct sections. The first section is the header information, and the second has the data.

We will use the Student Academics Performance Data Set [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. 

For reading an arff file, we need to import:

In [13]:
# For reading an arff file, we need to import
from io import StringIO
import urllib.request
from scipy.io.arff import loadarff

In [14]:
stAcademic_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00467/Sapfile1.arff"
resp = urllib.request.urlopen(stAcademic_url)

In [15]:
data, meta = loadarff(StringIO(resp.read().decode('utf-8')))

`data` contains the data and `meta` contains the metadata

In [16]:
# data contains the data and meta contains the metadata
meta

Dataset: Sapfile1
	ge's type is nominal, range is ('M', 'F')
	cst's type is nominal, range is ('G', 'ST', 'SC', 'OBC', 'MOBC')
	tnp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	twp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	iap's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	esp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	arr's type is nominal, range is ('Y', 'N')
	ms's type is nominal, range is ('Married', 'Unmarried')
	ls's type is nominal, range is ('T', 'V')
	as's type is nominal, range is ('Free', 'Paid')
	fmi's type is nominal, range is ('Vh', 'High', 'Am', 'Medium', 'Low')
	fs's type is nominal, range is ('Large', 'Average', 'Small')
	fq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	mq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	fo's type is nominal, range is ('Service', 'Business', 'Retired', 'Farmer', 'Others')
	mo's type is nominal, ran

In [17]:
columns_name = list(meta._attributes.keys())
df_st = pd.DataFrame(data, columns=columns_name)
df_st.head()

Unnamed: 0,ge,cst,tnp,twp,...,ss,me,tt,atd
0,b'F',b'G',b'Good',b'Good',...,b'Govt',b'Asm',b'Small',b'Good'
1,b'M',b'OBC',b'Vg',b'Vg',...,b'Govt',b'Asm',b'Average',b'Average'
2,b'F',b'OBC',b'Good',b'Good',...,b'Govt',b'Asm',b'Large',b'Good'
3,b'M',b'MOBC',b'Pass',b'Good',...,b'Govt',b'Asm',b'Average',b'Average'
4,b'M',b'G',b'Good',b'Good',...,b'Private',b'Asm',b'Small',b'Good'


In some cases, the integer columns are read as objects; for instance, instead of `2`, we have `b'2'`. We go over the object columns and decode them again to solve this problem.

Decoding the object columns

In [18]:
# decoding the object columns
for c in df_st.columns:
    if df_st[c].dtype == 'object':
        df_st[c] = df_st[c].str.decode('UTF-8')
df_st.head()

Unnamed: 0,ge,cst,tnp,twp,...,ss,me,tt,atd
0,F,G,Good,Good,...,Govt,Asm,Small,Good
1,M,OBC,Vg,Vg,...,Govt,Asm,Average,Average
2,F,OBC,Good,Good,...,Govt,Asm,Large,Good
3,M,MOBC,Pass,Good,...,Govt,Asm,Average,Average
4,M,G,Good,Good,...,Private,Asm,Small,Good


Describing the numerical variables of the dataset

In [19]:
df_st.describe()

Unnamed: 0,ge,cst,tnp,twp,...,ss,me,tt,atd
count,131,131,131,131,...,131,131,131,131
unique,2,5,4,4,...,2,4,3,3
top,M,OBC,Good,Good,...,Govt,Eng,Small,Good
freq,72,57,59,65,...,91,62,78,56


## Reading data from a `.json` file

JavaScript Object Notation (JSON) is another common way of sharing data, especially on the web.

`read_json()`: reads data from the json files and creates a DataFrame object.

We will use `countries.json` [https://data.world/dr5hn/country-state-city/workspace/file?filename=countries.json] for the following example.

In [20]:
df_countries = pd.read_json('https://query.data.world/s/6wc2blqdaxd6s2k3xmzmfx3zb72v7l')
print(df_countries.shape)
df_countries.head()

(250, 23)


Unnamed: 0,id,name,iso3,iso2,...,latitude,longitude,emoji,emojiU
0,1,Afghanistan,AFG,AF,...,33.0,65.0,🇦🇫,U+1F1E6 U+1F1EB
1,2,Aland Islands,ALA,AX,...,60.116667,19.9,🇦🇽,U+1F1E6 U+1F1FD
2,3,Albania,ALB,AL,...,41.0,20.0,🇦🇱,U+1F1E6 U+1F1F1
3,4,Algeria,DZA,DZ,...,28.0,3.0,🇩🇿,U+1F1E9 U+1F1FF
4,5,American Samoa,ASM,AS,...,-14.333333,-170.0,🇦🇸,U+1F1E6 U+1F1F8


Sometimes, if we want to see JSON as a table, we need to normalize it, and pandas have a handy method called `json_normalize`.

Describing the numerical variables of the dataset

In [21]:
df_countries.describe()

Unnamed: 0,id,numeric_code,region_id,subregion_id,latitude,longitude
count,250.0,250.0,248.0,247.0,250.0,250.0
mean,125.5,435.804,2.729839,10.744939,16.402597,13.52387
std,72.312977,254.38354,1.348112,6.056956,26.757204,73.45152
min,1.0,4.0,1.0,1.0,-74.65,-176.2
25%,63.25,219.0,2.0,6.0,1.0,-49.75
50%,125.5,436.0,3.0,11.0,16.083333,17.0
75%,187.75,653.5,4.0,16.0,39.0,48.75
max,250.0,926.0,6.0,22.0,78.0,178.0


Saving data to a json file

In [22]:
df_countries.to_json('countries.json')

## Reading data from a `.xlsx` file

Microsoft Excel is another way to store and share datasets. 

`read_excel()`: reads data from xlsx files and creates a DataFrame object.

Let's us the Immunotherapy Dataset [https://archive.ics.uci.edu/ml/datasets/Immunotherapy+Dataset] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/].

In [23]:
import openpyxl

In [24]:
df_immuno = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00428/Immunotherapy.xlsx')
print(df_immuno.shape)
df_immuno.head()

(90, 8)


Unnamed: 0,sex,age,Time,Number_of_Warts,Type,Area,induration_diameter,Result_of_Treatment
0,1,22,2.25,14,3,51,50,1
1,1,15,3.0,2,3,900,70,1
2,1,16,10.5,2,1,100,25,1
3,1,27,4.5,9,3,80,30,1
4,1,20,8.0,6,1,45,8,1


Describing the numerical variables of the dataset

In [25]:
df_immuno.describe()

Unnamed: 0,sex,age,Time,Number_of_Warts,Type,Area,induration_diameter,Result_of_Treatment
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,1.544444,31.044444,7.230556,6.144444,1.711111,95.7,14.333333,0.788889
std,0.500811,12.235435,3.098166,4.212238,0.824409,136.614643,17.217707,0.410383
min,1.0,15.0,1.0,1.0,1.0,6.0,2.0,0.0
25%,1.0,20.25,5.0,2.0,1.0,35.5,5.0,1.0
50%,2.0,28.5,7.75,6.0,1.0,53.0,7.0,1.0
75%,2.0,41.75,9.9375,8.75,2.0,80.75,9.0,1.0
max,2.0,56.0,12.0,19.0,3.0,900.0,70.0,1.0


## Conclusions

Key Takeaways:
- Pandas supports several data formats, enabling efficient data loading, manipulation, and saving.
- Some preprocessing steps are crucial for preparing datasets for analysis. We can mention setting column headers, handling missing values, and data decoding.
- The ability to handle different separators and the normalization of JSON data illustrate Pandas' flexibility in data preprocessing.
- Descriptive statistics functions provide quick insights into numerical variables, aiding in the initial analysis phase.

## References

- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3