# Pandas Input/Output

In [1]:
# !pip install numpy
# !pip install pandas

In [2]:
#import numpy as np
import pandas as pd
import urllib

In [3]:
# Controlling the number of columns a DataFrame shows
pd.set_option('display.max_columns', 6)

**UCI Machine Learning Repository**

The UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php] is a set of datasets, domain theories, and data generators mainly used by the machine learning community. It was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine.

For each data set, the UCI Machine Learning Repository has information like:
- Source
- Data Set Information
- Attribute Information
- Relevant Papers

**Data.World**

Data.world [https://data.world/] is the enterprise data catalog for the modern data stack. We can access a variety of datasets here.

MORE!!!!!

## Reading data from a `.csv` file

Due to its simplicity, the Comma Separated Values(CSV) files are among the most used formats for storing and sharing data.

`read_csv()`: reads data from the csv files and creates a DataFrame object.

We will use the Automobile dataset [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. 

In [4]:
# Defining the headers
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

In [5]:
df_auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
print(df_auto.shape)
df_auto.head()

(205, 26)


Unnamed: 0,symboling,normalized_losses,make,...,city_mpg,highway_mpg,price
0,3,,alfa-romero,...,21,27,13495.0
1,3,,alfa-romero,...,21,27,16500.0
2,1,,alfa-romero,...,19,26,16500.0
3,2,164.0,audi,...,24,30,13950.0
4,2,164.0,audi,...,18,22,17450.0


In [6]:
# Saving a local copy
df_auto.to_csv('auto.csv')

In [7]:
# Readi8ng the local copy
df2_auto = pd.read_csv('auto.csv')
print(df2_auto.shape)
df2_auto.head()

(205, 27)


Unnamed: 0.1,Unnamed: 0,symboling,normalized_losses,...,city_mpg,highway_mpg,price
0,0,3,,...,21,27,13495.0
1,1,3,,...,21,27,16500.0
2,2,1,,...,19,26,16500.0
3,3,2,164.0,...,24,30,13950.0
4,4,2,164.0,...,18,22,17450.0


The two dataframes are slightly different. 

The Unnamed column is the index column. How to avoit it?

In [8]:
# Save the dataframe to a csv file without the index column
df_auto.to_csv('auto.csv',index=False)

In [9]:
# Reading auto.csv 
df2_auto = pd.read_csv('auto.csv')
df2_auto.head()

Unnamed: 0,symboling,normalized_losses,make,...,city_mpg,highway_mpg,price
0,3,,alfa-romero,...,21,27,13495.0
1,3,,alfa-romero,...,21,27,16500.0
2,1,,alfa-romero,...,19,26,16500.0
3,2,164.0,audi,...,24,30,13950.0
4,2,164.0,audi,...,18,22,17450.0


## Reading data from an `.arff` file

An `arff` (Attribute-Relation File Format) file is an ASCII text file. The Machine Learning Project at the Computer Science Department of The University of Waikato developed it.

`arff` files have two distinct sections. The first section is the header information, and the second has the data.

We will use the Student Academics Performance Data Set [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. 

In [10]:
# For reading an arff file, we need to import
from io import StringIO
import urllib.request
from scipy.io.arff import loadarff

In [11]:
stAcademic_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00467/Sapfile1.arff"
resp = urllib.request.urlopen(stAcademic_url)

In [12]:
data, meta = loadarff(StringIO(resp.read().decode('utf-8')))

In [13]:
# data contains the data and meta contains the metadata
meta

Dataset: Sapfile1
	ge's type is nominal, range is ('M', 'F')
	cst's type is nominal, range is ('G', 'ST', 'SC', 'OBC', 'MOBC')
	tnp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	twp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	iap's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	esp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	arr's type is nominal, range is ('Y', 'N')
	ms's type is nominal, range is ('Married', 'Unmarried')
	ls's type is nominal, range is ('T', 'V')
	as's type is nominal, range is ('Free', 'Paid')
	fmi's type is nominal, range is ('Vh', 'High', 'Am', 'Medium', 'Low')
	fs's type is nominal, range is ('Large', 'Average', 'Small')
	fq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	mq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	fo's type is nominal, range is ('Service', 'Business', 'Retired', 'Farmer', 'Others')
	mo's type is nominal, ran

In [14]:
columns_name = list(meta._attributes.keys())
df_st = pd.DataFrame(data, columns=columns_name)
df_st.head()

Unnamed: 0,ge,cst,tnp,...,me,tt,atd
0,b'F',b'G',b'Good',...,b'Asm',b'Small',b'Good'
1,b'M',b'OBC',b'Vg',...,b'Asm',b'Average',b'Average'
2,b'F',b'OBC',b'Good',...,b'Asm',b'Large',b'Good'
3,b'M',b'MOBC',b'Pass',...,b'Asm',b'Average',b'Average'
4,b'M',b'G',b'Good',...,b'Asm',b'Small',b'Good'


In some cases, the integer columns are read as objects; for instance, instead of `2`, we have `b'2'`. We go over the object columns and decode them again to solve this problem.

In [15]:
# decoding the object columns
for c in df_st.columns:
    if df_st[c].dtype == 'object':
        df_st[c] = df_st[c].str.decode('UTF-8')
df_st.head()

Unnamed: 0,ge,cst,tnp,...,me,tt,atd
0,F,G,Good,...,Asm,Small,Good
1,M,OBC,Vg,...,Asm,Average,Average
2,F,OBC,Good,...,Asm,Large,Good
3,M,MOBC,Pass,...,Asm,Average,Average
4,M,G,Good,...,Asm,Small,Good


## Reading data from a `.json` file

We will use `countries.json` [https://data.world/dr5hn/country-state-city/workspace/file?filename=countries.json] for the following example.

In [16]:
df_countries = pd.read_json('https://query.data.world/s/6wc2blqdaxd6s2k3xmzmfx3zb72v7l')
print(df_countries.shape)
df_countries.head()

(250, 20)


Unnamed: 0,id,name,iso3,...,longitude,emoji,emojiU
0,1,Afghanistan,AFG,...,65.0,🇦🇫,U+1F1E6 U+1F1EB
1,2,Aland Islands,ALA,...,19.9,🇦🇽,U+1F1E6 U+1F1FD
2,3,Albania,ALB,...,20.0,🇦🇱,U+1F1E6 U+1F1F1
3,4,Algeria,DZA,...,3.0,🇩🇿,U+1F1E9 U+1F1FF
4,5,American Samoa,ASM,...,-170.0,🇦🇸,U+1F1E6 U+1F1F8


In [17]:
df = pd.read_json(path_json)
df.head(2)

ValueError: Expected object or value

Student Academics Performance Dataset

In [12]:
from io import StringIO
import urllib.request
from scipy.io.arff import loadarff

In [13]:
stAcademic_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00467/Sapfile1.arff"
resp = urllib.request.urlopen(stAcademic_url)

In [14]:
data, meta = loadarff(StringIO(resp.read().decode('utf-8')))

In [15]:
# data contains the data and meta contains the metadata
meta

Dataset: Sapfile1
	ge's type is nominal, range is ('M', 'F')
	cst's type is nominal, range is ('G', 'ST', 'SC', 'OBC', 'MOBC')
	tnp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	twp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	iap's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	esp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	arr's type is nominal, range is ('Y', 'N')
	ms's type is nominal, range is ('Married', 'Unmarried')
	ls's type is nominal, range is ('T', 'V')
	as's type is nominal, range is ('Free', 'Paid')
	fmi's type is nominal, range is ('Vh', 'High', 'Am', 'Medium', 'Low')
	fs's type is nominal, range is ('Large', 'Average', 'Small')
	fq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	mq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	fo's type is nominal, range is ('Service', 'Business', 'Retired', 'Farmer', 'Others')
	mo's type is nominal, ran

In [16]:
columns_name = list(meta._attributes.keys())
df = pd.DataFrame(data, columns=columns_name)
df.head(3)

Unnamed: 0,ge,cst,tnp,twp,iap,esp,arr,ms,ls,as,...,fq,mq,fo,mo,nf,sh,ss,me,tt,atd
0,b'F',b'G',b'Good',b'Good',b'Vg',b'Good',b'Y',b'Unmarried',b'V',b'Paid',...,b'Um',b'10',b'Farmer',b'Housewife',b'Large',b'Poor',b'Govt',b'Asm',b'Small',b'Good'
1,b'M',b'OBC',b'Vg',b'Vg',b'Vg',b'Vg',b'N',b'Unmarried',b'V',b'Paid',...,b'Um',b'Il',b'Service',b'Service',b'Small',b'Poor',b'Govt',b'Asm',b'Average',b'Average'
2,b'F',b'OBC',b'Good',b'Good',b'Vg',b'Good',b'N',b'Unmarried',b'V',b'Paid',...,b'12',b'10',b'Service',b'Housewife',b'Average',b'Average',b'Govt',b'Asm',b'Large',b'Good'


In some cases, the integer columns are read as objects; for instance, instead of 2, we have b'2'. We go over the object columns and decode them again to solve this problem.

In [17]:
# decoding the object columns
for c in df.columns:
    if df[c].dtype == 'object':
        df[c] = df[c].str.decode('UTF-8')
df.head()

Unnamed: 0,ge,cst,tnp,twp,iap,esp,arr,ms,ls,as,...,fq,mq,fo,mo,nf,sh,ss,me,tt,atd
0,F,G,Good,Good,Vg,Good,Y,Unmarried,V,Paid,...,Um,10,Farmer,Housewife,Large,Poor,Govt,Asm,Small,Good
1,M,OBC,Vg,Vg,Vg,Vg,N,Unmarried,V,Paid,...,Um,Il,Service,Service,Small,Poor,Govt,Asm,Average,Average
2,F,OBC,Good,Good,Vg,Good,N,Unmarried,V,Paid,...,12,10,Service,Housewife,Average,Average,Govt,Asm,Large,Good
3,M,MOBC,Pass,Good,Vg,Good,N,Unmarried,V,Paid,...,12,Um,Business,Business,Large,Poor,Govt,Asm,Average,Average
4,M,G,Good,Good,Vg,Vg,N,Unmarried,V,Paid,...,10,12,Service,Housewife,Large,Poor,Private,Asm,Small,Good


In [18]:
#!pip install xlsxwriter

In [19]:
import xlsxwriter

ModuleNotFoundError: No module named 'xlsxwriter'

In [None]:
writer = pd.ExcelWriter('countries.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name='WithoutIndex')
df1.to_excel(writer,sheet_name='WithIndex')
writer.save()

ModuleNotFoundError: No module named 'xlsxwriter'

In [None]:
df.to_json('countries.json')

In [None]:
dfj = pd.read_json('countries.json')
dfj.head(2)

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA


Reference:
- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3