# Lesson 5 Topic 1: 
# <span style=color:blue>How to read data from different text based (and non-text based) sources</span>

## Libraries to be installed for this Lesson
Because this lesson deals with reading various file formats, not surprisingly, we need to have support of additional libraries and software platforms to accomplish that. 

Execute following commands at the beginning to install necessary libraries, 

!apt-get update<br>
!apt-get install -y default-jdk<br> 
!pip install tabula-py xlrd lxml

Uncomment the following codes and execute them before proceeding

In [None]:
#!apt-get update
#!apt-get install -y default-jdk
#!pip install tabula-py xlrd lxml 

In [2]:
import numpy as np
import pandas as pd

### Exercise 1: Read data from a CSV

In [3]:
df1 = pd.read_csv("CSV_EX_1.csv")

In [4]:
df1

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 2: Read data from a CSV where headers are missing 

In [5]:
df2 = pd.read_csv("CSV_EX_2.csv")
df2

Unnamed: 0,2,1500,Good,300000
0,3,1300,Fair,240000
1,3,1900,Very good,450000
2,3,1850,Bad,280000
3,2,1640,Good,310000


In [6]:
df2 = pd.read_csv("CSV_EX_2.csv",header=None)
df2

Unnamed: 0,0,1,2,3
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [7]:
df2 = pd.read_csv("CSV_EX_2.csv",header=None, names=['Bedroom','Sq.ft','Locality','Price($)'])
df2

Unnamed: 0,Bedroom,Sq.ft,Locality,Price($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 3: Read data from a CSV where delimiters/separators are not comma 

In [8]:
df3 = pd.read_csv("CSV_EX_3.csv")
df3

Unnamed: 0,Bedroom; Sq. foot; Locality; Price ($)
0,2; 1500; Good; 300000
1,3; 1300; Fair; 240000
2,3; 1900; Very good; 450000
3,3; 1850; Bad; 280000
4,2; 1640; Good; 310000


In [9]:
df3 = pd.read_csv("CSV_EX_3.csv",sep=';')
df3

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 4: How to bypass given headers with your own? 

In [10]:
df4 = pd.read_csv("CSV_EX_1.csv",names=['A','B','C','D'])
df4

Unnamed: 0,A,B,C,D
0,Bedroom,Sq. foot,Locality,Price ($)
1,2,1500,Good,300000
2,3,1300,Fair,240000
3,3,1900,Very good,450000
4,3,1850,Bad,280000
5,2,1640,Good,310000


In [11]:
df4 = pd.read_csv("CSV_EX_1.csv",header=0,names=['A','B','C','D'])
df4

Unnamed: 0,A,B,C,D
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 5: Skip initial rows 

In [12]:
df5 = pd.read_csv("CSV_EX_skiprows.csv")
df5

Unnamed: 0,Filetype: CSV,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,Info about some houses,,
1,Bedroom,Sq. foot,Locality,Price ($)
2,2,1500,Good,300000
3,3,1300,Fair,240000
4,3,1900,Very good,450000
5,3,1850,Bad,280000
6,2,1640,Good,310000


In [13]:
df5 = pd.read_csv("CSV_EX_skiprows.csv",skiprows=2)
df5

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 6: Skip footers

In [14]:
df6 = pd.read_csv("CSV_EX_skipfooter.csv")
df6

Unnamed: 0,Filetype: CSV,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,Info about some houses,,
1,Bedroom,Sq. foot,Locality,Price ($)
2,2,1500,Good,300000
3,3,1300,Fair,240000
4,3,1900,Very good,450000
5,3,1850,Bad,280000
6,2,1640,Good,310000
7,,This is the end of file,,


In [15]:
df6 = pd.read_csv("CSV_EX_skipfooter.csv",skiprows=2,skipfooter=1,engine='python')
df6

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 7: Read only first _n_ rows (especially useful for large files)

In [16]:
df7 = pd.read_csv("CSV_EX_1.csv",nrows=2)
df7

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000


### Exercise 8: How to combine `skiprows` and `nrows` to read data in small chunks  

In [17]:
# List where DataFrames will be stored
list_of_dataframe = []
# Number of rows to be read in one chunk
rows_in_a_chunk = 10
# Number of chunks to be read (this many separate DataFrames will be produced)
num_chunks = 5
# Dummy DataFrame to get the column names
df_dummy = pd.read_csv("Boston_housing.csv",nrows=2)
colnames = df_dummy.columns
# Loop over the CSV file to read only specified number of rows at a time
# Note how the iterator variable i is set up inside the range
for i in range(0,num_chunks*rows_in_a_chunk,rows_in_a_chunk):
    df = pd.read_csv("Boston_housing.csv",header=0,skiprows=i,nrows=rows_in_a_chunk,names=colnames)
    list_of_dataframe.append(df)

In [18]:
list_of_dataframe[0]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


In [19]:
list_of_dataframe[1]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
1,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
2,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7
3,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4
4,0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2
5,0.62739,0.0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9
6,1.05393,0.0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1
7,0.7842,0.0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5
8,0.80271,0.0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2
9,0.7258,0.0,8.14,0,0.538,5.727,69.5,3.7965,4,307,21.0,390.95,11.28,18.2


### Exercise 9: Setting the option `skip_blank_lines`

In [20]:
df9 = pd.read_csv("CSV_EX_blankline.csv")
df9

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [21]:
df9 = pd.read_csv("CSV_EX_blankline.csv",skip_blank_lines=False)
df9

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2.0,1500.0,Good,300000.0
1,3.0,1300.0,Fair,240000.0
2,,,,
3,3.0,1900.0,Very good,450000.0
4,3.0,1850.0,Bad,280000.0
5,,,,
6,2.0,1640.0,Good,310000.0


### Exercise 10: Read CSV from inside a compressed (.zip/.gz/.bz2/.xz) file

In [22]:
df10 = pd.read_csv('CSV_EX_1.zip')
df10

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 11: Reading from an Excel file - how to use `sheet_name`

In [23]:
df11_1 = pd.read_excel("Housing_data.xlsx",sheet_name='Data_Tab_1')
df11_2 = pd.read_excel("Housing_data.xlsx",sheet_name='Data_Tab_2')
df11_3 = pd.read_excel("Housing_data.xlsx",sheet_name='Data_Tab_3')

In [24]:
df11_1.shape

(9, 14)

In [25]:
df11_2.shape

(4, 14)

In [26]:
df11_3.shape

(16, 14)

### Exercise 12: If `sheet_name` is set to `None` then an Ordered Dictionary of DataFrame is returned if the Excel file has distinct sheets

In [27]:
dict_df = pd.read_excel("Housing_data.xlsx",sheet_name=None)

In [28]:
dict_df.keys()

odict_keys(['Data_Tab_1', 'Data_Tab_2', 'Data_Tab_3'])

### Exercise 13: General delimated text file can be read same as a CSV

In [29]:
df13 = pd.read_table("Table_EX_1.txt")
df13

Unnamed: 0,"Bedroom, Sq. foot, Locality, Price ($)"
0,"2, 1500, Good, 300000"
1,"3, 1300, Fair, 240000"
2,"3, 1900, Very good, 450000"
3,"3, 1850, Bad, 280000"
4,"2, 1640, Good, 310000"


In [30]:
df13 = pd.read_table("Table_EX_1.txt",sep=',')
df13

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [31]:
df13 = pd.read_table("Table_tab_separated.txt")
df13

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


### Exercise 14: Read HTML tables directly from an URL

In [32]:
url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
list_of_df = pd.read_html(url)

In [33]:
df14 = list_of_df[0]
df14.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 21, 2018"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"
2,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","July 26, 2017"
3,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","March 22, 2018"
4,First NBC Bank,New Orleans,LA,58302,Whitney Bank,"April 28, 2017","December 5, 2017"


### Exercise 15: Mostly, `read_html` returns more than one table and further wrangling is needed to get the desired data

In [34]:
list_of_df = pd.read_html("https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table",header=0)

In [35]:
len(list_of_df)

6

In [36]:
for t in list_of_df:
    print(t.shape)

(1, 1)
(87, 6)
(10, 8)
(0, 2)
(1, 1)
(4, 2)


In [37]:
df15=list_of_df[1]
df15.head()

Unnamed: 0,Rank,NOC,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121.0
1,2,Great Britain (GBR),27,23,17,67.0
2,3,China (CHN),26,18,26,70.0
3,4,Russia (RUS),19,17,20,56.0
4,5,Germany (GER),17,10,15,42.0


### Exercise 16: Read in a JSON file

In [38]:
df16 = pd.read_json("movies.json")

In [39]:
df16.head()

Unnamed: 0,cast,genres,title,year
0,[],[],After Dark in Central Park,1900
1,[],[],Boarding School Girls' Pajama Parade,1900
2,[],[],Buffalo Bill's Wild West Parad,1900
3,[],[],Caught,1900
4,[],[],Clowns Spinning Hats,1900


In [40]:
df16[df16['title']=="The Avengers"]['cast']

13519                           [Adele Mara, John Carroll]
23778    [Ralph Fiennes, Uma Thurman, Sean Connery, Jim...
27195    [Robert Downey, Jr., Chris Evans, Mark Ruffalo...
Name: cast, dtype: object

In [41]:
cast_of_avengers=df16[(df16['title']=="The Avengers") & (df16['year']==2012)]['cast']

In [42]:
print(list(cast_of_avengers))

[['Robert Downey, Jr.', 'Chris Evans', 'Mark Ruffalo', 'Chris Hemsworth', 'Scarlett Johansson', 'Jeremy Renner', 'Tom Hiddleston', 'Clark Gregg', 'Cobie Smulders', 'Stellan SkarsgÃ¥rd', 'Samuel L. Jackson']]


### Exercise 17: Read Stata file (.dta)

In [44]:
df17 = pd.read_stata("wu-data.dta")

In [45]:
df17.head()

Unnamed: 0,id,year,province,totalpop,totalso2,reg_GDP,time,treatment,provincearea,group,SO2PC,SO2PGDP,GDPPC,GDPPC2,pop_density
0,Beijing,1991,Beijing,1094.0,210000,598.900024,1.0,0,16800,1,191.956131,191.956131,0.547441,0.299691,0.065119
1,Beijing,1992,Beijing,1102.0,200000,709.099976,2.0,0,16800,1,181.488205,181.488205,0.643466,0.414049,0.065595
2,Beijing,1993,Beijing,1112.0,203736,863.530029,3.0,0,16800,1,183.21582,183.21582,0.776556,0.603039,0.06619
3,Beijing,1994,Beijing,1125.0,175616,1084.030029,4.0,0,16800,1,156.103104,156.103104,0.963582,0.928491,0.066964
4,Beijing,1995,Beijing,1251.0,214899,1394.890015,5.0,0,16800,1,171.781769,171.781769,1.11502,1.24327,0.074464


### Exercise 18: Read tabular data from PDF file

In [46]:
from tabula import read_pdf

In [47]:
df18_1 = read_pdf('Housing_data.pdf',pages=[1],pandas_options={'header':None})

In [48]:
df18_1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311
1,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311
2,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311
3,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311


In [49]:
df18_2 = read_pdf('Housing_data.pdf',pages=[2],pandas_options={'header':None})

In [50]:
df18_2

Unnamed: 0,0,1,2,3
0,15.2,386.71,17.1,18.9
1,15.2,392.52,20.45,15.0
2,15.2,396.9,13.27,18.9
3,15.2,390.5,15.71,21.7


In [51]:
df18=pd.concat([df18_1,df18_2],axis=1)

In [52]:
df18

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,0.1,1.1,2.1,3.1
0,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
1,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
2,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
3,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7


#### With PDF extraction, most of the time, headres will be difficult to extract automatically. You have to pass on the list of headres as the `names` argument in the `read-pdf` function as `pandas_option`,

In [53]:
names=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','PRICE']

In [54]:
df18_1 = read_pdf('Housing_data.pdf',pages=[1],pandas_options={'header':None,'names':names[:10]})
df18_2 = read_pdf('Housing_data.pdf',pages=[2],pandas_options={'header':None,'names':names[10:]})
df18=pd.concat([df18_1,df18_2],axis=1)

In [55]:
df18

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
1,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
2,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
3,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7
