# Importing Data Using Pandas - Lab

## Introduction

In this lab, you'll get some practice with loading files with summary or metadata, and if you find that easy, the optional "level up" content covers loading data from a corrupted csv file!

## Objectives
You will be able to:

- Use pandas to import data from a CSV and and an Excel spreadsheet  

##  Loading Files with Summary or Meta Data

Load either of the files `'Zipcode_Demos.csv'` or `'Zipcode_Demos.xlsx'`. What's going on with this dataset? Clean it up into a useable format and describe the nuances of how the data is currently formatted.

All data files are stored in a folder titled `'Data'`.

In [1]:
# Import pandas using the standard alias
import pandas as pd

In [2]:
df = pd.read_csv('Data/Zipcode_Demos.csv')

In [3]:
# Import the file and print the first 5 rows
df.head(5)

Unnamed: 0,0,Average Statistics,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46
0,1,,0.0,,,,,,,,...,,,,,,,,,,
1,2,JURISDICTION NAME,10005.8,,,,,,,,...,,,,,,,,,,
2,3,COUNT PARTICIPANTS,9.4,,,,,,,,...,,,,,,,,,,
3,4,COUNT FEMALE,4.8,,,,,,,,...,,,,,,,,,,
4,5,PERCENT FEMALE,0.404,,,,,,,,...,,,,,,,,,,


In [4]:
# Print the last 5 rows of df
df.tail(5)

Unnamed: 0,0,Average Statistics,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46
52,53,10006,6,2,0.33,4,0.67,0,0,6,...,6,100,0,0,6,1,0,0,6,100
53,54,10007,1,0,0.0,1,1.0,0,0,1,...,1,100,1,1,0,0,0,0,1,100
54,55,10009,2,0,0.0,2,1.0,0,0,2,...,2,100,0,0,2,1,0,0,2,100
55,56,10010,0,0,0.0,0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,57,10011,3,2,0.67,1,0.33,0,0,3,...,3,100,0,0,3,1,0,0,3,100


In [5]:
# What is going on with this data set? Anything unusual?

In [6]:
# Clean up the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 47 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   0                   57 non-null     int64 
 1   Average Statistics  56 non-null     object
 2   Unnamed: 2          57 non-null     object
 3   Unnamed: 3          11 non-null     object
 4   Unnamed: 4          11 non-null     object
 5   Unnamed: 5          11 non-null     object
 6   Unnamed: 6          11 non-null     object
 7   Unnamed: 7          11 non-null     object
 8   Unnamed: 8          11 non-null     object
 9   Unnamed: 9          11 non-null     object
 10  Unnamed: 10         11 non-null     object
 11  Unnamed: 11         11 non-null     object
 12  Unnamed: 12         11 non-null     object
 13  Unnamed: 13         11 non-null     object
 14  Unnamed: 14         11 non-null     object
 15  Unnamed: 15         11 non-null     object
 16  Unnamed: 16         11 non-n

In [7]:
df.shape

(57, 47)

In [8]:
df = pd.read_csv('Data/Zipcode_Demos.csv', header=47)

In [9]:
df = df.drop(df.columns[[0, 1, 2]], axis=1)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 44 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   COUNT FEMALE                         10 non-null     int64  
 1   PERCENT FEMALE                       10 non-null     float64
 2   COUNT MALE                           10 non-null     int64  
 3   PERCENT MALE                         10 non-null     float64
 4   COUNT GENDER UNKNOWN                 10 non-null     int64  
 5   PERCENT GENDER UNKNOWN               10 non-null     int64  
 6   COUNT GENDER TOTAL                   10 non-null     int64  
 7   PERCENT GENDER TOTAL                 10 non-null     int64  
 8   COUNT PACIFIC ISLANDER               10 non-null     int64  
 9   PERCENT PACIFIC ISLANDER             10 non-null     int64  
 10  COUNT HISPANIC LATINO                10 non-null     int64  
 11  PERCENT HISPANIC LATINO            

In [11]:
df

Unnamed: 0,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,COUNT PACIFIC ISLANDER,PERCENT PACIFIC ISLANDER,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
0,22,0.5,22,0.5,0,0,44,100,0,0,...,44,100,20,0.45,24,0.55,0,0,44,100
1,19,0.54,16,0.46,0,0,35,100,0,0,...,35,100,2,0.06,33,0.94,0,0,35,100
2,1,1.0,0,0.0,0,0,1,100,0,0,...,1,100,0,0.0,1,1.0,0,0,1,100
3,0,0.0,0,0.0,0,0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0
4,2,1.0,0,0.0,0,0,2,100,0,0,...,2,100,0,0.0,2,1.0,0,0,2,100
5,2,0.33,4,0.67,0,0,6,100,0,0,...,6,100,0,0.0,6,1.0,0,0,6,100
6,0,0.0,1,1.0,0,0,1,100,0,0,...,1,100,1,1.0,0,0.0,0,0,1,100
7,0,0.0,2,1.0,0,0,2,100,0,0,...,2,100,0,0.0,2,1.0,0,0,2,100
8,0,0.0,0,0.0,0,0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0
9,2,0.67,1,0.33,0,0,3,100,0,0,...,3,100,0,0.0,3,1.0,0,0,3,100


## Level Up (Optional) - Loading Corrupt CSV files

Occasionally, you encounter some really ill-formatted data. One example of this can be data that has strings containing commas in a csv file. Under the standard protocol, when this occurs, one is supposed to use quotes to differentiate between the commas denoting fields and the commas within those fields themselves. For example, we could have a table like this:  

`ReviewerID,Rating,N_reviews,Review,VenueID
123456,4,137,This restaurant was pretty good, we had a great time.,98765`

Which should be saved like this if it were a csv (to avoid confusion with the commas in the Review text):
`"ReviewerID","Rating","N_reviews","Review","VenueID"
"123456","4","137","This restaurant was pretty good, we had a great time.","98765"`

Attempt to import the corrupt file, or at least a small preview of it. It is appropriately titled `'Yelp_Reviews_Corrupt.csv'`. Investigate some of the intricacies of skipping rows to then pass over this error and comment on what you think is going on.

In [12]:
# Hint: Here's a useful programming pattern to use
try:
    # Do something
except Exception as e:
    # Handle your exception e

IndentationError: expected an indented block after 'try' statement on line 2 (2533573781.py, line 4)

In [13]:

df = pd.read_csv('Data/Yelp_Reviews_Corrupt.csv', on_bad_lines = 'skip')
df
    

Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,1,pomGBqfbxcqPv14c3XH-ZQ,0,2012-11-13,0,dDl8zu1vWPdKGihJrwQbpw,5,I love this place! My fiance And I go here atl...,0,msQe1u7Z_XuqjGoqhB0J5g
1,2,jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23,1,LZp4UX5zK3e-c5ZGSeo3kA,1,Terrible. Dry corn bread. Rib tips were all fa...,3,msQe1u7Z_XuqjGoqhB0J5g
2,4,Ums3gaP2qM3W1XcA5r6SsQ,0,2014-09-05,0,jsDu6QEJHbwP2Blom1PLCA,5,Delicious healthy food. The steak is amazing. ...,0,msQe1u7Z_XuqjGoqhB0J5g
3,5,vgfcTvK81oD4r50NMjU2Ag,0,2011-02-25,0,pfavA0hr3nyqO61oupj-lA,1,This place sucks. The customer service is horr...,2,msQe1u7Z_XuqjGoqhB0J5g
4,10,yFumR3CWzpfvTH2FCthvVw,0,2016-06-15,0,STiFMww2z31siPY7BWNC2g,5,I have been an Emerald Club member for a numbe...,0,TlvV-xJhmh7LCwJYXkV-cg
...,...,...,...,...,...,...,...,...,...,...
4030,This was disappointing.,,,,,,,,,
4031,First off,it was really awkward sitting on the benches ...,as people walked past us while to wait for ou...,,,,,,,
4032,Second,when we were seated,it was so loud. It felt like we were in a hig...,,,,,,,
4033,Finally - Food was mediocre. I was extremely d...,but it wasn't flavourful.,,,,,,,,


In [14]:
df.head(50)

Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,1,pomGBqfbxcqPv14c3XH-ZQ,0,2012-11-13,0.0,dDl8zu1vWPdKGihJrwQbpw,5.0,I love this place! My fiance And I go here atl...,0.0,msQe1u7Z_XuqjGoqhB0J5g
1,2,jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23,1.0,LZp4UX5zK3e-c5ZGSeo3kA,1.0,Terrible. Dry corn bread. Rib tips were all fa...,3.0,msQe1u7Z_XuqjGoqhB0J5g
2,4,Ums3gaP2qM3W1XcA5r6SsQ,0,2014-09-05,0.0,jsDu6QEJHbwP2Blom1PLCA,5.0,Delicious healthy food. The steak is amazing. ...,0.0,msQe1u7Z_XuqjGoqhB0J5g
3,5,vgfcTvK81oD4r50NMjU2Ag,0,2011-02-25,0.0,pfavA0hr3nyqO61oupj-lA,1.0,This place sucks. The customer service is horr...,2.0,msQe1u7Z_XuqjGoqhB0J5g
4,10,yFumR3CWzpfvTH2FCthvVw,0,2016-06-15,0.0,STiFMww2z31siPY7BWNC2g,5.0,I have been an Emerald Club member for a numbe...,0.0,TlvV-xJhmh7LCwJYXkV-cg
5,11,UBv8heCQR0RPnUQG0zkXIQ,0,2016-09-23,0.0,HkYqGb0Gplmmk-xlHTRBoA,1.0,The score should be negative. Its HORRIBLE. Th...,0.0,NhOc64RsrTT1Dls50yYW8g
6,12,hdgYnadxg0GANhWOJabr2g,0,2014-08-23,0.0,RgqWdZA4xR023iP3T6jVfA,5.0,I went there twice and I am pretty happy with ...,0.0,NhOc64RsrTT1Dls50yYW8g
7,19,gZGsReG0VeX4uKViHTB9EQ,0,2017-08-16,0.0,51RHs_V_fjuistnuKxNpEg,5.0,Finally! After trying many Mexican restaurants...,0.0,5ngpW5tf3ep680eG1HxHzA
8,25,f-v1fvtnbdw_QQRsCnwH-g,0,2017-11-18,0.0,alI_kRKyEHfdHibYGgtJbw,1.0,I have to write a review on the Fractured Prun...,0.0,Fc_nb6N6Sdurqb-rwsY1Bw
9,26,yz66FIUPDKGhILDWzRLeKg,0,2017-11-18,0.0,85DRIjwPJOTb4q0qOlBstw,1.0,I wish i could tell you all about the food but...,1.0,Fc_nb6N6Sdurqb-rwsY1Bw


In [15]:
df.tail(50)

Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
3985,Some of the desserts are really over priced fo...,,,,,,,,,
3986,Today I ordered an Early Grey. The earl grey t...,,,,,,,,,
3987,Will not be going back to this place ever .......,,,,,,,,,
3988,Tip* go to the Starbucks upstairs in the indig...,late,cappuccino or tea the way it should be done.,4,kBNFdviedCPFWyR-wVaAzw,,,,,
3989,2128,fB4cb6uvz-QngtYP0fbfAw,0,2012-07-17,0,20LissUeP84XzaBBYTmbZA,1,I didn't eat here because I was sitting at the...,,
3990,The Korean BBQ restaurant is next door and kit...,,,,,,,,,
3991,Service was very poor,2,ggUIxd2V8ryLPYUYJUBEaA,,,,,,,
3992,2934,21sGRVR7HEs_t6PdB9tGMw,0,2013-07-12,2,6U0BqygunHHoZzoS4D5mhA,5,I finally found a pizza that ill never get tir...,2,mNZUtwyCu4q0fsWvwPqceA
3993,1465,YJ05ntGlszxACOD5zn1YjA,0,2015-10-17,0,6IsC3q3ZULUWcbme0SdZYQ,4,Had a cheese burger and fries was really expen...,0,f9NVh1UWkW6DTnckfOuwEA
3994,3506,CE05_kJvl0xRloi8fW6R3w,0,2015-06-25,0,_Rzm6xCJb01K7OrALNjSvg,5,They do a great job of letting you know what's...,0,dN9nRZmDCk0wk1Uya5kCRQ


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4035 entries, 0 to 4034
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   4035 non-null   object
 1   business_id  3293 non-null   object
 2   cool         3097 non-null   object
 3   date         2495 non-null   object
 4   funny        2284 non-null   object
 5   review_id    2194 non-null   object
 6   stars        2127 non-null   object
 7   text         2072 non-null   object
 8   useful       1640 non-null   object
 9   user_id      1525 non-null   object
dtypes: object(10)
memory usage: 315.4+ KB


In [17]:
df2 = df[df.text.notnull()]
df2.head(50)

Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,1,pomGBqfbxcqPv14c3XH-ZQ,0,2012-11-13,0,dDl8zu1vWPdKGihJrwQbpw,5,I love this place! My fiance And I go here atl...,0.0,msQe1u7Z_XuqjGoqhB0J5g
1,2,jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23,1,LZp4UX5zK3e-c5ZGSeo3kA,1,Terrible. Dry corn bread. Rib tips were all fa...,3.0,msQe1u7Z_XuqjGoqhB0J5g
2,4,Ums3gaP2qM3W1XcA5r6SsQ,0,2014-09-05,0,jsDu6QEJHbwP2Blom1PLCA,5,Delicious healthy food. The steak is amazing. ...,0.0,msQe1u7Z_XuqjGoqhB0J5g
3,5,vgfcTvK81oD4r50NMjU2Ag,0,2011-02-25,0,pfavA0hr3nyqO61oupj-lA,1,This place sucks. The customer service is horr...,2.0,msQe1u7Z_XuqjGoqhB0J5g
4,10,yFumR3CWzpfvTH2FCthvVw,0,2016-06-15,0,STiFMww2z31siPY7BWNC2g,5,I have been an Emerald Club member for a numbe...,0.0,TlvV-xJhmh7LCwJYXkV-cg
5,11,UBv8heCQR0RPnUQG0zkXIQ,0,2016-09-23,0,HkYqGb0Gplmmk-xlHTRBoA,1,The score should be negative. Its HORRIBLE. Th...,0.0,NhOc64RsrTT1Dls50yYW8g
6,12,hdgYnadxg0GANhWOJabr2g,0,2014-08-23,0,RgqWdZA4xR023iP3T6jVfA,5,I went there twice and I am pretty happy with ...,0.0,NhOc64RsrTT1Dls50yYW8g
7,19,gZGsReG0VeX4uKViHTB9EQ,0,2017-08-16,0,51RHs_V_fjuistnuKxNpEg,5,Finally! After trying many Mexican restaurants...,0.0,5ngpW5tf3ep680eG1HxHzA
8,25,f-v1fvtnbdw_QQRsCnwH-g,0,2017-11-18,0,alI_kRKyEHfdHibYGgtJbw,1,I have to write a review on the Fractured Prun...,0.0,Fc_nb6N6Sdurqb-rwsY1Bw
9,26,yz66FIUPDKGhILDWzRLeKg,0,2017-11-18,0,85DRIjwPJOTb4q0qOlBstw,1,I wish i could tell you all about the food but...,1.0,Fc_nb6N6Sdurqb-rwsY1Bw


In [18]:
# Hint: Here's a useful programming pattern to use
import csv

df = open('Data/Yelp_Reviews_Corrupt.csv')
row_count= len(df.readlines())
row_count

df = pd.read_csv('Data/Yelp_Reviews_Corrupt.csv', on_bad_lines = 'skip')
row_count - len(df)
#try:


#except Exception as e:
#    print(7)

2074

In [27]:
text = []
a = open('Data/Yelp_Reviews_Corrupt.csv')
for line in df.readlines():
    text.append(line)





In [25]:
text = open('Data/Yelp_Reviews_Corrupt.csv')
for line in df.readlines():
    print(line)
df = pd.read_csv('Data/Yelp_Reviews_Corrupt.csv', on_bad_lines)

NameError: name 'on_bad_lines' is not defined

## Summary

Congratulations, you now practiced your Pandas-importing skills!

<_io.TextIOWrapper name='Data/Yelp_Reviews_Corrupt.csv' mode='r' encoding='UTF-8'>