# Intro to Pandas
by Ryan Orsinger

## Module 5: Working With Files
- More on using `pd.read_csv`
    - HTTP Requests
    - Working with files that use delimiters/separators other than commas
    - Setting the index column
- Writing data with `to_csv`
- Reading JSON
- Reading from Excel files
- Writing to Excel files

In [3]:
import pandas as pd

In [3]:
# read_csv can read from hosted CSV files.
# Pandas sends the http request!
url = "https://gist.githubusercontent.com/ryanorsinger/cc276eea59e8295204d1f581c8da509f/raw/2388559aef7a0700eb31e7604351364b16e99653/mall_customers.csv"
pd.read_csv(url).head()

Unnamed: 0,customer_id,gender,age,annual_income,spending_score
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [4]:
# To set the index column, use the index_col argument
# If you notice a column that makes sense to use as the index, you'll need to specic
pd.read_csv(url, index_col="customer_id").head()

Unnamed: 0_level_0,gender,age,annual_income,spending_score
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Male,19,15,39
2,Male,21,15,81
3,Female,20,16,6
4,Female,23,16,77
5,Female,31,17,40


In [15]:
# The ! operator inside of Jupyter Notebooks or iPython issues a command to the terminal
# If you use Windows without the Linux Subsystem enabled, use !dir *.csv
!dir "../datasets/*sales*.csv"

 Volume in drive C is Windows
 Volume Serial Number is 3E94-F918

 Directory of C:\Users\rioca\_anaconda_courses\datasets

12/27/2023  02:50 PM                72 2020_sales.csv
12/27/2023  02:50 PM                72 2021_sales.csv
12/27/2023  02:50 PM                72 2022_sales.csv
12/22/2023  05:28 PM            28,747 movie_genres.csv
12/25/2023  04:39 PM            13,312 mpg.csv
12/25/2023  12:10 PM            15,240 penguins.csv
12/27/2023  05:08 PM               781 quotes.csv
12/25/2023  04:39 PM             7,943 tips.csv
               8 File(s)         66,239 bytes
               0 Dir(s)  781,990,973,440 bytes free


In [21]:
!dir "../datasets/*sales*.csv" 

 Volume in drive C is Windows
 Volume Serial Number is 3E94-F918

 Directory of C:\Users\rioca\_anaconda_courses\intro_to_pandas

12/27/2023  08:09 PM                72 2020_sales.csv
12/27/2023  08:09 PM                72 2021_sales.csv
12/27/2023  08:09 PM                72 2022_sales.csv
               3 File(s)            216 bytes
               0 Dir(s)  781,983,408,128 bytes free


In [33]:
sales_files = !dir "../datasets/*sales*.csv"

# Extract just the file names from the dir_output
sales_files = [line.split()[-1] for line in sales_files if line.endswith(".csv")]

sales_files 

['2020_sales.csv', '2021_sales.csv', '2022_sales.csv']

In [35]:
# Programmatically Reading Multiple Files 
sales_data = []
for file in sales_files:
    df = pd.read_csv(f"../datasets/{file}")
    sales_data.append(df)
    
sales_df = pd.concat(sales_data, ignore_index=True)
sales_df

Unnamed: 0,year,items,units
0,2020,trucks,20
1,2020,sedans,15
2,2020,compact vehicles,14
3,2021,trucks,35
4,2021,sedans,30
5,2021,compact vehicles,17
6,2022,trucks,40
7,2022,sedans,31
8,2022,compact vehicles,35


In [36]:
# It's common in the field to combine many different data sources into a single dataframe for cleaning/analysis
# Writing to_csv will write the index values to their own column on the data
sales_df.to_csv("../datasets/all_sales.csv")

In [37]:
!dir "../datasets/*.csv"

 Volume in drive C is Windows
 Volume Serial Number is 3E94-F918

 Directory of C:\Users\rioca\_anaconda_courses\datasets

12/27/2023  02:50 PM                72 2020_sales.csv
12/27/2023  02:50 PM                72 2021_sales.csv
12/27/2023  02:50 PM                72 2022_sales.csv
12/27/2023  11:50 PM               211 all_sales.csv
12/22/2023  05:28 PM            28,747 movie_genres.csv
12/25/2023  04:39 PM            13,312 mpg.csv
12/25/2023  12:10 PM            15,240 penguins.csv
12/27/2023  05:08 PM               781 quotes.csv
12/25/2023  04:39 PM             7,943 tips.csv
               9 File(s)         66,450 bytes
               0 Dir(s)  781,938,380,800 bytes free


In [38]:
# Notice how the left-over column is turned into an unnamed column
pd.read_csv("../datasets/all_sales.csv").head()

Unnamed: 0.1,Unnamed: 0,year,items,units
0,0,2020,trucks,20
1,1,2020,sedans,15
2,2,2020,compact vehicles,14
3,3,2021,trucks,35
4,4,2021,sedans,30


In [39]:
# Let's see an example where we avoid this complication by paying more attention to the index
# The index argument on to_csv takes a boolean and defaults to True
sales_df.to_csv("../datasets/all_sales_clean.csv", index=False)

In [40]:
# Notice that the index is regenerated and is appropriate
pd.read_csv("../datasets/all_sales_clean.csv")

Unnamed: 0,year,items,units
0,2020,trucks,20
1,2020,sedans,15
2,2020,compact vehicles,14
3,2021,trucks,35
4,2021,sedans,30
5,2021,compact vehicles,17
6,2022,trucks,40
7,2022,sedans,31
8,2022,compact vehicles,35


If you use a named index column instead of only the autogenerated index, you will avoid this.

### Note on Separator Characters, called Delimiters
- CSV files use commas to separate values
- You may encounter files that use another delimiter character than a comma
- Tab separated files are common in logfiles and spreadsheet exports
- Sometimes, you may encounter a file extension of .tsv for tab-separated-values
- You may encounter delimiters other than commas or tabs in plain text files.
- Use `pd.read_csv` for them (unless the file is .JSON), and identify the appropriate character

In [41]:
# The "\t" character is how we specify a tab character
pd.read_csv("../misc/penguins_with_tabs.tsv", sep="\t").head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [44]:
# The read_json method can read JSON files from the file system or from URLs.
# This is particularly helpful when consuming data from a RESTful API that returns JSON
# curie_quotes = pd.read_json("https://aphorisms.glitch.me/api/example") -- got a 403 error
curie_quotes

NameError: name 'curie_quotes' is not defined

## Example of using `read_clipboard`

|     model |             displ | year |  cyl | trans |        drv |  cty |  hwy |   fl | drv   | class   |
| --------: | ----------------: | ---: | ---: | ----: | ---------: | ---: | ---: | ---: | ----: | ------- |
|      audi |                a4 |  2.0 | 2008 |     4 |   auto(av) |    f |   21 |   30 |     p | compact |
|     dodge | dakota pickup 4wd |  3.9 | 1999 |     6 | manual(m5) |    4 |   14 |   17 |     r | pickup  |
|    toyota |       4runner 4wd |  4.7 | 2008 |     8 |   auto(l5) |    4 |   14 |   17 |     r | suv     |
|     dodge |       caravan 2wd |  3.8 | 2008 |     6 |   auto(l6) |    f |   16 |   23 |     r | minivan |
| chevrolet |            malibu |  3.6 | 2008 |     6 |   auto(s6) |    f |   17 |   26 |     r | midsize |


In [4]:
# Highlight and copy the table above 
# Then run this cell
df = pd.read_clipboard()
df

Unnamed: 0,model,displ,year,cyl,trans,drv,cty,hwy,fl,drv .1,class
0,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
1,dodge,dakota pickup 4wd,3.9,1999,6,manual(m5),4,14,17,r,pickup
2,toyota,4runner 4wd,4.7,2008,8,auto(l5),4,14,17,r,suv
3,dodge,caravan 2wd,3.8,2008,6,auto(l6),f,16,23,r,minivan
4,chevrolet,malibu,3.6,2008,6,auto(s6),f,17,26,r,midsize


In [5]:
# Writing a dataframe in memory to an excel file
df.to_excel("../misc/mpg.xlsx", index=None)

In [6]:
# Reading an excel file (simple version)
mpg = pd.read_excel("../misc/mpg.xlsx")

In [7]:
mpg

Unnamed: 0,model,displ,year,cyl,trans,drv,cty,hwy,fl,drv .1,class
0,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
1,dodge,dakota pickup 4wd,3.9,1999,6,manual(m5),4,14,17,r,pickup
2,toyota,4runner 4wd,4.7,2008,8,auto(l5),4,14,17,r,suv
3,dodge,caravan 2wd,3.8,2008,6,auto(l6),f,16,23,r,minivan
4,chevrolet,malibu,3.6,2008,6,auto(s6),f,17,26,r,midsize


In [8]:
# Reading a specific sheet from an excel file
pd.read_excel("../misc/example_spreadsheet.xlsx", sheet_name="grocery_list")

Unnamed: 0,item,price,quantity
0,cat foot,3.99,2
1,toilet paper,7.99,1
2,beans,0.99,2
3,corn,0.75,4


In [9]:
# Notice how there's some extra
pd.read_excel("../misc/example_spreadsheet.xlsx", sheet_name="pet_info")

Unnamed: 0,Fancy Company Reports,Unnamed: 1,Unnamed: 2
0,Quality service for all your pet needs,,
1,Company Motto: when your companion is in need,,
2,,,
3,Pet Name,Species,weight
4,Fluffy,cat,8
5,Max,dog,18
6,Gus,iguana,12


In [10]:
# Sometimes, you may need to open the spreadsheet to identify the columns to skip
pd.read_excel("../misc/example_spreadsheet.xlsx", sheet_name="pet_info", skiprows=4)

Unnamed: 0,Pet Name,Species,weight
0,Fluffy,cat,8
1,Max,dog,18
2,Gus,iguana,12


## Additional Resources
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_clipboard.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_clipboard.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
- Other formats https://pandas.pydata.org/docs/user_guide/io.html
    - SQL
    - XML
    - STATA
    - SAS
    - SPSS

# Exercises

- Use `pd.read_json` to read the Dolly Parton quotes into a DataFrame named `dolly`.
  Dolly Parton quotes: https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/4c0eef2e4cbce5e47b674e8d1d5bad34f0c7b757/dolly.json

- Read the Bob Ross quotes into a DataFrame named `bob`.
  Bob Ross quotes in JSON: https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/bob_ross.json

- Make a dictionary using the keys "quote" and "author" and provide a quote of your choice. Be sure to wrap the dictionary in square brackets. Use `pd.DataFrame` to turn this list containing the single dictionary into a one row DataFrame. Name your new DataFrame `my_quote`.

    - Next, use `pd.concat` to combine all three DataFrames together in a new variable named `quotes`.

    - Use `to_csv` to write the quotes DataFrame to disk, providing the file name `quotes.csv`.

- Read this `drinks JSON` into a DataFrame called `drinks`
  https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drinks.json

- Now, read in the beverage cost CSV into a DataFrame called `drink_costs`
  https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drink_cost.csv

    - Combine these DataFrames, and overwrite the DataFrame called `drinks` using `pd.concat`

    - Finally, write your `drinks` DataFrame to disk using `.to_excel`. Name the file `drinks.xlsx`.


In [11]:
# Use pd.read_json to read the Dolly Parton quotes into a DataFrame named dolly. 
# https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/4c0eef2e4cbce5e47b674e8d1d5bad34f0c7b757/dolly.json
dolly = pd.read_json("https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/4c0eef2e4cbce5e47b674e8d1d5bad34f0c7b757/dolly.json")
dolly

Unnamed: 0,quote,author
0,"We cannot direct the wind, but we can adjust t...",Dolly Parton
1,Find out who you are and do it on purpose,Dolly Parton
2,"If you don't like the road you're walking, sta...",Dolly Parton
3,You'll never do a whole lot unless you're brav...,Dolly Parton
4,I'm not going to limit myself just because peo...,Dolly Parton
5,I think everybody has the right to be who they...,Dolly Parton


In [13]:
# Read the Bob Ross quotes into a DataFrame named bob.
# https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/bob_ross.json
bob = pd.read_json(" https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/bob_ross.json")
bob

Unnamed: 0,quote,author
0,"We don't make mistakes, just happy little acci...",Bob Ross
1,"Talent is a pursued interest. In other words, ...",Bob Ross
2,"Anything that you try and you don't succeed, i...",Bob Ross


In [15]:
# Make a dictionary using the keys "quote" and "author" and provide a quote of your choice.
# Be sure to wrap the dictionary in square brackets. Use pd.DataFrame to turn this list containing the single dictionary into a one row DataFrame. 
# Name your new DataFrame my_quote.
my_dict = {
        "quote": "asta la vista",
        "author": "Terminator"
}
my_dict

{'quote': 'asta la vista', 'author': 'Terminator'}

In [17]:
# when it says to wrap it in square brackets I believe that is the part
# that allows the program to know you don't want to turn the dictionary
# into a dataframe but you want a new dataframe and you want THAT piece of 
# information as a row within it -- if that makes sense
my_quote = pd.DataFrame([my_dict])
my_quote

Unnamed: 0,quote,author
0,asta la vista,Terminator


In [19]:
# use pd.concat to combine all three DataFrames together in a new variable named quotes
quotes = pd.concat([dolly, bob, my_quote], ignore_index=True)
quotes

Unnamed: 0,quote,author
0,"We cannot direct the wind, but we can adjust t...",Dolly Parton
1,Find out who you are and do it on purpose,Dolly Parton
2,"If you don't like the road you're walking, sta...",Dolly Parton
3,You'll never do a whole lot unless you're brav...,Dolly Parton
4,I'm not going to limit myself just because peo...,Dolly Parton
5,I think everybody has the right to be who they...,Dolly Parton
6,"We don't make mistakes, just happy little acci...",Bob Ross
7,"Talent is a pursued interest. In other words, ...",Bob Ross
8,"Anything that you try and you don't succeed, i...",Bob Ross
9,asta la vista,Terminator


In [23]:
# Use to_csv to write the quotes DataFrame to disk, providing the file name quotes.csv
# quotes.to_csv("../datasets/quotes.csv", index=False)
pd.read_csv("../datasets/quotes.csv")

Unnamed: 0,quote,author
0,"We cannot direct the wind, but we can adjust t...",Dolly Parton
1,Find out who you are and do it on purpose,Dolly Parton
2,"If you don't like the road you're walking, sta...",Dolly Parton
3,You'll never do a whole lot unless you're brav...,Dolly Parton
4,I'm not going to limit myself just because peo...,Dolly Parton
5,I think everybody has the right to be who they...,Dolly Parton
6,"We don't make mistakes, just happy little acci...",Bob Ross
7,"Talent is a pursued interest. In other words, ...",Bob Ross
8,"Anything that you try and you don't succeed, i...",Bob Ross
9,asta la vista,Terminator


In [24]:
# Read this drinks JSON into a DataFrame called drinks
# https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drinks.json
drinks = pd.read_json("https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drinks.json")
drinks

Unnamed: 0,type,calories,number_consumed,caffeinated
0,water,0,7,0
1,orange juice,220,4,0
2,gatorade,140,1,0
3,cappuccino,350,2,1
4,hot tea,5,3,1


In [25]:
# read in the beverage cost CSV into a DataFrame called drink_costs
# https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drink_cost.csv
drink_costs = pd.read_csv(" https://gist.githubusercontent.com/ryanorsinger/ad042d8ee4340ae7026e215bc6b69665/raw/b0c1c816d87e4d3db34e52d35e376394f689911e/drink_cost.csv")
drink_costs

Unnamed: 0,price,tax
0,0.0,0.0
1,2.99,0.06
2,1.99,0.07
3,3.99,0.07
4,0.99,0.07


In [26]:
# Combine these DataFrames, and overwrite the DataFrame called drinks using pd.concat
drinks = pd.concat([drinks, drink_costs])
drinks

Unnamed: 0,type,calories,number_consumed,caffeinated,price,tax
0,water,0.0,7.0,0.0,,
1,orange juice,220.0,4.0,0.0,,
2,gatorade,140.0,1.0,0.0,,
3,cappuccino,350.0,2.0,1.0,,
4,hot tea,5.0,3.0,1.0,,
0,,,,,0.0,0.0
1,,,,,2.99,0.06
2,,,,,1.99,0.07
3,,,,,3.99,0.07
4,,,,,0.99,0.07


In [27]:
# write your drinks DataFrame to disk using .to_excel. Name the file drinks.xlsx
drinks.to_excel("../datasets/drinks.xlsx")
pd.read_excel("../datasets/drinks.xlsx")

Unnamed: 0.1,Unnamed: 0,type,calories,number_consumed,caffeinated,price,tax
0,0,water,0.0,7.0,0.0,,
1,1,orange juice,220.0,4.0,0.0,,
2,2,gatorade,140.0,1.0,0.0,,
3,3,cappuccino,350.0,2.0,1.0,,
4,4,hot tea,5.0,3.0,1.0,,
5,0,,,,,0.0,0.0
6,1,,,,,2.99,0.06
7,2,,,,,1.99,0.07
8,3,,,,,3.99,0.07
9,4,,,,,0.99,0.07
