In [1]:
%pip install matplotlib numpy openpyxl pandas



In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Pandas DataFrames

## What is a DataFrame?

A DataFrame, simply put, is a **Table** of data.  It is a structure that contains multiple rows, each row containing the same labelled collection of data types.  A **DataFrame** is a table with named rows (called the "index"). For example, a DataFrame might look like this:

| (index) | Name | Age | Height | LikesIceCream |
| :---: | :--: | :--: | :--: | :--: |
| 0     | "Nick" | 22 | 3.4 | True |
| 1     | "Jenn" | 55 | 1.2 | True |
| 2     | "Joe"  | 25 | 2.2 | True |

Because each row contains the same data, DataFrames can also be thought of as a collection of same-length columns!

**Pandas** is a Python package that has a DataFrame class.  Using either the **DataFrame** class constructor or one of Pandas' many **read_()** functions, you can make your own DataFrame from a variety of sources.  

## Making DataFrames Directly

### Examples of Different Ways

#### From a List of Dicts

Dicts are named collections.  If you have many of the same dicts in a list, the DataFrame constructor can convert it to a Dataframe:

In [6]:
friends = [
    {'Name': "Nick", "Age": 31, "Height": 2.9, "Weight": 20},
    {'Name': "Jenn", "Age": 55, "Height": 1.2},
    {"Name": "Joe", "Height": 1.2, "Age": 25, },
]
pd.DataFrame(friends)

Unnamed: 0,Name,Age,Height,Weight
0,Nick,31,2.9,20.0
1,Jenn,55,1.2,
2,Joe,25,1.2,


#### From a Dict of Lists

In [7]:
df = pd.DataFrame({
    'Name': ['Nick', 'Jenn', 'Joe'], 
    'Age': [31, 55, 25], 
    'Height': [2.9, 1.2, 1.2],
})
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


#### From a List of Lists

if you have a collection of same-length sequences, you essentially have a rectangular data structure already!  All that's needed is to add some column labels.

In [8]:
friends = [
    ['Nick', 31, 2.9],
    ['Jenn', 55, 1.2],
    ['Joe',  25, 1.2],
]
pd.DataFrame(friends, columns=["Name", "Age", "Height"])

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


#### From an empty DataFrame
If you prefer, you can also add columns one at a time, starting with an empty DataFrame:

In [9]:
df = pd.DataFrame()
df['Name'] = ['Nick', 'Jenn', 'Joe']
df['Age'] = [31, 55, 25]
df['Height'] = [2.9, 1.2, 1.2]
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


**Exercises**: Making DataFrames from Scratch

Please use Pandas to recreate the table here as a Dataframe using one of the approaches detailed above:

| Year | Product | Cost |
| :--: | :----:  | :--: |
| 2015 | Apples  | 0.35 |
| 2016 | Apples  | 0.45 |
| 2015 | Bananas | 0.75 |
| 2016 | Bananas | 1.10 |

*(3-minute Discussion)*: Which approach did you choose?  What did you like about it?

### Reading Data from Files into a DataFrame


| File Format | File Extension | `read_xxx()` function | Dataframe Write Method | 
| :--:  | :--: | :--: | :--: |
| Comma-Seperated Values      | .csv           | `pd.read_csv()` | `df.to_csv()` |
| Tab-seperated Values       | .tsv, .tabular, .csv | `pd.read_csv(sep='\t')`, `pd.read_table()` | `df.to_csv(sep='\t')` `df.to_table()` |
| Excel Spreadsheet           |  .xls | `pd.read_excel()`                    | `df.to_excel()`  |
| Excel Spreadsheet 2010      | .xlsx | `pd.read_excel(engine='openpyxl')`   | `df.to_excel(engine='openpyxl')` |
| JSON                        | .json | `pd.read_json()`                     | `df.to_json()` |
| Tables in a Web Page (HTML) | .html | `pd.read_html()[0]`                  | `df.to_html()` |
| HDF5 | .hdf5, .h5, | `pd.read_hdf5()` |  `df.to_hdf5()` |

In [37]:
import pandas as pd

### Understanding Different File Formats



**Exercises**: "Roundtripping" write-read

run the code below to download the Titanic passengers dataset, and transform it into different file formats

*Note*: Yep, that's right, you can supply a web url and pandas reads it like a normal file!

In [38]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df[:5]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Now run the code below to save the file to a comma-seperated file using the `DataFrame.to_csv()` method, then use a text editor to examine the file that was saved on the computer.  How is the file structured?

In [47]:
df.to_csv("titanic2001.csv", index=False)

Now read the file back into Pyhton using the `pd.read_csv()` function:

**JSON**

Save the dataframe to a JSON file using the `df.to_json()` method.

Read the JSON file into Pandas again, using the `pd.read_json()` method.

Open the JSON file in a text editor.  What does it look like?  In what ways is it different from the CSV file?

**HTML**

Save the dataframe to a HTML file, using the `df.to_html()` method.

Read the HTML file into Pandas again, using the `pd.read_html()` function.  

*Note*: Because HTML files can contain multiple dataframes, you'll get a `list` of dataframes instaead of just one.  Just add `[0]` onto the end of the line to get the first dataframe, and it will look the same as before.

Try opening the file with a text editor, then with a web browser (e.g. Chrome, Firefox, etc).  What does the file look like in each case?

**Excel**

Note: Because XLS and XLSX are proprietary formats, you may need to install a couple extra packages for this to work 

`%pip install openpyxl`

Save the dataframe to an Excel file.  `DataFrame.to_excel(engine='openpyxl')`

Read the Excel file into Pandas again, using the `pd.read_excel(engine='openpyxl')` function

Open the file in a text editor.  What does the file it look like?  Does it even open?

Open it in your spreadsheet program.  What does it look like?