# Introdution to pandas

numpy is great for numerical data.

pandas is designed for manipulating mixtures of data types in tabular formats (much like Excel spreadsheets).

This Jupyter notebook provides only an overview of basic pandas functionality.  For more detailed information, here are a few good resources:
	
- “Python Data Science Handbook: Essential Tools for Working with Data” by Jake VanderPlas
- “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition” by Wes McKinney (the creator of pandas)
- https://realpython.com/python-data-cleaning-numpy-pandas/
		
---

### Outline:
- Series vs. DataFrame

- Working with Series
	- creating a Series
	- indices
	- filtering
	
- Working with DataFrames
	- creating a DataFrame
	- indices
	- filtering
	
- Importing data from .csv

- Exporting to .csv

---

- Examples (in a separate Jupyter notebook):
    - In-class Exercise
    - Daily Show Guests
	- Scraping HTML Data
    
---     

In [None]:
import pandas as pd
import numpy as np

## Series

A pandas "Series" is a one-dimensional object, like an array.

In [None]:
mySeries1 = pd.Series([8, 3 , -6, 7])
mySeries1

- The left column shows the "indices".  By default, these will run from 0 to (number of entries - 1). 

- The right column shows the "values".


In [None]:
# We can extract just the values:
mySeries1.values

In [None]:
# We can also look at the indices:
mySeries1.index

- This is like range(0,len(mySeries1))

One useful pandas feature is that we can define custom indices:

In [None]:
mySeries2 = pd.Series([8, 3, -6, 7], index=['c', 'a', 'b', 'xyz'])
mySeries2

Take a look at the 3rd row:

In [None]:
# We can use the index name:
mySeries2['b']

In [None]:
# This is the same as above, but uses the index number
mySeries2[2]

In [None]:
# Filter for values > 1:
mySeries2[mySeries2 > 1]

In [None]:
# We can create a Series from a python dictionary:
myDict = {'HW1': 90, 'Exam 1': 77, 'Project': 88, 'HW2': 66}

mySeries3 = pd.Series(myDict)
mySeries3

- Note that, by default, pandas sorts by key/index

In [None]:
# We can provide an explicit ordering of indices:
assignments = ['HW1', 'HW2', 'HW3', 'Exam 1', 'Project']
mySeries4 = pd.Series(myDict, index=assignments)
mySeries4

- Note that index 'HW3' doesn't appear in myDict.  "NaN" stands for "Not a Number"; it represents a null/missing value.

In [None]:
pd.isnull(mySeries4)

In [None]:
pd.notnull(mySeries4)

#### We can give a Series a name:

In [None]:
# Here's the original:
mySeries1

In [None]:
# Now we've given the series a name:
mySeries1.name = 'my numbers'
mySeries1

#### We can also change the indices:

In [None]:
mySeries1.index = ['a', 'x', 'b', 'z']
mySeries1

### Series Indexing

In [None]:
mySeries5 = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
mySeries5

In [None]:
# We can use the row numbers:
mySeries5[1:2]

In [None]:
mySeries5[1:3] 

- Note that [1:3] only shows rows with indices 1 and 2 (3 is ignored)

In [None]:
# We can also use the index labels:
mySeries5['b':'c']

- **Note that 'c' is included in the output when we use index labels.**

## DataFrame

A pandas "DataFrame" represents a table of data.

Unlike a numpy ndarray, each column in a pandas DataFrame can contain a different type of data.

*This example comes from Wes McKinney's book.*

In [None]:
# Suppose we already have some data in the form of a dictionary:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
		'year': [2000, 2001, 2002, 2001, 2002, 2003], 
		'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [None]:
# Convert this to a pandas DataFrame:
frame1 = pd.DataFrame(data)
frame1

In [None]:
# Look at the first 5 rows:
frame1.head()

In [None]:
# Look at the last 5 rows:
frame1.tail()

In [None]:
# We can specify the order in which columns are displayed:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

In [None]:
# Let's create another dataframe.
# We've added a new column (debt).
# We've also specified the index names.
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'], 
			index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

In [None]:
# Assign a scalar value to all rows in a given column:
frame2['debt'] = 16.5
frame2

In [None]:
# Assign the values of a column via a list or array:
frame2['debt'] = np.arange(6)
frame2

In [None]:
# The following won't work because the list doesn't match the number of rows in frame2:
frame2['debt'] = np.arange(7)
frame2['debt'] = np.arange(3)

In [None]:
# However, if we assign a pandas Series to a DataFrame column, pandas will fill in the gaps with NaN:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2

In [None]:
# Add a new column:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

In [None]:
# Remove a column:
del frame2['eastern']
frame2

### DataFrame Indexing

In [None]:
# Get a list of all columns:
frame2.columns

In [None]:
# Retrieving a specific column:
frame2['year']

In [None]:
# Retrieving a specific row:
# a) by row index name, using "loc"
frame2.loc['one']

In [None]:
frame2.loc['one':'four']			# Note that 'four' is included

In [None]:
frame2.loc[['one', 'four']]

In [None]:
# b) by row index ID, using "iloc"
frame2.iloc[0]

In [None]:
frame2.iloc[0:3]			# Note that 'four' is NOT included

In [None]:
frame2.iloc[[0, 3]]

In [None]:
# Select a subset of rows and columns:
frame2.loc['one', ['year', 'pop']]

In [None]:
frame2.loc['one', 'year':'pop']

In [None]:
frame2.loc['one':'three', 'year':'pop']

#### Caution:  Integer indices can be ambiguous:

In [None]:
ser = pd.Series(np.arange(3.))
ser

In [None]:
ser[-1]        # This throws an error

In [None]:
ser.iloc[-1]      # Using iloc removes the ambiguity

#### It's safer to assign non-integer indices:

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

### Sorting Series

In [None]:
mySeries = pd.Series([3, 9, -5, 2])
mySeries

In [None]:
# sort_values() goes in ascending order by default:
mySeries.sort_values()

In [None]:
# We can force descending order:
mySeries.sort_values(ascending=False)

### Sorting DataFrames

In [None]:
myFrame = pd.DataFrame([[3, 1, 7], [2, 9, 4]], index=['a', 'b'], columns=['x', 'y', 'z'])
myFrame

In [None]:
# Sort by column 'x', in ascending order (default):
myFrame.sort_values(by=['x'])

In [None]:
# Sort by column 'x', in descending order:
myFrame.sort_values(by=['x'], ascending=False)

## Importing Data from .csv

#### First, suppose we have a .csv file, named "example_with_header.csv".

In [None]:
!cat example_with_header.csv     # For Mac/Linux
# !type example_with_header.csv    # For Windows

In [None]:
# Option 1:  Use "read_csv()"
df = pd.read_csv('example_with_header.csv')
df

In [None]:
# Option 2:  Use "read_table()" and specify the delimiter:
pd.read_table('example_with_header.csv', sep=",")

#### Now, suppose our file doesn't have a header row.

In [None]:
!cat example_without_header.csv     # For Mac/Linux
# !type example_without_header.csv    # For Windows

In [None]:
# Option 1:  Use "read_csv()" and let pandas assign column names:
pd.read_csv('example_without_header.csv', header=None)

In [None]:
# Option 2: Use "read_csv()" and explicitly assign column names:
pd.read_csv('example_without_header.csv', names=['a', 'b', 'c', 'd', 'message'])

In [None]:
# Option 3:  Use "read_table()" and assign column names:
pd.read_table('example_without_header.csv', sep=",", names=['a', 'b', 'c', 'd', 'message'])

#### Now, suppose we want the last column to actually be our index

In [None]:
df2 = pd.read_csv('example_without_header.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')
df2

In [None]:
df2.loc['world']

## Exporting DataFrame to .csv

In [None]:
# Here's our original DataFrame:
df

In [None]:
# By default, "to_csv()" also includes the index:
df.to_csv('out1.csv')

In [None]:
!cat out1.csv
#!type out1.csv

In [None]:
# Now, if we re-read this .csv, we get an extra column:
pd.read_csv('out1.csv')

In [None]:
# Instead, we can specify that we don't want to export the index:
df.to_csv('out2.csv', index=False)

In [None]:
!cat out2.csv
#!type out2.csv

In [None]:
pd.read_csv('out2.csv')