# 3.2: Intro to Pandas

## Importing the Pandas Library

* If you want to use `pandas`, make sure to include the following import statement at the top of your code file:

In [5]:
import pandas as pd

## About `pandas`

* Short for panel data
* library (like `numpy`) for data science
* built on top of numpy
* one of the major shortcomings of data science is the lack of label-based indexing: `pandas` fixes that issue
* `pandas` also has really great functionality for indexing, cleaning, stats, etc.


There are two major objects in `pandas`:

1. `pd.Series()`: a `pandas` object for 1D data
2. `pd.DataFrame()`: a `pandas` object for 2D data

## `pd.Series()` Objects

### Creating and Managing a `Series` object

First let's create some sample data to use:

In [6]:
populations = [219190, 744955, 147599, 2010]
cities = ['Spokane', 'Seattle', 'Bellevue', 'Leavenworth']

When we create series, use `ser` for the variable name for a `Series` object and `df` for the variable name for a `DataFrame`.

We can create a Series object using the Series constructor:

In [7]:
pop_ser = pd.Series(populations, index=cities)
print(pop_ser)

Spokane        219190
Seattle        744955
Bellevue       147599
Leavenworth      2010
dtype: int64


Now we can index into `pop_ser` using a string key!

In [8]:
print(pop_ser['Spokane'])

219190


## Indexing/Slicing a Pandas Series

In [9]:
pop_ser.name = "Population"
print(pop_ser)

Spokane        219190
Seattle        744955
Bellevue       147599
Leavenworth      2010
Name: Population, dtype: int64


We can use a label to index into the Series and get a value (as we saw using `print(pop_ser['Spokane'])`). However, we can pass in multiple keys to return multiple rows in a Pandas Series object. This is called **multiple indexing**.

In [10]:
print(pop_ser[["Seattle", "Leavenworth"]])

Seattle        744955
Leavenworth      2010
Name: Population, dtype: int64


We can still slice into our Series object still by using the slicing operator `:` however with label based indexing THE STOP IS INCLUSIVE (which is not how Python works normally).

In [11]:
print(pop_ser['Seattle':'Leavenworth'])

Seattle        744955
Bellevue       147599
Leavenworth      2010
Name: Population, dtype: int64


We can still use position based indexing with integers like we do with normal lists and strings using `iloc`. Multiple indexing works the same way here as well. Slicing also works the same way, but now we revert back to normal Python where the stop is NOT INCLUDED.

In [17]:
print(pop_ser.iloc[1])
print('\n',pop_ser.iloc[[1,3]])
print('\n', pop_ser.iloc[1:3])

744955

 Seattle        744955
Leavenworth      2010
Name: Population, dtype: int64

 Seattle     744955
Bellevue    147599
Name: Population, dtype: int64


## Stats Functions with Pandas

You don't have to write your own functions to compute mean and median, `pandas` can compute all of that stuff now for you with a one-liner function call!

In [21]:
print(pop_ser.mean())
print(pop_ser.std()) # this is a population standard deviation
print(pop_ser.std(ddof=0)) # this is a sample standard deviation

278438.5
323872.25871578854
280481.60362891894


## Adding New Data to a Series Object

Pandas Series objects are mutable, so we can add new rows (key/value pairs) whenever we want

In [22]:
pop_ser['Pullman'] = 34019
print(pop_ser)

Spokane        219190
Seattle        744955
Bellevue       147599
Leavenworth      2010
Pullman         34019
Name: Population, dtype: int64


## Creating Empty Series Objects

You can create empty Pandas Series Objects using the constructor method:

In [23]:
pop_ser2 = pd.Series()

  pop_ser2 = pd.Series()


The reason we get this error is because we should specify the type of data this Series object will have using `dtype`

In [27]:
pop_ser2 = pd.Series(dtype=int)
print(pop_ser2)

Series([], dtype: int64)


# Pandas `DataFrame` Objects

DataFrame objects are used to store 2d data in `pandas`.

We can create a DataFrame object in many ways:
* From a 2D list
* 

In [32]:
twod_list = [[3,'a',4.5],[7,'b',10.99],[-10,'c',-1.5]]
header = ['col1', 'col2', 'col3']

# Creating a DataFrame from a 2D list
df = pd.DataFrame(twod_list, columns=header)
print(df)

   col1 col2   col3
0     3    a   4.50
1     7    b  10.99
2   -10    c  -1.50


Some notes on DataFrame terminology:

* column labels are called **columns**
* row labels are called **index**

In [35]:
row_index = ['row1','row2','row3']
df = pd.DataFrame(twod_list, columns=header, index=row_index)
print(df)

      col1 col2   col3
row1     3    a   4.50
row2     7    b  10.99
row3   -10    c  -1.50


## Example: Information About Cities

We will need to read the data from `regions.csv`.

To start let's set what our column headers will be. We will need a city, the city's population, and a classification based on population (namely small, medium, and large)

In [47]:
header = ['City', 'Population', 'Class']
pop_data = [['Spokane',219190,'Medium'],
           ['Seattle',744966,'Large'],
           ['Bellevue',147599,'Medium'],
           ['Leavenworth',2010,'Small']]

pop_df = pd.DataFrame(pop_data, columns=header)
print(pop_df)

          City  Population   Class
0      Spokane      219190  Medium
1      Seattle      744966   Large
2     Bellevue      147599  Medium
3  Leavenworth        2010   Small


We can notice that the keys for the rows are normal integers. However, we can make the City the key for the rows using the `set_index` function.

In [48]:
pop_df = pop_df.set_index('City')
print(pop_df)

             Population   Class
City                           
Spokane          219190  Medium
Seattle          744966   Large
Bellevue         147599  Medium
Leavenworth        2010   Small


Now let's grb a column by its label which will return a Pandas Series object

In [49]:
pop_ser = pop_df['Population'] # grabs the population column
print(pop_ser)

City
Spokane        219190
Seattle        744966
Bellevue       147599
Leavenworth      2010
Name: Population, dtype: int64


We can also use `iloc` but this will index by *row* and not by *column*

In [51]:
pop_ser = pop_df.iloc[0]
print(pop_ser)

Population    219190
Class         Medium
Name: Spokane, dtype: object


In [55]:
pop_spokane = pop_df.iloc[0,0]
print(pop_spokane)
pop_spokane_2 = pop_df.loc['Spokane','Population']
print(pop_spokane_2)

219190
219190


You can slice across the columns just by typing the column names. More than 1 column returns another DataFrame. 

In [57]:
# do this later

## Data Cleaning with Pandas

We can open files using `pd.read_csv()` function. It's a one-liner!

In [59]:
regions_df = pd.read_csv('regions.csv')
print(regions_df)

          City Region
0      Spokane      E
1      Seattle      W
2     Bellevue      W
3  Leavenworth      C


## Joining Pandas DataFrame Objects

We can produce a third DataFrame by doing a join operation on two other dataframes

In [63]:
merged_df = pop_df.merge(regions_df, on='City') # by default the merge() function does an inner join
print(merged_df)

          City  Population   Class Region
0      Spokane      219190  Medium      E
1      Seattle      744966   Large      W
2     Bellevue      147599  Medium      W
3  Leavenworth        2010   Small      C


We can write DataFrame objects to files in a one-liner which is super convenient :)

In [64]:
merged_df.to_csv('merged.csv')