# Introduction to pandas

## What is it?
* [pandas](http://pandas.pydata.org/) is an open source [Python](http://www.python.org/) library for data analysis. 
* Python has always been great for prepping and manupulating data, but historically it has not been great for analysis - you'd usually end up using [R](http://www.r-project.org/) or loading it into a database and using SQL (or worse, Excel). 
* pandas makes Python great for analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
# Switch to truncated view if too many columns
pd.set_option('max_columns', 50)
%matplotlib inline

# Data Structures
pandas introduces two new data structures to Python - [Series](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#series) and [DataFrame](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe), both of which are built on top of [NumPy](http://www.numpy.org/) (this means it's fast).

## Series

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [None]:
# create a Series with an arbitrary list
s = pd.Series([7, 'Chennai', 3.14, -1789710578, 'Happy Eating!'])
s

0                7
1          Chennai
2             3.14
3      -1789710578
4    Happy Eating!
dtype: object

Lets compare pandas series with Python List

In [None]:
# python list
list_var = ['a', 'b','c','d']
print(list_var)
print(list_var.index('c'))

['a', 'b', 'c', 'd']
2


Alternatively, you can specify an index to use when creating the Series.

In [None]:
s = pd.Series([7, 'Chennai', 3.14, -1789710578, 'Happy Eating!'], index=['A', 'Z', 'C', 'Y', 'E'])
s

A                7
Z          Chennai
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [None]:
pythonDictionary = {'Chennai': 400, 'Hyderabad': 100, 'Pune': 150, 'Bangalore': 700, 'Guruguon': 450, 'Kolkata': None}
cities = pd.Series(pythonDictionary)
cities

Bangalore    700.0
Chennai      400.0
Guruguon     450.0
Hyderabad    100.0
Kolkata        NaN
Pune         150.0
dtype: float64

You can use the index to select specific items from the Series ...

In [None]:
cities['Chennai']

400.0

In [None]:
cities[['Chennai', 'Hyderabad', 'Bangalore']]

Chennai      400.0
Hyderabad    100.0
Bangalore    700.0
dtype: float64

Or you can use boolean indexing for selection.

In [None]:
cities[cities < 500]

Chennai      400.0
Guruguon     450.0
Hyderabad    100.0
Pune         150.0
dtype: float64

That last one might be a little weird, so let's make it more clear - `cities < 1000` returns a Series of True/False values, which we then pass to our Series `cities`, returning the corresponding True items.

In [None]:
less_than_500 = cities < 500
print(less_than_500)
print('\n')
print(cities[less_than_500])

Bangalore    False
Chennai       True
Guruguon      True
Hyderabad     True
Kolkata      False
Pune          True
dtype: bool


Chennai      400.0
Guruguon     450.0
Hyderabad    100.0
Pune         150.0
dtype: float64


You can also change the values in a Series on the fly.

In [None]:
# changing based on the index
print('Old value:', cities['Chennai'])
cities['Chennai'] = 500
print('New value:', cities['Chennai'])

('Old value:', 400.0)
('New value:', 500.0)


In [None]:
cities

Bangalore    700.0
Chennai      500.0
Guruguon     450.0
Hyderabad    100.0
Kolkata        NaN
Pune         150.0
dtype: float64

In [None]:
# changing values using boolean logic
print(cities[cities < 500])
print('\n')
cities[cities < 500] = 250

print(cities[cities < 500])

Guruguon     450.0
Hyderabad    100.0
Pune         150.0
dtype: float64


Guruguon     250.0
Hyderabad    250.0
Pune         250.0
dtype: float64


What if you aren't sure whether an item is in the Series?  You can check using idiomatic Python.

In [None]:
print('Chennai' in cities)
print('Delhi' in cities)

True
False


Mathematical operations can be done using scalars and functions.

In [None]:
# divide city values by 3
cities/3

Bangalore    233.333333
Chennai      166.666667
Guruguon      83.333333
Hyderabad     83.333333
Kolkata             NaN
Pune          83.333333
dtype: float64

In [None]:
# square city values
np.square(cities)

Bangalore    490000.0
Chennai      250000.0
Guruguon      62500.0
Hyderabad     62500.0
Kolkata           NaN
Pune          62500.0
dtype: float64

You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values.  Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

In [None]:
print(cities[['Chennai', 'Hyderabad', 'Bangalore']])
print('\n')
print(cities[['Chennai', 'Kolkata']])
print('\n')
print(cities[['Chennai', 'Hyderabad', 'Bangalore']] + cities[['Chennai', 'Kolkata']])

Chennai      500.0
Hyderabad    250.0
Bangalore    700.0
dtype: float64


Chennai    500.0
Kolkata      NaN
dtype: float64


Bangalore       NaN
Chennai      1000.0
Hyderabad       NaN
Kolkata         NaN
dtype: float64


Notice that because Bangalore and Hyderabad were not found in both Series, they were returned with NULL/NaN values.

NULL checking can be performed with `isnull` and `notnull`.

In [None]:
# returns a boolean series indicating which values aren't NULL
print(cities)
print('\n')
cities.notnull()

Bangalore    700.0
Chennai      500.0
Guruguon     250.0
Hyderabad    250.0
Kolkata        NaN
Pune         250.0
dtype: float64




Bangalore     True
Chennai       True
Guruguon      True
Hyderabad     True
Kolkata      False
Pune          True
dtype: bool

In [None]:
# use boolean logic to grab the NULL cities
print(cities.isnull())
print('\n')
print(cities[cities.isnull()])

Bangalore    False
Chennai      False
Guruguon     False
Hyderabad    False
Kolkata       True
Pune         False
dtype: bool


Kolkata   NaN
dtype: float64


## DataFrame

A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

For the rest of the tutorial, we'll be primarily working with DataFrames.

### Reading Data

To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.

Using the `columns` parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn't the case when reading from a file - more on that next).

In [None]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['CSK', 'CSK', 'CSK', 'RR', 'RR', 'MI', 'MI', 'MI'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football

Unnamed: 0,year,team,wins,losses
0,2010,CSK,11,5
1,2011,CSK,8,8
2,2012,CSK,10,6
3,2011,RR,15,1
4,2012,RR,11,5
5,2010,MI,6,10
6,2011,MI,10,6
7,2012,MI,4,12


Many times you may not have all the data easily available, so what would you do ?

In [None]:
teams = ['CSK', 'RR', 'MI', 'KXIP']
teams

['CSK', 'RR', 'MI', 'KXIP']

In [172]:
d = pd.date_range('20201010', periods=10)
d

DatetimeIndex(['2020-10-10', '2020-10-11', '2020-10-12', '2020-10-13',
               '2020-10-14', '2020-10-15', '2020-10-16', '2020-10-17',
               '2020-10-18', '2020-10-19'],
              dtype='datetime64[ns]', freq='D')

In [None]:
np_array = np.random.randn(10,4)
np_array

array([[-0.18898057, -0.04828275,  0.30777981, -1.17861804],
       [-1.01667186, -0.48541507,  0.831175  ,  1.32751593],
       [ 0.9207213 , -1.2763351 ,  0.13588732, -1.57556638],
       [-1.52172563, -0.33805323, -0.46428492,  1.06036997],
       [ 1.42493914, -0.5498024 , -1.13414734, -0.86278818],
       [-0.50168515,  1.04318919, -0.32047437,  1.10332782],
       [-1.34498286, -0.81400423,  1.48915985,  0.95819327],
       [ 0.6047341 , -0.51422553,  0.36934254, -0.68676277],
       [ 0.49442337,  1.1070468 , -0.00808864, -1.30213163],
       [ 1.2357625 , -0.23521166,  0.72227834, -0.74910691]])

In [173]:
simple_df = pd.DataFrame(np_array, index=d, columns=teams)
simple_df

Unnamed: 0,CSK,RR,MI,KXIP
2020-10-10,-0.188981,-0.048283,0.30778,-1.178618
2020-10-11,-1.016672,-0.485415,0.831175,1.327516
2020-10-12,0.920721,-1.276335,0.135887,-1.575566
2020-10-13,-1.521726,-0.338053,-0.464285,1.06037
2020-10-14,1.424939,-0.549802,-1.134147,-0.862788
2020-10-15,-0.501685,1.043189,-0.320474,1.103328
2020-10-16,-1.344983,-0.814004,1.48916,0.958193
2020-10-17,0.604734,-0.514226,0.369343,-0.686763
2020-10-18,0.494423,1.107047,-0.008089,-1.302132
2020-10-19,1.235762,-0.235212,0.722278,-0.749107


Padas series can be converted to data frame!!

In [181]:
df_cities = cities.to_frame()
df_cities

Unnamed: 0,0
Bangalore,700.0
Chennai,500.0
Guruguon,250.0
Hyderabad,250.0
Kolkata,
Pune,250.0


In [183]:
df_cities.columns = ['2020'] 
df_cities["2021"] = ['800','600','300','400','100','300',]
df_cities

Unnamed: 0,2020,2021
Bangalore,700.0,800
Chennai,500.0,600
Guruguon,250.0,300
Hyderabad,250.0,400
Kolkata,,100
Pune,250.0,300


Much more often, you'll have a dataset you want to read into a DataFrame. Let's go through several common ways of doing so.

### CSV

Reading a CSV is as simple as calling the *read_csv* function. By default, the *read_csv* function expects the column separator to be a comma, but you can change that using the `sep` parameter.

In [None]:
from_csv = pd.read_csv('https://github.com/luciasantamaria/pandas-tutorial/raw/master/data/mariano-rivera.csv')
from_csv

Unnamed: 0,Year,Age,Tm,Lg,W,L,W-L%,ERA,G,GS,GF,CG,SHO,SV,IP,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,WHIP,H/9,HR/9,BB/9,SO/9,SO/BB,Awards
0,1995,25,NYY,AL,5,3,0.625,5.51,19,10,2,0,0,0,67.0,71,43,41,11,30,0,51,2,1,0,301,84,1.507,9.5,1.5,4.0,6.9,1.7,
1,1996,26,NYY,AL,8,3,0.727,2.09,61,0,14,0,0,5,107.2,73,25,25,1,34,3,130,2,0,1,425,240,0.994,6.1,0.1,2.8,10.9,3.82,CYA-3MVP-12
2,1997,27,NYY,AL,6,4,0.6,1.88,66,0,56,0,0,43,71.2,65,17,15,5,20,6,68,0,0,2,301,239,1.186,8.2,0.6,2.5,8.5,3.4,ASMVP-25
3,1998,28,NYY,AL,3,0,1.0,1.91,54,0,49,0,0,36,61.1,48,13,13,3,17,1,36,1,0,0,246,233,1.06,7.0,0.4,2.5,5.3,2.12,
4,1999,29,NYY,AL,4,3,0.571,1.83,66,0,63,0,0,45,69.0,43,15,14,2,18,3,52,3,1,2,268,257,0.884,5.6,0.3,2.3,6.8,2.89,ASCYA-3MVP-14
5,2000,30,NYY,AL,7,4,0.636,2.85,66,0,61,0,0,36,75.2,58,26,24,4,25,3,58,0,0,2,311,170,1.097,6.9,0.5,3.0,6.9,2.32,AS
6,2001,31,NYY,AL,4,6,0.4,2.34,71,0,66,0,0,50,80.2,61,24,21,5,12,2,83,1,0,1,310,192,0.905,6.8,0.6,1.3,9.3,6.92,ASMVP-11
7,2002,32,NYY,AL,1,4,0.2,2.74,45,0,37,0,0,28,46.0,35,16,14,3,11,2,41,2,1,1,187,163,1.0,6.8,0.6,2.2,8.0,3.73,AS
8,2003,33,NYY,AL,5,2,0.714,1.66,64,0,57,0,0,40,70.2,61,15,13,3,10,1,63,4,0,0,277,267,1.005,7.8,0.4,1.3,8.0,6.3,MVP-27
9,2004,34,NYY,AL,4,2,0.667,1.94,74,0,69,0,0,53,78.2,65,17,17,3,20,3,66,5,0,0,316,232,1.081,7.4,0.3,2.3,7.6,3.3,ASCYA-3MVP-9


There are many python adapters for loading files from google drive,  AWS S3 (boto3 lib), Azure blob storage etc.

In [None]:
url = 'https://github.com/luciasantamaria/pandas-tutorial/blob/master/data/city-of-chicago-salaries.csv?raw=true'

salary_df = pd.read_csv(url)
salary_df.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$75372.00
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$80916.00
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$99648.00
4,"ABBATACOLA, ROBERT J",ELECTRICAL MECHANIC,AVIATION,$89440.00


Our file had headers, which the function inferred upon reading in the file. Had we wanted to be more explicit, we could have passed `header=None` to the function along with a list of column names to use:

In [187]:
cols = ['Full Name', 'Title', 'Department', 'Salary']
url = 'https://github.com/luciasantamaria/pandas-tutorial/blob/master/data/city-of-chicago-salaries.csv?raw=true'
salary_df_no_header = pd.read_csv(url, sep=',', header=None, names=cols)
salary_df_no_header.head()

Unnamed: 0,Full Name,Title,Department,Salary
0,Name,Position Title,Department,Employee Annual Salary
1,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
2,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$75372.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$80916.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$99648.00


pandas various *reader* functions have many parameters allowing you to do things like skipping lines of the file, parsing dates, or specifying how to handle NA/NULL datapoints.

There's also a set of *writer* functions for writing to a variety of formats (CSVs, HTML tables, JSON).  They function exactly as you'd expect and are typically called `to_format`:

```python
my_dataframe.to_csv('path_to_file.csv')
```

[Take a look at the IO documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) to familiarize yourself with file reading/writing functionality.

### PostgreSQL Database

pandas also has some support for reading/writing DataFrames directly from/to a database [[docs](http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries)].  You'll typically just need to pass a connection object to the `read_sql` or `to_sql` functions within the `pandas.io` module.

Note that `to_sql` executes as a series of INSERT INTO statements and thus trades speed for simplicity. If you're writing a large DataFrame to a database, it might be quicker to write the DataFrame to CSV and load that directly using the database's file import arguments.

First, let me create a Postgrest databse server

In [None]:
# Install postgresql server
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a database with name `demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS demo;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE demo;'

 * Starting PostgreSQL 10 database server
   ...done.
ALTER ROLE
ERROR:  database "demo" is being accessed by other users
DETAIL:  There is 1 other session using the database.
ERROR:  database "demo" already exists


Create the Shell variables in the server

In [None]:
# Setup a database with name `demo` to be used
%env DEMO_DATABASE_NAME=demo
%env DEMO_DATABASE_HOST=localhost
%env DEMO_DATABASE_PORT=5432
%env DEMO_DATABASE_USER=postgres
%env DEMO_DATABASE_PASS=postgres

env: DEMO_DATABASE_NAME=demo
env: DEMO_DATABASE_HOST=localhost
env: DEMO_DATABASE_PORT=5432
env: DEMO_DATABASE_USER=postgres
env: DEMO_DATABASE_PASS=postgres


Create a table there from a open source sql query

In [None]:
!curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql

!PGPASSWORD=$DEMO_DATABASE_PASS psql -q -h $DEMO_DATABASE_HOST -p $DEMO_DATABASE_PORT -U $DEMO_DATABASE_USER -d $DEMO_DATABASE_NAME -f AirQualityUCI.sql

psql:AirQualityUCI.sql:17: ERROR:  relation "airqualityuci" already exists


Now the database can be queried !!

In [None]:
from sqlalchemy import create_engine

In [None]:
import os
endpoint="postgresql://{}:{}@{}:{}/{}".format(
    os.environ['DEMO_DATABASE_USER'],
    os.environ['DEMO_DATABASE_PASS'],
    os.environ['DEMO_DATABASE_HOST'],
    os.environ['DEMO_DATABASE_PORT'],
    os.environ['DEMO_DATABASE_NAME'],
)
engine = create_engine(endpoint)

In [188]:
aqi = pd.read_sql_query('select * from AirQualityUCI limit 10', con=engine)

In [189]:
aqi.head()

Unnamed: 0,date,time,co,pt08s1,nmhc,c6h6,pt08s2,nox,pt08s3,no2,pt08s4,pt08s5,t,rh,ah
0,2004-03-10,18:00:00,2.6,1360,150.0,11.9,1046,166.0,1056,113.0,1692,1268,13.6,48.9,0.7578
1,2004-03-10,19:00:00,2.0,1292,112.0,9.4,955,103.0,1174,92.0,1559,972,13.3,47.7,0.7255
2,2004-03-10,20:00:00,2.2,1402,88.0,9.0,939,131.0,1140,114.0,1555,1074,11.9,54.0,0.7502
3,2004-03-10,21:00:00,2.2,1376,80.0,9.2,948,172.0,1092,122.0,1584,1203,11.0,60.0,0.7867
4,2004-03-10,22:00:00,1.6,1272,51.0,6.5,836,131.0,1205,116.0,1490,1110,11.2,59.6,0.7888


### Clipboard

While the results of a query can be read directly into a DataFrame, we can read the results directly from the clipboard.

This works just as well with any type of delimited data you've copied to your clipboard. The function does a good job of inferring the delimiter, but you can also use the `sep` parameter to be explicit.

In [None]:
hank = pd.read_clipboard()
hank.head()