*Disclaimer: This notebook was made for the Principles of Data Science course (DATA100).*

_This is not for distribution oustide of this class._

# Pandas

Pandas is an easy-to-use, fast, flexible, and powerful open-source Python library for working with “relational” or “labeled” data. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python by offering data structures and operations for manipulating tables. And, like NumPy, a Pandas implementation is faster than a default Python ones, and it is a third-party module, which means it is not part of Python's suite of built-in libraries. 

Pandas is a very big topic and it has so many features that cannot be tacked in this module. You can do further studying by reading the official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html.

---
### Installation
In your Terminal (MacOS) or Anaconda Prompt (Windows), make sure your desired environment is activated before doing this step:
```
    conda install pandas
```

### Usage

To use `pandas` in our code, we have to **import** it first.

In [1]:
import numpy as np
import pandas as pd

The alias `pd` is a common namin convention for pandas.

Note that Jupyter has the following features: tab-completion(<TAB>) and `?`. You may use this to explore and access the documentation for anything that you are looking for.

```
Example:

ipython
In [1]: pd.<TAB>
or

ipython
In [2]: pd.__version__?
```

---
## Data Strucures

Pandas has 2 data structures: `Series` and `Dataframe`. These 2 give us many features that make it easy to do data analysis.

### 01 - Series

Pandas Series is a one-dimensional array-like object that has index and value just like NumPy and is capable of holding any data type. 

**Creating Series**

You can input a list, numpy array, and dictionary to create a Series. Here are some examples of showcasing those.

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

a   -1.764387
b   -2.128205
c    0.883364
d    0.942170
e    0.921279
dtype: float64


In [3]:
print(type(s))
print(s.name)  

<class 'pandas.core.series.Series'>
None


You can access the values by indexing.

In [4]:
s['a']

-1.764386861270621

If you don't pass an argument for index, it uses 0..n by default.

In [5]:
s = pd.Series([8,7,6,5], name='test_data')
print('Name: ',s.name)
print('Data:\n',s)
print('Type of Object: ',type(s))
print('Type of elements:',type(s.values))

Name:  test_data
Data:
 0    8
1    7
2    6
3    5
Name: test_data, dtype: int64
Type of Object:  <class 'pandas.core.series.Series'>
Type of elements: <class 'numpy.ndarray'>


We can also create a Series from a dictionary.

In [6]:
libs_dict = {'Library1': 'Numpy', 
               'Library2': 'Pandas', 
               'Library3': 'Matplotlib'}
s = pd.Series(libs_dict)
print(s)
print(type(s))

Library1         Numpy
Library2        Pandas
Library3    Matplotlib
dtype: object
<class 'pandas.core.series.Series'>


In [7]:
s['Library2']

'Pandas'

As you can see, when creating a Pandas Series, only the data (first argument) is mandatory here. The rest are optional; you can opt not to input the name or index.

---
**Editing Series**

Editing a Series is very similar to how you do it with `dict`s.

In [8]:
libs_dict = {'Library1': 'Numpy', 'Library2': 'Pandas', 'Library3': 'Matplotlib'}
s = pd.Series(libs_dict)
print(s)

s['Library2'] = 'Pandas 2.0'
s['Library4'] = 'GeoPandas'
print(s)

s.pop('Library4')
print(s)

Library1         Numpy
Library2        Pandas
Library3    Matplotlib
dtype: object
Library1         Numpy
Library2    Pandas 2.0
Library3    Matplotlib
Library4     GeoPandas
dtype: object
Library1         Numpy
Library2    Pandas 2.0
Library3    Matplotlib
dtype: object


### 02 - Dataframe

A DataFrame is like a spreadsheet or an SQL table. It is basically a 2-dimensional labelled data structure with columns of potentially different data types. To put it simply, `DataFrame` is a multi-column `Series` object. It is generally the most commonly used `pandas` object, and like `Series`, `DataFrame` accepts many different kinds of input:

 - Dict of 1D ndarrays, lists, dicts, or Series
 - 2-D numpy.ndarray
 - Structured or record ndarray
 - A `Series`
 - Another `DataFrame`
  
Here are some examples of creating Dataframes using different data types and structures as inputs.

**Creating Dataframe using dict**

In [9]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(data)
print('Dataframe:\n',df)
print('Type of Object:',type(df))
print('Type of elements:',type(df.values))

Dataframe:
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
Type of Object: <class 'pandas.core.frame.DataFrame'>
Type of elements: <class 'numpy.ndarray'>


Just like Series, you can access index, values and also columns.

In [10]:
print('Index: ',df.index)
print('Columns: ',df.columns)
print('Values of Column one: ',df['one'].values)
print('Values of Column two: ',df['two'].values)

Index:  Index(['a', 'b', 'c', 'd'], dtype='object')
Columns:  Index(['one', 'two'], dtype='object')
Values of Column one:  [ 1.  2.  3. nan]
Values of Column two:  [1. 2. 3. 4.]


**Creating Dataframe using list of dict**

In [11]:
df2 = pd.DataFrame([{'a': 1, 'b': 2, 'c':3, 'd':None}, 
                    {'a': 2, 'b': 2, 'c': 3, 'd': 4}],
                   index=['one', 'two'])
print('Dataframe: \n',df2)

# Ofcourse you can also transpose the result:
print('Transposed Dataframe: \n',df2.T)

Dataframe: 
      a  b  c    d
one  1  2  3  NaN
two  2  2  3  4.0
Transposed Dataframe: 
    one  two
a  1.0  2.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


**Editing DataFrame**

Assigning a column that doesn’t exist will create a new column.

In [12]:
df = pd.DataFrame(data)

df['three'] = None
print('Added third column: \n',df)

# The del keyword can be used delete columns:
del df['three']
print('\nDeleted third column: \n',df)
# You can also use df.drop(). We shall see that later

df.loc['a','one'] = 9000
print('\nEdited a value: \n',df)

Added third column: 
    one  two three
a  1.0  1.0  None
b  2.0  2.0  None
c  3.0  3.0  None
d  NaN  4.0  None

Deleted third column: 
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

Edited a value: 
       one  two
a  9000.0  1.0
b     2.0  2.0
c     3.0  3.0
d     NaN  4.0


#### 02-01 Reindexing

In `pandas`, we can do reindexing, which means to create a new object with the data conformed to a new index.

In [13]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [14]:
# Reindex in descending order.
df.reindex(['d','c','b','a'])

Unnamed: 0,one,two
d,,4.0
c,3.0,3.0
b,2.0,2.0
a,1.0,1.0


If you reindex with more number of rows than in the dataframe, it will return the dataframe with new row whose values are `NaN`.

In [15]:
df.reindex(['a','b','c','d','e'])

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0
e,,


Reindexing is also useful when you want to introduce any missing values. For example in our case, look at column `one` and row `d`.

In [16]:
df.reindex(['a','b','c','d','e'], fill_value=0)
# Guess why the df['one']['d'] was not filled with 0 ?

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0
e,0.0,0.0


#### 02-02 Dropping entries

We can drop the rows/entries by using the `drop` function.

In [17]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [18]:
df.drop(['c', 'a'])

Unnamed: 0,one,two
b,2.0,2.0
d,,4.0


#### 02-03 Indexing, Filtering, and Sorting 

**Indexing**

Series indexing works similarly to NumPy array indexing, except you can use the Series’s index values instead of only integers.

To slice and select only column one for rows 0 and 4 use the following.

In [19]:
print("Dataframe: \n",df)
# Slicing and selecting only column `one` for row 0 and row 4
df['one'][['a', 'd']]

Dataframe: 
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


a    1.0
d    NaN
Name: one, dtype: float64

In [20]:
# Slicing df from row b to row 4 for column `one`
df['one']['b':'d']

b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In the above cell, you will notice that slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.

For DataFrame label-indexing on the rows, there is a special indexing field loc, which enables us to select a subset of the rows and columns from a `DataFrame` with `numpy`-like notation plus axis labels. It is a less verbose way to do the reindexing, but we typically use this.

In [21]:
df.loc[['a','c'],['one']]

Unnamed: 0,one
a,1.0
c,3.0


**Filtering**

We can also filter by having a condition inside `loc`.

In [22]:
df.loc[df.one > 1]

Unnamed: 0,one,two
b,2.0,2.0
c,3.0,3.0


There are many ways to select and rearrange the data contained in a pandas object. Some indexing options can be seen in below table:

|Indexing Type| Description|
|:---|:---|
|df[val] | Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion).|
|df.ix[val] | Selects single row of subset of rows from the DataFrame.|
|df.ix[:, val] | Selects single column of subset of columns.|
|df.ix[val1, val2] | Select both rows and columns.|
|reindex method | Conform one or more axes to new indexes.|
|xs method | Select single row or column as a Series by label.|
|icol, irowmethods | Select single column or row, respectively, as a Series by integer location.|
|get_value, set_value methods | Select single value by row and column label.|

**Sorting**

We can sort items by doing the following.

In [23]:
dt = pd.Series(np.random.randint(3, 10, size=7), 
               index=['g','c','a','b','e','d','f'])
print('Original Data: \n', dt, end="\n\n")
print('Sorted by Index: \n',dt.sort_index())

Original Data: 
 g    7
c    7
a    8
b    7
e    9
d    8
f    5
dtype: int64

Sorted by Index: 
 a    8
b    7
c    7
d    8
e    9
f    5
g    7
dtype: int64


#### 02-04 Data alignment and arithmetic

Data alignment between `DataFrame` objects automatically align on both the columns and the index (row labels). The resulting object will have the union of the column and row labels.

In [24]:
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

print('df1:')
display(df1)
print('df2:')
display(df2)
print('Sum:')
display(df1.add(df2))

df1:


Unnamed: 0,A,B,C,D
0,-0.092902,-0.614456,2.762824,-1.351655
1,-0.262685,0.270409,-1.39136,0.655182
2,-0.983761,-0.364976,-0.893155,0.168804
3,-1.017767,-1.445179,-0.717208,0.787632
4,0.050705,-0.878196,0.100162,0.038959
5,-1.104643,2.589244,-0.389075,-1.5657
6,1.849123,0.330865,-0.454649,-0.68523
7,1.216587,0.470229,-0.413304,-1.515006
8,-1.504535,0.928231,1.770481,-0.30686
9,-0.372782,-0.584428,-0.182191,0.921259


df2:


Unnamed: 0,A,B,C
0,1.70112,-0.106179,-1.074017
1,0.066296,-0.593019,0.336862
2,-0.014362,-0.117732,0.099709
3,0.337932,0.018704,-0.192461
4,-0.560795,0.585062,-0.623387
5,-0.10817,1.375148,-0.66527
6,-0.082105,-0.150797,1.383563


Sum:


Unnamed: 0,A,B,C,D
0,1.608218,-0.720635,1.688807,
1,-0.196389,-0.322611,-1.054498,
2,-0.998123,-0.482709,-0.793446,
3,-0.679835,-1.426474,-0.909669,
4,-0.51009,-0.293134,-0.523225,
5,-1.212812,3.964392,-1.054345,
6,1.767017,0.180068,0.928914,
7,,,,
8,,,,
9,,,,


Note that in arithmetic operations between differently-indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [25]:
df1.add(df2, fill_value=0)

Unnamed: 0,A,B,C,D
0,1.608218,-0.720635,1.688807,-1.351655
1,-0.196389,-0.322611,-1.054498,0.655182
2,-0.998123,-0.482709,-0.793446,0.168804
3,-0.679835,-1.426474,-0.909669,0.787632
4,-0.51009,-0.293134,-0.523225,0.038959
5,-1.212812,3.964392,-1.054345,-1.5657
6,1.767017,0.180068,0.928914,-0.68523
7,1.216587,0.470229,-0.413304,-1.515006
8,-1.504535,0.928231,1.770481,-0.30686
9,-0.372782,-0.584428,-0.182191,0.921259


Similarly, you can perform other arithmetic operations such as subtracion, multiplication and division.

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting (just like in numpy) row-wise.

In [26]:
print("Dataframe:")
display(df1)
print("Operand (0th row):")
display(df1.loc[0])
print('Subtraction:')
display(df1.sub(df1.loc[0]))

Dataframe:


Unnamed: 0,A,B,C,D
0,-0.092902,-0.614456,2.762824,-1.351655
1,-0.262685,0.270409,-1.39136,0.655182
2,-0.983761,-0.364976,-0.893155,0.168804
3,-1.017767,-1.445179,-0.717208,0.787632
4,0.050705,-0.878196,0.100162,0.038959
5,-1.104643,2.589244,-0.389075,-1.5657
6,1.849123,0.330865,-0.454649,-0.68523
7,1.216587,0.470229,-0.413304,-1.515006
8,-1.504535,0.928231,1.770481,-0.30686
9,-0.372782,-0.584428,-0.182191,0.921259


Operand (0th row):


A   -0.092902
B   -0.614456
C    2.762824
D   -1.351655
Name: 0, dtype: float64

Subtraction:


Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-0.169783,0.884865,-4.154185,2.006838
2,-0.890859,0.249479,-3.65598,1.520459
3,-0.924865,-0.830723,-3.480033,2.139287
4,0.143607,-0.26374,-2.662662,1.390615
5,-1.011741,3.2037,-3.151899,-0.214045
6,1.942025,0.945321,-3.217474,0.666425
7,1.309489,1.084684,-3.176129,-0.163351
8,-1.411633,1.542687,-0.992344,1.044796
9,-0.27988,0.030028,-2.945016,2.272914


In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:

In [27]:
ind1 = pd.date_range('06/1/2017', periods=10)
df1.set_index(ind1)

Unnamed: 0,A,B,C,D
2017-06-01,-0.092902,-0.614456,2.762824,-1.351655
2017-06-02,-0.262685,0.270409,-1.39136,0.655182
2017-06-03,-0.983761,-0.364976,-0.893155,0.168804
2017-06-04,-1.017767,-1.445179,-0.717208,0.787632
2017-06-05,0.050705,-0.878196,0.100162,0.038959
2017-06-06,-1.104643,2.589244,-0.389075,-1.5657
2017-06-07,1.849123,0.330865,-0.454649,-0.68523
2017-06-08,1.216587,0.470229,-0.413304,-1.515006
2017-06-09,-1.504535,0.928231,1.770481,-0.30686
2017-06-10,-0.372782,-0.584428,-0.182191,0.921259


#### 02-05 Using Numpy functions on DataFrame

Element-wise numpy functions like log, exp, sqrt, ... and various other numpy functions can be used on DataFrame.

In [28]:
np.abs(df1)

Unnamed: 0,A,B,C,D
0,0.092902,0.614456,2.762824,1.351655
1,0.262685,0.270409,1.39136,0.655182
2,0.983761,0.364976,0.893155,0.168804
3,1.017767,1.445179,0.717208,0.787632
4,0.050705,0.878196,0.100162,0.038959
5,1.104643,2.589244,0.389075,1.5657
6,1.849123,0.330865,0.454649,0.68523
7,1.216587,0.470229,0.413304,1.515006
8,1.504535,0.928231,1.770481,0.30686
9,0.372782,0.584428,0.182191,0.921259


In [29]:
# Convert to numpy array
np.asarray(df1)

array([[-0.09290196, -0.61445577,  2.76282444, -1.3516554 ],
       [-0.26268495,  0.2704089 , -1.39136044,  0.65518247],
       [-0.98376131, -0.36497643, -0.8931551 ,  0.1688038 ],
       [-1.01776704, -1.44517853, -0.71720828,  0.78763182],
       [ 0.05070495, -0.87819604,  0.10016247,  0.03895943],
       [-1.10464256,  2.58924411, -0.38907491, -1.56570025],
       [ 1.84912273,  0.33086547, -0.4546492 , -0.6852304 ],
       [ 1.21658745,  0.47022852, -0.4133044 , -1.51500606],
       [-1.5045347 ,  0.92823085,  1.77048066, -0.30685961],
       [-0.37278153, -0.58442801, -0.18219116,  0.92125897]])

Sometimes, we have to apply a custom logic when transforming our DataFrame. We can use `apply` function for this. It allows us to apply a function on 1D arrays to each column or row. 

In [30]:
def fn(x):
    """
    Get max and min of the columns
    """
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

df1.apply(fn)

Unnamed: 0,A,B,C,D
min,-1.504535,-1.445179,-1.39136,-1.5657
max,1.849123,2.589244,2.762824,0.921259


Element-wise mapping can be done too.

In [31]:
fmt = lambda x: "{:.3f}".format(x)
df1.applymap(fmt)

Unnamed: 0,A,B,C,D
0,-0.093,-0.614,2.763,-1.352
1,-0.263,0.27,-1.391,0.655
2,-0.984,-0.365,-0.893,0.169
3,-1.018,-1.445,-0.717,0.788
4,0.051,-0.878,0.1,0.039
5,-1.105,2.589,-0.389,-1.566
6,1.849,0.331,-0.455,-0.685
7,1.217,0.47,-0.413,-1.515
8,-1.505,0.928,1.77,-0.307
9,-0.373,-0.584,-0.182,0.921


## Exploratory Data Analysis using Pandas

In this section, we'll demonstrate how `pandas` is used for analyzing real-world datasets.

**Loading data**

The first thing that we have to do is acquire the data and load it into our system. For the following examples, we will use the very popular Titanic dataset. We can load the data by using one of the following commands in Pandas.

|Function| Description|
|:---|:---|
|read_csv | Load delimited data from a file, URL, or file-like object. Use comma as default delimiter|
|read_table | Load delimited data from a file, URL, or file-like object. Use tab ('\\t') as default delimiter|
|read_fwf | Read data in fixed-width column format (that is, no delimiters)|
|read_clipboard | Version of read_table that reads data from the clipboard. Useful for converting tables from web pages.|

`train.csv` in our 'data' folder contains the "train set" of the titanic dataset. We'll explore what "train set" means in future sessions (or you could read up on your own), but what's important to know is this is simply a subset of the original dataset. We'll be loading this file for the following examples. We will be using `read_csv` here since our dataset is in csv format.

In [32]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Notice that we have appended `train.csv` to data. What happens here is in your current directory, you go inside 'data' folder and then access the `train.csv` file. 

We used the `head` method to just display the first 5 (default) entries.

> To know more about what the column names mean, see this link: https://www.kaggle.com/c/titanic/data

Let's check the dimensions and data types in our dataset.

In [33]:
df.shape

(891, 12)

In [34]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


We can use the following snippet to check the number of null values in our dataset.

In [35]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As we can see, there are null values in `Age`, `Cabin` and `Embarked` columns.

What if we want to get the age?

In [36]:
df['Age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

We can get the average age. Note that we can use this method for numeric data types. The different types of data (in general, not Python) will be further explained in future modules.

In [37]:
df['Age'].mean()

29.69911764705882

What about categorical data types?

In [38]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

We can also filter the dataset to only those who survived.

In [39]:
df_survived = df[df.Survived==1]
df_survived.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [40]:
df_survived['Age'].mean()

28.343689655172415

In [41]:
df_survived['Sex'].value_counts()

female    233
male      109
Name: Sex, dtype: int64