# Pandas

In this section, we will learn about the Pandas Python library and its main data structure: the DataFrame. DataFrames are similar to excel where it is very easy to do manipulation on the data. However, one major advantage of DataFrames over excel is the ability to do complex aggregation easier and ease when it comes to plotting. To do the same things in excel, it would require you to learn VBA, which is a less common language and lacks as many resources compared to Python.

## Checklist:

    1) Series
    2) DataFrames
    3) Manipulating DataFrames
    4) Missing Data / NaN values
    5) Group By
    6) Combining DataFrames (Concat, Merge, Join)
    7) Built in Methods
    8) Applying Self Defined Functions
___

# 0) Pip Install Pandas

Similar to when we worked with Numpy, we need to install the Pandas library if you don't it for your kernel/environment already

In [1]:
pip install Pandas



You should consider upgrading via the 'c:\Users\cryst\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.





# 1) Series

Series is similar to a Numpy array (its actually built using Numpy Arrays) but instead of just an array of numbers, it allows you to associate each value with some index

In [2]:
# first we will import both numpy and pandas
import numpy as np
import pandas as pd

Let's create our first Series data structure

In [3]:
labels = ['a','b','c']
numbers = [10,20,30]

Using a list as the input (default indexing are integers)

In [4]:
pd.Series(numbers)

0    10
1    20
2    30
dtype: int64

Adding labels to it

In [5]:
pd.Series(numbers,labels)

a    10
b    20
c    30
dtype: int64

Using numpy array as the input

In [6]:
pd.Series(np.array(numbers),labels)

a    10
b    20
c    30
dtype: int32

Using a dictionary as the input

In [7]:
my_dict = {"a": 10, "b":20, "c":30}

pd.Series(my_dict)

a    10
b    20
c    30
dtype: int64

The index is used to help identify and label your data so it is easier to retrieve (similar to the concept of a dictionary)

In [8]:
series = pd.Series([27,18,28,5],index = ['USA', 'Germany','Italy', 'Japan'])  
series

USA        27
Germany    18
Italy      28
Japan       5
dtype: int64

With a Series, we can easy get the value for a desired index

In [9]:
series['Japan']

5

We can also do math on the Series

In [10]:
series*2

USA        54
Germany    36
Italy      56
Japan      10
dtype: int64

# 2) DataFrames

DataFrames are practically the most import concept in the Pandas library and is made up of Series put together that share the same index. Its similar to how you would set up data where the index is the time step or maybe the category of data

We can make a dataframe based on a nested list, and manually set the index and column names

In [11]:
input = [[1,2,3],
        [4,5,6],
        [7,8,9]]

df = pd.DataFrame(input,index=["A","B","C"], columns=["X","Y","Z"])

df

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9


We can also define a DataFrame using a dictionary

In [12]:
input2 = {"X":[1,4,7],"Y":[2,5,8],"Z":[3,6,9]}

df2 = pd.DataFrame(input2, index=["A","B","C"])

df2

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9


# 3) Manipulating DataFrames
Because DataFrames are just a collection of Series, we can easily select part of the DataFrame based on certain criteria

## Selecting part of the DataFrame based on index/column names

In [13]:
df['X']

A    1
B    4
C    7
Name: X, dtype: int64

We can also select multiple columns

In [14]:
df[['X','Y']]

Unnamed: 0,X,Y
A,1,2
B,4,5
C,7,8


We can see that if we take only one column, the data structure is a Series

In [15]:
type(df['X'])

pandas.core.series.Series

## Selecting using .loc() and iloc()

.loc() and iloc() are two different ways we can extract part of a 

* .loc() is the way to select rows of data based on index (previously we selected based on column name only)
* .iloc() is like the traditional method of selecting a value based on the numerical indexing (coordinate style)
    * Ex: (0,0) (0,1) (2,3)

Selecting the 'A' row

In [16]:
df.loc['A']

X    1
Y    2
Z    3
Name: A, dtype: int64

In [17]:
df.iloc[0]

X    1
Y    2
Z    3
Name: A, dtype: int64

Selecting a single value

In [18]:
df.loc['A']['X']

1

In [19]:
df.iloc[0,0]

1

Selecting the 'A' and 'B' rows

In [20]:
df.loc[['A','B']]

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6


In [21]:
df.iloc[0:2]

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6


Selecting rows A and B, and columns X and Y

In [22]:
df.loc[['A','B'],['X','Y']]

Unnamed: 0,X,Y
A,1,2
B,4,5


In [23]:
df.iloc[0:2, 0:2]

Unnamed: 0,X,Y
A,1,2
B,4,5


## Selecting based on criteria
We can also filter the results of a DataFrame based on criteria on a column on row. Let's make a large DataFrame to see it in action. We will use the numpy random module to help us make a DataFrame faster

In [24]:
from numpy.random import randn
np.random.seed(200)

df3 = pd.DataFrame(randn(6,3),index=["A","B","C","D","E","F"], columns=["X","Y","Z"])

df3

Unnamed: 0,X,Y,Z
A,-1.450948,1.910953,0.711879
B,-0.247738,0.361466,-0.03295
C,-0.221347,0.477257,-0.691939
D,0.792006,0.073249,1.303286
E,0.213481,1.017349,1.911712
F,-0.529672,1.842135,-1.057235


Grabbing only the values where the X column is greater than 0

In [25]:
df3[df3['X'] > 0]

Unnamed: 0,X,Y,Z
D,0.792006,0.073249,1.303286
E,0.213481,1.017349,1.911712


Grabbing only colum X and Z column for rows that have a value greater than 0

In [26]:
df3[df3['X'] > 0][['X','Z']]

Unnamed: 0,X,Z
D,0.792006,1.303286
E,0.213481,1.911712


Grabbing rows where X and Z are less than 0 (You must use & or | for logical comparisons. You can't use AND/OR spelled out)

In [28]:
df3[(df3['X'] < 0) and (df3['Z'] < 0)]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [29]:
df3[(df3['X'] < 0) & (df3['Z'] < 0)]

Unnamed: 0,X,Y,Z
B,-0.247738,0.361466,-0.03295
C,-0.221347,0.477257,-0.691939
F,-0.529672,1.842135,-1.057235


Grabbing rows where either column is < 0

In [30]:
df3[(df3['X'] < 0) | (df3['Y'] < 0) | (df3['Z'] < 0)]

Unnamed: 0,X,Y,Z
A,-1.450948,1.910953,0.711879
B,-0.247738,0.361466,-0.03295
C,-0.221347,0.477257,-0.691939
F,-0.529672,1.842135,-1.057235


## Adding and Removing Columns

We can also create new columns based on values of another column

In [31]:
df["sum"] = df["X"] + df["Y"] + df["Z"]

df

Unnamed: 0,X,Y,Z,sum
A,1,2,3,6
B,4,5,6,15
C,7,8,9,24


We can also  a drop column of DataFrame using .drop()

In [32]:
df.drop('sum',axis=1)

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9


Notice that if we call df again, the column is still there? This is because we need to store the new df after dropping the column

In [33]:
df

Unnamed: 0,X,Y,Z,sum
A,1,2,3,6
B,4,5,6,15
C,7,8,9,24


In [34]:
df = df.drop('sum',axis=1) # notice how we have to store this and that we can't run this block again
df

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9


Alternatively we could also specify to drop inplace

In [35]:
df["sum"] = df["X"] + df["Y"] + df["Z"]

print(df)

df.drop('sum', axis=1, inplace=True)

df

   X  Y  Z  sum
A  1  2  3    6
B  4  5  6   15
C  7  8  9   24


Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9


# 4) Missing Data / NaN values
Its common for data that we receive as engineers is incomplete or missing values. Therefore, we need a way to easily deal with all values that are either missing or incorrect

In [36]:
df4 = pd.DataFrame(randn(5,4),index=["A","B","C","D","E"], columns=["W","X","Y","Z"])

# Add some NaN values
df4.iloc[0,0] = np.NAN
df4.iloc[2,2] = np.NAN
df4.iloc[3,0] = np.NAN

df4

Unnamed: 0,W,X,Y,Z
A,,0.237631,-1.154182,1.214984
B,-1.293759,0.822723,-0.332151,-1.281429
C,0.218538,2.083474,,0.26746
D,,-0.653608,0.16228,2.21338
E,-0.665534,-1.009003,2.348051,0.604376


We can drop all rows that have NaN

In [37]:
df4.dropna()

Unnamed: 0,W,X,Y,Z
B,-1.293759,0.822723,-0.332151,-1.281429
E,-0.665534,-1.009003,2.348051,0.604376


We can drop all columns that have Nan

In [38]:
df4.dropna(axis=1)

Unnamed: 0,X,Z
A,0.237631,1.214984
B,0.822723,-1.281429
C,2.083474,0.26746
D,-0.653608,2.21338
E,-1.009003,0.604376


We can replace all the NaN values

In [39]:
df4.fillna("ERROR")

Unnamed: 0,W,X,Y,Z
A,ERROR,0.237631,-1.15418,1.214984
B,-1.29376,0.822723,-0.332151,-1.281429
C,0.218538,2.083474,ERROR,0.26746
D,ERROR,-0.653608,0.16228,2.21338
E,-0.665534,-1.009003,2.34805,0.604376


In [40]:
df4.fillna(-1)

Unnamed: 0,W,X,Y,Z
A,-1.0,0.237631,-1.154182,1.214984
B,-1.293759,0.822723,-0.332151,-1.281429
C,0.218538,2.083474,-1.0,0.26746
D,-1.0,-0.653608,0.16228,2.21338
E,-0.665534,-1.009003,2.348051,0.604376


# 5) Group By
A very useful function of DataFrames is its ability to group together rows of data. This will become extremely important once we learn plotting. 

In [41]:
data = {'Status':['Senior','Senior','Junior','Freshmen','Freshmen','Sophomore','Senior','Junior'],
        'Name':['Crystal','Sam','Ben','Long','Katie','Sarah','Jessica','Ryan'],
        'Credits':[12,16,18,15,14,17,21,9]}

df5 = pd.DataFrame(data)

df5

Unnamed: 0,Status,Name,Credits
0,Senior,Crystal,12
1,Senior,Sam,16
2,Junior,Ben,18
3,Freshmen,Long,15
4,Freshmen,Katie,14
5,Sophomore,Sarah,17
6,Senior,Jessica,21
7,Junior,Ryan,9


We can use the group by function to group the data based on the status. This will only return a data structure back and you can't see anything. We have to do aggregation on the data to see the results

In [42]:
df5.groupby("Status")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000216DA63F0A0>

We can average the credits taken by students of each status

In [43]:
df5.groupby("Status").mean()

Unnamed: 0_level_0,Credits
Status,Unnamed: 1_level_1
Freshmen,14.5
Junior,13.5
Senior,16.333333
Sophomore,17.0


There are also other common methods as well (median, max, min, std)

In [44]:
df5.groupby("Status").max() # finds the maximum value within each group

Unnamed: 0_level_0,Name,Credits
Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Freshmen,Long,15
Junior,Ryan,18
Senior,Sam,21
Sophomore,Sarah,17


In [45]:
df5.groupby("Status").std() # finds the std within each group

Unnamed: 0_level_0,Credits
Status,Unnamed: 1_level_1
Freshmen,0.707107
Junior,6.363961
Senior,4.50925
Sophomore,


In [46]:
df5.groupby("Status").count() # finds the number of students within each group

Unnamed: 0_level_0,Name,Credits
Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Freshmen,2,2
Junior,2,2
Senior,3,3
Sophomore,1,1


In [47]:
df5.groupby("Status").describe() # finds common statistics for each group

Unnamed: 0_level_0,Credits,Credits,Credits,Credits,Credits,Credits,Credits,Credits
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Freshmen,2.0,14.5,0.707107,14.0,14.25,14.5,14.75,15.0
Junior,2.0,13.5,6.363961,9.0,11.25,13.5,15.75,18.0
Senior,3.0,16.333333,4.50925,12.0,14.0,16.0,18.5,21.0
Sophomore,1.0,17.0,,17.0,17.0,17.0,17.0,17.0


# 6) Combining DataFrames (Concat and Merge)
Often time we import different sets of data individually and then combine them all together. Therefore, its important to learn the different ways we can combine DataFrames
* Concat - Combines DataFrames on top of each other (stacking if they have the same column names)
* Merge - Combines DataFrames based on column you specify

## Concat

In [48]:
top1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']},
                    index=['T0', 'T1', 'T2']) 

bottom1 = pd.DataFrame({'A': ['C0', 'C1', 'C2'],
                    'B': ['D0', 'D1', 'D2']},
                    index=['B0', 'B1', 'B2'])

In [49]:
top1

Unnamed: 0,A,B
T0,A0,B0
T1,A1,B1
T2,A2,B2


In [50]:
bottom1

Unnamed: 0,A,B
B0,C0,D0
B1,C1,D1
B2,C2,D2


In [51]:
pd.concat([top1,bottom1])

Unnamed: 0,A,B
T0,A0,B0
T1,A1,B1
T2,A2,B2
B0,C0,D0
B1,C1,D1
B2,C2,D2


We can also combine left to right if they have the same index names

In [52]:
top = pd.DataFrame({'A_T': ['A0', 'A1', 'A2'],
                    'B_T': ['B0', 'B1', 'B2']},
                    index=['0', '1', '2']) # notice that the index are the same

bottom = pd.DataFrame({'A_B': ['C0', 'C1', 'C2'],
                    'B_B': ['D0', 'D1', 'D2']},
                    index=['0', '1', '2']) # notice that the index are the same

pd.concat([top,bottom],axis=1)

Unnamed: 0,A_T,B_T,A_B,B_B
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2


## Merge
Merge allows to combine DataFrames together based on a specific column

In [53]:
left = pd.DataFrame({'Capacity': [10,20,30,40],
                    'Energy': [20,40,60,80],
                    "Label":['Cell1', 'Cell2', 'Cell3','Cell4']}) 

right = pd.DataFrame({'Cost': [25,68,100,120],
                    'Weight': [1.5, 2.6, 3.2,3.7],
                    "Label":['Cell1', 'Cell2', 'Cell3','Cell5']}) 

Left Join (all values from LEFT table are include)

In [54]:
pd.merge(left, right, how='left', on='Label')

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
0,10,20,Cell1,25.0,1.5
1,20,40,Cell2,68.0,2.6
2,30,60,Cell3,100.0,3.2
3,40,80,Cell4,,


Right Join (all values from RIGHT table are include)

In [55]:
pd.merge(left, right, how='right', on='Label')

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
0,10.0,20.0,Cell1,25,1.5
1,20.0,40.0,Cell2,68,2.6
2,30.0,60.0,Cell3,100,3.2
3,,,Cell5,120,3.7


Inner Join (shows only values that exist in BOTH tables)

In [56]:
pd.merge(left, right, how='inner', on='Label')

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
0,10,20,Cell1,25,1.5
1,20,40,Cell2,68,2.6
2,30,60,Cell3,100,3.2


Outer Join (all values regardless if they exist in the other table are included)

In [57]:
pd.merge(left,right, how='outer', on='Label')

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
0,10.0,20.0,Cell1,25.0,1.5
1,20.0,40.0,Cell2,68.0,2.6
2,30.0,60.0,Cell3,100.0,3.2
3,40.0,80.0,Cell4,,
4,,,Cell5,120.0,3.7


# 7) Built in Methods and Applying Functions
Similar to other data structures we've seen, DataFrame has built in methods

In [58]:
df6 = pd.merge(left,right, how='outer', on='Label')

df6

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
0,10.0,20.0,Cell1,25.0,1.5
1,20.0,40.0,Cell2,68.0,2.6
2,30.0,60.0,Cell3,100.0,3.2
3,40.0,80.0,Cell4,,
4,,,Cell5,120.0,3.7


## Sort DataFrame

Sort DataFrame based on WEIGHT from low to high (ascending)

In [59]:
df6.sort_values(by='Weight') #inplace=False by default

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
0,10.0,20.0,Cell1,25.0,1.5
1,20.0,40.0,Cell2,68.0,2.6
2,30.0,60.0,Cell3,100.0,3.2
4,,,Cell5,120.0,3.7
3,40.0,80.0,Cell4,,


Sort DataFrame based on COST from high to low (descending)

In [60]:
df6.sort_values(by='Cost', ascending=False) #inplace=False by default

Unnamed: 0,Capacity,Energy,Label,Cost,Weight
4,,,Cell5,120.0,3.7
2,30.0,60.0,Cell3,100.0,3.2
1,20.0,40.0,Cell2,68.0,2.6
0,10.0,20.0,Cell1,25.0,1.5
3,40.0,80.0,Cell4,,


## Set Index
To utilize  .loc and select part of a DataFrame we need to set the index

In [61]:
df6.set_index("Label")

Unnamed: 0_level_0,Capacity,Energy,Cost,Weight
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cell1,10.0,20.0,25.0,1.5
Cell2,20.0,40.0,68.0,2.6
Cell3,30.0,60.0,100.0,3.2
Cell4,40.0,80.0,,
Cell5,,,120.0,3.7


Notice that if we print df6, it does not have the index
* We would either need to store df6.set_index("Label") as a new DataFrame or use inplace=True

## Renaming Columns
The method to rename columns is to input a dictionary between the old and new column names

In [62]:
df6.rename(columns={"Capacity": "Capacity_new",
                    "Energy": "Energy_new",
                    "Cost": "Cost_new",
                    "Weight": "Weight_new"})

Unnamed: 0,Capacity_new,Energy_new,Label,Cost_new,Weight_new
0,10.0,20.0,Cell1,25.0,1.5
1,20.0,40.0,Cell2,68.0,2.6
2,30.0,60.0,Cell3,100.0,3.2
3,40.0,80.0,Cell4,,
4,,,Cell5,120.0,3.7


We can also get a list of the column names

In [63]:
df6.columns

Index(['Capacity', 'Energy', 'Label', 'Cost', 'Weight'], dtype='object')

## Viewing certain number of rows

Viewing first nth rows

In [64]:
df5.head(2)

Unnamed: 0,Status,Name,Credits
0,Senior,Crystal,12
1,Senior,Sam,16


Viewing the last nth rows

In [65]:
df5.tail(3)

Unnamed: 0,Status,Name,Credits
5,Sophomore,Sarah,17
6,Senior,Jessica,21
7,Junior,Ryan,9


## Seeing the data types for each column

In [66]:
df6.dtypes

Capacity    float64
Energy      float64
Label        object
Cost        float64
Weight      float64
dtype: object

## Getting the size or shape of the DataFrame
* shape = number of rows and colums
* size = number of entries / data points

In [67]:
df5.size # shows number of entries

24

In [68]:
df5.shape # shows the dimensions of the DataFrame

(8, 3)

# 8) Applying Self Defined Functions
Apply is a built in method to DataFrame that allow us to apply a self defined function across a DataFrame

In [69]:
left # we can reuse a previous DataFrame

Unnamed: 0,Capacity,Energy,Label
0,10,20,Cell1
1,20,40,Cell2
2,30,60,Cell3
3,40,80,Cell4


Here we will define a function that returns 2 times the inputted value

In [70]:
def times2(x):
    return x*2

Now we can apply the function across the entire DataFrame

In [71]:
left['Capacity'].apply(times2)

0    20
1    40
2    60
3    80
Name: Capacity, dtype: int64

## Group By and Apply
We can also combine group by and apply to do operations that require us to do some computation only within groups

One example is that we might want to subtract all numbers by the minimum value in a group so we can see the deviation instead of the total number

Take our previous example with Students and Course Credits

In [72]:
df5

Unnamed: 0,Status,Name,Credits
0,Senior,Crystal,12
1,Senior,Sam,16
2,Junior,Ben,18
3,Freshmen,Long,15
4,Freshmen,Katie,14
5,Sophomore,Sarah,17
6,Senior,Jessica,21
7,Junior,Ryan,9


Now we will define a function that takes in a DataFrame and subtracts a number based on the min of DataFrame

In [73]:
def subtract_min(df):
    print("Min in group:", df['Credits'].min())
    min_credits = df['Credits'].min()
    return df['Credits'] - min_credits

In [74]:
df5[['Status','Credits']].groupby('Status').apply(subtract_min)

Min in group: 14
Min in group: 9
Min in group: 12
Min in group: 17


Status      
Freshmen   3    1
           4    0
Junior     2    9
           7    0
Senior     0    0
           1    4
           6    9
Sophomore  5    0
Name: Credits, dtype: int64

(BONUS) We can do the same thing using a lambda expression

In [75]:
df5.groupby('Status').apply(lambda df: df['Credits'] - df['Credits'].min())

Status      
Freshmen   3    1
           4    0
Junior     2    9
           7    0
Senior     0    0
           1    4
           6    9
Sophomore  5    0
Name: Credits, dtype: int64