# 1. Pandas - Concepts and Dataframe Operations

## Learning Objectives

- Understand the nature of Pandas series and dataframes
- Know how to index and slice Pandas dataframes columns to return dataframes and series


## Pandas

- Pandas, along with Numpy, are probably the most important libraries for Python data science activities
- The main Pandas object is a _dataframe_,  which is used to store data into rows and columns. There are also series objects, but we will not really cover them here
- The best way to think of Pandas is to consider it as a very powerful version of Excel, and a dataframe is like a spreadsheet
- Each column of a dataframe object is a Pandas _series_ object
- A Pandas series object is built from a Numpy array (series is like an array but with labelled index)
- We import using the following syntax by convention:

In [1]:
import pandas as pd

In [2]:
# Do not worry about this, here we are just reading in a dataset to demonstrate operations
sf_sal = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv')
sf_sal

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
671,672,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,
672,673,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,
673,674,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,
674,675,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,


The next cell is not important, it will simply create a list of combinations of two letters. We will use it as an index just to show you how to set indices to a Pandas dataframe

In [3]:
# again do not worry about this cell, it is just creating a list of 2-letter alphabetical codes

import string
alphabet = string.ascii_uppercase
alphabet

index_list = []

for first in alphabet:
    for second in alphabet:
            index_list.append(first + second)

print(index_list)

['AA', 'AB', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN', 'AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW', 'AX', 'AY', 'AZ', 'BA', 'BB', 'BC', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BK', 'BL', 'BM', 'BN', 'BO', 'BP', 'BQ', 'BR', 'BS', 'BT', 'BU', 'BV', 'BW', 'BX', 'BY', 'BZ', 'CA', 'CB', 'CC', 'CD', 'CE', 'CF', 'CG', 'CH', 'CI', 'CJ', 'CK', 'CL', 'CM', 'CN', 'CO', 'CP', 'CQ', 'CR', 'CS', 'CT', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DA', 'DB', 'DC', 'DD', 'DE', 'DF', 'DG', 'DH', 'DI', 'DJ', 'DK', 'DL', 'DM', 'DN', 'DO', 'DP', 'DQ', 'DR', 'DS', 'DT', 'DU', 'DV', 'DW', 'DX', 'DY', 'DZ', 'EA', 'EB', 'EC', 'ED', 'EE', 'EF', 'EG', 'EH', 'EI', 'EJ', 'EK', 'EL', 'EM', 'EN', 'EO', 'EP', 'EQ', 'ER', 'ES', 'ET', 'EU', 'EV', 'EW', 'EX', 'EY', 'EZ', 'FA', 'FB', 'FC', 'FD', 'FE', 'FF', 'FG', 'FH', 'FI', 'FJ', 'FK', 'FL', 'FM', 'FN', 'FO', 'FP', 'FQ', 'FR', 'FS', 'FT', 'FU', 'FV', 'FW', 'FX', 'FY', 'FZ', 'GA', 'GB', 'GC', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GJ', 'GK

So, as mentioned, we are going to use these letters as the indices for the `salaries` dataset

In [4]:
sf_sal["Id"] = index_list
sf_sal.set_index("Id", inplace=True)
sf_sal
sf_sal.to_csv('data.csv')

In [43]:
# Here we are just reading in a dataset to demonstrate operations.
sf_sal = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv', index_col='Id')
# Notice that we set the index to be the 'Id' column. Try to remove that argument and see what happens!
sf_sal

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,
...,...,...,...,...,...,...,...,...,...,...,...,...
ZV,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,
ZW,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,
ZX,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,
ZY,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,


## Selection and Indexing

- We use square brackets to index, primarily by columns
- We can pass in a list of columns to select
- __Single bracket__ notation returns a __Pandas series__
- __Double bracket__ returns a __Pandas dataframe__.

> <font size=+1> A Pandas series is a single column of the Pandas dataframe </font>

Although a Pandas series is a single column, a single column can also be a Pandas dataframe

For example, observe the type we get by using single brackets:

In [40]:
type(sf_sal["BasePay"])

pandas.core.series.Series

And see now what happens if we use double brackets:

In [41]:
type(sf_sal[["BasePay"]])

pandas.core.frame.DataFrame

A Pandas series shares many methods with a Pandas dataframe. However, when retrieving a Pandas series, it will also tell you the name, length and dtype of the column you are retrieving:

In [42]:
sf_sal["BasePay"]

Id
AA    167411.18
AB    155966.02
AC    212739.13
AD     77916.00
AE    134401.60
        ...    
ZV    176856.18
ZW    176856.18
ZX    176856.19
ZY    176856.17
ZZ    130457.76
Name: BasePay, Length: 676, dtype: float64

But that doesn't happen with a Pandas dataframe

In [43]:
sf_sal[["BasePay"]]

Unnamed: 0_level_0,BasePay
Id,Unnamed: 1_level_1
AA,167411.18
AB,155966.02
AC,212739.13
AD,77916.00
AE,134401.60
...,...
ZV,176856.18
ZW,176856.18
ZX,176856.19
ZY,176856.17


When you use double brackets, you can select multiple columns in the same call:

In [44]:
# selecting multiple columns requires double brackets

sf_sal[["BasePay","TotalPay"]]

Unnamed: 0_level_0,BasePay,TotalPay
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,167411.18,567595.43
AB,155966.02,538909.28
AC,212739.13,335279.91
AD,77916.00,332343.61
AE,134401.60,326373.19
...,...,...
ZV,176856.18,180394.07
ZW,176856.18,180393.98
ZX,176856.19,180393.94
ZY,176856.17,180393.28


On many occasions you might want to get information of the whole dataframe quickly. It would be very tedious computing the statistics of your dataframe manually. Luckily, Pandas has methods to get information about your dataframe called: `info()` and `describe()`.

## `info()` and `describe()`

`info()` and `describe()` give us a brief summary of the dataframe we are handling. Let's see their outputs and their differences:

In [45]:
sf_sal.info()

<class 'pandas.core.frame.DataFrame'>
Index: 676 entries, AA to ZZ
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   EmployeeName      676 non-null    object 
 1   JobTitle          676 non-null    object 
 2   BasePay           676 non-null    float64
 3   OvertimePay       676 non-null    float64
 4   OtherPay          676 non-null    float64
 5   Benefits          0 non-null      float64
 6   TotalPay          676 non-null    float64
 7   TotalPayBenefits  676 non-null    float64
 8   Year              676 non-null    int64  
 9   Notes             0 non-null      float64
 10  Agency            676 non-null    object 
 11  Status            0 non-null      float64
dtypes: float64(8), int64(1), object(3)
memory usage: 84.8+ KB


In [46]:
sf_sal.describe()

Unnamed: 0,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Status
count,676.0,676.0,676.0,0.0,676.0,676.0,676.0,0.0,0.0
mean,149900.019867,30577.231657,23448.147426,,203925.39895,203925.39895,2011.0,,
std,41837.73621,35124.001536,30720.629073,,31798.561906,31798.561906,0.0,,
min,25400.0,0.0,0.0,,180312.67,180312.67,2011.0,,
25%,117268.875,0.0,7006.3175,,185695.47,185695.47,2011.0,,
50%,144042.16,16664.63,16491.805,,194842.065,194842.065,2011.0,,
75%,184727.1325,57995.9625,27305.23,,209450.04,209450.04,2011.0,,
max,294580.02,245131.88,400184.25,,567595.43,567595.43,2011.0,,


Notice that `info()` gives a list of the columns with their respective number of non-null values. On the other hand, `describe()` returns statistic aggregations of the columns, giving values such as mean, standard deviation, minimum value...

But wait, the results from describe have a shorter list of the columns we actually have. Where did they go? On a closer look, there are no categorical columns. 

Take a look at this [page](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) to know how to include those columns and run the `describe` method again to see the output.

## Lambda/Anonymous Expressions with .apply() method

If you want to apply the same function to all the values in a column, __avoid iterating through the whole dataframe!__. Pandas offers a method that is much more efficient: `apply()`. The `apply()` method takes in a function and returns a dataframe or slice with the function applied to the members of that dataframe or slice.

Here, knowing how to use `lambda` functions comes very handy. You can use these lambda functions as the argument for the method. Remember their syntax:

- `lambda input : output` (in terms of input)
- e.g. to square --> `lambda x : x**2`

For example, if you want to obtain a column with the surnames of the employees based on the column with the names, you can:
- First extract the column you want to use (`sf_sal["EmployeeName"]`)
- Implement the `apply` method on it
- Use a `lambda` function that returns the second element after splitting the name
    - For example, if the name is "Ivan Ying", split will return the following list `["Ivan", "Ying"]`
    - So, by taking the second element, we obtain "Ying"
    - Thus, the whole operation would look like `"Ivan Ying".split()[1]`
- Assign the output to a column in the dataframe `sf_sal["Surname"]`

In [47]:
sf_sal["Surname"] = sf_sal["EmployeeName"].apply(lambda x: x.split()[1])
sf_sal.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,PARDINI
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,CHONG
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER


We encourage you to use lambda functions. But if you are still struggling with them, you can also use regular functions as an input for `apply`.

For example, in the next cell, we define the same function but using a regular function:

In [48]:
def find_surname(x):
    return x.split()[1]

sf_sal["Surname"] = sf_sal["EmployeeName"].apply(find_surname)

# check head to see if it worked
sf_sal.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,PARDINI
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,CHONG
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER


You can also create other columns based on different conditions. For example, let's say that we want to check all those samples that have the word "POLICE" in their `JobTitle`.

You can:
1. Create a function that returns `True` if the word "POLICE" is in the string
2. Use that function as an argument in `apply()`
3. Use it on the `JobTitle` column
4. Assign the output to a new column

In [49]:
# define function to find 'police' first

def find_police(x):
    return "POLICE" in x


# use apply to search for it in JobTitle
sf_sal["isPolice"] = sf_sal["JobTitle"].apply(find_police)

sf_sal[["JobTitle", "isPolice"]]

Unnamed: 0_level_0,JobTitle,isPolice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,False
AB,CAPTAIN III (POLICE DEPARTMENT),True
AC,CAPTAIN III (POLICE DEPARTMENT),True
AD,WIRE ROPE CABLE MAINTENANCE MECHANIC,False
AE,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",False
...,...,...
ZV,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False
ZW,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False
ZX,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False
ZY,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False


And now you can check how many "POLICE" we have in the dataframe

In [50]:
# sum
sf_sal["isPolice"].sum()

139

### Try it out

Using what you have learnt about `apply`, how many employees have a surname starting with the letter "A"?

# Key Takeaways

- Pandas and Numpy are two of the most important libraries for Python. They are used often for data engineering and data science tasks.
- The main Pandas object is called a _dataframe_. It is used to store data into rows and columns, similar to how a relational database stores data
- A _series_ is a column within a dataframe. It is built from a Numpy array and contains a labelled index.
- We can use square brackets to index columns. A single bracket `[]` returns a series, while a double bracket `[[]]`returns a dataframe. 
- To obtain a brief summary of a certain dataframe, we can use the `.info()` and `.describe()` commands 
- Using the `.apply()` method, we can iterate through an entire dataframe and implement a function on all the data stored without the need to use a loop

## Further reading
- More details on pandas operations are available in pandas documentation: https://pandas.pydata.org/docs/