# 1. Introduction to Pandas

## Learning objectives
- Understand the nature of Pandas Series and DataFrames.
- Know how to index and slice Pandas DataFrames columns to return DataFrames and Series.
- Know how to make new columns from existing columns.
- Understand the difference between None and np.nan.
- How to drop rows and columns with .drop().
- How to use .loc[] and .iloc[].
- Understand conditional selection and boolean masking.
- Know how to use .set_index() and .reset_index().

## Pandas

- Pandas, along with Numpy, is probably the most important library for Python Data Science.
- The main Pandas object is a DataFrame, there are also Series but we will not really cover them here.
- Best way to think of Pandas is as a very powerful version of Excel, and a DataFrame is like a spreadsheet.
- DataFrames have rows and columns.
- Each column of a DataFrame object is a Pandas Series object.
- A Pandas Series object is built from a Numpy array (Series is like array but with labelled index).
- We import using the following syntax by convention:

In [1]:
import pandas as pd

In [2]:
# do not worry about this, here we are just reading in a dataset to demonstrate operations
sf_sal = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv')
sf_sal

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
671,672,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,
672,673,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,
673,674,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,
674,675,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,


The next cell is not important, it will simply create a list of combinations of two letters. We will use it as an index just to show you how to set indices to a pandas dataframe

In [3]:
# again do not worry about this cell, it is just creating a list of 2-letter alphabetical codes

import string
alphabet = string.ascii_uppercase
alphabet

index_list = []

for first in alphabet:
    for second in alphabet:
            index_list.append(first + second)

print(index_list)

['AA', 'AB', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN', 'AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW', 'AX', 'AY', 'AZ', 'BA', 'BB', 'BC', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BK', 'BL', 'BM', 'BN', 'BO', 'BP', 'BQ', 'BR', 'BS', 'BT', 'BU', 'BV', 'BW', 'BX', 'BY', 'BZ', 'CA', 'CB', 'CC', 'CD', 'CE', 'CF', 'CG', 'CH', 'CI', 'CJ', 'CK', 'CL', 'CM', 'CN', 'CO', 'CP', 'CQ', 'CR', 'CS', 'CT', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DA', 'DB', 'DC', 'DD', 'DE', 'DF', 'DG', 'DH', 'DI', 'DJ', 'DK', 'DL', 'DM', 'DN', 'DO', 'DP', 'DQ', 'DR', 'DS', 'DT', 'DU', 'DV', 'DW', 'DX', 'DY', 'DZ', 'EA', 'EB', 'EC', 'ED', 'EE', 'EF', 'EG', 'EH', 'EI', 'EJ', 'EK', 'EL', 'EM', 'EN', 'EO', 'EP', 'EQ', 'ER', 'ES', 'ET', 'EU', 'EV', 'EW', 'EX', 'EY', 'EZ', 'FA', 'FB', 'FC', 'FD', 'FE', 'FF', 'FG', 'FH', 'FI', 'FJ', 'FK', 'FL', 'FM', 'FN', 'FO', 'FP', 'FQ', 'FR', 'FS', 'FT', 'FU', 'FV', 'FW', 'FX', 'FY', 'FZ', 'GA', 'GB', 'GC', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GJ', 'GK

So, as mentioned, we are going to use these letters as the indices for the salaries dataset

In [4]:
sf_sal["Id"] = index_list
sf_sal.set_index("Id", inplace=True)
sf_sal
sf_sal.to_csv('data.csv')

In [43]:
# Here we are just reading in a dataset to demonstrate operations.
sf_sal = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv', index_col='Id')
# Notice that we set the index to be the 'Id' column. Try to remove that argument and see what happens!
sf_sal

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,
...,...,...,...,...,...,...,...,...,...,...,...,...
ZV,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,
ZW,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,
ZX,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,
ZY,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,


So, as mentioned, we are going to use these letters as the indices for the salaries dataset

So, as mentioned, we are going to use these letters as the indices for the salaries dataset

## Selection and Indexing

- We use square brackets to index, primarily by columns.
- We can pass in a list of columns to select.
- __Single bracket__ notation returns a __Pandas Series__
- __Double bracket__ returns a __Pandas DataFrame__.

> <font size=+1> A Pandas Series is a single column of the Pandas DataFrame </font>

Although a Pandas Series is a single column, a single column can also be a Pandas DataFrame

For example, observe the type we get by using single bracket

In [40]:
type(sf_sal["BasePay"])

pandas.core.series.Series

And see now what happens if we use double brackets:

In [41]:
type(sf_sal[["BasePay"]])

pandas.core.frame.DataFrame

A Pandas Series shares many methods with a Pandas DataFrame. However, when retrieving a Pandas Series, it will also tell you the name, length and dtype of the column you are retrieving

In [42]:
sf_sal["BasePay"]

Id
AA    167411.18
AB    155966.02
AC    212739.13
AD     77916.00
AE    134401.60
        ...    
ZV    176856.18
ZW    176856.18
ZX    176856.19
ZY    176856.17
ZZ    130457.76
Name: BasePay, Length: 676, dtype: float64

But that doesn't happen with a pandas DataFrame

In [43]:
sf_sal[["BasePay"]]

Unnamed: 0_level_0,BasePay
Id,Unnamed: 1_level_1
AA,167411.18
AB,155966.02
AC,212739.13
AD,77916.00
AE,134401.60
...,...
ZV,176856.18
ZW,176856.18
ZX,176856.19
ZY,176856.17


When you use double brackets, you can select multiple columns in the same call

In [44]:
# selecting multiple columns requires double brackets

sf_sal[["BasePay","TotalPay"]]

Unnamed: 0_level_0,BasePay,TotalPay
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,167411.18,567595.43
AB,155966.02,538909.28
AC,212739.13,335279.91
AD,77916.00,332343.61
AE,134401.60,326373.19
...,...,...
ZV,176856.18,180394.07
ZW,176856.18,180393.98
ZX,176856.19,180393.94
ZY,176856.17,180393.28


On many occasions you might want to get information of the whole DataFrame quickly. It would be very tedious computing the statistics of your DataFrame manually. Luckily, Pandas has methods to get information about your DataFrame called `info()` and `describe()`

## `info()` and `describe()`

`info()` and `describe()` give us a brief summary of the DataFrame we are handling. Let's see their outputs and their difference

In [45]:
sf_sal.info()

<class 'pandas.core.frame.DataFrame'>
Index: 676 entries, AA to ZZ
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   EmployeeName      676 non-null    object 
 1   JobTitle          676 non-null    object 
 2   BasePay           676 non-null    float64
 3   OvertimePay       676 non-null    float64
 4   OtherPay          676 non-null    float64
 5   Benefits          0 non-null      float64
 6   TotalPay          676 non-null    float64
 7   TotalPayBenefits  676 non-null    float64
 8   Year              676 non-null    int64  
 9   Notes             0 non-null      float64
 10  Agency            676 non-null    object 
 11  Status            0 non-null      float64
dtypes: float64(8), int64(1), object(3)
memory usage: 84.8+ KB


In [46]:
sf_sal.describe()

Unnamed: 0,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Status
count,676.0,676.0,676.0,0.0,676.0,676.0,676.0,0.0,0.0
mean,149900.019867,30577.231657,23448.147426,,203925.39895,203925.39895,2011.0,,
std,41837.73621,35124.001536,30720.629073,,31798.561906,31798.561906,0.0,,
min,25400.0,0.0,0.0,,180312.67,180312.67,2011.0,,
25%,117268.875,0.0,7006.3175,,185695.47,185695.47,2011.0,,
50%,144042.16,16664.63,16491.805,,194842.065,194842.065,2011.0,,
75%,184727.1325,57995.9625,27305.23,,209450.04,209450.04,2011.0,,
max,294580.02,245131.88,400184.25,,567595.43,567595.43,2011.0,,


Notice that `info()` gives a list of the columns with their respective number of non-null values. On the other hand, `describe()` returns statistic aggregations of the columns, giving values such as mean, standard deviation, minimum value...

But wait, the results from describe have a shorter list of the columns we actually have. Where did they go? On a closer look, there are no categorical columns. 

Take a look at this [page](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) to know how to include those columns and run the `describe` method again to see the output

## Lambda/Anonymous Expressions with .apply() method

If you want to apply the same function to all the values in a column, __avoid iterating through the whole dataframe!__. Pandas offers a method that is much more efficient: `apply()`. The `apply()` method takes in a function and returns a DataFrame or slice with the function applied to the members of that DataFrame or slice.

Here, knowing how to use `lambda` functions comes very handy. You can use these lambda functions as the argument for the method. Remember their syntax:

- `lambda input : output` (in terms of input)
- e.g. to square --> `lambda x : x**2`

For example, if you want to obtain a column with the surnames of the employees based on the column with the names, you can:
- First extract the column you want to use (`sf_sal["EmployeeName"]`)
- Apply the `apply` method to it
- Use a `lambda` function that returns the second element after splitting the name
    - For example, if the name is "Ivan Ying", split will return the following list `["Ivan", "Ying"]`
    - So, by taking the second element, we obtain "Ying"
    - Thus, the whole operation would look like `"Ivan Ying".split()[1]`
- Assign the output to a column in the DataFrame `sf_sal["Surname"]`

In [47]:
sf_sal["Surname"] = sf_sal["EmployeeName"].apply(lambda x: x.split()[1])
sf_sal.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,PARDINI
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,CHONG
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER


We encourage you to use lambda functions. But if you are still struggling with the, you can also use regular functions as an input for `apply`.

For example, in the next cell, we define the same function but using a regular function:

In [48]:
def find_surname(x):
    return x.split()[1]

sf_sal["Surname"] = sf_sal["EmployeeName"].apply(find_surname)

# check head to see if it worked
sf_sal.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,PARDINI
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,CHONG
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER


You can also create other columns based on different conditions. For example, let's say that we want to check all those samples that have the word "POLICE" in their JobTitle.

You can:
1. Create a function that returns `True` if the word "POLICE" is in the string
2. Use that function as an argument in `apply()`
3. Use it on the "JobTitle" column
4. Assign the output to a new column

In [49]:
# define function to find 'police' first

def find_police(x):
    return "POLICE" in x


# use apply to search for it in JobTitle
sf_sal["isPolice"] = sf_sal["JobTitle"].apply(find_police)

sf_sal[["JobTitle", "isPolice"]]

Unnamed: 0_level_0,JobTitle,isPolice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,False
AB,CAPTAIN III (POLICE DEPARTMENT),True
AC,CAPTAIN III (POLICE DEPARTMENT),True
AD,WIRE ROPE CABLE MAINTENANCE MECHANIC,False
AE,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",False
...,...,...
ZV,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False
ZW,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False
ZX,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False
ZY,"HEAD ATTORNEY, CIVIL AND CRIMINAL",False


And now you can check how many Polices we have in the DataFrame

In [50]:
# sum
sf_sal["isPolice"].sum()

139

### Try it out

Using what you have learnt about `apply`, how many employees have a surname starting with the letter "A"?

## NoneType


- We must distinguish between None, NaN and 0 here.
- None has data type 'NoneType' and is therefore a value, and we can use this as a placeholder before adding values.
- 0 is an integer, and therefore a value, this shows that we have a response in the cell, but the response is 0.
- NaN stands for 'Not a Number' and so denotes a MISSING value, although it has the type float.
- We can see this clearly when we check the types of each:

In [51]:
type(None)

NoneType

In [52]:
type(0)

int

In [53]:
import numpy as np

type(np.nan)
np.nan + float("inf")

nan

## Adding and Removing Columns

So far, we have created new columns based on the values of other columns. But we can create columns with the values we want. For example, let's say that I want to create an empty column (with `None` values).

You can simply index the name of the column, and assign a value. All samples in the column will get the same value. Careful here! If the name of the column exists, it will overwrite the whole column

In [54]:
# can create new column with single value inc. integer/string/NoneType

sf_sal["new col"] = None

sf_sal.head() # shows the first 5 entries (the head) of the dataframe

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname,isPolice,new col
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD,False,
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ,True,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,PARDINI,True,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,CHONG,False,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER,False,


If you want to remove rows or columns, you can use the `drop` method. This method accepts the name of the rows or the columns you want to remove. You can pass a single row or column, or a list of the rows or columns you want to remove. For example, if we want to remove the row corresponding to index "AB" 


In [55]:
sf_sal.drop("AB").head(5)

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname,isPolice,new col
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD,False,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,PARDINI,True,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,CHONG,False,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER,False,
AF,DAVID SULLIVAN,ASSISTANT DEPUTY CHIEF II,118602.0,8601.0,189082.74,,316285.74,316285.74,2011,,San Francisco,,SULLIVAN,False,


Or if you want to remove "AB", "AC", and "AD"

In [56]:
sf_sal.drop(["AB", "AC", "AD"]).head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname,isPolice,new col
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD,False,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER,False,
AF,DAVID SULLIVAN,ASSISTANT DEPUTY CHIEF II,118602.0,8601.0,189082.74,,316285.74,316285.74,2011,,San Francisco,,SULLIVAN,False,
AG,ALSON LEE,"BATTALION CHIEF, (FIRE DEPARTMENT)",92492.01,89062.9,134426.14,,315981.05,315981.05,2011,,San Francisco,,LEE,False,
AH,DAVID KUSHNER,DEPUTY DIRECTOR OF INVESTMENTS,256576.96,0.0,51322.5,,307899.46,307899.46,2011,,San Francisco,,KUSHNER,False,


By default, `drop` will remove rows based on the argument you passed. So, if you pass the name of the column and don't add additional arguments, it will throw an error (unless the DataFrame contains a row and a column with the same name)

So, if you want to drop a column you need to change the `axis` argument to `1`.

Let's say that we want to remove the "Status" column:

In [57]:
sf_sal.drop("Status")

KeyError: "['Status'] not found in axis"

Observe the error. Pandas is complaining because there is no "Status" in the rows. So, if we specify the axis to find "Status", let's see the difference:

In [None]:
sf_sal.drop("Status", axis=1).head(3)

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco


Notice that we have dropped "AB", "AC", and "AD" in an earlier example. But they are still there!

The drop method, by default, is not an "In place" method. That means that it will not change the content of the original DataFrame.

If you want to change its original content, you can use the `inplace` argument and set it to True

<font size=+0.75> Don't run this cell yet!</font>

In [None]:
sf_sal.drop("Status", axis=1, inplace=True)

If we run this cell, we will change the original status of the dataframe, and it would be irreversible. A better idea would be creating a copy of this dataframe and apply those changes to it. 

However, we can't simply say `sf_sal_copy = sf_sal` because any operation we perform on the copy would affect the original as well

Check the following [page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) to know more about the `drop` method

## `copy()`

In many programming languages, when you assign a value to a variable, that variable is going to point to a new space in memory. Remember that assigning a list to a new variable will make the new variable point to the same space in memory that the original list was pointing to.

When dealing with DataFrames, it's quite common that you have to create new DataFrames based on existing ones. And if you change the value of the new DataFrame, the original DataFrame will be affected.

In order to avoid it, you can use the `copy()` method that will create a copy of the original DataFrame.

In [None]:
sf_copy = sf_sal.copy()

Now, we can remove the "Status" column with no consequences on sf_sal

In [None]:
sf_copy.drop("Status", axis=1, inplace=True)
sf_copy.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco


## `.loc` and `.iloc`

We saw that we can index columns to a DataFrame to obtain parts of that DataFrame. However, a more common way to get values from a DataFrame is using the `.loc` and `.iloc` properties

`.loc` will select rows or columns based on their names. For example, row "AB" can be accessed using `loc["AB"]`. However, if you want to get an entire column, for example "JobTitle", you can't use `loc["JobTitle"]`. To do so you can either:
- `sf_sal["JobTitle]`
- `sf_sal.loc[:, "JobTitle"]` where the colon (`:`) indicates ALL rows

In [None]:
# use .loc[] to select rows by row names and columns by column names
print(sf_sal.loc["AB"])
sf_sal.loc[:, "JobTitle"].head(3)

EmployeeName                           GARY JIMENEZ
JobTitle            CAPTAIN III (POLICE DEPARTMENT)
BasePay                                   155966.02
OvertimePay                               245131.88
OtherPay                                  137811.38
Benefits                                        NaN
TotalPay                                  538909.28
TotalPayBenefits                          538909.28
Year                                           2011
Notes                                           NaN
Agency                                San Francisco
Status                                          NaN
Name: AB, dtype: object


Id
AA    GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
AB                   CAPTAIN III (POLICE DEPARTMENT)
AC                   CAPTAIN III (POLICE DEPARTMENT)
Name: JobTitle, dtype: object

So, you can combine them using `["row name", "column name"]`. For example, let's say that you want to see the JobTitle of the sample corresponding to "FG". Easy:

In [None]:
sf_sal.loc["FG", "JobTitle"]

'FORENSIC TOXICOLOGIST'

If you don't care too much about the index name, you can use `iloc` which will perform the same operation, but it will use numerical indices. 

In [None]:
print(sf_sal.iloc[0])
print(sf_sal.iloc[0, 1])

EmployeeName                                        NATHANIEL FORD
JobTitle            GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
BasePay                                                  167411.18
OvertimePay                                                    0.0
OtherPay                                                 400184.25
Benefits                                                       NaN
TotalPay                                                 567595.43
TotalPayBenefits                                         567595.43
Year                                                          2011
Notes                                                          NaN
Agency                                               San Francisco
Status                                                         NaN
Name: AA, dtype: object
GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY


You can also obtain subsets using `loc` and `iloc`

In [None]:
# for subset of rows and columns, use lists of each within .loc[]
# can index out of order

sf_sal.loc[["AC", "CV", "BK"],["BasePay", "Surname", "Year"]]

Unnamed: 0_level_0,BasePay,Surname,Year
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AC,212739.13,PARDINI,2011
CV,233357.28,MOYER,2011
BK,245124.44,CURRIN,2011


### Comparing Dates in ISO format

ISO 8601 is a internationally recognised way to compare dates and times, it solves the problem of comparing exact times and dates across timezones. In the case of time they are represented by hours-minutes-seconds and for years year-month-day. So "September 17th 2021" would be written as 2021-09-17 or without the delimiter "-" as 20210917. With the time added the general format is `20210917 18:16:11` which would represent 17th September 2021 at 16 minutes and 11 seconds past 6. 

The next table shows how each value range should be represented:


| Format according to ISO 8601  | Value ranges |
|---|---|
|  Year (Y) | YYYY, four-digit, abbreviated to two-digit  |
| Month (M)  | MM, 01 to 12  |
| Month name (B)  | Jan to Dec |
| Week (W)  |  WW, 01 to 53 |
| Day (D)  |  D, day of the week, 1 to 7 |
| day (d)  |  d, day of the month, 1 to 31 |
| Hour (h)  |  hh, 00 to 23, 24:00:00 as the end time |
|  Minute (m) | 	mm, 00 to 59  |
|  Second (s) |  ss, 00 to 59 |
| Decimal fraction (f)  | Fractions of seconds, any degree of accuracy  |

A full list of format codes to convert strings can be found at the bottom of the `datetime` documentation [here](https://docs.python.org/3/library/datetime.html).


Creating a date in non ISO format you can see that we don't get the expected result:

In [None]:
later_date_non_ISO = "February 17th 2021"
earlier_date_non_ISO = "January 16th 2000"

print(later_date_non_ISO < earlier_date_non_ISO)
print(later_date_non_ISO > earlier_date_non_ISO)

True
False


In [12]:
later_date_ISO_format = "2021-02-17"
earlier_date_ISO_format = "2021-02-16"

print(later_date_ISO_format < earlier_date_ISO_format)
print(later_date_ISO_format > earlier_date_ISO_format)

False
True


And comparing times on the same day only a millisecond apart:

In [13]:
later_date_ISO_format = "2021-02-16 11:01:11:10"
earlier_date_ISO_format = "2021-02-16 11:01:11:09"
print(later_date_ISO_format < earlier_date_ISO_format)
print(later_date_ISO_format > earlier_date_ISO_format)

False
True


You can also use Python's built-in library `datetime` to convert a date string easily to a date object using the `strptime` method for an easy comparison of dates. 

The `strptime` method takes two arguments:
- the string to convert to datetime
- and the format code



In [14]:
from datetime import datetime

non_ISO_datestring_one = "17 February, 1999"
non_ISO_datestring_two = "17/02/1999 18:40:02"
print(type(non_ISO_datestring_one))

datestring_one_ISO_format = datetime.strptime(non_ISO_datestring_one, "%d %B, %Y")
datestring_two_ISO_format = datetime.strptime(non_ISO_datestring_two, "%d/%m/%Y %H:%M:%S")
print(type(datestring_one_ISO_format))

print(datestring_one_ISO_format)
print(datestring_two_ISO_format)

print(datestring_one_ISO_format < datestring_two_ISO_format)


<class 'str'>
<class 'datetime.datetime'>
1999-02-17 00:00:00
1999-02-17 18:40:02
True


When cleaning data using Pandas you might run into this issue where the dates in your Dataframe are in non-ISO format. Let's look at how we can solve this problem by converting the dates to the correct format using the Pandas `to_datetime` method.

In [70]:
# Creating our dataframe

data = {"Date" : ["17 February, 2021", "21 September, 2000", "18 August, 1956"],
        "DOB" : ["17/02/2021", "21/09/2000", "18/08/1956"],
        "Flight_Departure" : ["17/02/2021 18:02:01", "10/02/2010 19:01:11", "01/07/1998 11:02:56"]}

df = pd.DataFrame(data)
df



Unnamed: 0,Date,DOB,Flight_Departure
0,"17 February, 2021",17/02/2021,17/02/2021 18:02:01
1,"21 September, 2000",21/09/2000,10/02/2010 19:01:11
2,"18 August, 1956",18/08/1956,01/07/1998 11:02:56


We can use the `format` argument to specify the format code as before:

In [71]:
# Convert the dates to ISO format

df["Date"] = pd.to_datetime(df["Date"], format="%d %B, %Y")
df["DOB"] = pd.to_datetime(df["DOB"], format="%d/%m/%Y")
df["Flight_Departure"] = pd.to_datetime(df["Flight_Departure"], format="%d/%m/%Y %H:%M:%S")
df

Unnamed: 0,Date,DOB,Flight_Departure
0,2021-02-17,2021-02-17,2021-02-17 18:02:01
1,2000-09-21,2000-09-21,2010-02-10 19:01:11
2,1956-08-18,1956-08-18,1998-07-01 11:02:56


## Conditional Selection

What if you wanted to obtain those samples that meet certain requirements? For example, we want to see how many people have a TotalPay greater than 30000. 

Similar to numpy, Pandas will apply a comparison operator to all the DataFrame elements

In [65]:
# boolean condition like this returns boolean applied to each value in the column (like NumPy)

sf_sal["TotalPay"] > 300000

Id
AA     True
AB     True
AC     True
AD     True
AE     True
      ...  
ZV    False
ZW    False
ZX    False
ZY    False
ZZ    False
Name: TotalPay, Length: 676, dtype: bool

The output of the comparison is also named <font size=+0.5> __mask__ </font>

You can use a mask to filter DataFrames just by indexing that mask to the DataFrame. For example, if we assign that mask to a variable, and then we index the mask to the original DataFrame, let's see what happens

In [66]:
# can index DataFrame using this boolean to return only rows where this is true (called boolean masking)

mask = sf_sal["TotalPay"] > 300000

sf_sal_mask = sf_sal[mask]
sf_sal_mask

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,
AF,DAVID SULLIVAN,ASSISTANT DEPUTY CHIEF II,118602.0,8601.0,189082.74,,316285.74,316285.74,2011,,San Francisco,
AG,ALSON LEE,"BATTALION CHIEF, (FIRE DEPARTMENT)",92492.01,89062.9,134426.14,,315981.05,315981.05,2011,,San Francisco,
AH,DAVID KUSHNER,DEPUTY DIRECTOR OF INVESTMENTS,256576.96,0.0,51322.5,,307899.46,307899.46,2011,,San Francisco,
AI,MICHAEL MORRIS,"BATTALION CHIEF, (FIRE DEPARTMENT)",176932.64,86362.68,40132.23,,303427.55,303427.55,2011,,San Francisco,
AJ,JOANNE HAYES-WHITE,"CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)",285262.0,0.0,17115.73,,302377.73,302377.73,2011,,San Francisco,


Observe that all the samples in the DataFrame have a TotalPay greater than 30000

You can also write the mask within the square brackets to save lines of code!

In [67]:
sf_sal[sf_sal["TotalPay"] > 300000]

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,
AF,DAVID SULLIVAN,ASSISTANT DEPUTY CHIEF II,118602.0,8601.0,189082.74,,316285.74,316285.74,2011,,San Francisco,
AG,ALSON LEE,"BATTALION CHIEF, (FIRE DEPARTMENT)",92492.01,89062.9,134426.14,,315981.05,315981.05,2011,,San Francisco,
AH,DAVID KUSHNER,DEPUTY DIRECTOR OF INVESTMENTS,256576.96,0.0,51322.5,,307899.46,307899.46,2011,,San Francisco,
AI,MICHAEL MORRIS,"BATTALION CHIEF, (FIRE DEPARTMENT)",176932.64,86362.68,40132.23,,303427.55,303427.55,2011,,San Francisco,
AJ,JOANNE HAYES-WHITE,"CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)",285262.0,0.0,17115.73,,302377.73,302377.73,2011,,San Francisco,


### Try it out
- Use what you have just learned about conditional selection to find the following:   

1. A DataFrame containing the Name, Total Pay (with Benefits), Job Title and Base Pay of the employee named 'Albert Pardini'.

2. The highest paid person in terms of Base Pay. (look up the .max() method) `max(sf_sal["BasePay"])`

## Set and Reset Index

In [105]:
# use .reset_index() to revert to original numerical index (must specify inplace=True)

sf_sal.reset_index()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname,isPolice
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD,False
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ,True
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,,PARDINI,True
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,,CHONG,False
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
671,ZV,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,,ADAMS,False
672,ZW,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,,SIMPSON,False
673,ZX,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,,LOEBS,False
674,ZY,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,,AGUILAR-TARCHI,False


In [106]:
sf_sal.reset_index(inplace=True)

In [107]:
sf_sal

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,Surname,isPolice
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.00,400184.25,,567595.43,567595.43,2011,,San Francisco,,FORD,False
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,JIMENEZ,True
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.60,,335279.91,335279.91,2011,,San Francisco,,PARDINI,True
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.00,56120.71,198306.90,,332343.61,332343.61,2011,,San Francisco,,CHONG,False
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.60,9737.00,182234.59,,326373.19,326373.19,2011,,San Francisco,,GARDNER,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
671,ZV,CHERYL ADAMS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.89,,180394.07,180394.07,2011,,San Francisco,,ADAMS,False
672,ZW,LOUISE SIMPSON,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.18,0.00,3537.80,,180393.98,180393.98,2011,,San Francisco,,SIMPSON,False
673,ZX,BLAKE LOEBS,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.19,0.00,3537.75,,180393.94,180393.94,2011,,San Francisco,,LOEBS,False
674,ZY,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.00,3537.11,,180393.28,180393.28,2011,,San Francisco,,AGUILAR-TARCHI,False


In [None]:
# use .set_index() to change index to a column

sf_sal.set_index("Id")

## Summary
You should now understand:
- The nature of Pandas Series and DataFrames.
- Conditional selection and boolean masking.
- The difference between None and np.nan.

You should now know how to manipulate Pandas DataFrames, including:
- Indexing and slicing to return DataFrames and Series
- How to use .loc[] and .iloc[].
- How to make new columns from existing columns.
- How to drop rows and columns with .drop().
- how to use .set_index() and .reset_index().

## Further reading
- More details on pandas operations are available in pandas documentation: https://pandas.pydata.org/docs/