# Intro to Pandas

`pip install pandas`

- Pandas: Panel Datasets, is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
- used for data manipulation, analysis, and visualization
- Widely used in DS and ML application for handling **structured** data
- It has 2 main components:
    - Series (1 dim)
    - Dataframes (2 dim)

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Purpose_of_Pandas.png)

### __1.2 Features of Pandas__
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Features_of_Pandas.png)




The two main libraries of Pandas data structure are:
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Data_Structures.png)

In [252]:
import pandas as pd

In [253]:
pd.__version__

'2.0.2'

## Pandas Series

In [254]:
my_list = [1,2,3,5,6,7,8,9,3,34,5]

type(my_list)

list

In [255]:
my_list

[1, 2, 3, 5, 6, 7, 8, 9, 3, 34, 5]

In [256]:
my_ser = pd.Series(my_list)
my_ser

0      1
1      2
2      3
3      5
4      6
5      7
6      8
7      9
8      3
9     34
10     5
dtype: int64

In [257]:
my_ser[5]

7

In [258]:
my_ser[3:9]

3    5
4    6
5    7
6    8
7    9
8    3
dtype: int64

In [259]:
my_ser[3:9].reset_index(drop=True)

0    5
1    6
2    7
3    8
4    9
5    3
dtype: int64

In [260]:
my_ser2 = pd.Series([11,12,13], index=['x','y','z'], name='mycol')
my_ser2

x    11
y    12
z    13
Name: mycol, dtype: int64

In [261]:
my_ser2[2]

13

In [262]:
my_ser2['y':]

y    12
z    13
Name: mycol, dtype: int64

In [263]:
import numpy as np
my_ser3 = pd.Series(np.arange(2,30))
my_ser3

0      2
1      3
2      4
3      5
4      6
5      7
6      8
7      9
8     10
9     11
10    12
11    13
12    14
13    15
14    16
15    17
16    18
17    19
18    20
19    21
20    22
21    23
22    24
23    25
24    26
25    27
26    28
27    29
dtype: int64

In [264]:
# convert to numpy array
my_ser3.values

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [265]:
my_ser3.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])

In [266]:
my_ser3.where(my_ser3 % 2 == 0).dropna()

0      2.0
2      4.0
4      6.0
6      8.0
8     10.0
10    12.0
12    14.0
14    16.0
16    18.0
18    20.0
20    22.0
22    24.0
24    26.0
26    28.0
dtype: float64

## Pandas Dataframe

**Anatomy of a DataFrame**
 
![df](https://static.packt-cdn.com/products/9781839213106/graphics/Images/B15597_01_01.png)

In [267]:
data = {
    'name': ['Mark', 'Mike', 'Tammy', 'Becky'],
    'age': [55,43,27,35],
    'score':[99,78,83,87]
}


In [268]:
# convert dict into a dataframe
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,score
0,Mark,55,99
1,Mike,43,78
2,Tammy,27,83
3,Becky,35,87


In [269]:
print(df) # print removes formatting

    name  age  score
0   Mark   55     99
1   Mike   43     78
2  Tammy   27     83
3  Becky   35     87


In [270]:
type(df['name'])

pandas.core.series.Series

In [271]:
type(df)

pandas.core.frame.DataFrame

### Slicing Data Using `loc()` and `iloc()`

#### Using `loc()`

In [272]:
# get the first row of the dataframe
df.loc[0]

name     Mark
age        55
score      99
Name: 0, dtype: object

In [273]:
# specific list of rows - select row 0 and 2
df.loc[[0,2]]

Unnamed: 0,name,age,score
0,Mark,55,99
2,Tammy,27,83


In [274]:
df.loc[[0,2]].ndim

2

In [275]:
df.ndim

2

In [276]:
df.shape

(4, 3)

In [277]:
df.dtypes

name     object
age       int64
score     int64
dtype: object

In [278]:
#get the num of rows or length of the dataframe
len(df)

4

In [279]:
# using : selects the full range
# loc[row range, column names]
df.loc[:,['age']] # not using brackets, will show series output

Unnamed: 0,age
0,55
1,43
2,27
3,35


In [280]:
df.loc[:,['age', 'score']]

Unnamed: 0,age,score
0,55,99
1,43,78
2,27,83
3,35,87


Using `loc()` as a filter

In [281]:
df.loc[df['age']>40]

Unnamed: 0,name,age,score
0,Mark,55,99
1,Mike,43,78


Similar to SQL:
```SQL
SELECT *
FROM DF
WHERE Age > 28
```

#### Using `iloc()`

In [282]:
# select the first 2 rows with first 2 cols
# iloc[row range, col range]
df.iloc[:2, :2]

Unnamed: 0,name,age
0,Mark,55
1,Mike,43


In [283]:
df.iloc[-1,-1]

87

In [284]:
# select all rows and last 2 columns
df.iloc[:,1:]

Unnamed: 0,age,score
0,55,99
1,43,78
2,27,83
3,35,87


In [285]:
df.iloc[0,0]

'Mark'

#### `iat[]`

similar functionality, but just for 1 data element

In [286]:
df.iat[2,2]

83

**Sumarry**
- Use `loc()`
    - you need to access the data by label
    - column position might change, and you want to ensure that you are accessing the correct column by its name
    - when the column name is stable
    - you want your code more readable

- Use `iloc()`
    - you need to access the data by position
    - The structure of the df and order of columns is stable
    - technically, `iloc()` is slightly faster than `loc()` because it's position-based instead of processing string values

`loc` vs `iloc`
- iloc is useful when you don't necessarily know the column names in advance
- loc is more advantageous when you can't rely on the columns being in a particular order


#### Creating a DataFrame from Lists

In [287]:
data = [
    ['Mark', 'Mike', 'Tammy', 'Becky'],
    [55,43,27,35],
    [99,78,83,87]
]
# lists bundled inside a list
type(data)

list

In [288]:
#convert org data to a df
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2,3
0,Mark,Mike,Tammy,Becky
1,55,43,27,35
2,99,78,83,87


We get numerical index by default. But, we can assign column names for readability

In [289]:
#transpose
df = df.T

In [290]:
#rename columns
df.columns = ['Names', 'Age', 'Score']
df

Unnamed: 0,Names,Age,Score
0,Mark,55,99
1,Mike,43,78
2,Tammy,27,83
3,Becky,35,87


In [291]:
pd.DataFrame(data, index=['Name', 'Age', 'Score']).T

Unnamed: 0,Name,Age,Score
0,Mark,55,99
1,Mike,43,78
2,Tammy,27,83
3,Becky,35,87


## Pandas Iteration Methods

In [292]:
data = {
    'name': ['Mark', 'Mike', 'Tammy', 'Becky', 'John'],
    'age': [55,43,27,35, 46],
    'score':[99,78,83,87, 79],
    'city': ['New York', 'Nashville', 'San Diego', 'Atlanta', 'Boston']
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,score,city
0,Mark,55,99,New York
1,Mike,43,78,Nashville
2,Tammy,27,83,San Diego
3,Becky,35,87,Atlanta
4,John,46,79,Boston


#### Method 1 - using `df.index`

In [293]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [294]:
print(df.loc[0,'name'], 'is', df.loc[0,'age'], 'years old')

Mark is 55 years old


In [295]:
#using df.index

for i in df.index:
    print(df.loc[i,'name'], 'is', df.loc[i,'age'], 'years old')

Mark is 55 years old
Mike is 43 years old
Tammy is 27 years old
Becky is 35 years old
John is 46 years old


#### Method 2 - Using `len(df)`

In [296]:
for i in range(len(df)):
    print(df.loc[i,'name'], 'is', df.loc[i,'age'], 'years old')

Mark is 55 years old
Mike is 43 years old
Tammy is 27 years old
Becky is 35 years old
John is 46 years old


Could be useful if you want a custom range in the data

In [297]:
for i in range(2,len(df)):
    print(df.loc[i,'name'], 'is', df.loc[i,'age'], 'years old')

Tammy is 27 years old
Becky is 35 years old
John is 46 years old


In [298]:
for i in range(2,len(df)):
    new_age = df.loc[i,'age'] + 10
    print(df.loc[i,'name'], 'will be',new_age , 'in 10 years')

Tammy will be 37 in 10 years
Becky will be 45 in 10 years
John will be 56 in 10 years


#### Method 3 - Using `iterows()`

In [299]:
for i, row in df.iterrows():
    print(f"Person {i}: {row['name']} - location: {row['city']}")

Person 0: Mark - location: New York
Person 1: Mike - location: Nashville
Person 2: Tammy - location: San Diego
Person 3: Becky - location: Atlanta
Person 4: John - location: Boston


In [300]:
for i, row in df.iterrows():
    print(f"Person {i}: {row['name']} - location: {row['city']}")

Person 0: Mark - location: New York
Person 1: Mike - location: Nashville
Person 2: Tammy - location: San Diego
Person 3: Becky - location: Atlanta
Person 4: John - location: Boston


Update the dataframe values

In [301]:
df['age'] = df['age'] + 5
df

Unnamed: 0,name,age,score,city
0,Mark,60,99,New York
1,Mike,48,78,Nashville
2,Tammy,32,83,San Diego
3,Becky,40,87,Atlanta
4,John,51,79,Boston


In [302]:
df['age'] = 50 # this will overwrite all the values to be 50

> Note: when modifying a dataframe, it's recommended to make a copy of the original so you can refer back to it.

In [303]:
df_org = df.copy()

### Using `apply()` and `map()` Functions

In [304]:
df['age'] = df.apply(lambda row: row['age'] - 5, axis=1)
df

Unnamed: 0,name,age,score,city
0,Mark,45,99,New York
1,Mike,45,78,Nashville
2,Tammy,45,83,San Diego
3,Becky,45,87,Atlanta
4,John,45,79,Boston


`apply()` can be useful when you have a complex function applied to the column

In [305]:
df['discount'] = df['age'].apply(lambda x: 'yes' if x >50 else 'no')
df

Unnamed: 0,name,age,score,city,discount
0,Mark,45,99,New York,no
1,Mike,45,78,Nashville,no
2,Tammy,45,83,San Diego,no
3,Becky,45,87,Atlanta,no
4,John,45,79,Boston,no


**Exercise** Build a category for ages. over 50: Senior, over 40: Mid-Age, Over 20: Junior

In [306]:
# build a function for it

def age_categorization(row):
    if row['age'] > 50:
        return 'Senior'
    elif row['age'] > 40:
        return 'Mid-Age'
    else:
        return 'Junior'


In [307]:
df['age_category'] = df.apply(age_categorization, axis=1)
df

Unnamed: 0,name,age,score,city,discount,age_category
0,Mark,45,99,New York,no,Mid-Age
1,Mike,45,78,Nashville,no,Mid-Age
2,Tammy,45,83,San Diego,no,Mid-Age
3,Becky,45,87,Atlanta,no,Mid-Age
4,John,45,79,Boston,no,Mid-Age


In [308]:
#drop a column
df.drop(columns='age_category', inplace=True)
df

Unnamed: 0,name,age,score,city,discount
0,Mark,45,99,New York,no
1,Mike,45,78,Nashville,no
2,Tammy,45,83,San Diego,no
3,Becky,45,87,Atlanta,no
4,John,45,79,Boston,no


In [309]:
df.insert(2, 'age_category', df.apply(age_categorization, axis=1))
df

Unnamed: 0,name,age,age_category,score,city,discount
0,Mark,45,Mid-Age,99,New York,no
1,Mike,45,Mid-Age,78,Nashville,no
2,Tammy,45,Mid-Age,83,San Diego,no
3,Becky,45,Mid-Age,87,Atlanta,no
4,John,45,Mid-Age,79,Boston,no


In [310]:
data = {'measure1':[4,5,7,8],
        'measure2':[8,4,6,2]}

df_measure = pd.DataFrame(data)
df_measure

Unnamed: 0,measure1,measure2
0,4,8
1,5,4
2,7,6
3,8,2


Current values are in meter units. We want to convert them to cm

In [311]:
df_measure_cm = df_measure.apply(lambda x : x * 100, axis=0)
df_measure_cm

Unnamed: 0,measure1,measure2
0,400,800
1,500,400
2,700,600
3,800,200


In [312]:
df_measure_cm.rename(columns={'measure1': 'measure1_cm',
                   'measure2': 'measure2_cm'}, inplace=True) #it's important to inplace=True to finalize the changes

df_measure_cm

Unnamed: 0,measure1_cm,measure2_cm
0,400,800
1,500,400
2,700,600
3,800,200


### Using `map()`

In [313]:
df

Unnamed: 0,name,age,age_category,score,city,discount
0,Mark,45,Mid-Age,99,New York,no
1,Mike,45,Mid-Age,78,Nashville,no
2,Tammy,45,Mid-Age,83,San Diego,no
3,Becky,45,Mid-Age,87,Atlanta,no
4,John,45,Mid-Age,79,Boston,no


**Exercisse** Create a column that gives us the region of the city

In SQL:
```sql
CASE WHEN city = 'New York' THEN 'North'
     WHEN city = 'Nashville' THEN 'South'
     WHEN city = 'San Diego' THEN 'West'
     WHEN city = 'Atlanta' THEN 'South'
     WHEN city = 'Boston' THEN 'North'

END AS region
```

In [314]:
df['region'] = df['city'].map({
                            'New York' : 'North',
                            'Nashville' : 'South',
                            'San Diego' : 'West',
                            'Atlanta' : 'South',
                            'Boston' : 'North'
                            })

df

Unnamed: 0,name,age,age_category,score,city,discount,region
0,Mark,45,Mid-Age,99,New York,no,North
1,Mike,45,Mid-Age,78,Nashville,no,South
2,Tammy,45,Mid-Age,83,San Diego,no,West
3,Becky,45,Mid-Age,87,Atlanta,no,South
4,John,45,Mid-Age,79,Boston,no,North


In [315]:
df_2 = df.copy()

In [316]:
df_2['city'] = df_2['city'].map({
                            'New York' : 'New Jersey',
                            'Nashville' : 'Knoxville',
                            'San Diego' : 'San Diego'
                            })

df_2

Unnamed: 0,name,age,age_category,score,city,discount,region
0,Mark,45,Mid-Age,99,New Jersey,no,North
1,Mike,45,Mid-Age,78,Knoxville,no,South
2,Tammy,45,Mid-Age,83,San Diego,no,West
3,Becky,45,Mid-Age,87,,no,South
4,John,45,Mid-Age,79,,no,North


In [317]:
df_2 = df.copy()

In [318]:
df_2['city'] = df_2['city'].str.replace('New York', 'New Jersey')
df_2

Unnamed: 0,name,age,age_category,score,city,discount,region
0,Mark,45,Mid-Age,99,New Jersey,no,North
1,Mike,45,Mid-Age,78,Nashville,no,South
2,Tammy,45,Mid-Age,83,San Diego,no,West
3,Becky,45,Mid-Age,87,Atlanta,no,South
4,John,45,Mid-Age,79,Boston,no,North


### Sorting Data

In [319]:
df.sort_values(by='age', inplace=True, ignore_index=True)
df

Unnamed: 0,name,age,age_category,score,city,discount,region
0,Mark,45,Mid-Age,99,New York,no,North
1,Mike,45,Mid-Age,78,Nashville,no,South
2,Tammy,45,Mid-Age,83,San Diego,no,West
3,Becky,45,Mid-Age,87,Atlanta,no,South
4,John,45,Mid-Age,79,Boston,no,North


In [320]:
df.sort_values(by=['age', 'score'], inplace=True)
df

Unnamed: 0,name,age,age_category,score,city,discount,region
1,Mike,45,Mid-Age,78,Nashville,no,South
4,John,45,Mid-Age,79,Boston,no,North
2,Tammy,45,Mid-Age,83,San Diego,no,West
3,Becky,45,Mid-Age,87,Atlanta,no,South
0,Mark,45,Mid-Age,99,New York,no,North


In [321]:
#descnding sort
df.sort_values(by=['age', 'score'], ascending=False, inplace=True)
df

Unnamed: 0,name,age,age_category,score,city,discount,region
0,Mark,45,Mid-Age,99,New York,no,North
3,Becky,45,Mid-Age,87,Atlanta,no,South
2,Tammy,45,Mid-Age,83,San Diego,no,West
4,John,45,Mid-Age,79,Boston,no,North
1,Mike,45,Mid-Age,78,Nashville,no,South


In [322]:
# more custom sort
df2 = df.sort_values(by=['Age', 'Score'], ascending=[False, True]).reset_index(drop=True)

KeyError: 'Age'

#### Looking for Nulls

In [None]:
data = {
    'name': ['Mark', 'Mike', 'Tammy', 'Becky', 'John'],
    'age': [55,43,np.nan,35, np.nan], #make sure to import numpy for nulls
    'score':[99,78,np.nan,87, 79],
    'city': ['New York', 'Nashville', 'San Diego', 'Atlanta', 'Boston']
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,score,city
0,Mark,55.0,99.0,New York
1,Mike,43.0,78.0,Nashville
2,Tammy,,,San Diego
3,Becky,35.0,87.0,Atlanta
4,John,,79.0,Boston


In [None]:
# to get a report of number of nulls 
df.isna().sum()

name     0
age      2
score    1
city     0
dtype: int64

The results above give the list of columns and their null/missing-value counts

In [251]:
# to get the rows with nulls
df[df.isna().any(axis=1)]

Unnamed: 0,name,age,age_category,score,city,discount,region


> Addressing nulls is a very important step in DS and ML. Without it, we won't get good results or the ML model won't run properly