# Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1.0 Get to know Series and DataFrame

## 1.1 What is pandas.Series?
<b><font color="orange" size=5>★</font> New Function:</b> pandas.Series()

A pandas.Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Here's a simple example to illustrate a Series:

In [2]:
data = [1, 3, 5, 7, 9]
series = pd.Series(data)

series

0    1
1    3
2    5
3    7
4    9
dtype: int64

In this output, the left column (0, 1, ..., 4) represents <b><font color="#AA0000">the indices</font></b>, and the right column (1, 3, ..., 9) represents <b><font color="#AA0000">the values</font></b>.

## 1.2 What is pandas.DataFrame?
<b><font color="orange" size=5>★</font> New Function:</b> pandas.DataFrame()

A pandas.DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's basically a table with rows and columns. Columns can be of different types, and it's the most commonly used pandas object.

Sometimes a DatFrame may look like a Series when there is only 1 column.

Here's a basic example of a DataFrame:

In [3]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [24, 27, 22],
        'City': ['New York', 'Boston', 'Los Angeles']}
df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Boston
2,Charlie,22,Los Angeles


In this DataFrame, <b>Name</b>, <b>Age</b>, and <b>City</b> are <b><font color="#AA0000">the column headers</font></b>, and the rows are indexed with numbers starting from 0.

Both Series and DataFrame are central to data analysis tasks using Pandas. They provide a vast array of functions and methods to efficiently work with structured data.

## 1.3 Extract a subset from DataFrame

### Preparation - Create a DataFrame

In [4]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [24, 27, 22],
        'City': ['New York', 'Boston', 'Los Angeles']}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Boston
2,Charlie,22,Los Angeles


### 1.3.1 Extract ONE column from a DataFrame as a Series
We can use the column name as the index to extract a column, as a Series object, from a DataFrame object.

In [5]:
series = df['City']
series

0       New York
1         Boston
2    Los Angeles
Name: City, dtype: object

### 1.3.2 Extract multiple columns from a DataFrame as a DataFrame
We can use a list that contains column names as the index to extract a subset from a DataFrame object.<br>
The subset will have the same number of rows as the source DataFrame.

In [6]:
sub_df = df[['Name', 'Age']]
sub_df

Unnamed: 0,Name,Age
0,Alice,24
1,Bob,27
2,Charlie,22


### 1.3.3 Extract ONE column from a DataFrame as a DataFrame
When the list contains only one column name, it will extract a subset with only 1 column.<br>
It will look very much like a Series object.<br>
Pay extra attention to differentiate them.

In [7]:
sub_df = df[['City']]
sub_df

Unnamed: 0,City
0,New York
1,Boston
2,Los Angeles


### 1.3.4 Drop ONE column from a DataFrame
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.drop()

We can use DataFrame.drop() function and specify the column name to drop specific columns.

The input can be a string to denote the column name.<br>
We also need to set axis=1 (if axis=0, it will be dropping rows instead).

In [8]:
df_dropped = df.drop('City', axis=1)
df_dropped

Unnamed: 0,Name,Age
0,Alice,24
1,Bob,27
2,Charlie,22


### 1.3.5 Drop multiple columns from a DataFrame
The input can also be a list that contains multiple column names.

In [9]:
df_dropped = df.drop(['Age', 'City'], axis=1)
df_dropped

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie


### 1.3.6 Extract rows from a DataFrame
We can use a range index to extract rows from a DataFrame. It works similar to how to slice a list.

In [10]:
sub_df = df[1:3]
sub_df

Unnamed: 0,Name,Age,City
1,Bob,27,Boston
2,Charlie,22,Los Angeles


We can't just use a single value as the index to slice a DataFrame.<br>
When there is no ":" in the index, a single value will be interpret as the column name and the program would think that you're trying to extract a column.<br>

We still need to set a range index even if we're just getting 1 row.

In [11]:
sub_df = df[1:2]
sub_df

Unnamed: 0,Name,Age,City
1,Bob,27,Boston


### 1.3.7 Extract columns and rows at the same time

We can use DataFrame.loc[..., ...] to get a subset.<br>
We need to set 2 indices, separated by a comma.<br>

The 1st index is the range index to slice rows.<br>
The 2nd index is the list of column names to slice columns.

In [12]:
sub_df = df.loc[1:3, ['Age', 'City']]
sub_df

Unnamed: 0,Age,City
1,27,Boston
2,22,Los Angeles


We can also use DataFrame.iloc[..., ...] to get a subset.<br>
We need to set 2 indices, separated by a comma.<br>

The 1st index is the range index to slice rows.<br>
The 2nd index is the range index to slice columns.<br>

The difference is the 2nd index, where it uses integers, instead of column names, to specify columns.

In [13]:
sub_df = df.iloc[1:3, 1:3]
sub_df

Unnamed: 0,Age,City
1,27,Boston
2,22,Los Angeles


## 1.4 How to read a data file as DataFrame?

### 1.41 Read CSV File with Headers
<b><font color="orange" size=5>★</font> New Function:</b> pandas.read_csv()

We can use <b><font color=#AA0000>pd.read_csv()</font></b> function to open a "csv" file and load it as <b><font color=blue>DataFrame</font></b> object.<br>
Pandas automatically uses the first row as column headers.

In [14]:
df = pd.read_csv('Abalone.csv')
df

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7.0
1,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9.0
2,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10.0
3,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7.0
4,I,0.425,0.300,0.095,0.3515,0.1410,0.0775,0.1200,8.0
...,...,...,...,...,...,...,...,...,...
4171,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11.0
4172,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10.0
4173,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9.0
4174,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10.0


As you can see above, this Abalone.csv file does not have headers. The data starts from the 1st row.<br>
Hence, we need to "tell" the function <b>NOT</b> to take the 1st row as the header.<br>

### 1.4.2 Read CSV File without Headers
If the "csv" file doesn't contain headers, we can specify this to Pandas by setting <b>header=None</b>, and it will use default integer indices for column names.

In [15]:
df = pd.read_csv('Abalone.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15.0
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7.0
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9.0
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10.0
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7.0
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11.0
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10.0
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9.0
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10.0


### 1.4.3 Read CSV File with an Index Column
If your CSV file has an index column (a column that should be used as row labels), you can specify this column with the <b>index_col</b> parameter.

In [16]:
df = pd.read_csv('airports.csv', index_col=0)
df

Unnamed: 0_level_0,city,state,name
airport_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10165,Adak Island,AK,Adak
10299,Anchorage,AK,Ted Stevens Anchorage International
10304,Aniak,AK,Aniak Airport
10754,Barrow,AK,Wiley Post/Will Rogers Memorial
10551,Bethel,AK,Bethel Airport
...,...,...,...
11233,Cheyenne,WY,Cheyenne Regional/Jerry Olson Field
11097,Cody,WY,Yellowstone Regional
11865,Gillette,WY,Gillette Campbell County
12441,Jackson,WY,Jackson Hole


In this case, <b>airport_id</b> is used as the row labels, instead of the default indices starting from 0.

### 1.4.4 Read XLSX File
<b><font color="orange" size=5>★</font> New Function:</b> pandas.read_excel()

Reading an Excel file is similar, but we use <b>pd.read_excel()</b> instead. Again, Pandas will use the first row for column headers by default.

If the Excel file has more than 1 sheets, we will need to specify which sheet to read by setting an input to the <b>sheet_name</b> parameter.

In [17]:
df = pd.read_excel('Generalization.xlsx', sheet_name='Sheet1')
df

Unnamed: 0,Sample,Address
0,1,"Blk296C ,CHOA CHU KANG AVENUE 2 ,#10-1451 ,SIN..."
1,2,"Blk486 ,PASIR RIS DRIVE 4 ,#10-1908 ,SINGAPORE..."
2,3,"Blk155 ,YISHUN STREET 11 ,#10-2241 ,SINGAPORE ..."
3,4,"Blk323 ,CHOA CHU KANG AVENUE 3 ,#10-2776 ,SING..."
4,5,"Blk208 ,CLEMENTI AVENUE 6 ,#10-2875 ,SINGAPORE..."
...,...,...
195,196,"Blk546 ,BEDOK NORTH STREET 3 ,#09-3890 ,SINGAP..."
196,197,"Blk75 ,WHAMPOA DRIVE ,#09-3908 ,SINGAPORE 320075"
197,198,"Blk159 ,WOODLANDS STREET 13 ,#09-4054 ,SINGAPO..."
198,199,"Blk588 ,YIO CHU KANG ROAD ,#09-4254 ,SINGAPORE..."


## 1.5 Convert to/from DataFrame
We can use pd.DataFrame() function to create a DataFrame objectwithout importing it from the file.<br>

THe first input is always the data. There are several different data types that we can use to input the data.<br>

We can also set an input to "columns" argument. It should take a list and it will use the values in the list as the column names.<br>
The number of values in the "columns" input should be the same as the number of columns we are making.

### 1.5.1 Creating DataFrame from a 1D list
When we use a 1D list as the input. It will just create a DataFrame with only 1 column.

In [18]:
data = [1, 2, 3, 4, 5]

df = pd.DataFrame(data, columns=['Numbers'])
print(df)

   Numbers
0        1
1        2
2        3
3        4
4        5


### 1.5.2 Create DataFrame from a 2D list
We can use a nested list (i.e. lists in a list, or a 2D list) as the input.<br>

Each sub-list denotes a row.<br>
All the "sub-lists" in the list should have equal length, which is the number of columns.<br>

In [19]:
data = [['Alex', 10],
        ['Bob', 12],
        ['Clarke', 13]]

df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


### 1.5.3 Create DataFrame from a dictionary
We can use a dictionary as the input.<br>

Each key has a list as its value.<br>
All lists should have equal length.<br>
Each key denotes the column name.<br>
Each list will become a column.<br>

When we use a dictionary, we don't need to give an input to "columns" argument.

In [20]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'],
        'Age': [20, 21, 19]}

df = pd.DataFrame(data)
print(df)

     Name  Age
0     Tom   20
1   Jerry   21
2  Mickey   19


### 1.5.4 Convert Series to a list
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.to_list()

We can use Series.to_list() function to convert a Series object to a list.

In [21]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'],
        'Age': [20, 21, 19]}

df = pd.DataFrame(data)

df['Name'].to_list()

['Tom', 'Jerry', 'Mickey']

### 1.5.5 Convert DataFrame to a numpy.array
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.to_numpy()

We can use DataFrame.to_numpy() function to convert a DataFrame object to a numpy.array.

In [22]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'],
        'Age': [20, 21, 19]}

df = pd.DataFrame(data)

df.to_numpy()

array([['Tom', 20],
       ['Jerry', 21],
       ['Mickey', 19]], dtype=object)

# 2.0 Exploratory Data Analysis

## 2.1 Basic Data Exploration

### Preparation - Import data

In [23]:
df = pd.read_csv('Car Sales.csv')
df

Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
0,Acura,Integra,16.919,16.360,Passenger,21.50,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,2/2/2012,58.280150
1,Acura,TL,39.384,19.875,Passenger,28.40,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,6/3/2011,91.370778
2,Acura,CL,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.470,17.2,26.0,1/4/2012,
3,Acura,RL,8.588,29.725,Passenger,42.00,3.5,210.0,114.6,71.4,196.6,3.850,18.0,22.0,3/10/2011,91.389779
4,Audi,A4,20.397,22.255,Passenger,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,10/8/2011,62.777639
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,Volvo,V40,3.545,,Passenger,24.40,1.9,160.0,100.5,67.6,176.6,3.042,15.8,25.0,9/21/2011,66.498812
153,Volvo,S70,15.245,,Passenger,27.50,2.4,168.0,104.9,69.3,185.9,3.208,17.9,25.0,11/24/2012,70.654495
154,Volvo,V70,17.531,,Passenger,28.80,2.4,168.0,104.9,69.3,186.2,3.259,17.9,25.0,6/25/2011,71.155978
155,Volvo,C70,3.493,,Passenger,45.50,2.3,236.0,104.9,71.5,185.7,3.601,18.5,23.0,4/26/2011,101.623357


### 2.1.1 Display the first/last few rows
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.head()<br>
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.tail()

We can use DataFrame.head() function to display the first few rows.
By default, it will show the first 5 rows.

In [24]:
df.head()

Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
0,Acura,Integra,16.919,16.36,Passenger,21.5,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,2/2/2012,58.28015
1,Acura,TL,39.384,19.875,Passenger,28.4,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,6/3/2011,91.370778
2,Acura,CL,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.47,17.2,26.0,1/4/2012,
3,Acura,RL,8.588,29.725,Passenger,42.0,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,3/10/2011,91.389779
4,Audi,A4,20.397,22.255,Passenger,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,10/8/2011,62.777639


We can give an integer input to DataFrame.head() to indicate the number of rows to show.

In [25]:
df.head(7)

Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
0,Acura,Integra,16.919,16.36,Passenger,21.5,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,2/2/2012,58.28015
1,Acura,TL,39.384,19.875,Passenger,28.4,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,6/3/2011,91.370778
2,Acura,CL,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.47,17.2,26.0,1/4/2012,
3,Acura,RL,8.588,29.725,Passenger,42.0,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,3/10/2011,91.389779
4,Audi,A4,20.397,22.255,Passenger,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,10/8/2011,62.777639
5,Audi,A6,18.78,23.555,Passenger,33.95,2.8,200.0,108.7,76.1,192.0,3.561,18.5,22.0,8/9/2011,84.565105
6,Audi,A8,1.38,39.0,Passenger,62.0,4.2,310.0,113.0,74.0,198.2,3.902,23.7,21.0,2/27/2012,134.656858


Similarly, we can use DataFrame.tail() to show the last few rows. It works pretty much the same as DataFrame.head().

In [26]:
df.tail()

Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
152,Volvo,V40,3.545,,Passenger,24.4,1.9,160.0,100.5,67.6,176.6,3.042,15.8,25.0,9/21/2011,66.498812
153,Volvo,S70,15.245,,Passenger,27.5,2.4,168.0,104.9,69.3,185.9,3.208,17.9,25.0,11/24/2012,70.654495
154,Volvo,V70,17.531,,Passenger,28.8,2.4,168.0,104.9,69.3,186.2,3.259,17.9,25.0,6/25/2011,71.155978
155,Volvo,C70,3.493,,Passenger,45.5,2.3,236.0,104.9,71.5,185.7,3.601,18.5,23.0,4/26/2011,101.623357
156,Volvo,S80,18.969,,Passenger,36.0,2.9,201.0,109.9,72.1,189.8,3.6,21.1,24.0,11/14/2011,85.735655


### 2.1.2 Display the shape of the DataFrame
<b><font color="orange" size=5>★</font> New Attribute:</b> pandas.DataFrame.shape

We can call DataFrame.shape attribute to get the number of rows and columns in the DataFrame.<br>
Take note that DataFrame.shape is an attribute, not a function. So, it is not callable, i.e. do NOT write DataFrame.shape().

In [27]:
df.shape

(157, 16)

DataFrame.shape is a tuple object.<br>
The first value is the number of rows and the second value is the number of columns.<br>
We can use indices to call the values, or use multiple variables at once to parse the values.

In [28]:
# Method 1
n_rows = df.shape[0]
n_columns = df.shape[1]
print('Method 1:', n_rows, n_columns)

# Method 2
n_rows, n_columns = df.shape
print('Method 2:', n_rows, n_columns)

Method 1: 157 16
Method 2: 157 16


### 2.1.3 Display the data type of each columns
<b><font color="orange" size=5>★</font> New Attribute:</b> pandas.DataFrame.dtypes

We can use DataFrame.dtypes attribute to get the data type of each column.<br>
Take note that DataFrame.dtypes is an attribute so it is not callable.

In [29]:
df.dtypes

Manufacturer            object
Model                   object
Sales_in_thousands     float64
__year_resale_value    float64
Vehicle_type            object
Price_in_thousands     float64
Engine_size            float64
Horsepower             float64
Wheelbase              float64
Width                  float64
Length                 float64
Curb_weight            float64
Fuel_capacity          float64
Fuel_efficiency        float64
Latest_Launch           object
Power_perf_factor      float64
dtype: object

"object" is basically the "string" in pandas library.

### 2.1.4 Display a statistical summary of numerical columns
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.describe()

We can use DataFrame.describe() function to display a statistical summary.<br>
Take note that columns in "object" type will not be shown.

In [30]:
df.describe()

Unnamed: 0,Sales_in_thousands,__year_resale_value,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor
count,157.0,121.0,155.0,156.0,156.0,156.0,156.0,156.0,155.0,156.0,154.0,155.0
mean,52.998076,18.072975,27.390755,3.060897,185.948718,107.487179,71.15,187.34359,3.378026,17.951923,23.844156,77.043591
std,68.029422,11.453384,14.351653,1.044653,56.700321,7.641303,3.451872,13.431754,0.630502,3.887921,4.282706,25.142664
min,0.11,5.16,9.235,1.0,55.0,92.6,62.6,149.4,1.895,10.3,15.0,23.276272
25%,14.114,11.26,18.0175,2.3,149.5,103.0,68.4,177.575,2.971,15.8,21.0,60.407707
50%,29.45,14.18,22.799,3.0,177.5,107.0,70.55,187.9,3.342,17.2,24.0,72.030917
75%,67.956,19.875,31.9475,3.575,215.0,112.2,73.425,196.125,3.7995,19.575,26.0,89.414878
max,540.561,67.55,85.5,8.0,450.0,138.7,79.9,224.5,5.572,32.0,45.0,188.144323


### 2.1.5 Count the unique values in each column
<b><font color="orange" size=5>★</font> New Method:</b> pandas.Series.nunique()

There is no direct way to display the number of unique values in all columns in one go.<br>
Though, we can do it for Series object.

In [31]:
series = df['Manufacturer']
series

0      Acura
1      Acura
2      Acura
3      Acura
4       Audi
       ...  
152    Volvo
153    Volvo
154    Volvo
155    Volvo
156    Volvo
Name: Manufacturer, Length: 157, dtype: object

We can use Series.nunique() function to see the number of unique values in the Series.

In [32]:
series.nunique()

30

Hence, by using a for loop, we can see the number of unique values in each column.

In [33]:
for column_name in list(df):
    series = df[column_name]
    print(column_name, series.nunique())

Manufacturer 30
Model 156
Sales_in_thousands 157
__year_resale_value 117
Vehicle_type 2
Price_in_thousands 152
Engine_size 31
Horsepower 66
Wheelbase 88
Width 78
Length 127
Curb_weight 147
Fuel_capacity 55
Fuel_efficiency 20
Latest_Launch 130
Power_perf_factor 154


## 2.2 Handling Missing Values

### Preparation - Create a DataFrame

In [34]:
data = {'Name': ['Alice', 'Bob', 'Charlie', np.nan],
        'Age': [24, np.nan, 22, 27],
        'Salary': [70000, 55000, None, 80000]}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary
0,Alice,24.0,70000.0
1,Bob,,55000.0
2,Charlie,22.0,
3,,27.0,80000.0


We have 1 missing value in each column.

### 2.2.1 Check for missing values
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.isnull()

We can use DataFrame.isnull() function to examine if each value is considered as a missing value.

In [35]:
df.isnull()

Unnamed: 0,Name,Age,Salary
0,False,False,False
1,False,True,False
2,False,False,True
3,True,False,False


<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.sum()

We can use DataFrame.sum() function to determine the sum of each column.<br>
When the column is in boolean type (True or False), it will count the number of "True" in the column.

Hence, we can use DataFrame.isnull().sum() to display the number of missing values in each column.

In [36]:
df.isnull().sum()

Name      1
Age       1
Salary    1
dtype: int64

### 2.2.2 Drop rows with any missing values
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.dropna()

We can use DataFrame.dropna() to drop a row that contains any missing value.

In [37]:
df_dropped = df.dropna()
df_dropped

Unnamed: 0,Name,Age,Salary
0,Alice,24.0,70000.0


We have only 1 row left as that is the only complete row.

### 2.2.3 Drop columns with any missing values
We can specify axis=1 in DataFrame.dropna() to drop a column, instead of a row, that contains any missing value.

In [38]:
df_dropped = df.dropna(axis=1)
df_dropped

0
1
2
3


We have no column left as each column has 1 missing value.

### 2.2.4 Drop rows with all values missing
We can specify how='all' in DataFrame.dropna().<br>
In this case, a row will be dropped only if the row has all values missing.

In [39]:
df_dropped = df.dropna(how='all')
df_dropped

Unnamed: 0,Name,Age,Salary
0,Alice,24.0,70000.0
1,Bob,,55000.0
2,Charlie,22.0,
3,,27.0,80000.0


### 2.2.5 Fill Missing Values with a Specific Value
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.fillna()

We can use DataFrame.fillna() to fill the missing values by a specific input value.

In [40]:
df_filled = df.fillna(0)
df_filled

Unnamed: 0,Name,Age,Salary
0,Alice,24.0,70000.0
1,Bob,0.0,55000.0
2,Charlie,22.0,0.0
3,0,27.0,80000.0


If we want to fill different values for different columns, we can input a dictionary instead.

In [41]:
df_filled = df.fillna(value={'Name': 'Unknown', 'Age': 30, 'Salary': 60000})
df_filled

Unnamed: 0,Name,Age,Salary
0,Alice,24.0,70000.0
1,Bob,30.0,55000.0
2,Charlie,22.0,60000.0
3,Unknown,27.0,80000.0


### 2.2.6 Fill Missing Values by a method
We can use several different methods to fill missing values.

When we set method='ffill', it will take the last value before each missing values to fill it.<br>
However, it will not fill the missing values in the 1st row.

In [42]:
df_ffill = df.fillna(method='ffill')
print(df_ffill)

      Name   Age   Salary
0    Alice  24.0  70000.0
1      Bob  24.0  55000.0
2  Charlie  22.0  55000.0
3  Charlie  27.0  80000.0


When we set method='bfill', it will take the next value after each missing values to fill it.<br>
However, it will not fill the missing values in the last row.

In [43]:
df_bfill = df.fillna(method='bfill')
print(df_bfill)

      Name   Age   Salary
0    Alice  24.0  70000.0
1      Bob  22.0  55000.0
2  Charlie  22.0  80000.0
3      NaN  27.0  80000.0


### 2.2.7 Fill Missing Values by mean/median and mode
For numeric columns, we can fill by the mean or median.<br>
For categorical columns, we can fill by the mode.

This is a common practice to fill missing values.

Firstly, we create a dictionary that contains the value to use for each column.

In [44]:
values_for_fillna = {'Name': 'Unknown', 'Age': df['Age'].mean(), 'Salary': df['Salary'].mean()}
values_for_fillna

{'Name': 'Unknown', 'Age': 24.333333333333332, 'Salary': 68333.33333333333}

Then, we use the dictionary to fill the missing values.

In [45]:
df_filled = df.fillna(value=values_for_fillna)
df_filled

Unnamed: 0,Name,Age,Salary
0,Alice,24.0,70000.0
1,Bob,24.333333,55000.0
2,Charlie,22.0,68333.333333
3,Unknown,27.0,80000.0


## 2.3 Data Conversion

### Preparation - Create a DataFrame

In [46]:
data = {'ProductID': [101, 102, 103, 104],
        'Price': [19.99, 25.50, 8.99, '12.34'],
        'Quantity': ['10', '15', '20', '25']}

df = pd.DataFrame(data)
df

Unnamed: 0,ProductID,Price,Quantity
0,101,19.99,10
1,102,25.5,15
2,103,8.99,20
3,104,12.34,25


Although "Price" and "Quantity" look like numeric columns, but we know they are not, as we set some values as string.<br>
We can check the data types by DataFrame.dtypes attribute.

In [47]:
df.dtypes

ProductID     int64
Price        object
Quantity     object
dtype: object

In pandas library, the "object" type means string (text).

### 2.3.1 Data Conversion
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.copy()<br>
<b><font color="orange" size=5>★</font> New Method:</b> pandas.Series.astype()

We can use Series.astype() function to convert a Series into a specific data type.<br>
We need to set the target data type as the input, for example, str, int or float.

We have to set the data conversion column by column, unless we are converting all columns into one type.

In [48]:
# We can create a copy of df, so the raw df will not be changed
df_copy = df.copy()

df_copy['ProductID'] = df_copy['ProductID'].astype(str)
df_copy['Price'] = df_copy['Price'].astype(float)
df_copy['Quantity'] = df_copy['Quantity'].astype(int)

df_copy.dtypes

ProductID     object
Price        float64
Quantity       int32
dtype: object

Now, all columns are converted into the correct type.

## 2.4 Rename Columns

### Preparation - Create a DataFrame

In [49]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


### 2.4.1 Rename specific columns
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.rename()

We can use DataFrame.rename() function to rename some specific columns.<br>
It takes a dictionary as the input. The keys are the original names and the corresponding values are the new names to use.

In [50]:
df_renamed = df.rename(columns={'A': 'Alpha', 'B': 'Beta'})
df_renamed

Unnamed: 0,Alpha,Beta,C
0,1,4,7
1,2,5,8
2,3,6,9


### 2.4.2 Rename all columns
<b><font color="orange" size=5>★</font> New Attribute:</b> pandas.DataFrame.columns

We can also overwrite DataFrame.columns by a list.<br>
The list should have equal length as the number of columns in the DataFrame.

In [51]:
df_renamed = df.copy()
df_renamed.columns = ['X', 'Y', 'Z']
df_renamed

Unnamed: 0,X,Y,Z
0,1,4,7
1,2,5,8
2,3,6,9


## 2.5 Filter data in a DataFrame

### Preparation - Create a DataFrame

In [52]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 22],
        'Salary': [70000, 80000, 90000, 60000, 75000],
        'Department': ['HR', 'IT', 'Finance', 'Marketing', 'HR']}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25,70000,HR
1,Bob,30,80000,IT
2,Charlie,35,90000,Finance
3,David,40,60000,Marketing
4,Eva,22,75000,HR


### 2.5.1 Create a boolean Series by comparison operator

In [53]:
df['Department'] == 'HR'

0     True
1    False
2    False
3    False
4     True
Name: Department, dtype: bool

This kind of operators will create a Series in boolean type (True or False).

### 2.5.2 Filter DataFrame by one condition
We can use a boolean Series as the index to slice a DataFrame.<br>
The resulting DataFrame will keep the rows that correspond to the "True" value.<br>
Take note that, the boolean Series must have the same length as the number of rows in the DataFrame.

In [54]:
mask = df['Department'] == 'HR'
df_filtered = df[mask]
df_filtered

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25,70000,HR
4,Eva,22,75000,HR


We can use "~" operator to "flip" the boolean value in a Series.<br>
If we apply it to the slicing criterion, the resulting DataFrame will keep the rows that correspond to the "False" value.<br>
That note 

In [55]:
mask = df['Age'] > 30
df_filtered = df[~mask]
df_filtered

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25,70000,HR
1,Bob,30,80000,IT
4,Eva,22,75000,HR


### 2.5.3 Filter DataFrame by multiple conditions
We can use "&" operator to join multiple boolean Series by "AND" condition.<br>
In the resulting boolean series, each value will be "True" if both/all corresponding values in joint series are "True".

Take note that, the joint series need to have equal length.

In [56]:
mask1 = df['Age'] <= 30
mask2 = df['Salary'] >= 75000
mask = mask1 & mask2

df_filtered = df[mask]
df_filtered

Unnamed: 0,Name,Age,Salary,Department
1,Bob,30,80000,IT
4,Eva,22,75000,HR


We can use "|" operator to join multiple boolean Series by "OR" condition.<br>
In the resulting boolean series, each value will be "True" if at least one of the corresponding values in joint series is "True".

In [57]:
mask1 = df['Age'] <= 30
mask2 = df['Salary'] >= 75000
mask = mask1 | mask2

df_filtered = df[mask]
df_filtered

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25,70000,HR
1,Bob,30,80000,IT
2,Charlie,35,90000,Finance
4,Eva,22,75000,HR


## 2.6 Sort values in DataFrame

### Preparation - Create a DataFrame

In [58]:
data = {'Name': ['Alice', 'Charlie', 'Bob', 'Eva', 'David'],
        'Age': [25, 35, 30, 22, 40],
        'Department': ['HR', 'Marketing', 'Sales', 'HR', 'Sales'],
        'Salary': [70000, 90000, 80000, 60000, 75000]}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Department,Salary
0,Alice,25,HR,70000
1,Charlie,35,Marketing,90000
2,Bob,30,Sales,80000
3,Eva,22,HR,60000
4,David,40,Sales,75000


### 2.6.1 Sort by a single column
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.sort_values()

We can use DataFrame.sort_values() method to sort values in a DataFrame.<br>
We can input the specific column name to the "by" argument, which will be the column we use to sort.<br>
By default, the sorting will be done in ascending order.

In [59]:
sorted_df = df.sort_values(by='Age')
sorted_df

Unnamed: 0,Name,Age,Department,Salary
3,Eva,22,HR,60000
0,Alice,25,HR,70000
2,Bob,30,Sales,80000
1,Charlie,35,Marketing,90000
4,David,40,Sales,75000


Take note that, the indices will be sorted accordingly as well.<br>
If we want to reset the index, we can use DataFrame.reset_index() method.

### 2.6.2 Sort by a column and reset index
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.reset_index()

In [60]:
sorted_df = df.sort_values(by='Age').reset_index()
sorted_df

Unnamed: 0,index,Name,Age,Department,Salary
0,3,Eva,22,HR,60000
1,0,Alice,25,HR,70000
2,2,Bob,30,Sales,80000
3,1,Charlie,35,Marketing,90000
4,4,David,40,Sales,75000


The indices are reset to 0 to 4 in sequence. The previous indices are converted in a new column, called "index".
If we do not want to keep the previous indices, we can set drop=True in DataFrame.reset_index() method.

In [61]:
sorted_df = df.sort_values(by='Age').reset_index(drop=True)
sorted_df

Unnamed: 0,Name,Age,Department,Salary
0,Eva,22,HR,60000
1,Alice,25,HR,70000
2,Bob,30,Sales,80000
3,Charlie,35,Marketing,90000
4,David,40,Sales,75000


### 2.6.3 Sort by a single column in descending order
We can set ascending=False in DataFrame.sort_values() method so the DataFrame will be sort by the specific column in descending order.

In [62]:
sorted_df = df.sort_values(by='Salary', ascending=False)
sorted_df

Unnamed: 0,Name,Age,Department,Salary
1,Charlie,35,Marketing,90000
2,Bob,30,Sales,80000
4,David,40,Sales,75000
0,Alice,25,HR,70000
3,Eva,22,HR,60000


### 2.6.4 Sort by multiple columns
We can input a list of column names to the "by" argument.<br>
The DataFrame will be sorted by these columns in sequence.<br>

In [63]:
sorted_df = df.sort_values(by=['Department', 'Salary'])
sorted_df

Unnamed: 0,Name,Age,Department,Salary
3,Eva,22,HR,60000
0,Alice,25,HR,70000
1,Charlie,35,Marketing,90000
4,David,40,Sales,75000
2,Bob,30,Sales,80000


Now, the DataFrame is firstly sorted by "Department" in ascending order (A to F), and then by "Salary" in ascending order.<br>
Apparently, we can set ascending=False to reverse that.

What if we want to sort by multiple columns concurrently, but some in ascending order while others in descending order?

### 2.6.5 Sort by multiple columns in different sorting methods
We can do that by inputting a boolean list to "ascending" argument. Each boolean value will determine whether the corresponding column should be sorted in ascending or descending order.

In [64]:
sorted_df = df.sort_values(by=['Department', 'Salary'], ascending=[True, False])
sorted_df

Unnamed: 0,Name,Age,Department,Salary
0,Alice,25,HR,70000
3,Eva,22,HR,60000
1,Charlie,35,Marketing,90000
2,Bob,30,Sales,80000
4,David,40,Sales,75000


## 2.7 Aggregate a DataFrame

### Preparation - Create a DataFrame
In term of data manipulation, aggregation means that, we are calculating something out of a group, such as the sum, the average, etc.

In [65]:
data = {'Employee': ['Anna', 'Emma', 'Ethan', 'Gary', 'John', 'Lila', 'Will'],
        'Department': ['HR', 'Sales', 'HR', 'Sales', 'HR', 'Sales', 'HR'],
        'Seniority': ['Junior', 'Junior',' Senior', 'Senior', 'Senior', 'Junior', 'Junior'],
        'Age': [29, 28, 35, 32, 33, 24, 26],
        'Salary': [70000, 60000, 80000, 73000, 78000, 55000, 58000]}

df = pd.DataFrame(data)
df

Unnamed: 0,Employee,Department,Seniority,Age,Salary
0,Anna,HR,Junior,29,70000
1,Emma,Sales,Junior,28,60000
2,Ethan,HR,Senior,35,80000
3,Gary,Sales,Senior,32,73000
4,John,HR,Senior,33,78000
5,Lila,Sales,Junior,24,55000
6,Will,HR,Junior,26,58000


### 2.7.1 Aggregate the entire DataFrame
There are a few methods that we can use to determine a numeric figure out of a Series/DataFrame.<br>

There are a few common examples below:
    - DataFrame.mean()
    - DataFrame.sum()
    - DataFrame.min()
    - DataFrame.max()
    - ...

Those methods can be applied to a Series object, too.

However, take note that, when we are trying to use DataFrame.mean(), we need to make sure all columns in the DataFrame are numeric.<br>
Hence, sometimes we need to get a subset of the DataFrame first, before we can apply the method.

In [66]:
# Determine the mean of "Age" and "Salary"
df[['Age', 'Salary']].mean()

Age          29.571429
Salary    67714.285714
dtype: float64

In [67]:
# Determine the sum of "Age" and "Salary"
df[['Age', 'Salary']].sum()

Age          207
Salary    474000
dtype: int64

### 2.7.2 Aggregate by groups
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.groupby()

We can use DataFrame.groupby() method to split the DataFrame into groups.<br>
Then we can compute those numeric figures per group.

In [68]:
# Determine the mean of "Age" and "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').mean()

Unnamed: 0_level_0,Age,Salary
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,30.75,71500.0
Sales,28.0,62666.666667


We can also use a list of multiple columns as the groupby factors. It will create a group per each unique combination.

In [69]:
# Determine the mean of "Age" and "Salary" per "Department" and "Seniority"
sub_df = df[['Department', 'Seniority', 'Age', 'Salary']]
sub_df.groupby(['Department', 'Seniority']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Salary
Department,Seniority,Unnamed: 2_level_1,Unnamed: 3_level_1
HR,Senior,35.0,80000.0
HR,Junior,27.5,64000.0
HR,Senior,33.0,78000.0
Sales,Junior,26.0,57500.0
Sales,Senior,32.0,73000.0


### 2.7.3 Aggregate columns in different ways by groups
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.agg()

We can use DataFrame.agg() method to apply different aggregation methods to different columns.<br>

Maybe we want to calculate a few numeric figures out of the same column.<br>
In this case, we can set a list as the input. The list should contain string to denote the aggregation methods.

In [70]:
# Determine the mean and the sum of "Age" and of "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg(['mean', 'sum'])

Unnamed: 0_level_0,Age,Age,Salary,Salary
Unnamed: 0_level_1,mean,sum,mean,sum
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
HR,30.75,123,71500.0,286000
Sales,28.0,84,62666.666667,188000


Maybe we want to compute the sum for a column and the average for the other column.<br>
In this case, we can set a dictionary as the input.<br>
The keys refer to the column names and the values refer to the aggregation method.

In [71]:
# Determine the mean of "Age" and the sum of "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg({'Age': 'mean', 'Salary': 'sum'})

Unnamed: 0_level_0,Age,Salary
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,30.75,286000
Sales,28.0,188000


Even when we are using a dictionary, we can set some values to a list so a column will be aggregated in few different ways.

In [72]:
# Determine the mean of "Age" and the mean and the sum of "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg({'Age': 'mean', 'Salary': ['mean', 'sum']})

Unnamed: 0_level_0,Age,Salary,Salary
Unnamed: 0_level_1,mean,mean,sum
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
HR,30.75,71500.0,286000
Sales,28.0,62666.666667,188000


## 2.8 Merge and concatenate DataFrame

### Preparation - Create multiple DataFrames

In [73]:
data1 = {'ID': [1, 2, 3, 4],
         'Name': ['Alice', 'Bob', 'Charlie', 'David']}
df1 = pd.DataFrame(data1)
df1

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie
3,4,David


In [74]:
data2 = {'ID': [5, 6],
         'Name': ['Eva', 'Frank']}
df2 = pd.DataFrame(data2)
df2

Unnamed: 0,ID,Name
0,5,Eva
1,6,Frank


In [75]:
data3 = {'ID': [4, 5, 6, 7],
         'Salary': [70000, 80000, 90000, 60000]}
df3 = pd.DataFrame(data3)
df3

Unnamed: 0,ID,Salary
0,4,70000
1,5,80000
2,6,90000
3,7,60000


### 2.8.1 Concatenate DataFrames
<b><font color="orange" size=5>★</font> New Function:</b> pandas.concat()

We can use pandas.concat() function to join multiple DataFrames vertically.<br>
pandas.concate() function takes a list of DataFrames as the input. It can join more than 2 DataFrames at once.

In [76]:
concatenated_df = pd.concat([df1, df2])
concatenated_df

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie
3,4,David
0,5,Eva
1,6,Frank


Take note that, the indices remain the same as how they appear in the separate DataFrames.

If we want to reset it, we can set ignore_index=True in pandas.concat() function.

In [77]:
df4 = pd.concat([df1, df2], ignore_index=True)
df4

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie
3,4,David
4,5,Eva
5,6,Frank


We can use join multiple DataFrames horizontally by setting axis=1 in pandas.concat() function.

In [78]:
concatenated_df = pd.concat([df4, df3], axis=1)
concatenated_df

Unnamed: 0,ID,Name,ID.1,Salary
0,1,Alice,4.0,70000.0
1,2,Bob,5.0,80000.0
2,3,Charlie,6.0,90000.0
3,4,David,7.0,60000.0
4,5,Eva,,
5,6,Frank,,


We may notice that, if the ID in salary table does not match the ID in name table.<br>
In order to align them, we need to use pandas.merge() function.

### 2.8.2 Merge DataFrames
<b><font color="orange" size=5>★</font> New Function:</b> pandas.merge()

We can use pandas.merge() function to merge DataFrames by a specific key.<br>
That means, the rows are joint when they have the same value in the "key" column.

Take note that, unlike pandas.concat() that can concatenate multiple DataFrames at once, pandas.merge() only processes 2 DataFrames at one time.<br>
Hence, it takes the DataFrames separately as 2 inputs, instead of one in the list.<br>
They are called the "left" table and the "right" table.

By default, pandas.merge() function takes the first column in each DataFrame as the key. But it would be better to specify them by the "on" argument.

In [79]:
merged_df = pd.merge(df4, df3, on='ID')
merged_df

Unnamed: 0,ID,Name,Salary
0,4,David,70000
1,5,Eva,80000
2,6,Frank,90000


We may notice that, only the IDs that appear in both DataFrames are kept.
This operation is called "inner join".

If we want to keep the IDs on the 1st DataFrame (it is also called, the "left" table), we can set how='left'.<br>
If there is no match, the rows in the "left" table will be kept with missing values.
This operation is called "left join".

In [80]:
merged_df = pd.merge(df4, df3, on='ID', how='left')
merged_df

Unnamed: 0,ID,Name,Salary
0,1,Alice,
1,2,Bob,
2,3,Charlie,
3,4,David,70000.0
4,5,Eva,80000.0
5,6,Frank,90000.0


Likewise, we can do a "right join".

In [81]:
merged_df = pd.merge(df4, df3, on='ID', how='right')
merged_df

Unnamed: 0,ID,Name,Salary
0,4,David,70000
1,5,Eva,80000
2,6,Frank,90000
3,7,,60000


If we do not want to drop any row, we can do a "outer join".

In [82]:
merged_df = pd.merge(df4, df3, on='ID', how='outer')
merged_df

Unnamed: 0,ID,Name,Salary
0,1,Alice,
1,2,Bob,
2,3,Charlie,
3,4,David,70000.0
4,5,Eva,80000.0
5,6,Frank,90000.0
6,7,,60000.0
