# Import Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1.0 Get to know Series and DataFrame

## 1.1 What is pandas.Series?
<b><font color="orange" size=5>★</font> New Function:</b> pandas.Series()

A pandas.Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Here's a simple example to illustrate a Series:

In [None]:
data = [1, 3, 5, 7, 9]
series = pd.Series(data)

series

In this output, the left column (0, 1, ..., 4) represents <b><font color="#AA0000">the indices</font></b>, and the right column (1, 3, ..., 9) represents <b><font color="#AA0000">the values</font></b>.

## 1.2 What is pandas.DataFrame?
<b><font color="orange" size=5>★</font> New Function:</b> pandas.DataFrame()

A pandas.DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's basically a table with rows and columns. Columns can be of different types, and it's the most commonly used pandas object.

Sometimes a DatFrame may look like a Series when there is only 1 column.

Here's a basic example of a DataFrame:

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [24, 27, 22],
        'City': ['New York', 'Boston', 'Los Angeles']}
df = pd.DataFrame(data)

df

In this DataFrame, <b>Name</b>, <b>Age</b>, and <b>City</b> are <b><font color="#AA0000">the column headers</font></b>, and the rows are indexed with numbers starting from 0.

Both Series and DataFrame are central to data analysis tasks using Pandas. They provide a vast array of functions and methods to efficiently work with structured data.

## 1.3 Extract a subset from DataFrame

### Preparation - Create a DataFrame

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [24, 27, 22],
        'City': ['New York', 'Boston', 'Los Angeles']}

df = pd.DataFrame(data)
df

### 1.3.1 Extract ONE column from a DataFrame as a Series
We can use the column name as the index to extract a column, as a Series object, from a DataFrame object.

In [None]:
series = df['City']
series

### 1.3.2 Extract multiple columns from a DataFrame as a DataFrame
We can use a list that contains column names as the index to extract a subset from a DataFrame object.<br>
The subset will have the same number of rows as the source DataFrame.

In [None]:
sub_df = df[['Name', 'Age']]
sub_df

### 1.3.3 Extract ONE column from a DataFrame as a DataFrame
When the list contains only one column name, it will extract a subset with only 1 column.<br>
It will look very much like a Series object.<br>
Pay extra attention to differentiate them.

In [None]:
sub_df = df[['City']]
sub_df

### 1.3.4 Drop ONE column from a DataFrame
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.drop()

We can use DataFrame.drop() function and specify the column name to drop specific columns.

The input can be a string to denote the column name.<br>
We also need to set axis=1 (if axis=0, it will be dropping rows instead).

In [None]:
df_dropped = df.drop('City', axis=1)
df_dropped

### 1.3.5 Drop multiple columns from a DataFrame
The input can also be a list that contains multiple column names.

In [None]:
df_dropped = df.drop(['Age', 'City'], axis=1)
df_dropped

### 1.3.6 Extract rows from a DataFrame
We can use a range index to extract rows from a DataFrame. It works similar to how to slice a list.

In [None]:
sub_df = df[1:3]
sub_df

We can't just use a single value as the index to slice a DataFrame.<br>
When there is no ":" in the index, a single value will be interpret as the column name and the program would think that you're trying to extract a column.<br>

We still need to set a range index even if we're just getting 1 row.

In [None]:
sub_df = df[1:2]
sub_df

### 1.3.7 Extract columns and rows at the same time

We can use DataFrame.loc[..., ...] to get a subset.<br>
We need to set 2 indices, separated by a comma.<br>

The 1st index is the range index to slice rows.<br>
The 2nd index is the list of column names to slice columns.

In [None]:
sub_df = df.loc[1:3, ['Age', 'City']]
sub_df

We can also use DataFrame.iloc[..., ...] to get a subset.<br>
We need to set 2 indices, separated by a comma.<br>

The 1st index is the range index to slice rows.<br>
The 2nd index is the range index to slice columns.<br>

The difference is the 2nd index, where it uses integers, instead of column names, to specify columns.

In [None]:
sub_df = df.iloc[1:3, 1:3]
sub_df

## 1.4 How to read a data file as DataFrame?

### 1.41 Read CSV File with Headers
<b><font color="orange" size=5>★</font> New Function:</b> pandas.read_csv()

We can use <b><font color=#AA0000>pd.read_csv()</font></b> function to open a "csv" file and load it as <b><font color=blue>DataFrame</font></b> object.<br>
Pandas automatically uses the first row as column headers.

In [None]:
df = pd.read_csv('Abalone.csv')
df

As you can see above, this Abalone.csv file does not have headers. The data starts from the 1st row.<br>
Hence, we need to "tell" the function <b>NOT</b> to take the 1st row as the header.<br>

### 1.4.2 Read CSV File without Headers
If the "csv" file doesn't contain headers, we can specify this to Pandas by setting <b>header=None</b>, and it will use default integer indices for column names.

In [None]:
df = pd.read_csv('Abalone.csv', header=None)
df

### 1.4.3 Read CSV File with an Index Column
If your CSV file has an index column (a column that should be used as row labels), you can specify this column with the <b>index_col</b> parameter.

In [None]:
df = pd.read_csv('airports.csv', index_col=0)
df

In this case, <b>airport_id</b> is used as the row labels, instead of the default indices starting from 0.

### 1.4.4 Read XLSX File
<b><font color="orange" size=5>★</font> New Function:</b> pandas.read_excel()

Reading an Excel file is similar, but we use <b>pd.read_excel()</b> instead. Again, Pandas will use the first row for column headers by default.

If the Excel file has more than 1 sheets, we will need to specify which sheet to read by setting an input to the <b>sheet_name</b> parameter.

In [None]:
df = pd.read_excel('Generalization.xlsx', sheet_name='Sheet1')
df

### <font color=darkred><b>Exercise 1</b><font>
Extract Row 100-199, Column 4-6 from the Abalone dataset

In [None]:
# Re-initialize df just in case they are changed
# Do NOT change this cell

df = pd.read_csv('Abalone.csv', header=None)

In [None]:
# Write your code for Exercise 1 here

df.loc[_, _]

## 1.5 Convert to/from DataFrame
We can use pd.DataFrame() function to create a DataFrame objectwithout importing it from the file.<br>

THe first input is always the data. There are several different data types that we can use to input the data.<br>

We can also set an input to "columns" argument. It should take a list and it will use the values in the list as the column names.<br>
The number of values in the "columns" input should be the same as the number of columns we are making.

### 1.5.1 Creating DataFrame from a 1D list
When we use a 1D list as the input. It will just create a DataFrame with only 1 column.

In [None]:
data = [1, 2, 3, 4, 5]

df = pd.DataFrame(data, columns=['Numbers'])
df

### 1.5.2 Create DataFrame from a 2D list
We can use a nested list (i.e. lists in a list, or a 2D list) as the input.<br>

Each sub-list denotes a row.<br>
All the "sub-lists" in the list should have equal length, which is the number of columns.<br>

In [None]:
data = [['Alex', 10],
        ['Bob', 12],
        ['Clarke', 13]]

df = pd.DataFrame(data, columns=['Name', 'Age'])
df

### 1.5.3 Create DataFrame from a dictionary
We can use a dictionary as the input.<br>

Each key has a list as its value.<br>
All lists should have equal length.<br>
Each key denotes the column name.<br>
Each list will become a column.<br>

When we use a dictionary, we don't need to give an input to "columns" argument.

In [None]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'],
        'Age': [20, 21, 19]}

df = pd.DataFrame(data)
df

### <font color=darkred><b>Exercise 2</b><font>
Create a DataFrame that looke like this
|  | ID | Name |
|----------|-----------|-------------|
| 0 | 0001 | Adam |
| 1 | 0002 | Bruce |
| 2 | 0003 | Charles |
| 3 | 0004 | David |

In [None]:
# Write your code for Exercise 2 here

# Highlight: ID has leading zeroes so it is string
data = {_}

df = pd.DataFrame(data)
df

### 1.5.4 Convert a Series to a list
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.to_list()

We can use Series.to_list() function to convert a Series object to a list.

In [None]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'],
        'Age': [20, 21, 19]}

df = pd.DataFrame(data)

df['Name'].to_list()

### 1.5.5 Convert a DataFrame to a list
We can use list(df) function to convert a DataFrame object to a list.

In [None]:
list(df)

Actually, we can't. It will only return the column names as a list. But we can make good use of it sometimes.

### 1.5.6 Convert a DataFrame to a numpy.array
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.to_numpy()

We can use DataFrame.to_numpy() function to convert a DataFrame object to a numpy.array.

In [None]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'],
        'Age': [20, 21, 19]}

df = pd.DataFrame(data)

df.to_numpy()

# 2.0 Exploratory Data Analysis

## 2.1 Basic Data Exploration

### Preparation - Import data

In [None]:
df = pd.read_csv('Car Sales.csv')
df

### 2.1.1 Display the first/last few rows
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.head()<br>
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.tail()

We can use DataFrame.head() function to display the first few rows.
By default, it will show the first 5 rows.

In [None]:
df.head()

We can give an integer input to DataFrame.head() to indicate the number of rows to show.

In [None]:
df.head(7)

Similarly, we can use DataFrame.tail() to show the last few rows. It works pretty much the same as DataFrame.head().

In [None]:
df.tail()

### 2.1.2 Display the shape of the DataFrame
<b><font color="orange" size=5>★</font> New Attribute:</b> pandas.DataFrame.shape

We can call DataFrame.shape attribute to get the number of rows and columns in the DataFrame.<br>
Take note that DataFrame.shape is an attribute, not a function. So, it is not callable, i.e. do NOT write DataFrame.shape().

In [None]:
df.shape

DataFrame.shape is a tuple object.<br>
The first value is the number of rows and the second value is the number of columns.<br>
We can use indices to call the values, or use multiple variables at once to parse the values.

In [None]:
# Method 1
n_rows = df.shape[0]
n_columns = df.shape[1]
print('Method 1:', n_rows, n_columns)

In [None]:
# Method 2
n_rows, n_columns = df.shape
print('Method 2:', n_rows, n_columns)

### 2.1.3 Display the data type of each columns
<b><font color="orange" size=5>★</font> New Attribute:</b> pandas.DataFrame.dtypes

We can use DataFrame.dtypes attribute to get the data type of each column.<br>
Take note that DataFrame.dtypes is an attribute so it is not callable.

In [None]:
df.dtypes

"object" is basically the "string" in pandas library.

### 2.1.4 Display a statistical summary of numerical columns
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.describe()

We can use DataFrame.describe() function to display a statistical summary.<br>
Take note that columns in "object" type will not be shown.

In [None]:
df.describe()

### 2.1.5 Count the unique values in each column
<b><font color="orange" size=5>★</font> New Method:</b> pandas.Series.nunique()

There is no direct way to display the number of unique values in all columns in one go.<br>
Though, we can do it for Series object.

In [None]:
series = df['Manufacturer']
series

We can use Series.nunique() function to see the number of unique values in the Series.

In [None]:
series.nunique()

DataFrame object does not have this method. Hence, if we want to examine the number of unique values in a DataFrame, we have to do it column by column.

### <font color=darkred><b>Exercise 3</b><font>
Examine the number of unique values in each column.<br>
<i>Hint: Use for loop.</i>

In [None]:
# Re-initialize df just in case they are changed
# Do NOT change this cell

df = pd.read_csv('Car Sales.csv')

In [None]:
# Write your code for Exercise 3 here


## 2.2 Handling Missing Values

### Preparation - Create a DataFrame

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', np.nan],
        'Age': [24, np.nan, 22, 27],
        'Salary': [70000, 55000, None, 80000]}

df = pd.DataFrame(data)
df

We have 1 missing value in each column.

### 2.2.1 Check for missing values
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.isnull()

We can use DataFrame.isnull() function to examine if each value is considered as a missing value.

In [None]:
df.isnull()

<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.sum()

We can use DataFrame.sum() function to determine the sum of each column.<br>
When the column is in boolean type (True or False), it will count the number of "True" in the column.

Hence, we can use DataFrame.isnull().sum() to display the number of missing values in each column.

In [None]:
df.isnull().sum()

### 2.2.2 Drop rows with any missing values
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.dropna()

We can use DataFrame.dropna() to drop a row that contains any missing value.

In [None]:
df_dropped = df.dropna()
df_dropped

We have only 1 row left as that is the only complete row.

### 2.2.3 Drop columns with any missing values
We can specify axis=1 in DataFrame.dropna() to drop a column, instead of a row, that contains any missing value.

In [None]:
df_dropped = df.dropna(axis=1)
df_dropped

We have no column left as each column has 1 missing value.

### 2.2.4 Drop rows with all values missing
We can specify how='all' in DataFrame.dropna().<br>
In this case, a row will be dropped only if the row has all values missing.

In [None]:
df_dropped = df.dropna(how='all')
df_dropped

There is no row dropped because none of them has all values missing.

### 2.2.5 Fill Missing Values with a Specific Value
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.fillna()

We can use DataFrame.fillna() to fill the missing values by a specific input value.

In [None]:
df_filled = df.fillna(0)
df_filled

If we want to fill different values for different columns, we can input a dictionary instead.

In [None]:
df_filled = df.fillna(value={'Name': 'Unknown', 'Age': 30, 'Salary': 60000})
df_filled

### 2.2.6 Fill Missing Values by a method
We can use several different methods to fill missing values.

When we set method='ffill', it will take the last value before each missing values to fill it.<br>
However, it will not fill the missing values in the 1st row.

In [None]:
df_ffill = df.fillna(method='ffill')
df_ffill

When we set method='bfill', it will take the next value after each missing values to fill it.<br>
However, it will not fill the missing values in the last row.

In [None]:
df_bfill = df.fillna(method='bfill')
df_bfill

### <font color=darkred><b>Exercise 4</b><font>
Fill Missing Values by mean/median and mode
For numeric columns, we can fill by the mean or median.<br>
For categorical columns, we can fill by the mode.

This is a common practice to fill missing values.

In [None]:
# Re-initialize df just in case they are changed
# Do NOT change this cell

data = {'Name': ['Alice', 'Bob', 'Charlie', np.nan],
        'Age': [24, np.nan, 22, 27],
        'Salary': [70000, 55000, None, 80000]}

df = pd.DataFrame(data)

In [None]:
# Write your code for Exercise 4 here

values_for_fillna = {_}
df_filled = df.fillna(value=values_for_fillna)
df_filled

## 2.3 Data Conversion

### Preparation - Create a DataFrame

In [None]:
data = {'ProductID': [101, 102, 103, 104],
        'Price': [19.99, 25.50, 8.99, '12.34'],
        'Quantity': ['10', '15', '20', '25']}

df = pd.DataFrame(data)
df

Although "Price" and "Quantity" look like numeric columns, but we know they are not, as we set some values as string.<br>
We can check the data types by DataFrame.dtypes attribute.

In [None]:
df.dtypes

In pandas library, the "object" type means string (text).

### 2.3.1 Data Conversion
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.copy()<br>
<b><font color="orange" size=5>★</font> New Method:</b> pandas.Series.astype()

We can use Series.astype() function to convert a Series into a specific data type.<br>
We need to set the target data type as the input, for example, str, int or float.

We have to set the data conversion column by column, unless we are converting all columns into one type.

In [None]:
# We can create a copy of df, so the raw df will not be changed
df_copy = df.copy()

df_copy['ProductID'] = df_copy['ProductID'].astype(str)
df_copy['Price'] = df_copy['Price'].astype(float)
df_copy['Quantity'] = df_copy['Quantity'].astype(int)

df_copy.dtypes

Now, all columns are converted into the correct type.

## 2.4 Rename Columns

### Preparation - Create a DataFrame

In [None]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}

df = pd.DataFrame(data)
df

### 2.4.1 Rename specific columns
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.rename()

We can use DataFrame.rename() function to rename some specific columns.<br>
It takes a dictionary as the input. The keys are the original names and the corresponding values are the new names to use.

In [None]:
df_renamed = df.rename(columns={'A': 'Alpha', 'B': 'Beta'})
df_renamed

### 2.4.2 Rename all columns
<b><font color="orange" size=5>★</font> New Attribute:</b> pandas.DataFrame.columns

We can also overwrite DataFrame.columns by a list.<br>
The list should have equal length as the number of columns in the DataFrame.

In [None]:
df_renamed = df.copy()
df_renamed.columns = ['X', 'Y', 'Z']
df_renamed

## 2.5 Filter data in a DataFrame

### Preparation - Create a DataFrame

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 22],
        'Salary': [70000, 80000, 90000, 60000, 75000],
        'Department': ['HR', 'IT', 'Finance', 'Marketing', 'HR']}

df = pd.DataFrame(data)
df

### 2.5.1 Create a boolean Series by comparison operator

In [None]:
df['Department'] == 'HR'

This kind of operators will create a Series in boolean type (True or False).

### 2.5.2 Filter DataFrame by one condition
We can use a boolean Series as the index to slice a DataFrame.<br>
The resulting DataFrame will keep the rows that correspond to the "True" value.<br>
Take note that, the boolean Series must have the same length as the number of rows in the DataFrame.

In [None]:
mask = df['Department'] == 'HR'
df_filtered = df[mask]
df_filtered

We can use "~" operator to "flip" the boolean value in a Series.<br>
If we apply it to the slicing criterion, the resulting DataFrame will keep the rows that correspond to the "False" value.<br>

In [None]:
mask = df['Department'] == 'HR'
df_filtered = df[~mask]
df_filtered

### 2.5.3 Filter DataFrame by multiple conditions
We can use "&" operator to join multiple boolean Series by "AND" condition.<br>
In the resulting boolean series, each value will be "True" if both/all corresponding values in joint series are "True".

Take note that, the joint series need to have equal length.

In [None]:
mask1 = df['Age'] <= 30
mask2 = df['Salary'] >= 75000
mask = mask1 & mask2

df_filtered = df[mask]
df_filtered

We can use "|" operator to join multiple boolean Series by "OR" condition.<br>
In the resulting boolean series, each value will be "True" if at least one of the corresponding values in joint series is "True".

In [None]:
mask1 = df['Age'] <= 30
mask2 = df['Salary'] >= 75000
mask = mask1 | mask2

df_filtered = df[mask]
df_filtered

Please take note that, if you are writing multiple conditions in one line, you need to use "()" to enclose each condition.<br>
Otherwise, there will be an error.

In [None]:
mask = (df['Age'] <= 30) | (df['Salary'] >= 75000)

df_filtered = df[mask]
df_filtered

### 2.5.3 Filter DataFrame by a value range
Based on what we have learned, we can simply write it this way.

In [None]:
mask = (df['Salary'] >= 60000) & (df['Salary'] <= 75000)

df_filtered = df[mask]
df_filtered

<b><font color="orange" size=5>★</font> New Method:</b> pandas.Series.between()

Instead, we can use pandas.Series.between() method to achieve the same result.

In [None]:
mask = df['Salary'].between(60000, 75000)

df_filtered = df[mask]
df_filtered

pandas.Series.between() method works not only on a numeric series, but also a string series as string can be sorted and ranked too.

In [None]:
mask = df['Name'].between('B', 'D')

df_filtered = df[mask]
df_filtered

When we sort the text, we will have: Alice < B < Bob < Charlie < D < David < Eva.

So, 'Bob' and 'Charlie' fall between 'B' and 'D'.

### 2.5.3 Filter DataFrame by a set of values
Assuming that we want to extract employees from HR or Marketing department, based on what we have learned, we can simply write it this way.

In [None]:
mask = (df['Department'] == 'HR') | (df['Department'] == 'Marketing')

df_filtered = df[mask]
df_filtered

However, we can imagine that, if we want to extract employees from 10 departments, we will need to write 10 joint conditions, which will be long and inefficient.

<b><font color="orange" size=5>★</font> New Method:</b> pandas.Series.isin()

Instead, we can use pandas.Series.isin() method to simplify the command. It will return True if the value is an instance in the list.

In [None]:
mask = df['Department'].isin(['HR', 'Marketing'])

df_filtered = df[mask]
df_filtered

## 2.6 Sort values in DataFrame

### Preparation - Create a DataFrame

In [None]:
data = {'Name': ['Alice', 'Charlie', 'Bob', 'Eva', 'David'],
        'Age': [25, 35, 30, 22, 40],
        'Department': ['HR', 'Marketing', 'Sales', 'HR', 'Sales'],
        'Salary': [70000, 90000, 80000, 60000, 75000]}

df = pd.DataFrame(data)
df

### 2.6.1 Sort by a single column
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.sort_values()

We can use DataFrame.sort_values() method to sort values in a DataFrame.<br>
We can input the specific column name to the "by" argument, which will be the column we use to sort.<br>
By default, the sorting will be done in ascending order.

In [None]:
sorted_df = df.sort_values(by='Age')
sorted_df

Take note that, the indices will be sorted accordingly as well.<br>
If we want to reset the index, we can use DataFrame.reset_index() method.

### 2.6.2 Sort by a column and reset index
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.reset_index()

In [None]:
sorted_df = df.sort_values(by='Age').reset_index()
sorted_df

The indices are reset to 0 to 4 in sequence. The previous indices are converted in a new column, called "index".
If we do not want to keep the previous indices, we can set drop=True in DataFrame.reset_index() method.

In [None]:
sorted_df = df.sort_values(by='Age').reset_index(drop=True)
sorted_df

### 2.6.3 Sort by a single column in descending order
We can set ascending=False in DataFrame.sort_values() method so the DataFrame will be sort by the specific column in descending order.

In [None]:
sorted_df = df.sort_values(by='Salary', ascending=False)
sorted_df

### 2.6.4 Sort by multiple columns
We can input a list of column names to the "by" argument.<br>
The DataFrame will be sorted by these columns in sequence.<br>

In [None]:
sorted_df = df.sort_values(by=['Department', 'Salary'])
sorted_df

Now, the DataFrame is firstly sorted by "Department" in ascending order (A to F), and then by "Salary" in ascending order.<br>
Apparently, we can set ascending=False to reverse that.

What if we want to sort by multiple columns concurrently, but some in ascending order while others in descending order?

### 2.6.5 Sort by multiple columns in different sorting methods
We can do that by inputting a boolean list to "ascending" argument. Each boolean value will determine whether the corresponding column should be sorted in ascending or descending order.

In [None]:
sorted_df = df.sort_values(by=['Department', 'Salary'], ascending=[True, False])
sorted_df

## 2.7 Aggregate a DataFrame

### Preparation - Create a DataFrame
In term of data manipulation, aggregation means that, we are calculating something out of a group, such as the sum, the average, etc.

In [None]:
data = {'Employee': ['Anna', 'Emma', 'Ethan', 'Gary', 'John', 'Lila', 'Will'],
        'Department': ['HR', 'Sales', 'HR', 'Sales', 'HR', 'Sales', 'HR'],
        'Seniority': ['Junior', 'Junior',' Senior', 'Senior', 'Senior', 'Junior', 'Junior'],
        'Age': [29, 28, 35, 32, 33, 24, 26],
        'Salary': [70000, 60000, 80000, 73000, 78000, 55000, 58000]}

df = pd.DataFrame(data)
df

### 2.7.1 Aggregate the entire DataFrame
There are a few methods that we can use to determine a numeric figure out of a Series/DataFrame.<br>

There are a few common examples below:
    - DataFrame.mean()
    - DataFrame.sum()
    - DataFrame.min()
    - DataFrame.max()
    - ...

Those methods can be applied to a Series object, too.

However, take note that, when we are trying to use DataFrame.mean(), we need to make sure all columns in the DataFrame are numeric.<br>
Hence, sometimes we need to get a subset of the DataFrame first, before we can apply the method.

In [None]:
# Determine the mean of "Age" and "Salary"
df[['Age', 'Salary']].mean()

In [None]:
# Determine the sum of "Age" and "Salary"
df[['Age', 'Salary']].sum()

### 2.7.2 Aggregate by groups
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.groupby()

We can use DataFrame.groupby() method to split the DataFrame into groups.<br>
Then we can compute those numeric figures per group.

In [None]:
# Determine the mean of "Age" and "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').mean()

We can also use a list of multiple columns as the groupby factors. It will create a group per each unique combination.

In [None]:
# Determine the mean of "Age" and "Salary" per "Department" and "Seniority"
sub_df = df[['Department', 'Seniority', 'Age', 'Salary']]
sub_df.groupby(['Department', 'Seniority']).mean()

### 2.7.3 Aggregate columns in different ways by groups
<b><font color="orange" size=5>★</font> New Method:</b> pandas.DataFrame.agg()

We can use DataFrame.agg() method to apply different aggregation methods to different columns.<br>

Maybe we want to calculate a few numeric figures out of the same column.<br>
In this case, we can set a list as the input. The list should contain string to denote the aggregation methods.

In [None]:
# Determine the mean and the sum of "Age" and of "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg(['mean', 'sum'])

Maybe we want to compute the sum for a column and the average for the other column.<br>
In this case, we can set a dictionary as the input.<br>
The keys refer to the column names and the values refer to the aggregation method.

In [None]:
# Determine the mean of "Age" and the sum of "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg({'Age': 'mean', 'Salary': 'sum'})

Even when we are using a dictionary, we can set some values to a list so a column will be aggregated in few different ways.

In [None]:
# Determine the mean of "Age" and the mean and the sum of "Salary" per "Department"
sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg({'Age': 'mean', 'Salary': ['mean', 'sum']})

### <font color=darkred><b>Exercise 5</b><font>
Find the min age, the max age, the median salary and the standard deviation of salary per Department.

In [None]:
# Re-initialize df just in case they are changed
# Do NOT change this cell

data = {'Employee': ['Anna', 'Emma', 'Ethan', 'Gary', 'John', 'Lila', 'Will'],
        'Department': ['HR', 'Sales', 'HR', 'Sales', 'HR', 'Sales', 'HR'],
        'Seniority': ['Junior', 'Junior',' Senior', 'Senior', 'Senior', 'Junior', 'Junior'],
        'Age': [29, 28, 35, 32, 33, 24, 26],
        'Salary': [70000, 60000, 80000, 73000, 78000, 55000, 58000]}

df = pd.DataFrame(data)

In [None]:
# Write your code for Exercise 5 here

sub_df = df[['Department', 'Age', 'Salary']]
sub_df.groupby('Department').agg(_)

## 2.8 Merge and concatenate DataFrame

### Preparation - Create multiple DataFrames

In [None]:
data1 = {'ID': [1, 2, 3, 4],
         'Name': ['Alice', 'Bob', 'Charlie', 'David']}
df1 = pd.DataFrame(data1)
df1

In [None]:
data2 = {'ID': [5, 6],
         'Name': ['Eva', 'Frank']}
df2 = pd.DataFrame(data2)
df2

In [None]:
data3 = {'ID': [4, 5, 6, 7],
         'Salary': [70000, 80000, 90000, 60000]}
df3 = pd.DataFrame(data3)
df3

### 2.8.1 Concatenate DataFrames
<b><font color="orange" size=5>★</font> New Function:</b> pandas.concat()

We can use pandas.concat() function to join multiple DataFrames vertically.<br>
pandas.concate() function takes a list of DataFrames as the input. It can join more than 2 DataFrames at once.

In [None]:
concatenated_df = pd.concat([df1, df2])
concatenated_df

Take note that, the indices remain the same as how they appear in the separate DataFrames.

If we want to reset it, we can set ignore_index=True in pandas.concat() function.

In [None]:
df4 = pd.concat([df1, df2], ignore_index=True)
df4

We can use join multiple DataFrames horizontally by setting axis=1 in pandas.concat() function.

In [None]:
concatenated_df = pd.concat([df4, df3], axis=1)
concatenated_df

We may notice that, if the ID in salary table does not match the ID in name table.<br>
In order to align them, we need to use pandas.merge() function.

### 2.8.2 Merge DataFrames
<b><font color="orange" size=5>★</font> New Function:</b> pandas.merge()

We can use pandas.merge() function to merge DataFrames by a specific key.<br>
That means, the rows are joint when they have the same value in the "key" column.

Take note that, unlike pandas.concat() that can concatenate multiple DataFrames at once, pandas.merge() only processes 2 DataFrames at one time.<br>
Hence, it takes the DataFrames separately as 2 inputs, instead of one in the list.<br>
They are called the "left" table and the "right" table.

By default, pandas.merge() function takes the first column in each DataFrame as the key. But it would be better to specify them by the "on" argument.

In [None]:
merged_df = pd.merge(df4, df3, on='ID')
merged_df

We may notice that, only the IDs that appear in both DataFrames are kept.
This operation is called "inner join".

If we want to keep the IDs on the 1st DataFrame (it is also called, the "left" table), we can set how='left'.<br>
If there is no match, the rows in the "left" table will be kept with missing values.
This operation is called "left join".

In [None]:
merged_df = pd.merge(df4, df3, on='ID', how='left')
merged_df

Likewise, we can do a "right join".

In [None]:
merged_df = pd.merge(df4, df3, on='ID', how='right')
merged_df

If we do not want to drop any row, we can do a "outer join".

In [None]:
merged_df = pd.merge(df4, df3, on='ID', how='outer')
merged_df

### <font color=darkred><b>Exercise 6</b><font>
Use airports.csv and flights.csv<br>
Find the top 5 cities with the most inbound flights and the top 3 state with the most outbound flights.

In [None]:
# Initialize the variable(s) just in case they are changed
# Do NOT change this cell

airport_df = pd.read_csv('airports.csv')
flight_df = pd.read_csv('flights.csv')

<i>Hint: When the left key and the right key are not the same, we need to indicate them separately.<br>
We will need to set "left_on" and "right_on", instead of the "on" argument, .</i>

In [None]:
merged_df = pd.merge(_, _, left_on=_, right_on=_)
merged_df

In [None]:
merged_df['city'].value_counts().head(5)

In [None]:
merged_df['state'].value_counts().tail(3)