# Question 3: Python Programming [10 marks] 

Study “Open Data on COVID-19 in Malaysia” by the Ministry of Health (MOH), Malaysia via https://github.com/MoH-Malaysia/covid19-public. Use only datasets from the categories “Cases and Testing”, “Healthcare”, “Deaths”, and “Static
data” for this assignment.

Answer the following questions and prepare your findings using the “Streamlit” package. Upload it Heroku.com. Each analysis must have at least one chart and a short paragraph explaining your findings.

<span style='color:green'>Data taken from MoH Malaysia as of [`11-09-2021`](https://github.com/MoH-Malaysia/covid19-public/commit/a9d2a11512d0943db02140a03486f6862df87107)</span>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from imblearn.over_sampling import SMOTE
from collections import Counter

## 1.0 Simple Data Exploratory Analysis

In [None]:
# Loading the datasets

# cases and testing
cases_malaysia = pd.read_csv('dataset/cases_and_testing/cases_malaysia.csv')
cases_state = pd.read_csv('dataset/cases_and_testing/cases_state.csv')
clusters = pd.read_csv('dataset/cases_and_testing/clusters.csv')
tests_malaysia = pd.read_csv('dataset/cases_and_testing/tests_malaysia.csv')
tests_state = pd.read_csv('dataset/cases_and_testing/tests_state.csv')

# deaths
deaths_malaysia = pd.read_csv('dataset/deaths/deaths_malaysia.csv')
deaths_state = pd.read_csv('dataset/deaths/deaths_state.csv')

# healthcare
hospital = pd.read_csv('dataset/healthcare/hospital.csv')
icu = pd.read_csv('dataset/healthcare/icu.csv')
pkrc = pd.read_csv('dataset/healthcare/pkrc.csv')

# static data
population = pd.read_csv('dataset/static_data/population.csv')

It is found that multiple datasets have different start date (date of first record) and end date (date of last record).

| Dataset         | Start date | End date   | Number of rows |
| --------------- | ---------- | ---------- | -------------- |
| cases_malaysia  | 2020-01-25 | 2021-09-11 | 596            |
| cases_state     | 2020-01-25 | 2021-09-11 | 9536           |
| clusters        | 2020-03-01 | 2021-09-09 | 5022           |
| tests_malaysia  | 2020-01-24 | 2021-09-08 | 594            |
| tests_state     | 2021-07-01 | 2021-09-08 | 1120           |
| deaths_malaysia | 2020-03-17 | 2021-09-11 | 544            |
| deaths_state    | 2020-03-17 | 2021-09-11 | 8704           |
| hospital        | 2020-03-24 | 2021-09-11 | 8179           |
| icu             | 2020-03-24 | 2021-09-11 | 8179           |
| pkrc            | 2020-03-28 | 2021-09-11 | 6185           |
| population      | -          | -          | 17             |

Some useful functions for data exploratory.

In [None]:
# add a 'month' column by using 'date'
def add_month(df):
    df['month'] = df['date'].str[:-3]
    return df

# group data by months and count
def count_by_months(df):
    count_df = df.groupby(['month']).count()
    count_df.reset_index(inplace = True)
    return count_df

# group data by months and get mean
def avg_by_months(df):
    avg_df = df.groupby(['month']).mean()
    avg_df.reset_index(inplace = True)
    return avg_df

### 1.1 Cases and Testing

1. `date`: yyyy-mm-dd format; data correct as of 1200hrs on that date
2. `state`: name of state (present in state file, but not country file)
3. `cases_new`: cases reported in the 24h since the last report (except for 16th March 2020, for which the data is cumulative)
4. `cluster_x`: cases attributable to clusters under category `x`; possible values for `x` are import, religious, community, highRisk, education, detentionCentre, and workplace; the difference between `cases_new` and the sum of cases attributable to clusters is the number of sporadic cases.
5. `rtk-ag`: number of tests done using Antigen Rapid Test Kits (RTK-Ag)
6. `pcr`: number of tests done using Real-time Reverse Transcription Polymerase Chain Reaction (RT-PCR) technology

#### 1.1.1 cases_malaysia.csv

Daily recorded COVID-19 cases at country level.

In [None]:
cases_malaysia

In [None]:
cases_malaysia.info()

In [None]:
cases_malaysia.describe()

In [None]:
plt.figure(figsize=(40,40))

cols = ["cases_new", "cases_import", "cases_recovered", "cluster_import", "cluster_religious", "cluster_community", "cluster_highRisk", "cluster_education", "cluster_detentionCentre", "cluster_workplace"]

for i in range(len(cols)):
    plt.subplot(len(cols),1,1+i)
    sns.lineplot(x='date', y=cols[i], data=cases_malaysia)

In [None]:
cases_malaysia_months = add_month(cases_malaysia)
cases_malaysia_months_count = count_by_months(cases_malaysia_months)
#cases_malaysia_months_count

In [None]:
plt.figure(figsize=(40,40))

for i in range(len(cols)):
    plt.subplot(len(cols),1,1+i)
    sns.barplot(x='month', y=cols[i], data=cases_malaysia_months_count, palette='rocket')

#### 1.1.2 cases_state.csv

Daily recorded COVID-19 cases at state level.

In [None]:
cases_state.info()

In [None]:
cases_state

#### 1.1.3 clusters.csv

Exhaustive list of announced clusters with relevant epidemiological datapoints.

In [None]:
clusters.info()

In [None]:
clusters

In [None]:
clusters_unique_state = clusters['state'].unique()
print(clusters_unique_state)
print(len(clusters_unique_state))

clusters_unique_district = clusters['district'].unique()
print(clusters_unique_district)
print(len(clusters_unique_district))

In [None]:
clusters_unique_category = clusters['category'].unique()
print(clusters_unique_category)
print(len(clusters_unique_category))

clusters_unique_status = clusters['status'].unique()
print(clusters_unique_status)
print(len(clusters_unique_status))

#### 1.1.4 tests_malaysia.csv

Daily tests (note: not necessarily unique individuals) by type at country level.

1. `date`: yyyy-mm-dd format; data correct as of 1200hrs on that date
2. `rtk-ag`: number of tests done using Antigen Rapid Test Kits (RTK-Ag)
3. `pcr`: number of tests done using Real-time Reverse Transcription Polymerase Chain Reaction (RT-PCR) technology

In [None]:
tests_malaysia.info()

In [None]:
tests_malaysia

In [None]:
tests_malaysia_months = add_month(tests_malaysia)
tests_malaysia_months_count = count_by_months(tests_malaysia_months)
#tests_malaysia_months_count

In [None]:
plt.figure(figsize=(40,10))

cols = ["rtk-ag", "pcr"]

for i in range(len(cols)):
    plt.subplot(len(cols),1,1+i)
    sns.barplot(x='month', y=cols[i], data=tests_malaysia_months_count, palette='rocket')

#### 1.1.5 tests_state.csv

Daily tests (note: not necessarily unique individuals) by type at state level.

1. `date`: yyyy-mm-dd format; data correct as of 1200hrs on that date
2. `state`: name of state (present in state file, but not country file)
3. `rtk-ag`: number of tests done using Antigen Rapid Test Kits (RTK-Ag)
4. `pcr`: number of tests done using Real-time Reverse Transcription Polymerase Chain Reaction (RT-PCR) technology

In [None]:
tests_state.info()

In [None]:
tests_state

### 1.2 Deaths

1. `date`: yyyy-mm-dd format; data correct as of 1200hrs on that date
2. `state`: name of state (present in state file, but not country file)
3. `deaths_new`: deaths due to COVID-19 reported in the 24h since the last report

#### 1.2.1 deaths_malaysia.csv

Daily deaths due to COVID-19 at country level.

In [None]:
deaths_malaysia.info()

In [None]:
deaths_malaysia

#### 1.2.2 deaths_state.csv

Daily deaths due to COVID-19 at state level.

In [None]:
deaths_state.info()

In [None]:
deaths_state

### 1.3 Healthcare

#### 1.3.1 hospital.csv

Flow of patients to/out of hospitals, with capacity and utilisation.

1. `date`: yyyy-mm-dd format; data correct as of 2359hrs on that date
2. `state`: name of state, with similar qualification on exhaustiveness of date-state combos as PKRC data
3. `beds`: total hospital beds (with related medical infrastructure)
3. `beds_covid`: total beds dedicated for COVID-19
4. `beds_noncrit`: total hospital beds for non-critical care
5. `admitted_x`: number of individuals in category `x` admitted to hospitals, where `x` can be suspected/probable, COVID-19 positive, or non-COVID
6. `discharged_x`: number of individuals in category `x` discharged from hospitals
7. `hosp_x`: total number of individuals in category `x` in hospitals; this is a stock variable altered by flows from admissions and discharges


In [None]:
hospital.info()

In [None]:
hospital

#### 1.3.2 icu.csv

Capacity and utilisation of intensive care unit (ICU) beds.

1. `date`: yyyy-mm-dd format; data correct as of 2359hrs on that date
2. `state`: name of state, with similar qualification on exhaustiveness of date-state combos as PKRC data
3. `beds_icu`: total gazetted ICU beds
4. `beds_icu_rep`: total beds aside from (3) which are temporarily or permanently designated to be under the care of Anaesthesiology & Critical Care departments
5. `beds_icu_total`: total critical care beds available (with related medical infrastructure)
6. `beds_icu_covid`: total critical care beds dedicated for COVID-19
7. `vent`: total available ventilators
8. `vent_port`: total available portable ventilators
9. `icu_x`: total number of individuals in category `x` under intensive care, where `x` can be  suspected/probable, COVID-19 positive, or non-COVID; this is a stock variable
10. `vent_x`: total number of individuals in category `x` on mechanical ventilation, where `x` can be suspected/probable, COVID-19 positive, or non-COVID; this is a stock variable

In [None]:
icu.info()

In [None]:
icu

#### 1.3.3 pkrc.csv

Flow of patients to/out of Covid-19 Quarantine and Treatment Centres (PKRC), with capacity and utilisation.

1. `date`: yyyy-mm-dd format; data correct as of 2359hrs on that date
2. `state`: name of state; note that (unlike with other datasets), it is not necessary that there be an observation for every state on every date. for instance, there are no PKRCs in W.P. Kuala Lumpur and W.P Putrajaya.
3. `beds`: total PKRC beds (with related medical infrastructure)
4. `admitted_x`: number of individuals in category `x` admitted to PKRCs, where `x` can be suspected/probable, COVID-19 positive, or non-COVID
5. `discharged_x`: number of individuals in category `x` discharged from PKRCs
6. `pkrc_x`: total number of individuals in category `x` in PKRCs; this is a stock variable altered by flows from admissions and discharges


In [None]:
pkrc.info()

In [None]:
pkrc

### 1.4 Static Data

#### 1.4.1 population.csv

Total, adult (≥18), and elderly (≥60) population at state level.

In [None]:
population.info()

In [None]:
population

In [None]:
state_population = population.drop(0) # drop Malaysia population
state_population.reset_index(inplace=True, drop=True)

plt.figure(figsize=(30,30))
plt.subplot(311)
sns.barplot(x='state', y='pop', data=state_population, palette='mako')
plt.subplot(312)
sns.barplot(x='state', y='pop_18', data=state_population, palette='mako')
plt.subplot(313)
sns.barplot(x='state', y='pop_60', data=state_population, palette='mako')

### 1.5 Outliers Detection

In [None]:
def check_outlier(df):
    numeric_columns = df.describe().copy()
    num_of_columns = len(numeric_columns.columns)
    
    for count, column in enumerate(numeric_columns):
        plt.figure(figsize=(20,5))
        plt.subplot(num_of_columns, 1, count+1)
        plot_boxplot(df[column], xlabel=column)
    plt.show()
    
def plot_boxplot(series, title='', xlabel=''):
    bp = sns.boxplot(x=series)
    bp.set(title=title,
           xlabel=xlabel)
    #plt.show()

In [None]:
check_outlier(cases_malaysia)

In [None]:
check_outlier(cases_state)

In [None]:
check_outlier(clusters)

In [None]:
check_outlier(tests_malaysia)

In [None]:
check_outlier(tests_state)

### 1.6 Identifying Missing Values

In [None]:
def get_nan(df):
    missing_value = df[df.isna().values.any(axis=1)]
    rows = missing_value.shape[0]
    print(rows, "rows with missing values")
    
    if rows > 0:
        print(df.isna().sum(), "\n")

In [None]:
get_nan(cases_malaysia)
get_nan(cases_state)
get_nan(clusters)
get_nan(tests_malaysia)
get_nan(tests_state)

In [None]:
get_nan(deaths_malaysia)
get_nan(deaths_state)

In [None]:
get_nan(hospital)
get_nan(icu)
get_nan(pkrc)

In [None]:
get_nan(population)

**CSV files with missing values:**

1. `cases_malaysia` - 342 out of 596 rows
2. `cases_state` - 176 out of 9536 rows 

## 2.0 Data Transformation

In [None]:
# list of unique states
states = tests_state['state'].unique()

print(len(states))
for s in states:
    print(s)

In [None]:
# plot a time series scatterplot according to column
# suitable for: cases_state, tests_state, deaths_state

def plot_scatterplot(df, column, state_list):
    for state in state_list:
        temp_df = df[df['state'] == state].copy()
        temp_df.reset_index(inplace=True, drop=True)

        ax = sns.scatterplot(x='date', y=column, data=temp_df)
        plt.title(state)
        plt.show()

In [None]:
plot_scatterplot(cases_state, 'cases_new', states)

### 2.1 Merging

Merging `cases_state`, `tests_state`, and `deaths_state`.

However, these 3 datasets have different starting date and ending date.

| Dataset         | Start date | End date   | Number of rows |
| --------------- | ---------- | ---------- | -------------- |
| cases_state     | 2020-01-25 | 2021-09-11 | 9536           |
| tests_state     | 2021-07-01 | 2021-09-08 | 1120           |
| deaths_state    | 2020-03-17 | 2021-09-11 | 8704           |

Hence we selected the timeframe: `2021-07-01` to `2021-09-08` (70 days) which is covered in all 3 datasets. First we performed an inner merge between cases_state with tests_state. Then we merged the resulting dataframe with deaths_state.

In [None]:
tests_state

In [None]:
# merge tests_state and cases_state together
df_inner_join = (pd.merge(tests_state, cases_state, on='date', how='inner', indicator=True)).set_index('date')
df_cases_state = df_inner_join[df_inner_join['state_x'] == df_inner_join['state_y']].copy()
df_cases_state.drop(['state_y', '_merge'], axis=1, inplace=True)

# merge deaths_state
df_inner_join_2 = (pd.merge(df_cases_state, deaths_state, on='date', how='inner', indicator=True)).set_index('date')
df_cases_tests_deaths = df_inner_join_2[df_inner_join_2['state_x'] == df_inner_join_2['state']].copy()
df_cases_tests_deaths.drop(['state', '_merge'], axis=1, inplace=True)
df_cases_tests_deaths.rename(columns={'state_x': 'state'}, inplace=True)

# reset index
df_cases_tests_deaths.reset_index(inplace = True)

# rearranging the columns
cols = df_cases_tests_deaths.columns.tolist()
df_cases_tests_deaths = df_cases_tests_deaths[cols[:2]+cols[4:7]+cols[2:4]+cols[7:]]
df_cases_tests_deaths

In [None]:
df_cases_tests_deaths.info()

Checking if there is any missing value.

In [None]:
get_nan(df_cases_tests_deaths)

Checking for outliers.

In [None]:
check_outlier(df_cases_tests_deaths)

The columns in `df_cases_tests_deaths`

1. `date`: yyyy-mm-dd format; data correct as of 1200hrs on that date
2. `state`: name of state (present in state file, but not country file) 
3. `cases_import`: 
4. `cases_new`: cases reported in the 24h since the last report (except for 16th March 2020, for which the data is cumulative)
5. `cases_recovered`:
6. `rtk-ag`: number of tests done using Antigen Rapid Test Kits (RTK-Ag)
7. `pcr`: number of tests done using Real-time Reverse Transcription Polymerase Chain Reaction (RT-PCR) technologyb
8. `deaths_new`: deaths due to COVID-19 reported in the 24h since the last report
9. `deaths_new_dod`:
10. `deaths_bid`:  
11. `deaths_bid_dod`:

### 2.2 Binning

Binning is applied to the `cases_new` column of the above cleaned data. This essential for building a classification model later.

In [None]:
sns.displot(df_cases_tests_deaths, x='cases_new')
plt.show()

In [None]:
bins = np.linspace(min(df_cases_tests_deaths['cases_new']), max(df_cases_tests_deaths['cases_new']), 4)
bins

In [None]:
group_names = ['High', 'Moderate', 'Low']
df_cases_tests_deaths['cases_new_binned'] = pd.cut(df_cases_tests_deaths['cases_new'], bins, labels=group_names, include_lowest=True)

df_cases_tests_deaths.head()

In [None]:
plt.bar(group_names, df_cases_tests_deaths['cases_new_binned'].value_counts())
plt.title('Daily Cases Distribution')
plt.xlabel('Daily New Cases')
plt.ylabel('Frequency')

## 3.0 Correlation Check

Observing the strong features and indicators to daily cases for all states

### 3.1 Correlation within a State

#### 3.1.1 Heatmaps

In [None]:
corr_msia = df_cases_tests_deaths.corr()
plt.figure(figsize=(8, 8))
sns.heatmap(corr_msia,square=True,annot=True)
plt.title('Malaysia')

In [None]:
# plot heatmap of all states
for state in states:
    df_state = df_cases_tests_deaths[df_cases_tests_deaths['state'] == state].copy()
    corr_state = df_state.corr()
    
    plt.figure(figsize=(8, 8))
    sns.heatmap(corr_state,square=True,annot=True)
    plt.title(state)
    plt.show()

Upon checking the heatmap of each states, we noticed that **Perlis** and **W.P.Putrajaya** have interesting patterns

In [None]:
df_perlis = df_cases_tests_deaths[df_cases_tests_deaths['state'] == 'Perlis'].copy()
df_perlis.describe()

In [None]:
df_putrajaya = df_cases_tests_deaths[df_cases_tests_deaths['state'] == 'W.P. Putrajaya'].copy()
df_putrajaya.describe()

According to our data, **Perlis** has 0 `cases_import` in total and **W.P. Putrajaya** has 0 `deaths_bid` and `deaths_bid_dod`

#### 3.1.2 Feature Importance

Calculating the feature importance score for the regression and classification later.

In [None]:
df_pahang = df_cases_tests_deaths[df_cases_tests_deaths['state'] == 'Pahang'].copy()
df_pahang.drop(['date', 'state'], axis=1, inplace=True)
df_pahang.reset_index(inplace=True, drop=True)
df_pahang

In [None]:
X = df_pahang.loc[:, df_pahang.columns.difference(['cases_new', 'cases_new_binned'])]
y = df_pahang.loc[:, df_pahang.columns == 'cases_new']

In [None]:
# define the model
model = LinearRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]

In [None]:
for i, col in enumerate(X.columns):
    print(col, importance[i])

In [None]:
from sklearn.tree import DecisionTreeRegressor

# define the model
model = DecisionTreeRegressor()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_

for i, col in enumerate(X.columns):
    print(col, importance[i])

In [None]:

# decision tree for feature importance on a classification problem
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model

In [None]:
# separates the date column into year, month, day and drops the column

def separate_df_date(df):
    df_temp = df.copy()
    
    df_temp['date'].replace('-', ' ', regex=True, inplace=True)
    date = df_temp['date'].str.split(' ', n=2, expand=True)
    date.columns = ['year','month','day']

    df_temp.drop(['date'], axis=1, inplace=True)

    df_temp = pd.concat([date, cases_state_ymd], axis=1)

    return df_temp

### 3.2 Correlation between States

In [None]:
# selects a target state and compare with other states

def check_correlation(df, column, target_state, plot=False):
    
    target_df = df[df['state'] == target_state].copy()
    target_df.reset_index(inplace=True, drop=True)
    
    state_list = df['state'].unique()
    for state in state_list:
        if state == target_state:
            continue
        
        temp_df = df[df['state'] == state].copy()
        temp_df.reset_index(inplace=True, drop=True)
        
        print(target_state, '&', state, target_df[column].corr(temp_df[column]))
        
        if plot: # plots scatterplot
            plt.scatter(target_df[column], temp_df[column])
            plt.xlabel(target_state)
            plt.ylabel(state)
            plt.show()

In [None]:
df_cases_tests_deaths

In [None]:
numeric_columns = df_cases_tests_deaths.describe().columns

for col in numeric_columns:
    print('=============', col, '=============')
    check_correlation(df_cases_tests_deaths, col, 'Pahang')
    print()

In [None]:
for col in numeric_columns:
    print('=============', col, '=============')
    check_correlation(df_cases_tests_deaths, col, 'Johor')
    print()

## 3.0 Building Prediction Models

Select specific data from specific states ('Pahang', 'Kedah', 'Johor', 'Selangor') for building regression and classification models.

In [None]:
df_predict = df_cases_tests_deaths.copy()

df_predict = df_predict.loc[(df_predict['state']=='Pahang') | (df_predict['state'] == 'Kedah') |
                            (df_predict['state'] == 'Johor') | (df_predict['state'] == 'Selangor'), :]

df_predict.drop(['date', 'state'], axis=1, inplace=True)
df_predict

### 3.1 Regression Models

Regression models that will be used:
1. Linear Regression
2. Decision Tree Regressor
3. Random Forest Regressor

Evaluation matrics that will be used:
1. R Square => (0, 1)
2. Mean Absolute Error(MAE)


3. Mean Square Error(MSE)/Root Mean Square Error(RMSE)

In [None]:
X = df_predict.loc[:, df_predict.columns.difference(['cases_new', 'cases_new_binned'])]
y = df_predict.loc[:, df_predict.columns == 'cases_new']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### 3.1.1 Linear Regression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
reg = LinearRegression()

reg = reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
y_pred

In [None]:
r2_score(y_test, y_pred)

In [None]:
mean_absolute_error(y_test, y_pred)

#### 3.1.2 Decision Tree Regressor

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

In [None]:
reg = DecisionTreeRegressor(criterion='mse', splitter='random')

In [None]:
reg = reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
y_pred

In [None]:
r2_score(y_test, y_pred)

In [None]:
mean_absolute_error(y_test, y_pred)

#### 3.1.3 Random Forest Regressor

### 3.2 Classification Models

Classification models that will be used:
1. K-Nearest Neighbors Classifier
2. Naive Bayes Classifier
3. Decision Tree Classifier
4. Random Forest Classifier

Performance evaluation metrics that will be used:
1. Confusion Matrix
2. ROC Curve

We binned the daily new cases into 3 categories.

In [None]:
print(df_predict['cases_new_binned'].value_counts())

plt.bar(df_predict['cases_new_binned'].value_counts().index, 
        df_predict['cases_new_binned'].value_counts())

plt.title('Daily Cases Distribution')
plt.xlabel('Daily New Cases')
plt.ylabel('Frequency')

However, the results shows that the data is highly imbalanced.

In [None]:
# encoding
encode = {"High": 0, "Moderate": 1, "Low": 2}

df_predict_2 = df_predict.copy()
df_predict_2['cases_new_binned'] = df_predict_2['cases_new_binned'].map(encode)

In [None]:
X = df_predict_2.loc[:, df_predict_2.columns.difference(['cases_new', 'cases_new_binned'])]
y = df_predict_2[['cases_new_binned']]

print(X.shape)
print(y.shape)

Applying **SMOTE** to balance the data.

Reference: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification

In [None]:
print("--Before--")
counter = Counter(y['cases_new_binned'])
print(counter)

oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

print("--After--")
counter = Counter(y['cases_new_binned'])
print(counter)

print(X.shape)
print(y.shape)

In [None]:
y.cases_new_binned.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### 3.2.1 K-Nearest Neighbors (KNN) Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
clf_KNN = KNeighborsClassifier(n_neighbors=3)

clf_KNN = clf_KNN.fit(X_train, np.ravel(y_train))

y_pred = clf_KNN.predict(X_test)

print("Accuracy on training set: {:.5f}".format(clf_KNN.score(X_train, y_train)))
print("Accuracy on test set:     {:.5f}".format(clf_KNN.score(X_test, y_test)))

In [None]:
fig,ax = plt.subplots(figsize = (5,5))

ax.set_title("K-Nearest Neighbors Confusion Matrix")

cm = confusion_matrix(y_test, y_pred)
display = ConfusionMatrixDisplay(confusion_matrix=cm)
display.plot(cmap = 'Greens', 
             xticks_rotation ='vertical', 
             ax=ax)

#### 3.2.2 Naïve Bayes Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [None]:
clf_NB = GaussianNB()

clf_NB = clf_NB.fit(X_train, np.ravel(y_train))

y_pred = clf_NB.predict(X_test)

print("Accuracy on training set: {:.5f}".format(clf_NB.score(X_train, y_train)))
print("Accuracy on test set:     {:.5f}".format(clf_NB.score(X_test, y_test)))

In [None]:
fig,ax = plt.subplots(figsize = (5,5))

ax.set_title("Naive Bayes Classifier Confusion Matrix")

cm = confusion_matrix(y_test, y_pred)
display = ConfusionMatrixDisplay(confusion_matrix=cm)
display.plot(cmap = 'Greens', 
             xticks_rotation ='vertical', 
             ax=ax)

#### 3.2.3 Decision Tree Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
clf_DT = DecisionTreeClassifier(criterion='entropy', max_depth=7, splitter='best')

clf_DT = clf_DT.fit(X_train, y_train)

y_pred = clf_DT.predict(X_test)

print("Accuracy on training set: {:.5f}".format(clf_DT.score(X_train, y_train)))
print("Accuracy on test set:     {:.5f}".format(clf_DT.score(X_test, y_test)))

In [None]:
fig,ax = plt.subplots(figsize = (5,5))

ax.set_title("Decision Tree Classifier Confusion Matrix")

cm = confusion_matrix(y_test, y_pred)
display = ConfusionMatrixDisplay(confusion_matrix=cm)
display.plot(cmap = 'Greens', 
             xticks_rotation ='vertical', 
             ax=ax)

#### 3.2.4 Random Forest Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
clf_RF = RandomForestClassifier(max_depth=10, random_state=0)


clf_RF = clf_RF.fit(X_train, np.ravel(y_train))

y_pred = clf_RF.predict(X_test)

print("Accuracy on training set: {:.5f}".format(clf_RF.score(X_train, y_train)))
print("Accuracy on test set:     {:.5f}".format(clf_RF.score(X_test, y_test)))

In [None]:
fig,ax = plt.subplots(figsize = (5,5))

ax.set_title("Random Forest Classifier Confusion Matrix")

cm = confusion_matrix(y_test, y_pred)
display = ConfusionMatrixDisplay(confusion_matrix=cm)
display.plot(cmap = 'Greens', 
             xticks_rotation ='vertical', 
             ax=ax)

In [None]:
prob_KNN = clf_KNN.predict_proba(X_test)[:,1]
prob_NB = clf_NB.predict_proba(X_test)[:,1]
prob_DT = clf_DT.predict_proba(X_test)[:,1]
prob_RF = clf_RF.predict_proba(X_test)[:,1]

proba = [prob_KNN, prob_NB, prob_DT, prob_RF]
color = ['orange', 'blue', 'purple', 'red']
label = ['KNN', 'Naive Bayes', 'Decision Tree', 'Random Forest']


plt.figure(figsize=(10, 8))
for i in range(4):
    fpr, tpr, thresholds = roc_curve(y_test, proba[i], pos_label=1)
    plt.plot(fpr, tpr, color=color[i], label=label[i]) 
    
plt.plot([0, 1], [0, 1], color='green', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()


## Questions

**(i) Discuss the exploratory data analysis steps you have conducted including detection of outliers and missing values?**

```
ok
```

**(ii) What are the states that exhibit strong correlation with (i) Pahang, and (ii) Johor?**

```
multi-dimension data
(matplotlib)
```

**(iii) What are the strong features/indicators to daily cases for (i) Pahang, (ii) Kedah,
(iii) Johor, and (iv) Selangor? [Note: you must use at least 2 methods to justify
your findings]**


```
one more method

feature importance
```

**(iv) Comparing regression and classification models, what model performs well in
predicting the daily cases for (i) Pahang, (ii) Kedah, (iii) Johor, and (iv) Selangor?
Requirements:**
1. Use TWO(2) regression models and TWO(2) classification models
2. Use appropriate evaluation metrics.


## References

https://machinelearningmastery.com/calculate-feature-importance-with-python/#:~:text=Feature%20importance%20refers%20to%20techniques,at%20predicting%20a%20target%20variable.&text=The%20role%20of%20feature%20importance%20in%20a%20predictive%20modeling%20problem.

https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e