# 1 Setup

Importing a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. 

Check that Python 3.5 or later is installed

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Importing useful libraries
import numpy as np
import os
from typing import Array, Dict, List, Tuple

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images")
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

ImportError: cannot import name 'Array' from 'typing' (C:\Users\Joo_Kai\miniconda3\envs\cits5508-2023\Lib\typing.py)

# 2 Binary Classification

Read the contents of training.csv and testing.csv. It is assumed that both datasets are present in the root directory of this project

In [None]:
import pandas as pd

testing_data = pd.read_csv('testing.csv')
training_data = pd.read_csv('training.csv')

## 2.1 Display the first lines and visualise dataset

### 2.1.1 Display the Training Dataset

In [None]:
# Displaying the first 5 rows of the dataset
training_data.head()

In [None]:
training_data.describe()

**TO DO describe the cells below** - watch the lecture on this part?

In [None]:
# Plotting the different attributes/columns in the dataset

%matplotlib inline
import matplotlib.pyplot as plt
training_data.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots_training")
plt.show()

The majority of the histograms for each attribute are bell shaped. However, each arrtibute is skewed either to the left or the right.

### 2.1.2 Displaying the Testing Dataset

In [None]:
# Displaying the first 5 rows of the dataset
testing_data.head()

In [None]:
testing_data.describe()

**TO DO describe the cells below** - watch the lecture on this part?

In [None]:
# Plotting the different attributes/columns in the dataset

%matplotlib inline
import matplotlib.pyplot as plt
testing_data.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots_testing")
plt.show()

## 2.2 Simplifying the Classification Task

### 2.2.1 Removing Columns

In order to simplify the classification task, the following scripts will delete all columns whose names begins with pred_minus_obs. This will leave 9 columns (features) for the training and test sets. 

In [None]:
# Defining a function that allows us to drop all cols that start with a string
def drop_cols(dataframe, start_string):
    cols = list(dataframe) # create a list of dataframe columns
    for i in cols:
        if i.startswith(start_string):
            dataframe.drop(i, axis=1, inplace=True)

In [None]:
# Drop the columns from the two datasets 
drop_cols(training_data, 'pred_minus_obs')
drop_cols(testing_data, 'pred_minus_obs')

### 2.2.2 Counting Instances

The following code counts the number of instances for each class label in order to determine if we have a balanced dataset. We can view the output of the count_instances() function to determine the number of non-null instances in the dataset.

It can be observed that the training data has an abundance of values for the s and d class label but a significantly lower proportion for values in the o and h class label.

The testing data is more balanced in terms of s, d and h. However, it is also lacking in values for o. 

In [None]:
testing_data["class"].value_counts()

In [None]:
training_data["class"].value_counts()

### 2.2.3 Feature Scaling

The following code performs appropirate feature scaling on the datasets before doing the classification. The StandardScaler function from the sklearn.preprocessing package is used for this purpose. This puts it in more of a normal distribution. 

In [None]:
from sklearn.preprocessing import StandardScaler

def scale(training: Array, testing: Array) -> Tuple[Array, Array]:
    scaler = StandardScaler().fit(training)
    training_scaled = scaler.transform(training)
    testing_scaled = scaler.transform(testing)
    return (training_scaled, testing_scaled)

training_scaled, testing_scaled = scale(training_data, testing_data)

To-do: 
1. Seperate into x and y labels for class and data
3. scale that bictch
2. Drop the o and h labels