# Assignment 3
## JuPyter Notebook - Verschuur L. 1811053, Kolenbrander M. 1653415

Assignment 3 of Advances in Datamining has two main tasks:
- Master classification algorithms
- Master visualisation and dimensionality reduction algorithms

Our submission consists of two notebooks and a python import file:
- `"AIDM Assignment 3 - Visualization.ipynb"`
- `"AIDM Assignment 3 - Classification.ipynb"`
- `"utility_functions.py"`

This file, `"AIDM Assignment 3 - Classification.ipynb"`, focusses on the classification aspect of the assignment. In the classification aspect, the assignment requires the classification, or rather prediction, of a target attribute based on a carefully selected set of input/variable attributes. In the case of this assignment, three different classification techniques are used. The classification techniques are as follows:
- a `Random Forest` classification
- a `Support Vector Classification` classification
- a `XGBoost` classification

The former two classification algorithms are implemented using the [SKLearn libraries](https://scikit-learn.org/stable/). The lather is implemented using the [XGBoost Library](https://xgboost.readthedocs.io/en/stable/python/python_intro.html) with SkLearn and Numpy input support.

For this assignment, the [*Rain in Australia*](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) dataset is used. For more detail, please see the **Data fetching & Data pre-processing** heading. All data is first processed in a Pandas dataframe structure.

One note beforehand is on the target attribute: **RainTomorrow**. This attribute is a binary attribute, and the predictions should either be *Rain is predicted* or *No rain is predicted*. In the original data, rougly `78%` of the data is *No rain is predicted*.

In [2]:
from utility_functions import *

## Parameters

# Convert numeric values into ranged representations
convert_to_range = True
# Represent ranges as categorical values or rounded to nearest base value: 12, b=5 -> 10-14
range_categorical = False
# Convert categorical values (not from ranged values) into one hot representations:
# {'smoking': ['sometimes', 'regularly', 'sometimes', 'never']} -> 
# {'smoking_sometimes': [1, 0, 1, 0], 'smoking_regularly': [0, 1, 0, 0], 'smoking_never': [0, 0, 0, 1]}
convert_categorical_to_one_hot = True

## Data fetching & Data pre-processing

The [*Rain in Australia*](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) dataset, is a historical dataset used for classification of rain prediction. The dataset contains `145460` data-entries spread across `21` dimensions, or rather columns/attributes. The data is gathered from around Australia and runs from `2007` up and until `2017`.

The data is build up of the following attributes:
- **Date** (format yyyy-mm-dd) \[data_type: string\] -> requires converting to \[data_type: datetime64\]
- **Location** \[data_type: string\] (categorical) -> requires converting to One Hot representation
- **MinTemp** \[data_type: float64\]
- **MaxTemp** \[data_type: float64\]
- **Rainfall** \[data_type: float64\]
- **Evaporation** \[data_type: float64\]
- **WindGustDir** \[data_type: string\] (categorical) -> requires converting to One Hot representation
- **WindDir9am** \[data_type: string\] (categorical) -> requires converting to One Hot representation
- **WindDir3pm** \[data_type: string\] (categorical) -> requires converting to One Hot representation
- **WindGustSpeed** \[data_type: float64\]
- **WindSpeed9am** \[data_type: float64\]
- **WindSpeed3pm** \[data_type: float64\]
- **Humidity9am** \[data_type: float64\]
- **Humidity3pm** \[data_type: float64\]
- **Pressure9am** \[data_type: float64\]
- **Pressure3pm** \[data_type: float64\]
- **Cloud9am** \[data_type: int64\] (categorical)
- **Cloud3pm** \[data_type: int64\] (categorical)
- **Temp9am** \[data_type: float64\]
- **Temp3pm** \[data_type: float64\]
- **RainToday** \[data_type: string\] (binary category) -> requires converting to \[data_type: bool\]
- **RainTomorrow** *TARGET* \[data_type: string\] (binary category) -> requires converting to \[data_type: bool\]

### Data preparation
All float values are reduced in their resolution by applying a *nearest range* to them. 

**Example:**

With a range of `range=5`
The following array:

`[12, 18, 21, 24, 25]`

Would be converted to:

`[10, 20, 20, 25, 25]`


Because of limitation in the used libraries, all categorical (string) attributes are converted into a *one-hot* representation.

In this implementation, the choice is made to drop entries with missing data.

### Basic attribute selection
For all attributes with multiple measurepoints, only the latest point (the 3pm attributes) are used. These are closest to the target event (RainTomorrow).

Lastely, data with a lot of entropy, such as the exact dates, and the id of an entry, are left out. The date is however disected into months, to allow for possible seasonality detection.

### Advanced attribute selection
The exact choice of attributes was partially determined by non-data specific knowledge on [weather forecasting](https://study.com/academy/lesson/what-is-air-pressure-definition-types-causes-effects.html#:~:text=Although%20it%20may%20seem%20like,pressure%2C%20temperature%20and%20air%20density.). However, our primary selection is based on groupings/relations found during the visualisation stages. Please refer to the **observation and discussion** sections within the visualisation notebook.

In [117]:
import pandas as pd
import numpy as np

file_path = "weatherAUS.csv"

# Fetching CSV and converting to data frame
data_file = pd.read_csv(file_path, header=0)

# Drop all entries with nan values
data_file = data_file.dropna().reset_index(drop=True)

# Convert interval and ratio variables into ranges
if convert_to_range:
    if range_categorical:
        r_func = floor_range
    else:
        r_func = round_to_base
    
    data_file.insert(data_file.columns.get_loc("MinTemp"), "ranged_MinTemp", [r_func(MinTemp, 2) for MinTemp in data_file["MinTemp"]])
    data_file.insert(data_file.columns.get_loc("MaxTemp"), "ranged_MaxTemp", [r_func(MaxTemp, 2) for MaxTemp in data_file["MaxTemp"]])
    data_file.insert(data_file.columns.get_loc("Rainfall"), "ranged_Rainfall", [r_func(Rainfall, 2) for Rainfall in data_file["Rainfall"]])
    data_file.insert(data_file.columns.get_loc("Evaporation"), "ranged_Evaporation", [r_func(Evaporation, 5) for Evaporation in data_file["Evaporation"]])
    data_file.insert(data_file.columns.get_loc("Sunshine"), "ranged_Sunshine", [r_func(Sunshine, 1) for Sunshine in data_file["Sunshine"]])
    data_file.insert(data_file.columns.get_loc("WindGustSpeed"), "ranged_WindGustSpeed", [r_func(WindGustSpeed, 5) for WindGustSpeed in data_file["WindGustSpeed"]])
    data_file.insert(data_file.columns.get_loc("WindSpeed9am"), "ranged_WindSpeed9am", [r_func(WindSpeed9am, 5) for WindSpeed9am in data_file["WindSpeed9am"]])
    data_file.insert(data_file.columns.get_loc("WindSpeed3pm"), "ranged_WindSpeed3pm", [r_func(WindSpeed3pm, 5) for WindSpeed3pm in data_file["WindSpeed3pm"]])
    data_file.insert(data_file.columns.get_loc("Humidity9am"), "ranged_Humidity9am", [r_func(Humidity9am, 5) for Humidity9am in data_file["Humidity9am"]])
    data_file.insert(data_file.columns.get_loc("Humidity3pm"), "ranged_Humidity3pm", [r_func(Humidity3pm, 5) for Humidity3pm in data_file["Humidity3pm"]])
    data_file.insert(data_file.columns.get_loc("Pressure9am"), "ranged_Pressure9am", [r_func(Pressure9am, 3) for Pressure9am in data_file["Pressure9am"]])
    data_file.insert(data_file.columns.get_loc("Pressure3pm"), "ranged_Pressure3pm", [r_func(Pressure3pm, 3) for Pressure3pm in data_file["Pressure3pm"]])
    data_file.insert(data_file.columns.get_loc("Cloud9am"), "ranged_Cloud9am", [r_func(Cloud9am, 3) for Cloud9am in data_file["Cloud9am"]])
    data_file.insert(data_file.columns.get_loc("Cloud3pm"), "ranged_Cloud3pm", [r_func(Cloud3pm, 3) for Cloud3pm in data_file["Cloud3pm"]])
    data_file.insert(data_file.columns.get_loc("Temp9am"), "ranged_Temp9am", [r_func(Temp9am, 3) for Temp9am in data_file["Temp9am"]])
    data_file.insert(data_file.columns.get_loc("Temp3pm"), "ranged_Temp3pm", [r_func(Temp3pm, 3) for Temp3pm in data_file["Temp3pm"]])

    
# Convert "boolean" variables into true boolean variables
data_file["RainToday"] = np.where(data_file["RainToday"] == "Yes", True, False).astype("bool")
data_file["RainTomorrow"] = np.where(data_file["RainTomorrow"] == "Yes", True, False).astype("bool")
data_file["Date"] = data_file["Date"].astype("datetime64")
data_file.insert(data_file.columns.get_loc("Date"), "Month", pd.DatetimeIndex(data_file["Date"]).month)
    
# Convert categorical variables into a one-hot representation
if convert_categorical_to_one_hot:
    data_file = pd.concat([data_file, pd.get_dummies(data_file["WindGustDir"], prefix="WindGustDir")], axis=1)
    data_file = pd.concat([data_file, pd.get_dummies(data_file["WindDir9am"], prefix="WindDir9am")], axis=1)
    data_file = pd.concat([data_file, pd.get_dummies(data_file["WindDir3pm"], prefix="WindDir3pm")], axis=1)
    data_file = pd.concat([data_file, pd.get_dummies(data_file["Location"], prefix="Location")], axis=1)
    
data_file

Unnamed: 0,Month,Date,Location,ranged_MinTemp,MinTemp,ranged_MaxTemp,MaxTemp,ranged_Rainfall,Rainfall,ranged_Evaporation,...,Location_PerthAirport,Location_Portland,Location_Sale,Location_Sydney,Location_SydneyAirport,Location_Townsville,Location_WaggaWagga,Location_Watsonia,Location_Williamtown,Location_Woomera
0,1,2009-01-01,Cobar,18,17.9,36,35.2,0,0.0,10,...,0,0,0,0,0,0,0,0,0,0
1,1,2009-01-02,Cobar,18,18.4,28,28.9,0,0.0,15,...,0,0,0,0,0,0,0,0,0,0
2,1,2009-01-04,Cobar,20,19.4,38,37.6,0,0.0,10,...,0,0,0,0,0,0,0,0,0,0
3,1,2009-01-05,Cobar,22,21.9,38,38.4,0,0.0,10,...,0,0,0,0,0,0,0,0,0,0
4,1,2009-01-06,Cobar,24,24.2,40,41.0,0,0.0,10,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56415,6,2017-06-20,Darwin,20,19.3,34,33.4,0,0.0,5,...,0,0,0,0,0,0,0,0,0,0
56416,6,2017-06-21,Darwin,22,21.2,32,32.6,0,0.0,10,...,0,0,0,0,0,0,0,0,0,0
56417,6,2017-06-22,Darwin,20,20.7,32,32.8,0,0.0,5,...,0,0,0,0,0,0,0,0,0,0
56418,6,2017-06-23,Darwin,20,19.5,32,31.8,0,0.0,5,...,0,0,0,0,0,0,0,0,0,0


## Classification

### Test & Train set generation
**Applied algorithms for this test and training set**
- SKLearn Random Forest Classifier
- SKLearn Support Vector Classification Classifier
- XGBoost Classifier

In [118]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Decision Variables
X = data_file.loc[:,["RainToday", "ranged_Evaporation", "ranged_Sunshine", "ranged_WindGustSpeed", "ranged_Humidity3pm", "ranged_Pressure3pm", "ranged_Cloud3pm", "ranged_Temp3pm", *fetch_columns_on_name_list(data_file, ["WindDir9am", "WindDir3pm"])]].values
# Target Variable
y = data_file.loc[:,"RainTomorrow"].values

# Split datasets into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

sc = StandardScaler()

# Scale data for classifiers
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.fit_transform(X_test)

### Random Forest Classification training & testing

In [133]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=150, n_jobs=-1)
rf_classifier.fit(X_train_scaled, Y_train)

RandomForestClassifier(n_estimators=150, n_jobs=-1)

#### Random Forest Classification report

In [134]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

rf_y_prediction = rf_classifier.predict(X_test_scaled)

print("# # # Classification report # # #")
print(classification_report(Y_test, rf_y_prediction))

print("# # # Confusion matrix # # #")
print(confusion_matrix(Y_test, rf_y_prediction), "\n")

print("# # # Accuracy score # # #")
print(accuracy_score(Y_test, rf_y_prediction), "\n")

# # # Classification report # # #
              precision    recall  f1-score   support

       False       0.88      0.94      0.91      8843
        True       0.72      0.53      0.61      2441

    accuracy                           0.85     11284
   macro avg       0.80      0.74      0.76     11284
weighted avg       0.85      0.85      0.85     11284

# # # Confusion matrix # # #
[[8350  493]
 [1146 1295]] 

# # # Accuracy score # # #
0.8547500886210564 



#### Random Forest observations and discussions
The Random Forest classification technique has fair average of `85%` success, however, this should be viewed critically. Most predictions were made for *No rain* which is also the predominant (`78%`) expected outcome. It therefor made mostly **false negative** mistakes, where it predicted *No rain* eventhough rain was expected (refer to the confusion matrix).

Overall, the Random Forest method is most effective at predicting *No rain*, but also shows a fair ability for *expected rain* predictions (`88%` vs `73%`).

The Random Forest algorithm has an almost instant training and fitting runtime on this dataset.

Because of the fair size of the dataset and number of attributes, the number of estimators has been increased to `150`.

### Support Vector Classification training & testing

In [121]:
from sklearn.svm import SVC

SV_classifier = SVC()
SV_classifier.fit(X_train_scaled, Y_train)

SVC()

#### Support Vector Classification report

In [122]:
SV_y_prediction = SV_classifier.predict(X_test_scaled)

print("# # # Classification report # # #")
print(classification_report(Y_test, SV_y_prediction))

print("# # # Confusion matrix # # #")
print(confusion_matrix(Y_test, SV_y_prediction), "\n")

print("# # # Accuracy score # # #")
print(accuracy_score(Y_test, SV_y_prediction), "\n")

# # # Classification report # # #
              precision    recall  f1-score   support

       False       0.87      0.95      0.91      8843
        True       0.74      0.49      0.59      2441

    accuracy                           0.85     11284
   macro avg       0.81      0.72      0.75     11284
weighted avg       0.84      0.85      0.84     11284

# # # Confusion matrix # # #
[[8436  407]
 [1255 1186]] 

# # # Accuracy score # # #
0.8527118043247075 



#### Support Vector Classification observations and discussions
The Support Vector classification technique has fair average of `85%` success, however, this should be viewed critically. Most predictions were made for *No rain* which is also the predominant (`78%`) expected outcome. It therefor made mostly **false negative** mistakes, where it predicted *No rain* eventhough rain was expected (refer to the confusion matrix).

Overall, the Support Vector method is most effective at predicting *No rain*, but also shows a fair ability for *expected rain* predictions (`87%` vs `74%`).

The Support Vector algorithm has a long runtime both in model building and fitting, requiring several seconds to complete.

The support vector alogrithm either didn't seem to change much in results when tuning its parameters, or it got worse. The support vector classifier is therefor left at default settings

### XGBoost Classification training & testing

In [135]:
import xgboost as xgb

XGB_classifier = xgb.XGBClassifier(base_score=0.6, n_estimators=150)
XGB_classifier.fit(X_train_scaled, Y_train)





XGBClassifier(base_score=0.6, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=150, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

#### XGBoost Classification report

In [136]:
XGB_y_prediction = XGB_classifier.predict(X_test_scaled)

print("# # # Classification report # # #")
print(classification_report(Y_test, XGB_y_prediction))

print("# # # Confusion matrix # # #")
print(confusion_matrix(Y_test, XGB_y_prediction), "\n")

print("# # # Accuracy score # # #")
print(accuracy_score(Y_test, XGB_y_prediction), "\n")

# # # Classification report # # #
              precision    recall  f1-score   support

       False       0.88      0.94      0.91      8843
        True       0.71      0.55      0.62      2441

    accuracy                           0.85     11284
   macro avg       0.80      0.75      0.77     11284
weighted avg       0.85      0.85      0.85     11284

# # # Confusion matrix # # #
[[8293  550]
 [1089 1352]] 

# # # Accuracy score # # #
0.8547500886210564 



#### XGBoost Classification observations and discussions
The XGBoost classification technique has fair average of `85%` success, however, this should be viewed critically. Most predictions were made for *No rain* which is also the predominant (`78%`) expected outcome. It therefor made mostly **false negative** mistakes, where it predicted *No rain* eventhough rain was expected (refer to the confusion matrix). Although, notably less than the SV and RF classifiers. Instead, it makes more mistakes in **false possitives** compared to the others.

Overall, the XGBoost method is most effective at predicting *No rain*, but also shows a fair ability for *expected rain* predictions (`88%` vs `71%`).

The XGBoost algorithm has a long runtime in model building, requiring several seconds to complete, however it has an instant fitting routine.

Because of the fair size of the dataset and number of attributes, the number of estimators has been increased to `150`. This also seems to be the only attribute to slightly increase accuracy. Otherwise, the classifier is left at default settings for this dataset.

### Closing Discussion and Conslusion
Overall, the three classification algorithms all seem to have roughly the same *accuracy score*. The **random forest** and **support vector** classifiers both perform almost equal in terms of **false positives/negatives**, however, **random forest** is notably faster than **support vector**.

**XGBoost** Seems to make the trade-off with making less **false negative** classification, but in turn has more **false positives** as compared to the other two classification techniques.

Looking back at the results of each classifier, they all have fairly similar results. However, one classifier is notably preferable for this dataset and the current settings: **random forest**. This classifier outperforms the other two classifiers by delivering equal results, but with notably shorter runtimes.

#### Overfit
Looking at the false negatives, all three classifiers seem to show signs of overfitting for *no rain* results. This is not entirely surpising, as most entries are for *no rain*. However, changing parameters to reduce overfitting, such as tree-depth for XGBoost and Random Forest, does not seem to improve results. Since most entries expect a *no rain* result, the argument could be made that there is no case of overfitting at all, after all, it still manages to score over `70%` on the positive cases.

#### Hyper parameter tuning
The tuning of parameters was mostly performed manually, by hand, in a trial and error fashion. A possible better solution would have been to apply a GRG-Nonlinear parameter optimization search to the parameters of the classifiers, but due to the long runtime of the classifiers, applying this method of optimization would take many hours if not days for one run. Instead, we focussed mostly on tuning the attributes by pre-processing them, and selecting them by first looking at (overlapping) groups in the visualisation notebook.