# Homework 2: Classification Competition

#### COSC 410: Spring 2024, Colgate University

See HW2.pdf for more details. **Due Feb 26**

In [8]:
import pandas as pd
import sklearn
from sklearn import metrics
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import random
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier

# import pandas as pd
from sklearn.tree import DecisionTreeClassifier  
from sklearn.model_selection import train_test_split 
from sklearn import (
    metrics,
)  


### ML Task Description

The `Lab3_train.csv` file contains 10 years worth of daily weather observations from locations
across Australia, one row per day. It contains a column registering a binary label for each observation (`RainTomorrow`) a `1` if it rained
on the following day or a `0` if it did not. Your goal will be to create a ML model that, when given a
new weather observations, can predict whether it will rain on the day after
the observation. In other words, can you use machine learning to predict if it will rain tomorrow
based on the weather today?

### Open Ended Questions

In [None]:
1. Describe the data preprocessing steps your pipeline performs.

Answer the following questions (referencing your code in this notebook when appropriate).

- Filling Missing Values by replacing any missing values in numeric columns with the avg value of the column, and then filling missing values in categorical columns with the label "unknown".
- Encoding Categorical Variables by converting text data in categorical columns into numerical format by assigning a unique integer to each distinct value.
- Feature Engineering by creating a new feature called TempRange, which is calculated as the difference between the MaxTemp and MinTemp values.
- Feature Scaling: This process standardizes the features by subtracting the mean and dividing by the standard deviation, except for the target variable "RainTomorrow".

YOUR ANSWER GOES HERE

2. What different models did you try to improve your performance? How did they perform relative to each other?

The initial model I used was a Decision Tree Classifier and then switched to  a Random Forest Classifier to improve performance. The Random Forest Classifier performed better than the Decision Tree Classifier. The Random Forest Classifier had a higher accuracy score and a lower log loss score than the Decision Tree Classifier.

YOUR ANSWER GOES HERE

3. What different hyperparameters did you attempt to optimize for the model you submitted? What values did you choose in the end?

In my Random Forest model, I made optimizations to enhance performance and address data imbalance. I used 125 trees for accuracy and robustness, "entropy" for nuanced decision-making, limited tree depth to 13, set minimum samples for leaf nodes to 5 and for splits to 14 to control complexity and improve generalizability. I also adjusted the class weights to address the imbalanced dataset, which gave more weight to the minority class, and also used a fixed random state of 42 for reproducible results. 

4. Did model selection or hyperparameter optimization make the most difference for improving model performance?

I found that hyperparameter optimization within the Random Forest model had a more substantial impact. Initially, I didn't notice a significant difference between the Decision Tree and Random Forest models in terms of performance. But after fine-tuning the hyperparameters of the Random Forest model, such as the number of estimators, maximum depth, minimum samples for leaf nodes and splits, and class weights, the improvements became more pronounced. Optimizing the hyperparameters allowed me to tailor the Random Forest model more closely to the specifics of the dataset, leading to better handling of its imbalanced nature and enhancing the overall predictive performance. 

### Preprocessing

Your initial task is to preprocess this dataset. This includes resolving missing features, encoding nominal features, and appropriately scaling all features. You'll implement the function `preprocess`. Blocks below point out some useful tricks for approaching this.

In [9]:
df = pd.read_csv('Lab3_train.csv')
df.head(10)
# df.describe()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainTomorrow
0,Sydney,18.6,25.3,0.0,4.2,4.5,,,SSE,ENE,...,24.0,70.0,59.0,1021.7,1019.4,7.0,6.0,21.2,23.8,0
1,MountGambier,4.0,12.4,4.0,0.4,2.5,WSW,46.0,SW,WSW,...,19.0,77.0,67.0,1011.0,1013.3,6.0,7.0,10.1,11.5,1
2,Wollongong,,18.6,,,,SSE,24.0,,SW,...,7.0,,87.0,1020.0,1018.5,,8.0,,16.5,1
3,Ballarat,4.9,11.2,0.4,,,SSE,26.0,SSW,SSE,...,15.0,100.0,96.0,1029.3,1028.4,8.0,8.0,7.9,10.2,0
4,Albury,6.2,10.0,21.4,,,NW,57.0,NW,NW,...,19.0,82.0,91.0,1009.2,1008.7,8.0,8.0,8.5,9.1,1
5,Sydney,10.1,20.7,0.0,4.0,8.5,W,44.0,W,WSW,...,22.0,61.0,37.0,1018.7,1014.7,4.0,3.0,11.3,19.6,0
6,Hobart,8.8,16.6,1.6,5.2,12.1,SW,65.0,W,SW,...,28.0,54.0,34.0,1013.1,1017.8,,,11.5,15.5,0
7,Perth,16.1,30.8,0.0,11.0,13.1,SW,39.0,E,SSW,...,22.0,44.0,30.0,1021.4,1017.7,0.0,0.0,21.8,29.2,0
8,SalmonGums,2.9,23.5,0.0,,,SSE,41.0,S,SSE,...,15.0,43.0,25.0,,,,,16.3,21.8,0
9,MountGinini,4.0,13.2,15.0,,,,,ENE,,...,,97.0,,,,,,8.1,,0


In [10]:

def scale(df: pd.DataFrame) -> pd.DataFrame:
    """x' = (x - mean)/sd
    Args:
        df (pd.DataFrame): Dataframe to scale
    Returns:
        pd.DataFrame having standardized features

    Note: Only apply after steps 1 and 2"""

    nonLabel = list(filter(lambda x: x != "RainTomorrow", df.columns))

    # We don't want to scale our prediction
    subset = df[nonLabel]
    # Mapping feature to it's mean and sd
    means = dict(subset.mean())
    sds = dict(subset.std())

    # Loop through and do the math
    for col in means:
        df[col] = (df[col] - means[col]) / sds[col]
        # df.head(10)
    return df

In [11]:
def preprocess(filename: str) -> pd.DataFrame:
    """Preprocess your data

    Args:
        filename (str): Name of the csv file containing the data

    Returns:
        pd.DataFrame: Dataframe with relevent preprocessing applied

    """
    df = pd.read_csv(filename)
    # handle na
    numeric_cols = df.select_dtypes(include=["number"]).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

    string_cols = df.select_dtypes(include=["object"]).columns
    df[string_cols] = df[string_cols].fillna("unknown")

    df[string_cols] = df[string_cols].apply(lambda x: pd.factorize(x)[0])
    numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()
    df["TempRange"] = df["MaxTemp"] - df["MinTemp"]
    numeric_cols.append("TempRange")
    # df.drop(["WindDir9am", "WindDir3pm"], axis=1, inplace=True)
    df = scale(df)
    # print(df.head(10))
    return df
    # pass

In [12]:
data = preprocess('Lab3_train.csv')

## Train a Classifier

In [17]:

def fit_predict(train_fname: str, test_fname: str) -> np.array:
    """Fit a Random Forest model and return its predictions on test data

    Args:
        train_fname (str): Name of the training file
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note:
        Make sure you preprocess both your train and test data!"""

    train_data = preprocess(train_fname)
    test_data = preprocess(test_fname)
    X_train = train_data.drop(columns=["WindDir9am", "RainTomorrow"])
    y_train = train_data["RainTomorrow"]
    X_test = test_data.drop(columns=["WindDir9am", "RainTomorrow"])

    # Configure the Random Forest Classifier
    classifier = RandomForestClassifier(
        n_estimators=125,
        criterion="entropy",
        max_depth=13,
        min_samples_leaf=5,
        min_samples_split=14,
        class_weight={
            0: 1,
            1: 2,
        },
        random_state=42,
    )

    classifier.fit(X_train, y_train)
    Y_pred = classifier.predict(X_test)

    return Y_pred

In [18]:
def score(test_fname: str, Y_pred: np.array) -> list[float]:
    test = preprocess(test_fname)
    Y = test[test.columns[test.columns.isin(["RainTomorrow"])]]

    precision = metrics.precision_score(Y, Y_pred)
    recall = metrics.recall_score(Y, Y_pred)
    f1 = metrics.f1_score(Y, Y_pred)

    return precision, recall, f1

    # Y_pred = fit_predict(train_fname, test_fname)

[0 1 0 ... 0 0 0]
(0.48342187213775417, 0.5225742574257426, 0.502236178513655)
