# Lab 3: Nearest Neighbor and Classification Competition

#### COSC 410: Spring 2024, Colgate University

The goal of this notebook is to strengthen your understanding of $k$ Nearest Neighbor classification and to expand on your familarity with a standard machine learning pipeline. First you'll implement a version of $k$-NN classification. Then, you'll explore a weather prediction data set. Finally, you'll implement a function for fitting a classification model of your choosing on this dataset. 

Here are some learning objectives for this lab: 

1. Implement $k$-NN classification
2. Explore a dataset and standardize the values
3. Generate some plots to investigate some initial hypotheses
4. Apply any classification model from sklearn that we've discussed in class (i.e. `Logisitic Regression`, `SVC`, or `KNearestNeighbors`)
5. Experience with leaderboards


| Part | Description                       | Code? | Response? | 
| ---- | --------------------------------- | ----- | --------- |
| 1    | Implement $k$-NN                  | Yes   |    No     |
| 2    | Preprocess data and explore       | Yes   |    Yes    |
| 3    | Apply a classifier to data        | Yes   |    No     | 

In [44]:
import pandas as pd
import sklearn
from sklearn import metrics
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import random
import seaborn as sns

## Part 1: $k$-NN

In [45]:
def euclidean_distance(point: np.array, data: np.array) -> np.array:
    """ Calculates the euclidean distance for a point against all the data.

    Args:
        point (np.array): A single sample from test data
        data (np.array): All of the train data to calculate the distance for
    Returns:
        np.array: Distances between the point and all the data
    """
    distance = np.linalg.norm(data - point, axis=1)
    return distance
    pass

# print(X_val[0])
# print()
# print(X_train[:5])
# print()
# print(euclidean_distance(X_val[0], X_train[:5]))

# """
# [-1.25391319  1.31053702]

# [[ 0.16116456 -0.94258788]
#  [-1.40470908 -1.33032162]
#  [-0.82928994 -1.27174336]
#  [-1.10981406 -1.61297867]
#  [ 1.68625714  0.82660659]]

# [2.66064219 2.64516044 2.61695946 2.92706484 2.97972989]
# """

In [46]:
def mode(labels: list): 
    """ Returns the mode value from a list 

    Args:
        labels (list): A list of labels (e.g., ['cat', 'dog', 'cat'], [1, 0, 0], etc.)
    Returns:
        The mode (e.g., 'cat', 0, etc.)
    """ 
    labels_list = list(labels)
    return max(set(labels_list), key=labels_list.count)
    pass

print(mode(['cat','dog','dog']))

dog


In [47]:
class KNN():
    """ K Nearest Neighbors Classifier 

    Attributes:

        k (int): How many neighbors to consider (default: 5)
        dist_func (Callable): Distance function (default: euclidean_distance)
        X (np.array): Input training data
        Y (np.array): Output (e.g., gold prediction) training data
    """
    def __init__(self, k:int=5, dist_func=euclidean_distance):
        self.k = k
        self.dist_func = dist_func

    def fit(self, X, Y):
        """ Adds X and Y from train to class """ 
        self.X = X
        self.Y = Y

    def predict(self, X_test: np.array) -> np.array:
        """ Makes prediction based on k closest neighbors

        Args:
            X_test (np.array): Input test data

        Returns:
            np.array: Predictions
        """ 
        predictions = []

        for point in X_test:
            distances = self.dist_func(point, self.X)
            k_indices = np.argsort(distances)[:self.k]
            k_labels = self.Y[k_indices]
            prediction = mode(k_labels)
            predictions.append(prediction)
    
        return np.array(predictions)
        pass

In [48]:
random_seed = 2323

np.random.seed(random_seed)
random.seed(random_seed)

classifier = KNN()

# Create data
X, Y = make_classification(
    n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1
)

# Split off a valid set
n = X.shape[0]
X_train = X[0:int(n*0.8)]
Y_train = Y[0:int(n*0.8)]
X_val = X[int(n*0.8):]
Y_val = Y[int(n*0.8):]
# Y_val is [1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 1]


# Fit the classifier 
classifier.fit(X_train, Y_train)

# Get predictions
Y_pred = classifier.predict(X_val)
# Y_pred should be [1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 0 0]

# Get metrics
precision = metrics.precision_score(Y_val, Y_pred)
recall = metrics.recall_score(Y_val, Y_pred)
f1 = metrics.f1_score(Y_val, Y_pred)
print(f"precision: {precision} recall: {recall} f1: {f1}")
# 1.0, 0.85714 0.923077

precision: 1.0 recall: 0.8571428571428571 f1: 0.923076923076923


## Part 2: Explore Dataset and Preprocessing

### ML Task Description

The `Lab3_train.csv` file contains 10 years worth of daily weather observations from locations
across Australia, one row per day. It contains a column registering a binary label for each observation (`RainTomorrow`) a `1` if it rained
on the following day or a `0` if it did not. Your goal will be to create a ML model that, when given a
new weather observations, can predict whether it will rain on the day after
the observation. In other words, can you use machine learning to predict if it will rain tomorrow
based on the weather today?

### Preprocessing

Your initial task is to preprocess this dataset. This includes resolving missing features, encoding nominal features, and appropriately scaling all features. You'll implement the function `preprocess`. Blocks below point out some useful tricks for approaching this.

In [49]:
df = pd.read_csv('Lab3_train.csv')

In [50]:
# Let's consider the first couple samples of our data
df.head(10)
# plt.figure(figsize=(8, 8))
# X1, Y1 = make_classification(
# n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1
# )
# clusters = plt.scatter(X1[:, 0], X1[:, 1], marker="o", c=Y1, s=25,edgecolor="k")
# plt.show()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainTomorrow
0,Sydney,18.6,25.3,0.0,4.2,4.5,,,SSE,ENE,...,24.0,70.0,59.0,1021.7,1019.4,7.0,6.0,21.2,23.8,0
1,MountGambier,4.0,12.4,4.0,0.4,2.5,WSW,46.0,SW,WSW,...,19.0,77.0,67.0,1011.0,1013.3,6.0,7.0,10.1,11.5,1
2,Wollongong,,18.6,,,,SSE,24.0,,SW,...,7.0,,87.0,1020.0,1018.5,,8.0,,16.5,1
3,Ballarat,4.9,11.2,0.4,,,SSE,26.0,SSW,SSE,...,15.0,100.0,96.0,1029.3,1028.4,8.0,8.0,7.9,10.2,0
4,Albury,6.2,10.0,21.4,,,NW,57.0,NW,NW,...,19.0,82.0,91.0,1009.2,1008.7,8.0,8.0,8.5,9.1,1
5,Sydney,10.1,20.7,0.0,4.0,8.5,W,44.0,W,WSW,...,22.0,61.0,37.0,1018.7,1014.7,4.0,3.0,11.3,19.6,0
6,Hobart,8.8,16.6,1.6,5.2,12.1,SW,65.0,W,SW,...,28.0,54.0,34.0,1013.1,1017.8,,,11.5,15.5,0
7,Perth,16.1,30.8,0.0,11.0,13.1,SW,39.0,E,SSW,...,22.0,44.0,30.0,1021.4,1017.7,0.0,0.0,21.8,29.2,0
8,SalmonGums,2.9,23.5,0.0,,,SSE,41.0,S,SSE,...,15.0,43.0,25.0,,,,,16.3,21.8,0
9,MountGinini,4.0,13.2,15.0,,,,,ENE,,...,,97.0,,,,,,8.1,,0


I hope you notice 3 things: 

1. There are some missing values. We see them as NaN in our table. For example, The location Wollongong has a NanN for MinTemp (maybe it's always getting colder there, maybe our measuring device messed up)
2. Some of our columns aren't numbers (e.g., Location, WindDir9am)
3. The values our columns take can be quite different magnitudes (e.g., Pressure9am has values like 1021.7 and Cloud9am has values like 7.0). We can see that with describe below also:

In [51]:
df.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainTomorrow
count,90602.0,90815.0,90098.0,51978.0,47558.0,85061.0,90132.0,89326.0,89849.0,88714.0,81976.0,82001.0,56664.0,54415.0,90413.0,89277.0,91003.0
mean,12.192105,23.234429,2.352018,5.464587,7.611376,39.989008,13.992666,18.606845,68.869125,51.542755,1017.631846,1015.234076,4.439327,4.515759,16.990347,21.692859,0.22474
std,6.41145,7.136557,8.497127,4.1651,3.781962,13.608768,8.878223,8.811973,19.077023,20.86718,7.105664,7.036018,2.885399,2.715726,6.505233,6.960476,0.417413
min,-8.2,-4.1,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,982.2,980.2,0.0,0.0,-7.0,-5.1,0.0
25%,7.6,17.9,0.0,2.6,4.8,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6,0.0
50%,12.0,22.7,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1,0.0
75%,16.9,28.3,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4,0.0
max,33.9,47.3,367.6,145.0,14.3,135.0,87.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7,1.0


**Why do these three things pose a challenge for us?**

1) We would also not be able to perform any calculations using these NA points as we cannot use NA values in calculations since it will return an error.
2) Since KNN relies on calculating distances between data points to determine similarity. If the dataset contains columns with different data types, it might be difficult to calculate euclidean distance between points. For numerical data, Euclidean distance is commonly used, while for categorical data, other distance metrics might have to be used instead.
3) KNN relies on calculating distances between data points to determine similarity. Features with larger magnitude can dominate the distance calculation. Features with larger magnitude may be incorrectly detected as more important by the KNN algorithm, causing biased predictions. 

How do we address these challenges. Here are three tips: 

1. You should replace missing data points with sensible values of the correct type (e.g., float, string). Consider, pandas `fillna` function [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
2. Consider Panda's `factorize` method [here](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html)
3. Consider the function below

In [52]:
def scale(df: pd.DataFrame) -> pd.DataFrame:
    """ x' = (x - mean)/sd 
    Args:
        df (pd.DataFrame): Dataframe to scale 
    Returns:
        pd.DataFrame having standardized features
        
    Note: Only apply after steps 1 and 2"""
    
    nonLabel = list(filter(lambda x: x != 'RainTomorrow', df.columns))
    
    # We don't want to scale our prediction 
    subset = df[nonLabel]
    # Mapping feature to it's mean and sd
    means = dict(subset.mean())
    sds = dict(subset.std())

    # Loop through and do the math
    for col in means:
        df[col] = (df[col] - means[col])/sds[col]
    return df

In [53]:
def preprocess(filename: str) -> pd.DataFrame: 
    """ Preprocess your data 

    Args:
        filename (str): Name of the csv file containing the data

    Returns: 
        pd.DataFrame: Dataframe with relevent preprocessing applied
    """
    df = pd.read_csv(filename)
    #handle na
    numeric_cols = df.select_dtypes(include=['number']).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
    
    string_cols = df.select_dtypes(include=['object']).columns
    df[string_cols] = df[string_cols].fillna("unknown")

    df[string_cols] = df[string_cols].apply(lambda x: pd.factorize(x)[0])
    df = scale(df)
    return df
    pass

In [61]:
data = preprocess('Lab3_train.csv')
data.head(100)

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainTomorrow
0,-1.676600,1.001655e+00,0.289735,-2.781883e-01,-0.401738,-1.138030e+00,-1.572782,0.000000,-1.609540,-1.629264,...,0.617744,5.965891e-02,0.361948,0.603222,0.623739,1.124667e+00,7.067875e-01,0.649226,0.305642,0
1,-1.606909,-1.280555e+00,-1.519730,1.949175e-01,-1.608935,-1.869558e+00,-1.367910,0.456867,-1.402995,-1.424372,...,0.045033,4.289414e-01,0.750240,-0.983364,-0.289578,6.854595e-01,1.182982e+00,-1.062651,-1.478479,1
2,-1.537217,-5.553451e-16,-0.650065,-5.252529e-17,0.000000,6.497276e-16,-1.163039,-1.215250,-1.196449,-1.219480,...,-1.329474,7.496885e-16,1.720969,0.351148,0.488987,-3.900947e-16,1.659176e+00,0.000000,-0.753227,1
3,-1.467526,-1.139871e+00,-1.688052,-2.308777e-01,0.000000,6.497276e-16,-1.163039,-1.063239,-0.989904,-1.014587,...,-0.413136,1.642298e+00,2.157797,1.730144,1.971256,1.563875e+00,1.659176e+00,-1.401942,-1.667045,0
4,-1.397834,-9.366604e-01,-1.856374,2.252927e+00,0.000000,6.497276e-16,-0.958167,1.292926,-0.783358,-0.809695,...,0.045033,6.927146e-01,1.915115,-1.250267,-0.978309,1.563875e+00,1.659176e+00,-1.309408,-1.826600,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1.111058,1.611286e+00,1.636313,-2.781883e-01,0.106555,1.019979e+00,0.271063,0.228851,-0.989904,1.239225,...,1.304998,4.289414e-01,-0.948536,-0.805429,-1.128033,-1.071371e+00,-1.674184e+00,1.636254,1.669116,0
96,-1.049377,1.361181e+00,0.766648,-2.781883e-01,0.000000,6.497276e-16,0.271063,0.228851,0.249369,1.649009,...,0.159575,3.234321e-01,0.313412,0.440115,0.503960,-3.900947e-16,4.229456e-16,0.942250,0.726288,0
97,-0.282771,-8.272394e-01,-0.762280,-2.781883e-01,-1.481862,-1.284335e+00,1.500293,-0.379191,1.282097,0.624549,...,-0.871305,8.509785e-01,-0.026343,-0.256797,-0.469247,-3.900947e-16,4.229456e-16,-0.923850,-0.869267,1
98,-0.213080,5.014445e-01,2.085172,-2.781883e-01,0.000000,6.497276e-16,-0.958167,-0.531202,0.662460,1.239225,...,-1.100390,-2.208791e+00,-2.016338,-0.538527,-0.858529,-3.900947e-16,4.229456e-16,1.296963,2.292833,0


### Exploration

Now, you should explore the dataset, coming to answers to the following question (drawing on skills from the first two labs; hint hint, plot some stuff!):

1. What are some features that are important for this predictive task?

## Part 3: Train a Classifier

Now that you've explored your data, your final task is to fit at least one of the classifers from the class so far (i.e. `Logisitic Regression`, `SVC`, or `KNeighborsClassifier`) on this task. Report it's precision, recall, and F$_1$ score using the code snippet at the bottom. 

In [57]:
from sklearn.linear_model import LogisticRegression


def fit_predict(train_fname: str, test_fname: str) -> np.array: 
    """ Fit a logistic regression model and return its predictions on test data 

    Args:
        train_fname (str): Name of the training file 
        test_fname (str): Name of the testing file
    Returns:
        np.array: Predictions of the model on test data

    Note: 
        Make sure you preprocess both your train and test data!"""

    train_data = preprocess(train_fname)
    test_data = preprocess(test_fname)

    X_train = train_data.drop(columns=['RainTomorrow'])  
    y_train = train_data['RainTomorrow']

    X_test = test_data.drop(columns=['RainTomorrow']) 

    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)

    # Make predictions
    y_pred = classifier.predict(X_test)
    
    
    return y_pred




The following function can be used to get the precision, recall, and F$_1$ score from our model.

In [58]:
def score(test_fname: str, Y_pred: np.array) -> list[float]:
    test = preprocess(test_fname)
    Y = test[test.columns[test.columns.isin(['RainTomorrow'])]]

    precision = metrics.precision_score(Y, Y_pred)
    recall = metrics.recall_score(Y, Y_pred)
    f1 = metrics.f1_score(Y, Y_pred)

    return precision, recall, f1
    
Y_pred = fit_predict('Lab3_train.csv', 'Lab3_valid.csv')
print(score('Lab3_valid.csv', Y_pred))

(0.7130031856356791, 0.4875247524752475, 0.5790897330353992)


Our precision is 71.3%, recall is 48.75%, f1 is 57.9%