**CS 4774: Machine Learning Final Project - KMeans Approach**

Author: Donovan Ray (DonovanRay26)

## Data Preprocessing

We'll utilize pandas, numpy, and sklearn to preprocess our data, imputing numerical features and applying one-hot encoding to categorical features. 

In [31]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# training data:
train_raw = pd.read_csv('data/train.csv')
test_raw = pd.read_csv('data/test.csv')

print("Train data shape: ", train_raw.shape)
print("Test data shape: ", test_raw.shape)
print(train_raw.head())

# get features and targets:
X_train = train_raw.drop(['SalePrice', 'Id'], axis=1)  # I think that only Id needs to be dropped before PCA
y_train = train_raw["SalePrice"]
X_test = test_raw.copy()  # can just copy as test.csv doesn't have the target

# separate numerical and categorical features:
numFeatures = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
catFeatures = X_train.select_dtypes(include=['object']).columns.tolist()

# utilize pipelines for preprocessing:

numPipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
                        ('scaler', StandardScaler())])

catPipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# combine workflows:
preprocessor = ColumnTransformer([('numerical', numPipeline, numFeatures),
                                  ('categorical', catPipeline, catFeatures)])

# now, fit and transform data:

# use preprocessor to process train and test data:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# convert to pd dataframes:

# need to concatenate processed numerical and categorical features:
numFeature_names = numFeatures
catFeature_names = preprocessor.named_transformers_['categorical'].named_steps['onehot'].get_feature_names_out(catFeatures)

# concatenate
totalFeatures = np.concatenate((numFeature_names, catFeature_names))

# convert to dataframes:
X_train_processed = pd.DataFrame(X_train_processed, columns=totalFeatures)
X_test_processed = pd.DataFrame(X_test_processed, columns=totalFeatures)

print("Processed Train dataset: ", X_train_processed.shape)
print("Processed Test dataset: ", X_test_processed.shape)
print(X_train_processed.head())

# write out preprocessed data:
X_train_processed.to_csv('data/train_processed.csv', index=False)
X_test_processed.to_csv('data/test_processed.csv', index=False)

Train data shape:  (1460, 81)
Test data shape:  (1459, 80)
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold 

## Implementation of K Nearest Neighbors

Will use Euclidian and Manhattan distance, as well as weighted and unweighted KNN

In [6]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

class KNN:
    def __init__(self, k, distance, weighted=False):
        self.k = k
        self.distance = distance
        self.weighted = weighted
        self.X_train = None
        self.y_train = None
        self.X_test = None
        self.pca = None
        
    # fit
    def fit(self, X, y, pca_components=None):
        # fit with PCA (if chosen):
        if pca_components is not None:
            self.pca = PCA(n_components=pca_components)
            self.X_train = self.pca.fit_transform(X)
            
        # normal fit:
        else:
            self.X_train = X if isinstance(X, list) else X
        self.y_train = y if isinstance(y, list) else y
        
        # convert to numpy arrays:
        self.X_train = np.array(self.X_train)
        self.y_train = np.array(self.y_train)
        
    # helper method to calculate distances:
    def calculateDistance(self, p1, p2):
        p1 = np.array(p1, dtype=float)
        p2 = np.array(p2, dtype=float)
        if self.distance.lower() == 'manhattan':
            return np.sum(np.abs(p1 - p2))
        elif self.distance.lower() == 'euclidian':
            return np.sqrt(np.sum(p1 - p2) ** 2) 
        elif self.distance.lower() == 'chebyshev':
            return np.max(np.abs(p1 - p2))
        elif self.distance.lower() == 'cosine similarity':
            norm_p1 = np.linalg.norm(p1)
            norm_p2 = np.linalg.norm(p2)
            
            if norm_p1 == 0 or norm_p2 == 0:
                return 1.0
            else:
                cosSim = np.dot(p1, p2) / (norm_p1 * norm_p2)
                return 1 - cosSim
        
        else:
            return None  # invalid metric
         
    # predict
    def predict(self, X):
        if self.pca:
            X = self.pca.transform(X)
        else:
            X = X.values if isinstance(X, pd.DataFrame) else X
            
        predictions = []
        
        for x in X:
            # compute distance:
            distances = [self.calculateDistance(x, x_train) for x_train in self.X_train]
            
            # get knns
            knn_indices = np.argsort(distances)[:self.k]
            knn_distances = [distances[i] for i in knn_indices]
            knn_prices = [self.y_train[i] for i in knn_indices]
            
            # if weighted:
            if self.weighted:
                weights = 1 / np.array(knn_distances) + 1e-8  # avoid div by 0
                prediction = np.average(knn_prices, weights=weights)
            else:
                # calculate prediction using mean price of knns:
                prediction = np.mean(knn_prices)
            
            # append prediction:
            predictions.append(prediction)
            
        return np.array(predictions)
    
    # get accuracy metrics in format [RMSE, MAE, R2]:
    def measure_accuracy(self, X, y):
        y_pred = self.predict(X)
        RMSE = np.sqrt(mean_squared_error(y, y_pred))
        MAE = mean_absolute_error(y, y_pred)
        R2 = r2_score(y, y_pred)
        return [RMSE, MAE, R2]


## Application of Model:

Utilize an 80/20 test-train split on training data and calculate error to find best configuration for optimal KNN model.

In [15]:
from sklearn.model_selection import train_test_split

# load data:
X = pd.read_csv('data/train_processed.csv')
y = pd.read_csv('data/train.csv')['SalePrice']
competition_test = pd.read_csv('data/test_processed.csv')
raw_competition_test = pd.read_csv('data/test.csv')  # don't use processed file for the Id's

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bestRMSE_stats = [None, None, None, None]  # format: RMSE, k, pca, weighted?

# find the best choice of k:

"""
for k in [2, 3, 4, 5, 6, 7, 8, 9, 10]:
    for pca in [2, 3, 5, 10, 15, 25, 30]:
        for weighted in [True, False]:
            knn = KNN(k=k, distance='cosine similarity', weighted=weighted)
            knn.fit(X_train, y_train, pca_components=pca)
            [RMSE, MAE, R2] = knn.measure_accuracy(X_test, y_test)
            print(f'k = {k}, pca = {pca}, weighted: {weighted}  |  RMSE = {RMSE:.2f} | MAE = {MAE:.2f} | R2 = {R2:.2f}')
            
            if bestRMSE_stats[0] is None:
                bestRMSE_stats = [RMSE, MAE, R2, k, pca, weighted]
            elif bestRMSE_stats[0] > RMSE:
                bestRMSE_stats = [RMSE, MAE, R2, k, pca, weighted]

print(f'Best Configuration: k: {bestRMSE_stats[3]} | pca: {bestRMSE_stats[4]} | weighted: {bestRMSE_stats[5]} | RMSE: {bestRMSE_stats[0]:.2f} | MAE: {bestRMSE_stats[1]:.2f} | R2: {bestRMSE_stats[2]:.2f}')
"""
# make prediction of test data using optimal model:

# optimal Manhattan configuration: k = 3 | 2 PCA components | weighted
# optimal Chebyshev configuration: k = 4 | 5 PCA components | weighted

knn_manhattan = KNN(k=3, distance='manhattan', weighted=True)
knn_manhattan.fit(X_train, y_train, pca_components=2)
knn_manhattan_pred = knn_manhattan.predict(competition_test)

knn_chebyshev = KNN(k=4, distance='chebyshev', weighted=True)
knn_chebyshev.fit(X_train, y_train, pca_components=5)
knn_chebyshev_pred = knn_chebyshev.predict(competition_test)

# write out predictions:
manhattanDF = pd.DataFrame({'Id': raw_competition_test['Id'], 
                            'SalePrice': knn_manhattan_pred})
chebyshevDF = pd.DataFrame({'Id': raw_competition_test['Id'],
                            'SalePrice': knn_chebyshev_pred})
manhattanDF.to_csv('data/manhattan_prediction.csv', index=False)
chebyshevDF.to_csv('data/chebyshev_prediction.csv', index=False)
print("Complete.")

Complete.


## Current Results:

Euclidian Distance: k = 9, 3 PCA components, not weighted, RMSE = 57148.57, MAE = 38546.71, R2: 0.58

Manhattan Distance: k = 3, 2 PCA components, weighted, RMSE = 31922.75, MAE = 22905.35, R2: 0.87

Chebyshev Distance: k = 4, 5 PCA components, weighted, RMSE = 31039.14, MAE = 19930.17, R2: 0.87

Cosine Similarity: k = 10, 5 PCA components, weighted, RMSE = 36058.40, MAE = 21516.81, R2: 0.83

It would seem that choice of distance metric plays a large role in the effectivity of the model. Euclidian distance yields the worst overall results, Cosine Similarity comes in second with fairly strong results, and Manhattan and Chebyshev are tied for first in terms of R2 score, but Chebyshev has a slightly lower RMSE and MAE.

I will submit both a Chebyshev and Manhattan model to the competition with the specified configurations and compare results.

The Chebyshev distance KNN model received a score of 19423.31579, and the Manhattan distance KNN model received a score of 22953.36611. Thus, the KNN model that leveraged the Chebyshev distance metric performed better, and ranked 3500/6880 in the Kaggle competition. 