# K-Nearest Neighbors Regressor (KNN)

In this notebook, I use the K-Nearest Neighbors (KNN) algorithm to predict house prices for the Kaggle Housing Prices competition. KNN is a straightforward, instance-based learning method that makes predictions based on the similarity between data points.

## How it works in this project:
For each house in the test set, the algorithm finds the 'k' most similar houses in the training set, based on their features (such as size, location, and number of rooms).
The predicted price is calculated as the average price of these 'k' nearest neighbors.
The value of 'k' is a key parameter and is chosen based on model performance.

## Why KNN?
KNN does not make strong assumptions about the underlying data distribution, making it flexible for various types of data.
It is easy to implement and interpret, providing a good baseline for regression tasks like house price prediction.
Important notes for this notebook:
Since KNN relies on distance calculations, all features are scaled to ensure fair comparison.
The model’s performance is evaluated using cross-validation to select the optimal value of 'k'.



# Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# training data:
train_raw = pd.read_csv('data/train.csv')
test_raw = pd.read_csv('data/test.csv')

print("Train data shape: ", train_raw.shape)
print("Test data shape: ", test_raw.shape)
print(train_raw.head())

# get features and targets:
X_train = train_raw.drop(['SalePrice', 'Id'], axis=1)  # I think that only Id needs to be dropped before PCA
y_train = train_raw["SalePrice"]
X_test = test_raw.copy()  # can just copy as test.csv doesn't have the target

# separate numerical and categorical features:
numFeatures = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
catFeatures = X_train.select_dtypes(include=['object']).columns.tolist()

# utilize pipelines for preprocessing:

numPipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
                        ('scaler', StandardScaler())])

catPipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# combine workflows:
preprocessor = ColumnTransformer([('numerical', numPipeline, numFeatures),
                                  ('categorical', catPipeline, catFeatures)])

# now, fit and transform data:

# use preprocessor to process train and test data:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# convert to pd dataframes:

# need to concatenate processed numerical and categorical features:
numFeature_names = numFeatures
catFeature_names = preprocessor.named_transformers_['categorical'].named_steps['onehot'].get_feature_names_out(catFeatures)

# concatenate
totalFeatures = np.concatenate((numFeature_names, catFeature_names))

# convert to dataframes:
X_train_processed = pd.DataFrame(X_train_processed, columns=totalFeatures)
X_test_processed = pd.DataFrame(X_test_processed, columns=totalFeatures)

print("Processed Train dataset: ", X_train_processed.shape)
print("Processed Test dataset: ", X_test_processed.shape)
print(X_train_processed.head())

# write out preprocessed data:
X_train_processed.to_csv('data/train_processed.csv', index=False)
X_test_processed.to_csv('data/test_processed.csv', index=False)

ModuleNotFoundError: No module named 'pandas'