## KNN Regression Model

 **1st KNN Model**
 The input configuration is to use all the numeric variables but to leave out the target variable (MSOA income estimate)

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
import joblib

In [10]:
# Loading the preprocessed dataset
data_path = '../../data/processed/lsoa_census_standardized.csv'
data = pd.read_csv(data_path)

# Selecting only numeric columns for features
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Ensuring that the target column is excluded from features
X = data[numerical_cols].drop(columns=['net_income_after_housing_costs_()'])  # Excluding the target column
y = data['net_income_after_housing_costs_()']

# Identifying the non-numeric columns:
categorical_cols = data.select_dtypes(include=['object']).columns

# Drop any categorical columns from the features set X
X = X.drop(columns=categorical_cols, errors='ignore')

# Splitting data into training and testing sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Training the KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5)  # Adjust n_neighbors based on optimization
knn.fit(X_train, y_train)

# Making Predictions
y_pred = knn.predict(X_test)

# Save Actual and Predicted Values
evaluation_data = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})

evaluation_data_path = '../../data/processed/evaluation_data.csv'
evaluation_data.to_csv(evaluation_data_path, index=False)
print(f"Evaluation data saved to {evaluation_data_path}")

# Saving the trained KNN model
knn_model_path = '../../model/knn_model.pkl'  # Adjust the path as needed
joblib.dump(knn, knn_model_path)
print(f"Model saved to {knn_model_path}")


Evaluation data saved to ../../data/processed/evaluation_data.csv
Model saved to ../../model/knn_model.pkl


- **2nd KNN Model**
This is for the dataset after 40+ variables were removed due to low feature importance. I want to see how this performs and whether it will enhance the model. 

In [14]:
# Loading the preprocessed dataset
data_path = '../../data/processed/knn_many_removed.csv'
data = pd.read_csv(data_path)

# Selecting only numeric columns for features
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Ensuring that the target column is excluded from features
X = data[numerical_cols].drop(columns=['net_income_after_housing_costs_()'])  # Excluding the target column
y = data['net_income_after_housing_costs_()']

# Identifying the non-numeric columns:
categorical_cols = data.select_dtypes(include=['object']).columns

# Drop any categorical columns from the features set X
X = X.drop(columns=categorical_cols, errors='ignore')

# Splitting data into training and testing sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Training the KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5)  # Adjust n_neighbors based on optimization
knn.fit(X_train, y_train)

# Making Predictions
y_pred = knn.predict(X_test)

# Save Actual and Predicted Values
evaluation_data_2 = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})

evaluation_data_path_2 = '../../data/processed/evaluation_data_2.csv'
evaluation_data_2.to_csv(evaluation_data_path_2, index = False)
print(f"Evaluation data saved to {evaluation_data_path_2}")

Evaluation data saved to ../../data/processed/evaluation_data_2.csv


**3rd KNN Model - Feature Importance After Removing the Confidence Intervals**

In [16]:
# Loading the preprocessed dataset
data_path = '../../data/processed/knn_confidence_interval.csv'
data = pd.read_csv(data_path)

# Selecting only numeric columns for features
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Ensuring that the target column is excluded from features
X = data[numerical_cols].drop(columns=['net_income_after_housing_costs_()'])  # Excluding the target column
y = data['net_income_after_housing_costs_()']

# Identifying the non-numeric columns:
categorical_cols = data.select_dtypes(include=['object']).columns

# Drop any categorical columns from the features set X
X = X.drop(columns=categorical_cols, errors='ignore')

# Splitting data into training and testing sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)

# Training the KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5)  # Adjust n_neighbors based on optimization
knn.fit(X_train, y_train)

# Making Predictions
y_pred = knn.predict(X_test)

# Save Actual and Predicted Values
evaluation_data_3 = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})

evaluation_data_path_3 = '../../data/processed/evaluation_data_3.csv'
evaluation_data_3.to_csv(evaluation_data_path_3, index = False)
print(f"Evaluation data saved to {evaluation_data_path_3}")

Evaluation data saved to ../../data/processed/evaluation_data_3.csv


**4th KNN Model - Feature Importance After PCA Analysis**
This is where the most of the crime variables were combined, along with most of the educational, race, income support, benefits etc. I noticed that the top two feature importances were closer than all the models before. PCA Analysis looked beneficial in regards to combining variables within the same category. I want to see the evaluation metrics compared with the others, because they perfomed badly compared to the first model (that excluded only the target variable as an input configuration)

In [7]:
# Loading the preprocessed dataset
data_path = '../../data/processed/knn_pca.csv'
data = pd.read_csv(data_path)

# Selecting only numeric columns for features
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Ensuring that the target column is excluded from features
X = data[numerical_cols].drop(columns=['net_income_after_housing_costs_()'])  # Excluding the target column
y = data['net_income_after_housing_costs_()']

# Identifying the non-numeric columns:
categorical_cols = data.select_dtypes(include=['object']).columns

# Drop any categorical columns from the features set X
X = X.drop(columns=categorical_cols, errors='ignore')

# Splitting data into training and testing sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Training the KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5)  # Adjust n_neighbors based on optimization
knn.fit(X_train, y_train)

# Making Predictions
y_pred = knn.predict(X_test)

# Save Actual and Predicted Values
evaluation_data_4 = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})

evaluation_data_path_4 = '../../data/processed/evaluation_data_4.csv'
evaluation_data_4.to_csv(evaluation_data_path_4, index = False)
print(f"Evaluation data saved to {evaluation_data_path_4}")

Evaluation data saved to ../../data/processed/evaluation_data_4.csv
