## ML Final Exam
## Question 3: Deep Learning Based Regresion Task (20 Marks):
The "House_Price_Prediction" dataset is designed to represent properties with features like size,
number of bedrooms, number of bathrooms, age, and proximity to the city center. The target variable
is the property price. Given the dataset, perform the following tasks to predict the price of properties
based on various features.


In [None]:
# %pip install openpyxl

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import keras
import warnings

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, r2_score, mean_absolute_error

from keras.models import Sequential
from keras.layers import Dense, Dropout, InputLayer
from keras.regularizers import l2
from keras import regularizers
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback

warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline

### Load the data

In [None]:
df_ML_q3 = pd.read_excel('House_Price_Prediciton_Data.xlsx')
df_ML_q3.head()

### a) Data Preprocessing (4 marks)


In [None]:
# checking the data type
df_ML_q3.dtypes

In [None]:
df_ML_q3.shape

In [None]:
# checking missing values
df_ML_q3.isnull().sum()

In [None]:
# Verify the nature of missing values in 'Age_of_Property', 'Proximity_to_City_Center', and 'Property_Price'
missing_Age_of_Property = df_ML_q3['Age_of_Property'].isnull()
missing_Proximity_to_City_Center = df_ML_q3['Proximity_to_City_Center'].isnull()
missing_Property_Price = df_ML_q3['Property_Price'].isnull()

# Check if rows with missing values in 'Age_of_Property' also have missing values in other columns
all_missing_values_together = (
    missing_Age_of_Property.equals(missing_Proximity_to_City_Center) and
    missing_Age_of_Property.equals(missing_Property_Price)
)

# Print the result
print(f"All rows with missing 'Age_of_Property' also have missing values in 'Proximity_to_City_Center' and 'Property_Price': {all_missing_values_together}")

In [None]:
# Check for irregular values (0) in the column 'Square_Feet'
filtered_rows = df_ML_q3[df_ML_q3['Square_Feet'] == 0]

# Display the filtered rows
filtered_rows

In [None]:
# Drop the row number 2902
df_ML_q3 = df_ML_q3.drop(df_ML_q3.index[2902])

df_ML_q3.describe()

Handling missing values is an important step in preparing your data for a machine learning model. The decision to fill missing values or not depends on the nature of your data and the algorithm you plan to use.

Here are a few strategies you can consider:

Filling Missing Values:

You can fill missing values with the mean, median, or mode of the respective columns. This is a common approach and can be useful when the missing values are missing at random.
For the 'Age_of_Property,' you might choose to fill missing values with the mean or median value of the column.
For 'Proximity_to_City_Center' and 'Property_Price,' you might consider filling missing values with the mean or median as well.
Imputation using Machine Learning:

You can use machine learning algorithms to predict missing values based on the other features in your dataset. For example, you could train a regression model to predict 'Age_of_Property' based on 'Square_Feet,' 'Bedrooms,' and 'Bathrooms.' Similarly, you could train models for 'Proximity_to_City_Center' and 'Property_Price.'
Handling Missing Values during Training:

Some machine learning algorithms, like decision trees or deep learning models, can handle missing values during training. In this case, you don't need to fill missing values explicitly, as the algorithm will learn how to handle them.
Removing Rows:

If the missing values are a small percentage of your data and removing those rows doesn't significantly impact your dataset's representativeness, you might consider removing those rows.
The best approach depends on the characteristics of your data, the distribution of missing values, and the requirements of your model. Experiment with different strategies and evaluate their impact on your model's performance.

For deep learning models, make sure to normalize or standardize your data and handle missing values appropriately. It's also essential to split your dataset into training, validation, and test sets to evaluate your model's performance effectively.

In [None]:
# Drop rows with negative 'Property_Price'
df_ML_q3 = df_ML_q3[df_ML_q3['Property_Price'] >= 0]

# Reset the index after dropping rows
df_ML_q3.reset_index(drop=True, inplace=True)

print(df_ML_q3.describe())

### Imputing the missing values with IterativeImputer(LinearRegression)

In [None]:
# Select columns for imputation
cols_to_impute = ['Square_Feet', 'Bedrooms', 'Bathrooms', 'Age_of_Property', 'Proximity_to_City_Center', 'Property_Price']
imputation_df = df_ML_q3[cols_to_impute]

# Split data into training and testing sets
train_data, test_data = train_test_split(imputation_df, test_size=0.2, random_state=42)

# Impute missing values using an iterative imputer with linear regression
imputer = IterativeImputer(estimator=LinearRegression(), random_state=42, max_iter=100)
imputed_train_data = imputer.fit_transform(train_data)
imputed_test_data = imputer.transform(test_data)

# Convert imputed data to DataFrames
imputed_train_df, imputed_test_df = pd.DataFrame(imputed_train_data, columns=cols_to_impute), pd.DataFrame(imputed_test_data, columns=cols_to_impute)

# Update original DataFrame with imputed values
df_ML_q3.loc[train_data.index, cols_to_impute] = imputed_train_df
df_ML_q3.loc[test_data.index, cols_to_impute] = imputed_test_df

# Create a new DataFrame with imputed values
df_ML_q3_LiRegression_imputed = df_ML_q3.copy()
df_ML_q3_LiRegression_imputed[cols_to_impute] = imputer.transform(df_ML_q3_LiRegression_imputed[cols_to_impute])

# Print shape, check for dropped rows, and display head & description
print(df_ML_q3_LiRegression_imputed.shape)
df_ML_q3_LiRegression_imputed.head(5)
df_ML_q3_LiRegression_imputed.describe()


In [None]:
# Show distribution after imputation


# Determine the threshold for 'Age of Property'
age_threshold = df_ML_q3['Age_of_Property'].quantile(0.95) # or 0.99 if you want to be more conservative
# Filter out the outliers
df_ML_q3_filtered = df_ML_q3[df_ML_q3['Age_of_Property'] <= age_threshold]

# Determine the threshold for 'Proximity to City Center'
proximity_threshold = df_ML_q3['Proximity_to_City_Center'].quantile(0.95) # or 0.99

# Filter out the outliers
df_ML_q3_filtered = df_ML_q3_filtered[df_ML_q3_filtered['Proximity_to_City_Center'] <= proximity_threshold]

plt.figure(figsize=(12, 6))

# Plot for 'Age_of_Property' after filtering out outliers
ax1 = plt.subplot(2, 2, 1)
sns.histplot(df_ML_q3_filtered['Age_of_Property'], kde=True, color='blue', bins=30)
plt.title('Age of property')
plt.xlabel('Age of property')
plt.ylabel('Frequency')
# Format the x-axis tick labels to show the actual values
ax1.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
# Rotate the x-axis labels
ax1.tick_params(axis='x', rotation=45)

# Plot for 'Proximity_to_City_Center' after filtering out outliers
ax2 = plt.subplot(2, 2, 2)
sns.histplot(df_ML_q3_filtered['Proximity_to_City_Center'], kde=True, color='blue', bins=30)
plt.title('Proximity to city center')
plt.xlabel('Proximity to city center')
plt.ylabel('Frequency')
# Format the x-axis tick labels to show the actual values
ax2.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
# Rotate the x-axis labels
ax2.tick_params(axis='x', rotation=45)


plt.tight_layout(pad=2.0)
plt.show()


plt.figure(figsize=(10, 5))

# Plot for 'Property_Price' before imputation
ax = plt.subplot(2,2,3)
sns.histplot(df_ML_q3['Property_Price'], kde=True, color='blue', bins=30)
plt.title('Property price')
plt.xlabel('Property price')
plt.ylabel('Frequency')
# Format the x-axis tick labels to show the actual values
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
# Rotate the x-axis labels
ax.tick_params(axis='x', rotation=45)
plt.tight_layout(pad=2.0)
plt.show()

In [None]:
# Check to ensure that Missing values are imputed.
df_ML_q3_LiRegression_imputed.isnull().sum()

In [None]:
# Plot - Price distribution by the number of bedroom and bathroom


plt.figure(figsize=(5, 3))
ax = sns.boxplot(x='Bedrooms', y='Property_Price', data=df_ML_q3)
plt.title('Property Price Distribution by Number of Bedrooms')
# Format the y-axis label to show actual prices
ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
plt.show()

plt.figure(figsize=(5, 3))
ax = sns.boxplot(x='Bathrooms', y='Property_Price', data=df_ML_q3)
plt.title('Property Price Distribution by Number of Bathrooms')
# Format the axis label
ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
plt.show()

### Defining X and y

In [None]:
# separating the independent X and dependent variables y

# storing all the independent variables as X
X = df_ML_q3_LiRegression_imputed.drop('Property_Price', axis=1)

# storing the dependent variable as y
y = df_ML_q3_LiRegression_imputed['Property_Price']

### Preparing X

In [None]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit and transform X
X = scaler.fit_transform(X)

In [None]:
print(X.shape)
print(y.shape)

### Create Training and Validation Sets

In [None]:
# split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# shape of training and validation set
(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)

In [None]:
print(type(X_train))
X_train

In [None]:
print(type(X_test))
X_test

## b) Model Building (10 Marks)
## Our best model first and the one discussed further in the report

In [None]:
# Define a custom callback to print validation loss
class CustomCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        val_loss = logs.get('val_loss')
        print(f'Epoch {epoch + 1}: val_loss = {val_loss:.5f}')

# Define early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Create the Sequential model
model = Sequential()

# Input Layer
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))

# Hidden Layers
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))

# Output Layer
model.add(Dense(1, activation='linear'))

# Compiling the model (defining loss function, optimizer)
model.compile(optimizer='adam', loss='mean_squared_error')

# Summary of the model
model.summary()

# Define custom callback to print validation loss
custom_callback = CustomCallback()

# Train the model with early stopping and custom callback
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping, custom_callback])

# Evaluate the model on the test set
test_loss = model.evaluate(X_test, y_test)

# Predict on the test set
y_pred = model.predict(X_test)

test_mae = mean_absolute_error(y_test, y_pred)

# Calculate R^2 score
r2 = r2_score(y_test, y_pred)

# Report metrics on the test set
print(f"\nTest Mean Squared Error: {test_loss}")
print(f"Test Mean Absolute Error: {test_mae}")
print(f"R^2 Score on Test Set: {r2}")

# Visualization of true vs predicted values
plt.scatter(y_test, y_pred, alpha=0.5, color='blue', label='Predicted')
plt.scatter(y_test, y_test, alpha=0.5, color='red', label='True Value')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
plt.xlim([0, plt.xlim()[1]])
plt.ylim([0, plt.ylim()[1]])
plt.legend()
plt.plot([-100, 5000], [-100, 5000], color='green')
plt.show()

The model's performance on the test set is quite impressive:

Test Mean Squared Error (MSE): The mean squared error is a measure of the average squared difference between the predicted values and the true values. In this case, the test MSE is relatively low at 20,786,412 indicating that, on average, the model's predictions are close to the true values.

R^2 Score on Test Set: The R-squared (R^2) score is a measure of how well the predicted values match the actual values. A score of 0.997 is very close to 1, which suggests that the model explains a high proportion of the variance in the test data. In other words, the model's predictions are highly accurate and align well with the true values.

In summary, the low MSE and high R^2 score indicate that the model has performed very well on the test set and has learned patterns in the data that generalize effectively to unseen examples.

## Alternative Deep Learning Model 1

In [None]:
# Define a custom callback to print validation loss
class CustomCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        val_loss = logs.get('val_loss')
        print(f'Epoch {epoch + 1}: val_loss = {val_loss:.5f}')

# Using X_train dimentions
input_dim = X_train.shape[1]

# Build the model
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=input_dim, kernel_regularizer=l2(0.01)))
model.add(Dropout(0.5))  # Add dropout layer
model.add(Dense(128, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dropout(0.5))  # Add dropout layer
model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dropout(0.5))  # Add dropout layer
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mse', metrics=['mse'])

# Define early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Define model checkpoint callback to save the best model
model_checkpoint = ModelCheckpoint(filepath='best_model.h5', save_best_only=True, monitor='val_loss', mode='min', verbose=1)

# Define custom callback to print validation loss
custom_callback = CustomCallback()

# Train the model with early stopping and custom callback
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint, custom_callback])


print('\nBest config Summary')
model.summary()

# Calculate R^2 score
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"\n R^2 Score: {r2}")

# Evaluate the model loss and metrics on the test set
test_results = model.evaluate(X_test, y_test)

# Extract individual metrics
test_loss = test_results[0]
test_mse = test_results[1]

print(f"\nTest Loss: {test_loss:.2f}")
print(f"Test MSE: {test_mse:.2f}")

# Create variables for visualization
actual_values = y_test

# Flatten predictions to match the shape of actual_values
predicted_values = y_pred.flatten()  

# Plotting actual vs. predicted values
plt.scatter(y_test, y_pred, alpha=0.5, color='blue', label='Predicted')
plt.scatter(y_test, y_test, alpha=0.5, color='red', label='True Value')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
plt.xlim([0, plt.xlim()[1]])
plt.ylim([0, plt.ylim()[1]])
plt.legend()
plt.plot([-100, 5000], [-100, 5000], color='green')
plt.show()

This model has three dense layers and dropout layers in between. It has a total of 17,025 parameters.

This runs until it has 2 consecutive non-improving epochs. The scoring metrics might change when run.

R^2 Score: The R^2 score on the test set is approximately 0.93. This is an excellent R^2 score, suggesting that the model explains a very high proportion of the variance in the target variable.

Test Metrics: The test set evaluation shows a test loss (MSE) of 559,731,008 and a Mean Absolute Error (MAE) of 487,240,384 The MSE is relatively high, but the scale of the target variable needs to be considered to interpret the significance of this value. The MAE is a measure of the average absolute errors between the predicted and actual values.

Overall, based on the provided metrics, the model appears to have learned the patterns in the data very well and is performing at a high level. The R^2 score of 0.9865 indicates a strong correlation between predicted and actual values. However, it's crucial to consider the specific requirements and context of your problem to determine whether this level of performance is acceptable for your application. If these results meet your criteria, then the model can be considered successful.

## Alternative Deep Learning Model 2

In [None]:
# Create a Sequential model
model = Sequential()

# Input Layer
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))

# Hidden Layers with Dropout for Regularization
model.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5))  # Add dropout for regularization
model.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5))  # Add dropout for regularization

# Output Layer
model.add(Dense(1, activation='linear'))

# Compiling the model with a custom learning rate and mean absolute error loss
custom_optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=custom_optimizer, loss='mean_absolute_error', metrics=['mse'])

# Display the model summary
model.summary()

# Define a callback to save the best model during training
checkpoint_filepath = 'best_model.h5'
model_checkpoint = ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, monitor='val_loss', mode='min', verbose=1)

# Train the model with the defined callback
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), callbacks=[model_checkpoint])

# Save the best model in the native Keras format
model.save('best_model.keras')

# Load the best model
best_model = keras.models.load_model('best_model.keras')

# Print details of the best model
print("Details of the Best Model:")
best_model.summary()

# Prediction
y_pred = best_model.predict(X_test)

# Calculate R^2 score
score = r2_score(y_test, y_pred)
print(f"R^2 Score: {score}")

# Evaluate the best model on the test set
test_results = best_model.evaluate(X_test, y_test)
test_loss = test_results[0]
test_mse = test_results[1]

print(f"\nTest Loss: {test_loss:.2f}")
print(f"Test MSE: {test_mse:.2f}")

# Visualization of true vs predicted values
plt.scatter(y_test, y_pred, alpha=0.5, color='blue', label='Predicted')
plt.scatter(y_test, y_test, alpha=0.5, color='red', label='True Value')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
plt.xlim([0, plt.xlim()[1]])
plt.ylim([0, plt.ylim()[1]])
plt.legend()
plt.plot([-100, 5000], [-100, 5000], color='green')
plt.show()

In [None]:
# Plot - True vs Predicted Values Distribution KDE


fig, ax = plt.subplots(figsize=(10,6))

# Plotting the KDE for true values
sns.kdeplot(y_test, color='blue', label='True Values', fill=True, ax=ax)

# Plotting the KDE for predicted values
sns.kdeplot(y_pred.flatten(), color='red', label='Predicted Values', fill=True, ax=ax)
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('True vs Predicted Values Distribution - KDE')

# Set x-axis limits to avoid negative values
ax.set_xlim(left=0)

# Format the x-axis tick labels to show the actual values
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))

# Rotate the x-axis labels
ax.tick_params(axis='x', rotation=45)
ax.legend()
plt.show()