**<center>Milestone 2: Model Development</center>**

The objective of milestone 2 is to develop a machine learning model capable of accurately predicting house prices based on various features such as location, size, number of bedrooms, and market conditions.

**What is model selection?**
Model selection is the process of choosing the most appropriate machine learning algorithm(s) that can best learn the patterns from the data and make accurate predictions on unseen (test) data. Different algorithms work better for different types of data:

1. Linear models work well when the relationship between features and target is mostly linear.

2. Tree-based models perform better when the data has non-linear relationships or complex interactions between variables.

3. Regularized models help avoid overfitting, especially with many features.

Choosing the right models increases prediction accuracy, ensures good generalization to new data, and improves interpretability.



In [1]:
# Import models and tools for evaluation
from sklearn.linear_model import LinearRegression
#Imports the Linear Regression model from scikit-learn to use for baseline prediction.
from sklearn.ensemble import RandomForestRegressor
#imports the Random Forest model, an advanced ensemble algorithm, from scikit-learn.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
#Imports evaluation metrics to measure prediction error and model accuracy.

In [2]:
# Step 1: Define and train Linear Regression model
linear_model = LinearRegression() #Creates an instance of the Linear Regression model.
linear_model.fit(X_train, y_train)  # Fit the model to the training data
linear_pred = linear_model.predict(X_test)  # Predict house prices on test data

NameError: name 'X_train' is not defined

In [None]:
# Step 2: Define and train Random Forest model
random_forest = RandomForestRegressor(random_state=42) #Creates a Random Forest model with a fixed random seed for reproducibility.
random_forest.fit(X_train, y_train)  # Fit the model to the training data
rf_pred = random_forest.predict(X_test)  # Predict house prices on test data

In [None]:
#Step 3: Define an evaluation function
def evaluate_model(y_true, y_pred):
    """
    This function calculates and returns three evaluation metrics:
    - MAE: Mean Absolute Error
    - RMSE: Root Mean Squared Error
    - R2 Score: How well the model explains the variance in target values
    """
    mae = mean_absolute_error(y_true, y_pred) #Calculates the average absolute difference between actual and predicted values.
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    return round(mae, 2), round(rmse, 2), round(r2, 4) #Returns the three metrics rounded for readability.

In [None]:
# Step 4: Evaluate both models
lr_mae, lr_rmse, lr_r2 = evaluate_model(y_test, linear_pred)
rf_mae, rf_rmse, rf_r2 = evaluate_model(y_test, rf_pred)

In [None]:
# Step 5: Store results in a comparison table
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest'], #model names as a column in the results table.
    'MAE': [lr_mae, rf_mae], #Adds Mean Absolute Error scores for both models.
    'RMSE': [lr_rmse, rf_rmse], #Adds Root Mean Squared Error scores for both models.
    'R2 Score': [lr_r2, rf_r2] #Adds R² scores for both models and completes the table.
})

In [None]:
# Display the comparison
results

In [None]:
#Visualizing the Comparison
#Imports Matplotlib library for creating plots and charts.
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(8, 5))

# Create bar chart for RMSE
plt.bar(results['Model'], results['RMSE'], color=['pink', 'black'])
plt.title('Model Comparison: RMSE') #Adds a title to the chart for clarity.
plt.ylabel('Root Mean Squared Error (RMSE) - Lower is Better')
plt.grid(True)
plt.show()

Model Evaluation Summary

After training and evaluating both models, we observed the following:

Linear Regression performed reasonably well but assumes a straight-line relationship, which is not always realistic for housing prices.
Random Forest delivered significantly better performance with a lower RMSE and a higher R² score.
This suggests that Random Forest was able to capture more complex patterns in the data, such as interactions between location, size, and number of bedrooms.

Therefore, we selected Random Forest as our final model.