## Week 7 - House Price Prediction Model

In this assignment, you will have the exciting opportunity to build a cool regression model that predicts house prices. As a data scientist, your task is to analyze a dataset containing various features of houses and their corresponding prices. 

By harnessing the power of machine learning, you will develop a model that can ACCURATELY estimate house prices based on the given features and EVALUATE the model.


In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression


%matplotlib inline


**Question 1:** Load the house price prediction dataset into a variable called `house_price_df`. Next, write a function called `check_data` to check if the data has been loaded successfully.

**Question 1.1:** Explore the data to have an idea of its features and properties.

In [2]:
# load the customer retention dataset
house_price_df =pd.read_csv('house_price_prediction.csv') 
# write a function called `check_data` to check data loading is successful
def check_data(): 
    # Check if the dataframe is empty
    if house_price_df.empty:
        return True  # Return True if the dataframe is empty
    else:
        return False   # Return False if the dataframe is not empty

# Call the check_data function
result = check_data()
print(result)



False


In [3]:

# Explore the data to have an idea of its features and properties
# Display the first few rows of the dataframe
house_price_df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [4]:
# Get information about the dataset
print("Data Information")
house_price_df.info()


Data Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


In [5]:
# Get summary statistics
house_price_df.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


**Question 2:** Preprocess the data by handling missing values, converting categorical variables (like mainroad, guestroom,	basement, hotwaterheating, airconditioning, and prefarea), and scaling numerical features (if needed).

**Note**: assign your final preprocessed dataset to a variable called `processed_house_price_df`. Failure to do this might result in you not getting a score for this question.


In [6]:

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Copy the original dataframe to avoid modifying the original data
processed_house_price_df = house_price_df.copy()

# One-hot encode categorical variables using get_dummies
processed_house_price_df = pd.get_dummies(processed_house_price_df, drop_first=True)

# Define numerical columns
numerical_columns = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']

# Separate target variable 'price' from features
X = processed_house_price_df.drop('price', axis=1)
y = processed_house_price_df['price']

# Handle missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# Scale numerical features using StandardScaler
scaler = StandardScaler()
X[numerical_columns] = scaler.fit_transform(X[numerical_columns])

# Display the first few rows of the preprocessed dataframe
X.head()


Unnamed: 0,area,bedrooms,bathrooms,stories,parking,mainroad_yes,guestroom_yes,basement_yes,hotwaterheating_yes,airconditioning_yes,prefarea_yes,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,1.046726,1.403419,1.421812,1.378217,1.517692,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1,1.75701,1.403419,5.405809,2.532024,2.679409,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2.218232,0.047278,1.421812,0.22441,1.517692,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
3,1.083624,1.403419,1.421812,0.22441,2.679409,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
4,1.046726,1.403419,-0.570187,0.22441,1.517692,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0


**Question 3:** Split your processed dataset into training and testing set by using `k-fold cross-validation`. You can use **X** and **y** variable to store your splitted dataset.

**Question 3.1:** Apply k-fold cross-validation by using scikit-learn's `cross_val_score` function. Set the value of _k=5_.

**Question 3.2:** Train an ML model using `LinearRegression` to predict house prices. 

**Note**: Assign your model to a variable called `house_price_model`. Failure to do this might result in you not getting a score for this question.

In [7]:
# Step 1: Split the processed dataset into features (X) and target (y)
X = processed_house_price_df.drop('price', axis=1)
y = processed_house_price_df['price']

# Step 2: Split the dataset into training and testing sets using k-fold cross-validation
# Train_test_split function with test_size=0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3.1: Apply k-fold cross-validation with k=5 using cross_val_score
cross_val_scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')

# Display the cross-validation scores
print("Cross-Validation Scores:", cross_val_scores)

# Step 3.2: Train an ML model using LinearRegression
house_price_model = LinearRegression()

# Train the model on the training data using the fit method
house_price_model.fit(X_train, y_train)


Cross-Validation Scores: [ -2.08761653  -5.15625641 -16.34488122 -20.80699862  -5.16406595]


**Question 4:** Predict using the developed model and evaluate the model. Evaluate this model using MAE, MSE, RMSE, and R-squared.

**Note**: Assign your prediction to a variable called `prediction`. Failure to do this might result in you not getting a score for this question.

In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error

# Step 4: Predict using the developed model
prediction = house_price_model.predict(X_test)

# Step 4.1: Evaluate the model using different metrics
mae = mean_absolute_error(y_test, prediction)
mse = mean_squared_error(y_test, prediction)
rmse = root_mean_squared_error(y_test, prediction)  # Using the new function
r2 = r2_score(y_test, prediction)

# Display the evaluation metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)


Mean Absolute Error (MAE): 970043.4039201637
Mean Squared Error (MSE): 1754318687330.6643
Root Mean Squared Error (RMSE): 1324506.9600914388
R-squared (R²): 0.6529242642153184


<!-- BEGIN QUESTION -->

**Question 5:** What insight can you derive from this data?

a. Model Performance:

The Linear Regression model achieved an R-squared value of approximately 0.65. This means that around 65% of the variance in house prices is explained by the features included in the model.

b. Evaluation Metrics:

The Mean Absolute Error (MAE) is approximately 970,043, indicating the average absolute difference between predicted and actual prices. This provides a sense of the average magnitude of errors.

The Mean Squared Error (MSE) is approximately 1,754,318,687,330.66, indicating the squared average difference between predicted and actual prices. This metric emphasizes the impact of larger errors.

The Root Mean Squared Error (RMSE) is approximately 1,324,506.96. RMSE provides an interpretable measure of the average magnitude of errors in the same units as the target variable.

<!-- END QUESTION -->

<!-- END QUESTION -->

