# Housing Price Prediction


## Task 1: Loading the Data

Your first task is to load the dataset into a Pandas DataFrame. The dataset is in CSV format and contains information about housing prices and their associated features. Use Python's data-handling libraries to load the data and inspect the first few rows to ensure the data was loaded correctly. This task introduces you to working with external data files and familiarizing yourself with their structure.

In [None]:
import pandas as pd

# Load the dataset
file_path = "housing_prices_dataset.csv"  # Update with your dataset path
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(data.head())

# Display basic information about the dataset
print("\nBasic Information about the dataset:")
print(data.info())


## Task 2: Splitting the Data into Train and Test Sets

Your second task is to split the dataset into training and testing sets. The training set will be used to build your machine learning model, and the testing set will evaluate its performance on unseen data. Use an 80/20 split for this task. Ensure that the target variable (Price) is separated from the features, as we will be predicting this variable.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = data.drop(columns=['Price'])  # Features
y = data['Price']  # Target variable

# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("Training set size (features):", X_train.shape)
print("Training set size (target):", y_train.shape)
print("Testing set size (features):", X_test.shape)
print("Testing set size (target):", y_test.shape)


## Task 3a: Viewing Summary Statistics
The first step in checking the quality of your training data is to view its summary statistics. This helps you understand the range, central tendency (mean/median), and variability (standard deviation) of each numerical feature. It also gives clues about potential data issues, such as extreme values.

In [None]:
# Display summary statistics of numerical features in the training dataset
print("Summary statistics of numerical features in the training data:")
numerical_features = X_train.select_dtypes(include=['float64', 'int64']).columns
print(X_train[numerical_features].describe())


## Task 3b: Checking for Missing Values
Identify which columns in the training dataset contain missing values and how many. This step is essential because missing values can disrupt model training and need to be addressed.

In [None]:
# Check for missing values in the training dataset
print("\nMissing values in the training dataset:")
missing_values = X_train.isnull().sum()
missing_columns = missing_values[missing_values > 0]
print(missing_columns)


## Task 3c: Checking for Duplicate Values
Task Description:
Detect duplicate rows in the training dataset. Duplicates can distort the model by over-representing certain patterns, leading to bias.

In [None]:
# Check for duplicate rows in the training dataset
duplicates = X_train.duplicated().sum()
print(f"\nNumber of duplicate rows in the training dataset: {duplicates}")


## Task 3d: Checking for Outliers Using the Tukey Outlier Rule

Identify outliers in the numerical features using Tukey's rule, which defines an outlier as a value beyond 1.5 times the interquartile range (IQR) above the 75th percentile or below the 25th percentile.

In [None]:
# Identify outliers using Tukey's rule
def tukey_outliers(column):
    q1 = column.quantile(0.25)  # 25th percentile
    q3 = column.quantile(0.75)  # 75th percentile
    iqr = q3 - q1  # Interquartile range
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = (column < lower_bound) | (column > upper_bound)
    return outliers.sum()

print("\nOutliers in each numerical feature using Tukey's rule:")
for col in numerical_features:
    num_outliers = tukey_outliers(X_train[col])
    print(f"{col}: {num_outliers} outliers")


## Task 4: Understanding the Data - Feature Value Distributions
In this task, students will explore the distributions of features in the dataset. Understanding feature distributions helps identify patterns, skewness, and irregularities (e.g., heavy tails). Use visualization techniques to analyze both numerical and categorical features.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize distributions of numerical features
print("Visualizing distributions of numerical features:")
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(3, 3, i)
    sns.histplot(X_train[col], kde=True, bins=30)
    plt.title(f"Distribution of {col}")
plt.tight_layout()
plt.show()

# Visualize distributions of categorical features
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns
print("\nVisualizing distributions of categorical features:")
plt.figure(figsize=(15, 8))
for i, col in enumerate(categorical_features, 1):
    plt.subplot(2, 2, i)
    sns.countplot(x=X_train[col], order=X_train[col].value_counts().index)
    plt.title(f"Distribution of {col}")
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## Task 5: Understanding How to Convert the Problem of Predicting Price to a Regression Problem
In this task, students will identify how the problem of predicting housing prices fits into the category of regression problems. They will verify that the target variable (Price) is continuous and understand why regression models are suitable. This task lays the theoretical foundation for model selection.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Check the data type and distribution of the target variable
print("Target variable type:", y_train.dtype)
print("\nSummary statistics of the target variable:")
print(y_train.describe())

# Visualize the distribution of the target variable
plt.figure(figsize=(10, 6))
sns.histplot(y_train, kde=True, bins=30, color='blue')
plt.title("Distribution of Target Variable: Price")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

## Task 6: Figuring Out What Metric to Choose - Pros and Cons of Each
Students will explore different metrics for evaluating regression models, including their strengths and weaknesses. The task involves understanding common metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²), and deciding which metric(s) to use for this problem.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Dummy predictions to demonstrate metrics (replace with actual model predictions later)
dummy_predictions = y_train.mean()  # Using mean of the target as a baseline
y_pred_dummy = np.full_like(y_train, dummy_predictions)

# Calculate metrics
mae = mean_absolute_error(y_train, y_pred_dummy)
mse = mean_squared_error(y_train, y_pred_dummy)
rmse = np.sqrt(mse)
r2 = r2_score(y_train, y_pred_dummy)

# Display results
print("Evaluation Metrics for Dummy Predictions:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

## Task 7: Building a Baseline No-Machine Learning Solution
In this task, students will create a simple baseline solution to predict housing prices without using machine learning. The baseline will use a basic statistical approach, such as predicting the mean or median of the target variable (Price). This provides a reference point for evaluating the performance of machine learning models.

`Note: You can evaluate the model based on which metric you found the most effective.`

In [None]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Baseline solution: Predicting the mean price
baseline_prediction = y_train.mean()

# Generate baseline predictions (same value for all)
y_pred_baseline = np.full_like(y_test, baseline_prediction)

# Evaluate baseline performance
mae_baseline = mean_absolute_error(y_test, y_pred_baseline)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
rmse_baseline = np.sqrt(mse_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)

# Display results
print("Baseline Model Performance:")
print(f"Mean Absolute Error (MAE): {mae_baseline:.2f}")
print(f"Mean Squared Error (MSE): {mse_baseline:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_baseline:.2f}")
print(f"R-squared (R²): {r2_baseline:.2f}")


## Task 8: Building a Basic Linear Regression Model
In this task, students will implement a basic linear regression model. They will preprocess the data (e.g., encode categorical features, normalize numerical features) to ensure compatibility with the linear regression algorithm. This task introduces foundational concepts in model building and data transformation.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns
numerical_features = X_train.select_dtypes(include=['float64', 'int64']).columns

# Preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Define the linear regression pipeline
linear_regression_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Fit the model
linear_regression_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = linear_regression_pipeline.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display results
print("Linear Regression Model Performance:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")


## Task 9: Evaluating Performance of Model Using Chosen Metric and Comparing with No-Machine Learning Solution

In this task, students will evaluate the performance of the linear regression model they built in the previous task. They will compare the performance of this machine learning model against the baseline (no-machine learning solution) using the chosen evaluation metrics (MAE, MSE, RMSE, R²). This comparison will help them understand whether the model provides a meaningful improvement over a simple statistical approach.

In [None]:
# Evaluate baseline performance (from Task 7)
y_pred_baseline = np.full_like(y_test, y_train.mean())  # Baseline: Predict the mean price

# Calculate metrics for the baseline
mae_baseline = mean_absolute_error(y_test, y_pred_baseline)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
rmse_baseline = np.sqrt(mse_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)

# Evaluate linear regression model performance
mae_lr = mean_absolute_error(y_test, y_pred)
mse_lr = mean_squared_error(y_test, y_pred)
rmse_lr = np.sqrt(mse_lr)
r2_lr = r2_score(y_test, y_pred)

# Display comparison between baseline and linear regression model
print("Performance Comparison (Baseline vs Linear Regression):")
print("\nBaseline Model:")
print(f"Mean Absolute Error (MAE): {mae_baseline:.2f}")
print(f"Mean Squared Error (MSE): {mse_baseline:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_baseline:.2f}")
print(f"R-squared (R²): {r2_baseline:.2f}")

print("\nLinear Regression Model:")
print(f"Mean Absolute Error (MAE): {mae_lr:.2f}")
print(f"Mean Squared Error (MSE): {mse_lr:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lr:.2f}")
print(f"R-squared (R²): {r2_lr:.2f}")

# Visual comparison (optional)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.bar(['Baseline', 'Linear Regression'], [mae_baseline, mae_lr], color=['orange', 'blue'])
plt.ylabel("Mean Absolute Error (MAE)")
plt.title("Comparison of MAE between Baseline and Linear Regression")
plt.show()
