# Housing Price Prediction


## 1. Module 1

### Task 1-1 : Load the data

- **Description**: Load the data from file `train.csv` and assign it to variable `train_df`
- **Code Instruction**: 
    1. Import dataset using the path `train.csv` and assign to `train_df`
    2. From the column anmes, get all columns but the last one and assign it to `feats` as a list
    3. Get the last column names and assign it to `label`

In [16]:
import pandas as pd

# Load the dataset
file_path = "train.csv"  # Update with your dataset path
data = pd.read_csv(file_path)


### Task 1-2: Splitting the Data into Train and Test Sets

- **Description**: Create a train and test dataset from the dataset
- **Code Instruction**: 
    1. Load features in `X` and labels into `y`
    2. Split dataset into train and test using `train_test_split`

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = data.drop(columns=['Price'])  # Features
y = data['Price']  # Target variable

# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("Training set size (features):", X_train.shape)
print("Training set size (target):", y_train.shape)
print("Testing set size (features):", X_test.shape)
print("Testing set size (target):", y_test.shape)


### Task 1-3: Viewing Summary Statistics

- **Description**: The first step in checking the quality of your training data is to view its summary statistics
- **Code Instruction**:
    1. See top 5 rows of data
    2. See summary statistics of numeric features

In [None]:
X_train.head(5)

In [None]:
# Display summary statistics of numerical features in the training dataset
print("Summary statistics of numerical features in the training data:")
numerical_features = X_train.select_dtypes(include=['float64', 'int64']).columns
print(X_train[numerical_features].describe())


### Task 1-4: Checking for Missing Values
- **Description**: In this task,remove any duplicate rows in the data
- **Code Instruction**: Calculate number of values in each columns and store in `missing_values` 

In [None]:
# Check for missing values in the training dataset
print("\nMissing values in the training dataset:")
missing_values = X_train.isnull().sum()
missing_columns = missing_values[missing_values > 0]
print(missing_columns)


## Task 1-5: Checking for Duplicate Values

- **Description**: Detect duplicate rows in the training dataset
- **Code Instruction**: Store number of duplicate rows in `duplicates`

In [None]:
# Check for duplicate rows in the training dataset
duplicates = X_train.duplicated().sum()
print(f"\nNumber of duplicate rows in the training dataset: {duplicates}")


### Task 1-6: Checking for Outliers Using the Tukey Outlier Rule

- **Description**: Identify outliers in the numerical features using Tukey's rule, which defines an outlier as a value beyond 1.5 times the interquartile range (IQR) above the 75th percentile or below the 25th percentile.
- **Code Insturction**:
    1. Complete function to return number of outliers in each field
    2. Stor in `num_outliers` number of outlier for each field

In [None]:
# Identify outliers using Tukey's rule
def tukey_outliers(column):
    q1 = column.quantile(0.25)  # 25th percentile
    q3 = column.quantile(0.75)  # 75th percentile
    iqr = q3 - q1  # Interquartile range
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = (column < lower_bound) | (column > upper_bound)
    return outliers.sum()

print("\nOutliers in each numerical feature using Tukey's rule:")
for col in numerical_features:
    num_outliers = tukey_outliers(X_train[col])
    print(f"{col}: {num_outliers} outliers")


## 2. Modelling

### Task 2-1: Understanding the Data - Feature Value Distributions

- **Description**: In this task, students will explore the distributions of features in the dataset. Understanding feature distributions helps identify patterns, skewness, and irregularities (e.g., heavy tails). Use visualization techniques to analyze both numerical and categorical features.
- **Code Instruction**:
    1. Visualize distributions of numerical features using historgrams
    2. Visualize distributions of categorical features using bar plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize distributions of numerical features
print("Visualizing distributions of numerical features:")
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(3, 3, i)
    sns.histplot(X_train[col], kde=True, bins=30)
    plt.title(f"Distribution of {col}")
plt.tight_layout()
plt.show()

# Visualize distributions of categorical features
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns
print("\nVisualizing distributions of categorical features:")
plt.figure(figsize=(15, 8))
for i, col in enumerate(categorical_features, 1):
    plt.subplot(2, 2, i)
    sns.countplot(x=X_train[col], order=X_train[col].value_counts().index)
    plt.title(f"Distribution of {col}")
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Task 2-2: Building a Baseline No-Machine Learning Solution

- **Description**: In this task, students will create a simple baseline solution to predict housing prices without using machine learning. The baseline will use a basic statistical approach, such as predicting the mean of the target variable (Price)
- **Code Instruction**:
    1. Predict using mean of train labels
    2. Calculate performance using mse, mae, rmse, and r2

`Ask ChatGPT: Please explain mse, mae, rmse, and r2 and their difference with examples`

In [None]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Baseline solution: Predicting the mean price
baseline_prediction = y_train.mean()

# Generate baseline predictions (same value for all)
y_pred_baseline = np.full_like(y_test, baseline_prediction)

# Evaluate baseline performance
mae_baseline = mean_absolute_error(y_test, y_pred_baseline)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
rmse_baseline = np.sqrt(mse_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)

# Display results
print("Baseline Model Performance:")
print(f"Mean Absolute Error (MAE): {mae_baseline:.2f}")
print(f"Mean Squared Error (MSE): {mse_baseline:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_baseline:.2f}")
print(f"R-squared (R²): {r2_baseline:.2f}")


### Task 2-3: Building a Basic Linear Regression Model

-**Description**: In this task, students will implement a basic linear regression model. They will preprocess the data (e.g., encode categorical features, normalize numerical features) to ensure compatibility with the linear regression algorithm. This task introduces foundational concepts in model building and data transformation.
- **Code Instruction**:
    1. Use `pipeline` from sklean to put together transformations
    2. Use mean imputing to fill missing values and standard scaling for numerical values
    3. Use mode imputring to fill missing value and One Hot Encoder for categorical values
    4. Use Column Transformer to tranform subset of columns
    5. Fit the default Linear Regression model
    6. Evaluate performance on baslines LR model on mse, mae, rmse, and r2

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Separate categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns
numerical_features = X_train.select_dtypes(include=['float64', 'int64']).columns

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),  # Fill missing values with the mean
    ("scaler", StandardScaler())                 # Scale the features
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # Fill missing values with the most frequent value
    ("onehot", OneHotEncoder(handle_unknown="ignore"))     # One-hot encode the categorical features
])


# Preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the linear regression pipeline
linear_regression_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Fit the model
linear_regression_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = linear_regression_pipeline.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display results
print("Linear Regression Model Performance:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")


### Task 2-4: Comparing LR with No-Machine Learning Solution

- **Description**: Compare performance of linear regression model and no-ML baseline model
- **Code Instruction**: 
    1. Evaluate performance of baslines model on mse, mae, rmse, and r2
    2. Evaluate performance on LR model on mse, mae, rmse, and r2
    3. See which one is better!


In [None]:
# Evaluate baseline performance (from Task 7)
y_pred_baseline = np.full_like(y_test, y_train.mean())  # Baseline: Predict the mean price

# Calculate metrics for the baseline
mae_baseline = mean_absolute_error(y_test, y_pred_baseline)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
rmse_baseline = np.sqrt(mse_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)

# Evaluate linear regression model performance
mae_lr = mean_absolute_error(y_test, y_pred)
mse_lr = mean_squared_error(y_test, y_pred)
rmse_lr = np.sqrt(mse_lr)
r2_lr = r2_score(y_test, y_pred)

# Display comparison between baseline and linear regression model
print("Performance Comparison (Baseline vs Linear Regression):")
print("\nBaseline Model:")
print(f"Mean Absolute Error (MAE): {mae_baseline:.2f}")
print(f"Mean Squared Error (MSE): {mse_baseline:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_baseline:.2f}")
print(f"R-squared (R²): {r2_baseline:.2f}")

print("\nLinear Regression Model:")
print(f"Mean Absolute Error (MAE): {mae_lr:.2f}")
print(f"Mean Squared Error (MSE): {mse_lr:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lr:.2f}")
print(f"R-squared (R²): {r2_lr:.2f}")

# Visual comparison (optional)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.bar(['Baseline', 'Linear Regression'], [mae_baseline, mae_lr], color=['orange', 'blue'])
plt.ylabel("Mean Absolute Error (MAE)")
plt.title("Comparison of MAE between Baseline and Linear Regression")
plt.show()
