# Table of Contents

1. General flow of events
2. Best Models (1 and 2)
3. Kaggle submissions summary

## 1. General flow of events

## Introduction
This Jupyter Notebook is dedicated to predicting house prices based on a variety of features from the [Kaggle Ames Housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). The project employs several machine learning techniques, including linear regression and feature engineering, to enhance prediction accuracy and explore the effectiveness of different preprocessing methods.

## Data Exploration
In this section, we will load the dataset and perform initial data exploration to understand the data we're working with. This includes generating summary statistics, identifying the types of variables available, and assessing data quality issues like missing values.

### Loading Data and Initial Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import TruncatedSVD

In [None]:
# Load the dataset
train_df = pd.read_csv('train.csv')

# Display the first few rows of the dataset
print(train_df.head())

# Generate summary statistics to understand the data's scale and variance
print(train_df.describe())

# Check data types and missing values
print(train_df.info())

### Visual Analysis of Key Features

We focus on visualizing distributions of and relationships between key features identified as most impactful for house pricing. This helps in understanding how well each feature correlates with the SalePrice.

In [None]:
# Define key features for focused analysis
important_features = ['OverallQual', 'GrLivArea', 'TotalBsmtSF', 'GarageCars', 'YearBuilt']
target = 'SalePrice'

# Plotting distributions of important features
for feature in important_features:
    plt.figure(figsize=(10, 6))
    sns.histplot(train_df[feature], kde=True, bins=30, color='skyblue')
    plt.title(f'Distribution of {feature} - Understanding Skewness', fontsize=15)
    plt.xlabel(feature, fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.grid(True)
    plt.show()

# Exploring relationships with the SalePrice
for feature in important_features:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=train_df[feature], y=train_df[target], alpha=0.6, edgecolor=None, color='purple')
    plt.title(f'Impact of {feature} on SalePrice', fontsize=15)
    plt.xlabel(feature, fontsize=12)
    plt.ylabel('SalePrice', fontsize=12)
    plt.grid(True)
    plt.show()

In [None]:
# Identifying categorical and numerical variables
categorical_vars = train_df.select_dtypes(include=['object']).columns
numerical_vars = train_df.select_dtypes(include=[np.number]).columns

print("Categorical variables:", categorical_vars)
print("Numerical variables:", numerical_vars)

## Data Preprocessing

This section details the preprocessing steps applied to the data, including handling missing values, transforming skewed features, and encoding categorical variables. These steps are crucial for preparing the data for effective modeling.


In [None]:
# Handling missing values
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

# Impute numerical columns
train_df[numerical_vars] = num_imputer.fit_transform(train_df[numerical_vars])

# Impute categorical columns correctly by ensuring 2D input
for col in categorical_vars:
    # Using .ravel() to convert the 2D output to 1D
    train_df[col] = cat_imputer.fit_transform(train_df[[col]]).ravel()

# OneHotEncoding with updated method to avoid deprecation warning and error
encoder = OneHotEncoder(sparse_output=False)  # Ensures output is not sparse
encoded_vars = encoder.fit_transform(train_df[categorical_vars])
encoded_columns = encoder.get_feature_names_out(categorical_vars)  # Use the new method for getting feature names

# Adding encoded variables back to the dataframe
encoded_df = pd.DataFrame(encoded_vars, columns=encoded_columns, index=train_df.index)
train_df = pd.concat([train_df.drop(categorical_vars, axis=1), encoded_df], axis=1)

# Display the first few rows to verify changes
print(train_df.head())

In [None]:
# Example of adding a polynomial feature
train_df['TotalSF'] = train_df['1stFlrSF'] + train_df['2ndFlrSF'] + train_df['TotalBsmtSF']

# Example of creating interaction terms
train_df['YearBuilt*OverallQual'] = train_df['YearBuilt'] * train_df['OverallQual']

In [None]:
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE:', rmse)

## 2. Best Models

### a) Best Model

### Preprocessing Steps
- **Normalization**: Applied log transformation to reduce skewness in 'GrLivArea', 'LotArea', etc.
- **Feature Engineering**: Created 'HouseAge' and 'TotalRooms' to capture combined effects of related features.
- **Removal of Outliers**: Used z-score evaluation to remove values that have a standard deviation >2.6555
- **Used PCA Analysis**: Used TruncatedSVD to preserve 95% of the variance in the data while removing excess noise
- **Used Robust Scaler**: For its better performance over StandardScaler

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler, StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Handling Skewness in Numeric Features for both train and test data
skewed_features = ['GrLivArea', 'LotArea', '1stFlrSF', 'TotalBsmtSF']
for feature in skewed_features:
    train_df[feature] = np.log1p(train_df[feature])
    test_df[feature] = np.log1p(test_df[feature])
    
# Advanced Feature Engineering
train_df['HouseAge'] = train_df['YrSold'] - train_df['YearBuilt']
train_df['TotalRooms'] = train_df['FullBath'] + train_df['TotRmsAbvGrd']
test_df['HouseAge'] = test_df['YrSold'] - test_df['YearBuilt']
test_df['TotalRooms'] = test_df['FullBath'] + test_df['TotRmsAbvGrd']

# More Granular Outlier Removal
z_scores = np.abs(stats.zscore(train_df[['HouseAge', 'TotalRooms']]))
outlier_rows = np.where(z_scores > 2.6555)[0]
train_df = train_df.drop(index=outlier_rows).reset_index(drop=True)
X = train_df.drop(['SalePrice', 'Id'], axis=1)
y = np.log1p(train_df['SalePrice'])

numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent', fill_value='missing')

X_num = pd.DataFrame(num_imputer.fit_transform(X[numerical_features]), columns=numerical_features)
X_cat = pd.DataFrame(cat_imputer.fit_transform(X[categorical_features]), columns=categorical_features)

onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat_onehot = onehot_encoder.fit_transform(X_cat)

# Applying RobustScaler
scaler = RobustScaler()
X_num_scaled = scaler.fit_transform(X_num)

# Applying PCA
pca = PCA(n_components=88)
X_cat_reduced = pca.fit_transform(X_cat_onehot)
X_processed = np.hstack((X_num_scaled, X_cat_reduced))

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=110)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f'RMSE: {mse}, R^2: {r2}')

### b) 2nd Best Final Model

### Preprocessing Steps
- **Normalization**: Applied log transformation to reduce skewness in 'GrLivArea', 'LotArea', etc.
- **Feature Engineering**: Created 'HouseAge' and 'TotalRooms' to capture combined effects of related features.
- **Removal of Outliers**: Used z-score evaluation to remove values that have a standard deviation >2.65
- **Used SVD Analysis**: Used TruncatedSVD to preserve 95% of the variance in the data while removing excess noise

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

# Load datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Handling Skewness in Numeric Features for both train and test data
skewed_features = ['GrLivArea', 'LotArea', '1stFlrSF', 'TotalBsmtSF']
for feature in skewed_features:
    train_df[feature] = np.log1p(train_df[feature])
    test_df[feature] = np.log1p(test_df[feature])  # Apply the same transformation to the test set
    
# Handling Skewness in Numeric Features for both train and test data
skewed_features = ['GrLivArea', 'LotArea', '1stFlrSF', 'TotalBsmtSF']
for feature in skewed_features:
    train_df[feature] = np.log1p(train_df[feature])
    test_df[feature] = np.log1p(test_df[feature])  # Apply the same transformation to the test set
    
# Advanced Feature Engineering
train_df['HouseAge'] = train_df['YrSold'] - train_df['YearBuilt']
train_df['TotalRooms'] = train_df['FullBath'] + train_df['TotRmsAbvGrd']

test_df['HouseAge'] = test_df['YrSold'] - test_df['YearBuilt']
test_df['TotalRooms'] = test_df['FullBath'] + test_df['TotRmsAbvGrd']

# Advanced Feature Engineering
train_df['HouseAge'] = train_df['YrSold'] - train_df['YearBuilt']
train_df['TotalRooms'] = train_df['FullBath'] + train_df['TotRmsAbvGrd']

test_df['HouseAge'] = test_df['YrSold'] - test_df['YearBuilt']
test_df['TotalRooms'] = test_df['FullBath'] + test_df['TotRmsAbvGrd']

# More Granular Outlier Removal
z_scores = np.abs(stats.zscore(train_df[['HouseAge', 'TotalRooms']]))
outlier_rows = np.where(z_scores > 2.65)[0]
train_df = train_df.drop(index=outlier_rows).reset_index(drop=True)

X = train_df.drop(['SalePrice', 'Id'], axis=1)
y = np.log1p(train_df['SalePrice'])

numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent', fill_value='missing')

X_num = pd.DataFrame(num_imputer.fit_transform(X[numerical_features]), columns=numerical_features)
X_cat = pd.DataFrame(cat_imputer.fit_transform(X[categorical_features]), columns=categorical_features)

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent', fill_value='missing')

X_num = pd.DataFrame(num_imputer.fit_transform(X[numerical_features]), columns=numerical_features)
X_cat = pd.DataFrame(cat_imputer.fit_transform(X[categorical_features]), columns=categorical_features)

onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat_onehot = onehot_encoder.fit_transform(X_cat)
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

X_cat_onehot = onehot_encoder.fit_transform(X_cat)
svd = TruncatedSVD(n_components=95)


X_cat_reduced = svd.fit_transform(X_cat_onehot)
X_processed = np.hstack((X_num, X_cat_reduced))
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=110)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f'RMSE: {mse}, R^2: {r2}')
y_pred = model.predict(X_test)
mse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

## Kaggle Submission Summary

| Submission ID | Features Used                                       | Preprocessing Steps                                  | Kaggle Score |
|---------------|-----------------------------------------------------|------------------------------------------------------|--------------|
| 1             | Basic features                                      | None                                                 | 0.288        |
| 2             | Basic features                                      | Introduction of RobustScaler                         | 0.187        |
| 3             | Basic features + `HouseAge`                         | Log transformation on skewed features                | 0.172        |
| 4-27          | Increasing feature complexity                       | Iterative feature engineering and outlier management | Varied       |
| 28            | Full feature set without PCA                        | Inclusion of RobustScaler                            | 0.162        |
| 29            | Full feature set with PCA                           | PCA with 88 components applied                       | 0.159        |
| 30            | All features + polynomial features of selected vars | Full preprocessing including PCA and RobustScaler    | 0.1256       |
