<a href="https://colab.research.google.com/github/SomTu/RAD-2025/blob/main/code/01RAD_Ex12_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle house data set



## Downloading the Kaggle house rent dataset

The dataset we will use comes from Kaggle:

- *House Rent Prediction Dataset*  
  https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/data

To download directly from Kaggle inside this notebook you need a Kaggle
API token (see *Account ? API ? Create New Token* on Kaggle). The cell
below assumes you have configured your `KAGGLE_USERNAME` and
`KAGGLE_KEY` environment variables or placed `kaggle.json` in the
standard location.

Exampe of Auto NB - let's beat it

https://www.kaggle.com/code/sahityasetu/boosting-algorithms-for-machine-learning


In [None]:

# Download the Kaggle house rent dataset using kagglehub (no API key needed for public data)
try:
    import kagglehub  # lightweight helper for Kaggle datasets
except ImportError:  # pragma: no cover
    %pip install -q kagglehub
    import kagglehub

# Download latest version of the dataset; this returns a local directory path
path = kagglehub.dataset_download("iamsouravbanerjee/house-rent-prediction-dataset")
print("Path to dataset files:", path)


In [None]:

from pathlib import Path
import pandas as pd

# `path` is a directory returned by kagglehub; locate the CSV inside it
dataset_dir = Path(path)
candidates = list(dataset_dir.rglob("House_Rent_Dataset.csv"))
if not candidates:
    raise FileNotFoundError(f"House_Rent_Dataset.csv not found under {dataset_dir}")

csv_path = candidates[0]
print("Loading data from:", csv_path)
house = pd.read_csv(csv_path)
print("Shape:", house.shape)
house_orig = house
house.head()


In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor



In [None]:
print(list(house.columns))
print(house.describe(include='all'), "\n\n")
print(house.isna().sum())
sns.pairplot(data=house)

for col in house.columns:
    print(f"\nColumn: {col}")
    print(house[col].unique())


The columns are ['Posted On', 'BHK', 'Rent', 'Size', 'Floor', 'Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred', 'Bathroom', 'Point of Contact']

'Posted On' should not have any effect on Rent, therefore will not be used for the model.

'Area Type' has half as many unique values as there are total of samples. Author sees no sensible way to include them in the model.

Initial feature choice is ['BHK', 'Size', 'Floor', 'City', 'Furnishing Status', 'Bathroom']

In [None]:


def add_floor_columns(
    df: pd.DataFrame,
    source_col: str,
    current_col: str = "current_floor",
    max_col: str = "max_floor",
    inplace: bool = True,
):
    """
    Parses floor information from a column and adds two new columns:
    current floor and max floor.

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe
    source_col : str
        Column containing floor strings (e.g. '3 out of 10')
    current_col : str, optional
        Name of output column for current floor
    max_col : str, optional
        Name of output column for max floor
    inplace : bool, optional
        If True, modifies df in place. If False, returns a copy.

    Returns
    -------
    pd.DataFrame
        DataFrame with added columns
    """

    if not inplace:
        df = df.copy()

    floor_map = {
        "Ground": 0,
        "Upper Basement": -1,
        "Lower Basement": -2,
    }

    pattern = re.compile(r"(.+?)\s+out of\s+(\d+)$")

    def parse_value(val):
        if not isinstance(val, str):
            return (np.nan, np.nan)

        val = val.strip()
        match = pattern.match(val)

        if not match:
            return (np.nan, np.nan)

        raw_floor, max_floor = match.groups()
        max_floor = int(max_floor)

        raw_floor = raw_floor.strip()

        # Numeric floor
        if raw_floor.isdigit():
            return (int(raw_floor), max_floor)

        # Named floor
        if raw_floor in floor_map:
            return (floor_map[raw_floor], max_floor)

        return (np.nan, np.nan)

    df[[current_col, max_col]] = (
        df[source_col]
        .apply(parse_value)
        .apply(pd.Series)
    )

    return df

house = add_floor_columns(house, source_col='Floor')
house['floor_ratio'] = house['current_floor'] / house['max_floor']
#price_skew = house["Rent"].skew()
#print(f"Rent skewness: {price_skew:.2f} (right-skew suggests log-transforming the target)")
for column in house.select_dtypes(include='number').columns:
    print(f"Skewness in {column} is {house[column].skew():.2f}.")
house.head()


In [None]:
# transforming Rent
house['log_Rent'] = np.log(house['Rent'])


house['Furnishing_Status'] = house['Furnishing Status']
house['Area_Type'] = house['Area Type']

# Rent distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(house["Rent"], bins=40, kde=True, color="steelblue", ax=axes[0])
axes[0].set_title("Rent distribution (linear scale)")
axes[0].set_xlabel("Rent")
axes[0].set_ylabel("Count")

sns.histplot(np.log1p(house["Rent"]), bins=40, kde=True, color="darkorange", ax=axes[1])
axes[1].set_title("Rent distribution (log1p scale)")
axes[1].set_xlabel("log(1 + Rent)")
axes[1].set_ylabel("Count")

for column in house.select_dtypes(include='number').columns:
    print(f"Skewness in {column} is {house[column].skew():.2f}.")

In this case, log-transforming did not help with some skewness

In [None]:
sns.pairplot(data=house)




---

After log transforming the data, there are some noticable linear trends.




In [None]:
house.head()

---

Initial model with a very bad result:



In [None]:
model = smf.ols("Rent ~ BHK + Size + current_floor + max_floor + C(Furnishing_Status) + Bathroom", data=house).fit()
print(model.summary())

Since BHK is the number of bedrooms, halls and kitchens, it is very likely that there is going to be some multicollinearity between it and Size. We will therefore keep only the size. At the same time, the 'max_floor' and 'current_floor' are somewhat related as well as 'current_floor' never exceeds 'max_floor'. For this reason, we will omit the 'max_floor'.

At the same time, we will use log-scaled variants of the feature

In [None]:
model = smf.ols("log_Rent ~ Size + floor_ratio + Bathroom + C(City) + C(Furnishing_Status) + C(Area_Type) + Size:C(City) + C(Furnishing_Status):C(City) + Size:Bathroom", data=house).fit()
print(model.summary())

In [None]:
house.columns


## Questions for a linear regression analysis of house rent

When building a linear regression model for rent, it is useful to think
in terms of a workflow:

1. **Understand the data**
   - What is the response variable (e.g. `Rent`)?  
     What are the main predictor types (numeric, categorical, locations,
     amenities)?
   - Are there obvious data quality issues (missing values, impossible
     values, outliers)?

2. **Preprocessing and feature engineering**
   - How should categorical variables (e.g. city, furnishing status,
     point of contact) be encoded for a linear model (one?hot encoding,
     target encoding, etc.)?
   - Which numeric variables might benefit from scaling (standardization
     or robust scaling), and why can this matter for regularized
     regression?
   - Are there interactions that are conceptually meaningful
     (e.g. `BHK , Size`, `City , DistanceFromMainArea`)?
   - Can we create more interpretable features (e.g. rent per square
     foot, distance to city centre bins)?

3. **Transformations of response and regressors**
   - Is the distribution of `Rent` highly skewed or heavy tailed? Would a
     log transformation (modeling $\log(\text{Rent})$) stabilize
     variance and make residuals closer to normal?
   - Do some predictors show non linear relationships with rent? Would
     polynomial terms, splines, or monotone transforms (log, square
     root) be appropriate?
   - Are there predictors that should be centered or standardized before
     creating interaction or polynomial terms?

4. **Model specification and selection**
   - Start with a simple baseline: which variables should be included in
     a first OLS model, and how do residual plots look?
   - How to compare alternative specifications
     (different sets of features, transformed vs untransformed variables)
     using cross validation or a validation set?
   - When is it useful to move from plain OLS to regularized models such
     as ridge or lasso (e.g. many correlated predictors, high variance)?

5. **Model evaluation and diagnostics**
   - How to check linear model assumptions: residual vs fitted plots,
     QQ plots, heteroscedasticity, influential observations?
   - Which error metrics are most relevant here
     (RMSE, MAE, MAPE)?  How do training and test errors compare
     (overfitting vs underfitting)?
   - Are there systematic groups of houses (by city, BHK, furnishing)
     for which the model performs much worse, suggesting missing
     structure or interactions?



### More detailed questions to explore

- **Preprocessing**
  - How should missing values be handled for each variable (impute,
    drop, or create explicit missing indicators)?
  - Do we need to cap or Winsorize extreme values of `Rent` or `Size`
    before fitting a linear model?
  - Are there rare categories (e.g. cities or furnishing statuses with
    very few observations) that should be grouped together?

- **Transformations and linearity**
  - Plot `Rent` (or $\log(\text{Rent})$) against key predictors:
    `Size`, `BHK`, `Bathroom`, `City`, etc.  Do the relationships look
    approximately linear after transformation?
  - Would modeling $\log(\text{Rent})$ make residuals more symmetric and
    reduce heteroscedasticity?

- **Multicollinearity and regularization**
  - Are some predictors strongly correlated (e.g. `Size` and `BHK`)?  How
    do VIFs and condition numbers look for the chosen design matrix?
  - How do ridge and lasso behave in this dataset in terms of coefficient
    shrinkage and variable selection?
  - Which predictors consistently get selected by lasso across
    cross?validation folds?

- **Model selection and validation**
  - How does test error change when we:
    1. Add more predictors,
    2. Add interaction terms,
    3. Add polynomial terms,
    4. Switch from OLS to ridge/lasso?
  - How to choose the final model: by minimum cross?validated RMSE,
    parsimony (fewest predictors), or domain interpretability?

Use these questions as a checklist to design your own modeling pipeline
for the house rent dataset using linear regression and its regularized
variants.




---


There are no NaNs. There seem to be no immediate outliers.