<a href="https://colab.research.google.com/github/BrystofKlazek/RAD/blob/main/code/01RAD_Ex12_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle house data set



## Downloading the Kaggle house rent dataset

The dataset we will use comes from Kaggle:

- *House Rent Prediction Dataset*  
  https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/data

To download directly from Kaggle inside this notebook you need a Kaggle
API token (see *Account ? API ? Create New Token* on Kaggle). The cell
below assumes you have configured your `KAGGLE_USERNAME` and
`KAGGLE_KEY` environment variables or placed `kaggle.json` in the
standard location.

Exampe of Auto NB - let's beat it

https://www.kaggle.com/code/sahityasetu/boosting-algorithms-for-machine-learning


In [None]:

# Download the Kaggle house rent dataset using kagglehub (no API key needed for public data)
try:
    import kagglehub  # lightweight helper for Kaggle datasets
except ImportError:  # pragma: no cover
    %pip install -q kagglehub
    import kagglehub

# Download latest version of the dataset; this returns a local directory path
path = kagglehub.dataset_download("iamsouravbanerjee/house-rent-prediction-dataset")
print("Path to dataset files:", path)


In [None]:

from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from patsy import dmatrices


# `path` is a directory returned by kagglehub; locate the CSV inside it
dataset_dir = Path(path)
candidates = list(dataset_dir.rglob("House_Rent_Dataset.csv"))
if not candidates:
    raise FileNotFoundError(f"House_Rent_Dataset.csv not found under {dataset_dir}")

csv_path = candidates[0]
print("Loading data from:", csv_path)
house = pd.read_csv(csv_path)
print("Shape:", house.shape)
house.head()



## Questions for a linear regression analysis of house rent

When building a linear regression model for rent, it is useful to think
in terms of a workflow:

1. **Understand the data**
   - What is the response variable (e.g. `Rent`)?  
     What are the main predictor types (numeric, categorical, locations,
     amenities)?
   - Are there obvious data quality issues (missing values, impossible
     values, outliers)?

2. **Preprocessing and feature engineering**
   - How should categorical variables (e.g. city, furnishing status,
     point of contact) be encoded for a linear model (one?hot encoding,
     target encoding, etc.)?
   - Which numeric variables might benefit from scaling (standardization
     or robust scaling), and why can this matter for regularized
     regression?
   - Are there interactions that are conceptually meaningful
     (e.g. `BHK , Size`, `City , DistanceFromMainArea`)?
   - Can we create more interpretable features (e.g. rent per square
     foot, distance to city centre bins)?

3. **Transformations of response and regressors**
   - Is the distribution of `Rent` highly skewed or heavy tailed? Would a
     log transformation (modeling $\log(\text{Rent})$) stabilize
     variance and make residuals closer to normal?
   - Do some predictors show non linear relationships with rent? Would
     polynomial terms, splines, or monotone transforms (log, square
     root) be appropriate?
   - Are there predictors that should be centered or standardized before
     creating interaction or polynomial terms?

4. **Model specification and selection**
   - Start with a simple baseline: which variables should be included in
     a first OLS model, and how do residual plots look?
   - How to compare alternative specifications
     (different sets of features, transformed vs untransformed variables)
     using cross validation or a validation set?
   - When is it useful to move from plain OLS to regularized models such
     as ridge or lasso (e.g. many correlated predictors, high variance)?

5. **Model evaluation and diagnostics**
   - How to check linear model assumptions: residual vs fitted plots,
     QQ plots, heteroscedasticity, influential observations?
   - Which error metrics are most relevant here
     (RMSE, MAE, MAPE)?  How do training and test errors compare
     (overfitting vs underfitting)?
   - Are there systematic groups of houses (by city, BHK, furnishing)
     for which the model performs much worse, suggesting missing
     structure or interactions?



### More detailed questions to explore

- **Preprocessing**
  - How should missing values be handled for each variable (impute,
    drop, or create explicit missing indicators)?
  - Do we need to cap or Winsorize extreme values of `Rent` or `Size`
    before fitting a linear model?
  - Are there rare categories (e.g. cities or furnishing statuses with
    very few observations) that should be grouped together?

- **Transformations and linearity**
  - Plot `Rent` (or $\log(\text{Rent})$) against key predictors:
    `Size`, `BHK`, `Bathroom`, `City`, etc.  Do the relationships look
    approximately linear after transformation?
  - Would modeling $\log(\text{Rent})$ make residuals more symmetric and
    reduce heteroscedasticity?

- **Multicollinearity and regularization**
  - Are some predictors strongly correlated (e.g. `Size` and `BHK`)?  How
    do VIFs and condition numbers look for the chosen design matrix?
  - How do ridge and lasso behave in this dataset in terms of coefficient
    shrinkage and variable selection?
  - Which predictors consistently get selected by lasso across
    cross?validation folds?

- **Model selection and validation**
  - How does test error change when we:
    1. Add more predictors,
    2. Add interaction terms,
    3. Add polynomial terms,
    4. Switch from OLS to ridge/lasso?
  - How to choose the final model: by minimum cross?validated RMSE,
    parsimony (fewest predictors), or domain interpretability?

Use these questions as a checklist to design your own modeling pipeline
for the house rent dataset using linear regression and its regularized
variants.


In [None]:
df = house.copy()

df.shape
df.isna().sum()
df.duplicated().sum()
df.dtypes


In [None]:
print(df.describe(percentiles=[.01,.05,.1,.25,.5,.75,.9,.95,.99]))
print(df["Rent"].max(), df["Size"].min(), df["Size"].max())

df["log_rent"] = np.log(df["Rent"])


In [None]:
for col in ['Floor', 'Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred']:
    print(f"Unique values for '{col}':")
    print(df[col].unique())
    print("\n")
    print(f"Column '{col}': {df[col].nunique()} unique values")

In [None]:
import re

def parse_floor(floor_str):
    floor_str = str(floor_str).lower().strip()
    current_floor = None
    total_floors = None

    # Handle 'X out of Y' format
    match_out_of = re.match(r'(.+) out of (\d+)', floor_str)
    if match_out_of:
        current_part = match_out_of.group(1).strip()
        total_floors = int(match_out_of.group(2))
    else:
        current_part = floor_str

    if current_part == 'ground':
        current_floor = 0
    elif current_part == 'upper basement':
        current_floor = -1
    elif current_part == 'lower basement':
        current_floor = -2
    else:
        try:
            current_floor = int(current_part)
        except ValueError:
            pass

    # If total_floors wasn't extracted from 'out of Y' and current_floor is known and non-negative,
    # assume total_floors is the same as current_floor (e.g., "3" means 3 out of 3).
    if total_floors is None and current_floor is not None and current_floor >= 0:
        total_floors = current_floor

    if current_floor is None or total_floors is None:
        raise ValueError(f"Could not parse floor string: '{floor_str}'")

    return current_floor, total_floors

df[['current_floor', 'total_floors']] = df['Floor'].apply(lambda x: pd.Series(parse_floor(x)))

print("Original 'Floor' column unique values:")
print(df['Floor'].unique())
print("\n")

print("New 'current_floor' unique values:")
print(df['current_floor'].unique())
print("\n")

print("New 'total_floors' unique values:")
print(df['total_floors'].unique())
print("\n")

display(df[['Floor', 'current_floor', 'total_floors']].head())



In [None]:
min_n = 20  # tune this
counts = df["Area Locality"].value_counts()
keep = counts[counts >= min_n].index

df["AreaLocality_grp"] = np.where(df["Area Locality"].isin(keep),
                                  df["Area Locality"],
                                  "Other")

print(f"Unique values for AreaLocality_grp:")
print(df["AreaLocality_grp"].unique())
print("\n")
print(f"Column 'Arealocality_grp': {df["AreaLocality_grp"].nunique()} unique values")



In [None]:
sns.histplot(df["Rent"], bins=60)
plt.yscale("log")
plt.title("Rent (log y-scale)")
plt.show()

sns.histplot(df["log_rent"], bins=60)
plt.title("log(Rent)")
plt.show()

sns.scatterplot(data=df, x="Size", y="log_rent", hue="City", alpha=0.4, linewidth=0)
plt.title("Rent vs Size (log y-scale)")
plt.show()

sns.boxplot(data=df, x="City", y="log_rent")
plt.xticks(rotation=30, ha="right")
plt.title("log_rent by City")
plt.show()


In [None]:
formula_loc_grp = """
log_rent ~ Size + BHK + Bathroom
+ C(City) + C(Q("Furnishing Status")) + C(Q("Tenant Preferred")) + C(Q("Area Type"))
+ C(AreaLocality_grp)
"""
m_loc_grp = smf.ols(formula_loc_grp, data=df).fit(cov_type="HC3")
print(m_loc_grp.summary())

formula_area_interact = """
log_rent ~ Size:C(Q("Area Type")) + BHK + Bathroom
+ C(City) + C(Q("Furnishing Status")) + C(Q("Tenant Preferred"))
+ C(AreaLocality_grp)
"""
m_area = smf.ols(formula_area_interact, data=df).fit(cov_type="HC3")
print(m_area.summary())
