**STEP 1: Load & basic cleaning**

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv("/content/train (1).csv")

# Drop Id
df.drop(columns=['Id'], inplace=True)

**Fill NaNs in categorical columns**

In [None]:
num_cols = df.select_dtypes(include='number').columns
for col in num_cols:
    df[col].fillna(df[col].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


**Categorical columns :mode**

In [None]:
# Categorical columns ‚Üí mode
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


**STEP 2: Outlier handling**

In [None]:
num_cols = df.select_dtypes(include='number').columns
num_cols = num_cols.drop('SalePrice')

for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    df[col] = df[col].clip(lower, upper)

**STEP 3: Special handling for LotArea**

In [None]:
lower = df['LotArea'].quantile(0.01)
upper = df['LotArea'].quantile(0.99)

df['LotArea'] = df['LotArea'].clip(lower, upper)
df['LotArea'] = np.log1p(df['LotArea'])

**STEP 4: Target transform**

In [None]:
X = df.drop('SalePrice', axis=1)
y = np.log1p(df['SalePrice'])

**STEP 5: Drop high-cardinality categorical columns (LR decision)**

In [None]:
obj_cols = X.select_dtypes(include='object')
high_card_cols = obj_cols.columns[obj_cols.nunique() > 10]

X = X.drop(columns=high_card_cols)

**STEP 6: Identify categorical columns for OHE**

In [None]:
ohe_cols = X.select_dtypes(include='object').columns

**STEP 7: Train-test split**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

**STEP 8: Scaling + One-Hot Encoding**

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

num_cols = X_train.select_dtypes(include='number').columns

num_pipeline = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

cat_pipeline = OneHotEncoder(
    drop='first',
    handle_unknown='ignore',
    sparse_output=False
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, num_cols),
        ('cat', cat_pipeline, ohe_cols)
    ]
)

**STEP 9: Apply preprocessing**

In [None]:
X_train_final = preprocessor.fit_transform(X_train)
X_test_final  = preprocessor.transform(X_test)

print(X_train_final.shape)
print(X_test_final.shape)

(1168, 189)
(292, 189)




**STEP 10: Train Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train_final, y_train)

y_train_pred = lr.predict(X_train_final)
y_test_pred  = lr.predict(X_test_final)

rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test  = np.sqrt(mean_squared_error(y_test, y_test_pred))

r2_train = r2_score(y_train, y_train_pred)
r2_test  = r2_score(y_test, y_test_pred)

print(f"Train RMSE (log): {rmse_train:.4f}")
print(f"Test RMSE  (log): {rmse_test:.4f}")
print(f"Train R¬≤: {r2_train:.4f}")
print(f"Test  R¬≤: {r2_test:.4f}")

Train RMSE (log): 0.0976
Test RMSE  (log): 0.1527
Train R¬≤: 0.9375
Test  R¬≤: 0.8751


House Price Prediction using Linear Regression
üìå Project Overview

This project focuses on predicting house prices using Linear Regression.
The main goal was to understand the complete end-to-end machine learning workflow, starting from raw data analysis (EDA) to building and evaluating a regression model.

Instead of jumping directly to advanced models, I intentionally built a strong baseline model by carefully handling data quality issues like missing values, outliers, skewness, and categorical variables.

üìÇ Dataset Information

Dataset Name: House Prices ‚Äì Advanced Regression Techniques

Source: Kaggle

Link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

Dataset Size:

Rows: 1460

Columns: 81 (including target)

The target variable is:

SalePrice ‚Üí Price at which the house was sold

üõ†Ô∏è Tools & Libraries Used

Python

NumPy

Pandas

Scikit-learn

üîé Step-by-Step Workflow
1Ô∏è‚É£ Data Loading & Basic Cleaning

Loaded the dataset using Pandas

Dropped the Id column (not useful for prediction)

2Ô∏è‚É£ Handling Missing Values (Manual)

To keep things simple and transparent:

Numerical columns ‚Üí filled missing values using median

Categorical columns ‚Üí filled missing values using mode

This removed all NaN values from the dataset.

3Ô∏è‚É£ Outlier Handling

Used the IQR (Interquartile Range) method to cap outliers in numerical features

Instead of removing rows, values were clipped to reduce extreme influence

Special handling:

LotArea had heavy right skew

Applied 1st‚Äì99th percentile clipping

Followed by log transformation

4Ô∏è‚É£ Target Variable Transformation

SalePrice was heavily right-skewed

Applied log1p(SalePrice) to:

Reduce skewness

Improve linear regression performance

Stabilize variance

5Ô∏è‚É£ Handling Categorical Features

Identified categorical columns based on data type

High-cardinality categorical columns (more than 10 unique values) were dropped for Linear Regression to:

Avoid feature explosion

Reduce multicollinearity

Keep the model interpretable

(These columns can be reintroduced later for Ridge, Lasso, or tree-based models.)

6Ô∏è‚É£ Feature Encoding

Applied One-Hot Encoding to remaining categorical columns

Used drop='first' to avoid the dummy variable trap

Handled unseen categories safely

7Ô∏è‚É£ Feature Scaling

Numerical features were scaled using StandardScaler

Scaling was applied only to numeric columns (not categorical dummies)

8Ô∏è‚É£ Train-Test Split

Split the data into:

80% Training

20% Testing

Used a fixed random_state for reproducibility

9Ô∏è‚É£ Model Building

Built a Linear Regression model

Trained the model using the processed training data

üîü Model Evaluation
Evaluation Metrics Used:

R¬≤ Score

RMSE (Root Mean Squared Error)

Final Results:
Train R¬≤ : 0.9375
Test  R¬≤ : 0.8751

Train RMSE (log): 0.0976
Test  RMSE (log) : 0.1527

Interpretation:

The model shows strong performance on unseen data

Slight train‚Äìtest gap indicates mild overfitting, which is expected

Overall, this is a solid baseline Linear Regression model