Task 1 – Data Preprocessing for Machine Learning
Objective

The goal of preprocessing is to clean and prepare the dataset so that machine learning models can understand and learn from it effectively. Raw data often contains unnecessary columns, inconsistent formats, or values that require scaling.

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Loaded Dataset

Dataset: House Price India.csv

Shape: 14,620 rows × 23 columns

In [32]:
# Load dataset
df = pd.read_csv("House Price India.csv")
df.shape

(14620, 23)

Removed Unnecessary Columns

Dropped id (just a unique identifier) and Date (not helpful in predicting price).


In [33]:
# Drop unnecessary columns
df_processed = df.drop(columns=["id", "Date"])
df_processed.shape

(14620, 21)

Categorical Encoding

Converted Postal Code into categorical values.

Applied One-Hot Encoding, which created new columns for each postal code region.


In [34]:
# Treat Postal Code as categorical and one-hot encode
df_processed["Postal Code"] = df_processed["Postal Code"].astype("category")
df_processed = pd.get_dummies(df_processed, columns=["Postal Code"], drop_first=True)

Scaling Features

Used StandardScaler to normalize all numerical features so that values are on a similar scale.

This prevents large-value columns (like lot area) from dominating small-value ones (like number of bathrooms).

In [35]:
# Separate features and target
X = df_processed.drop(columns=["Price"])
y = df_processed["Price"]

In [36]:
# Normalize/Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Train-Test Split

Split dataset into 80% training (11,696 rows) and 20% testing (2,924 rows).

Why This Matters

Clean and scaled data improves model performance, avoids bias towards certain features, and ensures reliable predictions.

In [37]:
# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

In [38]:
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

Training data shape: (11696, 88)
Testing data shape: (2924, 88)
