## 1.Creating Income Categories

In [1]:
import pandas as pd                     # Import Pandas for data handling
import numpy as np                      # Import NumPy for numerical operations

# Load the dataset
data = pd.read_csv("housing.csv")       # Read CSV file and store it as a DataFrame

# Create income categories
data["income_cat"] = pd.cut(            # Create a new column by binning median_income
    data["median_income"],              # Column to be divided into categories
    bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],# Define income ranges (bins) #np.inf means infinite value
    labels=[1, 2, 3, 4, 5]               # Assign category labels to each range
)


## 2.Stratified Shuffle Split in Scikit-Learn

##### Scikit-learn provides a built-in way to perform stratified sampling using  StratifiedShuffleSplit .

###### Here’s how you can use it:

In [2]:
from sklearn.model_selection import StratifiedShuffleSplit   # Import class for stratified splitting

# Assume income_cat is already created from median_income
split = StratifiedShuffleSplit(
    n_splits=1,          # Number of train-test splits to generate
    test_size=0.2,       # 20% of data will be used as test set
    random_state=42      # Fix randomness for reproducibility
)

for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]   # Select training data using stratified indices
    strat_test_set = data.loc[test_index]     # Select test data using stratified indices


## 3. Lets remove income category coloumn

In [3]:
# Code to remove income category coloumn
for sett in (strat_train_set , strat_test_set):
    sett.drop("income_cat",axis=1,inplace=True)

In [4]:
strat_train_set
df=strat_train_set.copy() #training dataset ki copy bana lo aur ousko df me save krlo ab hum aone sare kaam df me krege joki humra training dataset h
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
12655,-121.46,38.52,29,3873,797.0,2237,706,2.1736,72100,INLAND
15502,-117.23,33.09,7,5320,855.0,2015,768,6.3373,279600,NEAR OCEAN
2908,-119.04,35.37,44,1618,310.0,667,300,2.8750,82700,INLAND
14053,-117.13,32.75,24,1877,519.0,898,483,2.2264,112500,NEAR OCEAN
20496,-118.70,34.28,27,3536,646.0,1837,580,4.4964,238300,<1H OCEAN
...,...,...,...,...,...,...,...,...,...,...
15174,-117.07,33.03,14,6665,1231.0,2026,1001,5.0900,268500,<1H OCEAN
12661,-121.42,38.51,15,7901,1422.0,4769,1418,2.8139,90400,INLAND
19263,-122.72,38.44,48,707,166.0,458,172,3.1797,140400,<1H OCEAN
19140,-122.70,38.31,14,3155,580.0,1208,501,4.1964,258100,<1H OCEAN


In [5]:
strat_test_set # now do not touch this data set we will use this data set only while testing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
5241,-118.39,34.12,29,6447,1012.0,2184,960,8.2816,500001,<1H OCEAN
17352,-120.42,34.89,24,2020,307.0,855,283,5.0099,162500,<1H OCEAN
3505,-118.45,34.25,36,1453,270.0,808,275,4.3839,204600,<1H OCEAN
7777,-118.10,33.91,35,1653,325.0,1072,301,3.2708,159700,<1H OCEAN
14155,-117.07,32.77,38,3779,614.0,1495,614,4.3529,184000,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...,...
12182,-117.29,33.72,19,2248,427.0,1207,368,2.8170,110000,<1H OCEAN
7275,-118.24,33.99,33,885,294.0,1270,282,2.1615,118800,<1H OCEAN
17223,-119.72,34.44,43,1781,342.0,663,358,4.7000,293800,<1H OCEAN
10786,-117.91,33.63,30,2071,412.0,1081,412,4.9125,335700,<1H OCEAN


# Further Preprocessing & Handling Missing Data

Before feeding your data into a machine learning algorithm, you need to clean and
prepare it.

## 1.Prepare Data for Training

It’s best to write transformation functions instead of applying them manually. This ensures:
- Reproducibility on any dataset
- Reusability across projects
- Compatibility with live systems
- Easier experimentation
  
Start by creating a clean copy and separating the predictors and labels:

In [6]:
housing = df.drop("median_house_value", axis=1) # This will delete coloumn median_house_value
housing_labels = df["median_house_value"].copy() # This will store data of coloumn median_house_value

In [7]:
housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29,3873,797.0,2237,706,2.1736,INLAND
15502,-117.23,33.09,7,5320,855.0,2015,768,6.3373,NEAR OCEAN
2908,-119.04,35.37,44,1618,310.0,667,300,2.8750,INLAND
14053,-117.13,32.75,24,1877,519.0,898,483,2.2264,NEAR OCEAN
20496,-118.70,34.28,27,3536,646.0,1837,580,4.4964,<1H OCEAN
...,...,...,...,...,...,...,...,...,...
15174,-117.07,33.03,14,6665,1231.0,2026,1001,5.0900,<1H OCEAN
12661,-121.42,38.51,15,7901,1422.0,4769,1418,2.8139,INLAND
19263,-122.72,38.44,48,707,166.0,458,172,3.1797,<1H OCEAN
19140,-122.70,38.31,14,3155,580.0,1208,501,4.1964,<1H OCEAN


In [8]:
housing_labels

12655     72100
15502    279600
2908      82700
14053    112500
20496    238300
          ...  
15174    268500
12661     90400
19263    140400
19140    258100
19773     62700
Name: median_house_value, Length: 16512, dtype: int64

## 2.Handling Missing Data

Some features, like total_bedrooms , contain missing values. You can:

1. Drop rows with missing values
2. Drop the entire column
3. <b>Impute missing values (recommended)</b>

We’ll use option 3 using SimpleImputer from Scikit-Learn, which allows consistent
handling across all datasets (train, test, new data):

In [9]:
from sklearn.impute import SimpleImputer        # Import class to handle missing values

imputer = SimpleImputer(strategy="median")      # Create imputer that replaces NaN with median

housing_num = housing.select_dtypes(
    include=[np.number]                         # Select only numeric columns from dataset #Becoz our data contain categorial data in ocean_proximity
)

imputer.fit(housing_num)                        # Learn median values from numeric data


0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


> <b>This computes the median for each numerical column and stores it in imputer.statistics_ :

In [10]:
imputer.statistics_  # output: this will give haar coloumn ka median jisko hum replace krdenge haar coloumn ki null/NaN value se 

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

<b>Now apply the learned medians to transform the data

In [11]:
X = imputer.transform(housing_num)

- Other available strategies:
   - "mean" – replaces with mean value
   - "most_frequent" – for the most common value (can handle categorical)
   - "constant" – fill with a fixed value using fill_value=...