## 1.Creating Income Categories

In [1]:
import pandas as pd                     # Import Pandas for data handling
import numpy as np                      # Import NumPy for numerical operations

# Load the dataset
data = pd.read_csv("housing.csv")       # Read CSV file and store it as a DataFrame

# Create income categories
data["income_cat"] = pd.cut(            # Create a new column by binning median_income
    data["median_income"],              # Column to be divided into categories
    bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],# Define income ranges (bins) #np.inf means infinite value
    labels=[1, 2, 3, 4, 5]               # Assign category labels to each range
)


## 2.Stratified Shuffle Split in Scikit-Learn

##### Scikit-learn provides a built-in way to perform stratified sampling using  StratifiedShuffleSplit .

###### Here’s how you can use it:

In [2]:
from sklearn.model_selection import StratifiedShuffleSplit   # Import class for stratified splitting

# Assume income_cat is already created from median_income
split = StratifiedShuffleSplit(
    n_splits=1,          # Number of train-test splits to generate
    test_size=0.2,       # 20% of data will be used as test set
    random_state=42      # Fix randomness for reproducibility
)

for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]   # Select training data using stratified indices
    strat_test_set = data.loc[test_index]     # Select test data using stratified indices


## 3. Lets remove income category coloumn

In [3]:
# Code to remove income category coloumn
for sett in (strat_train_set , strat_test_set):
    sett.drop("income_cat",axis=1,inplace=True)

In [4]:
strat_train_set
df=strat_train_set.copy() #training dataset ki copy bana lo aur ousko df me save krlo ab hum aone sare kaam df me krege joki humra training dataset h
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16512 entries, 12655 to 19773
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  int64  
 3   total_rooms         16512 non-null  int64  
 4   total_bedrooms      16354 non-null  float64
 5   population          16512 non-null  int64  
 6   households          16512 non-null  int64  
 7   median_income       16512 non-null  float64
 8   median_house_value  16512 non-null  int64  
 9   ocean_proximity     16512 non-null  object 
dtypes: float64(4), int64(5), object(1)
memory usage: 1.4+ MB


In [5]:
df=df.drop(["ocean_proximity"], axis=1)

In [6]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
12655,-121.46,38.52,29,3873,797.0,2237,706,2.1736,72100
15502,-117.23,33.09,7,5320,855.0,2015,768,6.3373,279600
2908,-119.04,35.37,44,1618,310.0,667,300,2.8750,82700
14053,-117.13,32.75,24,1877,519.0,898,483,2.2264,112500
20496,-118.70,34.28,27,3536,646.0,1837,580,4.4964,238300
...,...,...,...,...,...,...,...,...,...
15174,-117.07,33.03,14,6665,1231.0,2026,1001,5.0900,268500
12661,-121.42,38.51,15,7901,1422.0,4769,1418,2.8139,90400
19263,-122.72,38.44,48,707,166.0,458,172,3.1797,140400
19140,-122.70,38.31,14,3155,580.0,1208,501,4.1964,258100


# Transformation Pipelines

> hum pipelines bana kr jo kaam humne imputer se kiya missing values ko handle aur feature scaling ki ye kaam hum easyly one go me pipeline ke sath kr skte hai 

- As datasets grow more complex, data preprocessing often involves multiple steps
such as imputing missing values, scaling features, encoding categorical variables,
etc. These steps must be applied in the correct order and consistently across
training, validation, test, and future production data.
- To streamline this process, Scikit-Learn provides the Pipeline class — a powerful
utility for chaining data transformations

# 1. Building a Numerical Pipeline

<b>A typical pipeline for numerical attributes might include:
1. <b>Imputation</b>  of missing values (e.g., with median).
2. <b>Feature scaling</b> (e.g., with standardization).

In [7]:
from sklearn.pipeline import Pipeline              # Import Pipeline to chain preprocessing steps
from sklearn.impute import SimpleImputer            # Import imputer to handle missing values
from sklearn.preprocessing import StandardScaler    # Import scaler to standardize numeric data

num_pipeline = Pipeline([                           # Create a pipeline for numerical features
    ("impute", SimpleImputer(strategy="median")),   # Step 1: Replace missing values with median
    ("standardize", StandardScaler()),              # Step 2: Scale features to mean=0, std=1
])


How It Works <b>Note:</b>
- The pipeline takes a list of steps as (name, transformer) pairs.
- Names must be unique and should not contain double underscores __ .
- All intermediate steps must be transformers (i.e., must implement 
fit_transform() ).
- The final step can be either a transformer or a predictor.

<b>OR

#### Using make_pipeline
- If you don’t want to name the steps manually, you can use make_pipeline() :

- This automatically names the steps using the class names in lowercase.
- If the same class appears multiple times, a number is appended (e.g., 
standardscaler-1 ).

# 2.Applying the Pipeline
- Call fit_transform() to apply all transformations in sequence:

In [8]:
mypipeline = num_pipeline.fit_transform(df)

In [9]:
# Creating DataFrame of this pipeline
mypipeline =pd.DataFrame(mypipeline,columns=df.columns,index=df.index)

In [10]:
mypipeline

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
12655,-0.941350,1.347438,0.027564,0.584777,0.640371,0.732602,0.556286,-0.893647,-1.166015
15502,1.171782,-1.192440,-1.722018,1.261467,0.781561,0.533612,0.721318,1.292168,0.627451
2908,0.267581,-0.125972,1.220460,-0.469773,-0.545138,-0.674675,-0.524407,-0.525434,-1.074397
14053,1.221738,-1.351474,-0.370069,-0.348652,-0.036367,-0.467617,-0.037297,-0.865929,-0.816829
20496,0.437431,-0.635818,-0.131489,0.427179,0.272790,0.374060,0.220898,0.325752,0.270486
...,...,...,...,...,...,...,...,...,...
15174,1.251711,-1.220505,-1.165333,1.890456,1.696862,0.543471,1.341519,0.637374,0.531511
12661,-0.921368,1.342761,-1.085806,2.468471,2.161816,3.002174,2.451492,-0.557509,-1.007844
19263,-1.570794,1.310018,1.538566,-0.895802,-0.895679,-0.862013,-0.865118,-0.365475,-0.575684
19140,-1.560803,1.249211,-1.165333,0.249005,0.112126,-0.189747,0.010616,0.168261,0.441622


## 3.Pipeline as a Transformer or Predictor
- If the last step is a transformer, the pipeline behaves like a transformer
( fit_transform() , transform() ).
- If the last step is a predictor (e.g., a model), the pipeline behaves like an
estimator ( fit() , predict() ).
- This flexibility makes Pipeline the standard way to handle data preprocessing
and modeling in Scikit-Learn projects.