# **## Lab Experiment Name: Data Preprocessing Techniques**

### **OBJECTIVES:**
•	To transform raw data more easily and effectively processed in data mining, machine learning and other data science tasks.

**Step 1: Import Necessary Libraries**


*   pandas (pd): Used for handling and analyzing structured data in DataFrames.
*  numpy (np): Provides support for numerical computations.

*   sklearn.preprocessing: Includes methods for data preprocessing:
*   LabelEncoder: Converts categorical labels into numerical values.

*     OneHotEncoder: Converts categorical variables into one-hot encoded format.
*   MinMaxScaler: Scales numerical data to a specific range (0 to 1).


*   StandardScaler: Standardizes data by removing the mean and scaling to unit variance.

## **Step 2: Load the CSV File and Display the First 5 Rows









In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler, StandardScaler

df= pd.read_csv('/content/house_price_predic.csv')
print("\nFirst 5 rows:\n", df.head())





First 5 rows:
    Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2

### **Display basic info about dataset**

In [12]:
# Handling Missing Values
print("\nMissing Values before handling:\n", df.isnull().sum())




Missing Values before handling:
 Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64


## **Step 1: Fill Missing Values for Numerical Columns**


*   df.select_dtypes(include=[np.number]): Selects only numerical columns from the DataFrame.
*   .columns: Retrieves the column names.

*   df[col].fillna(df[col].mean()): Replaces missing values (NaN) in each numerical column with the column's mean.
*   This ensures that numerical columns do not have missing values, preventing errors in mathematical computations.

### **Step 2: Fill Missing Values for Categorical Columns**


*   df.select_dtypes(include=["object"]): Selects categorical (string/text) columns.
*  .mode()[0]: Finds the most frequent (mode) value in the column and replaces missing values with it.

*   This ensures categorical columns have valid values without introducing numerical bias.









In [13]:

# Fill missing values with mean (for numerical columns)
for col in df.select_dtypes(include=[np.number]).columns:
    df[col] = df[col].fillna(df[col].mean())

# Fill missing values with mode (for categorical columns)
for col in df.select_dtypes(include=["object"]).columns:
    df[col] = df[col].fillna(df[col].mode()[0])

print("\nMissing Values after handling:\n", df.isnull().sum())


Missing Values after handling:
 Id               0
MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 81, dtype: int64


## **# Handling Categorical Data Using One-Hot Encoding & Label Encoding bold textbold text bold text bold text**

# **Step 1: Initialize a Dictionary to Store Label Encoders**
This dictionary will store the LabelEncoder objects for each categorical column.
Storing them helps in future transformations (if you need to reverse the encoding).

## **Step 2: Apply Label Encoding to Categorical Columns**

In [14]:
# Handling Categorical Data using One-Hot Encoding & Label Encoding
label_encoders = {}
for col in df.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

print("\nCategorical Data after Label Encoding:\n", df.head())



Categorical Data after Label Encoding:
    Id  MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  \
0   1          60         3         65.0     8450       1      0         3   
1   2          20         3         80.0     9600       1      0         3   
2   3          60         3         68.0    11250       1      0         0   
3   4          70         3         60.0     9550       1      0         0   
4   5          60         3         84.0    14260       1      0         0   

   LandContour  Utilities  ...  PoolArea  PoolQC  Fence  MiscFeature  MiscVal  \
0            3          0  ...         0       2      2            2        0   
1            3          0  ...         0       2      2            2        0   
2            3          0  ...         0       2      2            2        0   
3            3          0  ...         0       2      2            2        0   
4            3          0  ...         0       2      2            2        0   

   

In [15]:
# Normalization using Min-Max Scaling
scaler = MinMaxScaler()
df[df.select_dtypes(include=[np.number]).columns] = scaler.fit_transform(df[df.select_dtypes(include=[np.number]).columns])

print("\nDataset after Min-Max Normalization:\n", df.head())



Dataset after Min-Max Normalization:
          Id  MSSubClass  MSZoning  LotFrontage   LotArea  Street  Alley  \
0  0.000000    0.235294      0.75     0.150685  0.033420     1.0    0.0   
1  0.000685    0.000000      0.75     0.202055  0.038795     1.0    0.0   
2  0.001371    0.235294      0.75     0.160959  0.046507     1.0    0.0   
3  0.002056    0.294118      0.75     0.133562  0.038561     1.0    0.0   
4  0.002742    0.235294      0.75     0.215753  0.060576     1.0    0.0   

   LotShape  LandContour  Utilities  ...  PoolArea  PoolQC     Fence  \
0       1.0          1.0        0.0  ...       0.0     1.0  0.666667   
1       1.0          1.0        0.0  ...       0.0     1.0  0.666667   
2       0.0          1.0        0.0  ...       0.0     1.0  0.666667   
3       0.0          1.0        0.0  ...       0.0     1.0  0.666667   
4       0.0          1.0        0.0  ...       0.0     1.0  0.666667   

   MiscFeature  MiscVal    MoSold  YrSold  SaleType  SaleCondition  SalePrice

In [16]:
# Save processed dataset
df.to_csv("processed_train.csv", index=False)

print("\nData Preprocessing Completed and Saved Successfully!")


Data Preprocessing Completed and Saved Successfully!
