<a href="https://colab.research.google.com/github/SujeetSaxena/AI-ML/blob/main/Preprocessing_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Step 2: Create a sample dataset
data = {
    'Age': [25, 30, 35, np.nan, 40, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan, 120000, 150000],
    'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
}

df = pd.DataFrame(data)

# Display original dataset
print("Original Dataset:")
print(df)

# Step 3: Handle Missing Values
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Replace missing Age with mean
df['Salary'].fillna(df['Salary'].median(), inplace=True)  # Replace missing Salary with median

print("\nDataset after handling missing values:")
print(df)

# Step 4: Encode Categorical Data (One-Hot Encoding)
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid dummy variable trap
encoded_categories = encoder.fit_transform(df[['Category']])

# Convert encoded data into a DataFrame
category_df = pd.DataFrame(encoded_categories, columns=['Category_B'])

# Concatenate the encoded category back to the original dataframe (dropping original Category column)
df = pd.concat([df.drop(columns=['Category']), category_df], axis=1)

print("\nDataset after encoding categorical data:")
print(df)

# Step 5: Feature Scaling (Standardization)
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

print("\nDataset after feature scaling:")
print(df)

# Step 6: Train-Test Split (80% Training, 20% Testing)
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

print("\nTraining Data:")
print(train_data)
print("\nTesting Data:")
print(test_data)


Original Dataset:
    Age    Salary Category
0  25.0   50000.0        A
1  30.0   60000.0        B
2  35.0       NaN        A
3   NaN   80000.0        B
4  40.0   90000.0        A
5  45.0       NaN        B
6   NaN  120000.0        A
7  50.0  150000.0        B

Dataset after handling missing values:
    Age    Salary Category
0  25.0   50000.0        A
1  30.0   60000.0        B
2  35.0   85000.0        A
3  37.5   80000.0        B
4  40.0   90000.0        A
5  45.0   85000.0        B
6  37.5  120000.0        A
7  50.0  150000.0        B

Dataset after encoding categorical data:
    Age    Salary  Category_B
0  25.0   50000.0         0.0
1  30.0   60000.0         1.0
2  35.0   85000.0         0.0
3  37.5   80000.0         1.0
4  40.0   90000.0         0.0
5  45.0   85000.0         1.0
6  37.5  120000.0         0.0
7  50.0  150000.0         1.0

Dataset after feature scaling:
        Age    Salary  Category_B
0 -1.690309 -1.337987         0.0
1 -1.014185 -1.003490         1.0
2 -0.33806

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)  # Replace missing Age with mean
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].median(), inplace=True)  # Replace missing Salary with median
