#  What is Data Scaling in ML and Why?


Data scaling transforms features to a similar scale so that no single feature dominates learning. It improves model performance, especially in algorithms like KNN, SVM, and gradient descent-based models. Without scaling, models may converge slowly or produce inaccurate results.

![image-5.png](attachment:image-5.png)







## Normalization in Scikit-learn (Min-Max Scaling)
Bring values in the range (0 and 1)

![image-7.png](attachment:image-7.png)



![image-8.png](attachment:image-8.png)


![image-9.png](attachment:image-9.png)


# Load Dataset

In [56]:
import pandas as pd 
df = pd.read_csv("Car Price Prediction.csv")
df.drop(columns=['name'],inplace=True)
df

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...
4335,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,2016,865000,90000,Diesel,Individual,Manual,First Owner


# 1 Data Preprocessing Step

In [57]:
df.isnull().sum()

year             0
selling_price    0
km_driven        0
fuel             0
seller_type      0
transmission     0
owner            0
dtype: int64

# Encoding Categorical Column

In [58]:
from sklearn.preprocessing import OrdinalEncoder

# Select categorical columns to encode
categorical_cols = ['fuel', 'seller_type', 'transmission','owner']

# Create OrdinalEncoder instance
encoder = OrdinalEncoder()

# Apply encoding
df[categorical_cols] = encoder.fit_transform(df[categorical_cols]).astype(int)

df

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,2007,60000,70000,4,1,1,0
1,2007,135000,50000,4,1,1,0
2,2012,600000,100000,1,1,1,0
3,2017,250000,46000,4,1,1,0
4,2014,450000,141000,1,1,1,2
...,...,...,...,...,...,...,...
4335,2014,409999,80000,1,1,1,2
4336,2014,409999,80000,1,1,1,2
4337,2009,110000,83000,4,1,1,2
4338,2016,865000,90000,1,1,1,0


# Scalling: Normalization (MinMaxScaler Class)

In [59]:
from sklearn.preprocessing import MinMaxScaler


# Select numerical columns to normalize
numerical_cols = ['year', 'km_driven']

# Initialize scaler
scaler = MinMaxScaler()

# Fit-transform on full data (only on numerical columns)
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,0.535714,60000,0.086783,4,1,1,0
1,0.535714,135000,0.061988,4,1,1,0
2,0.714286,600000,0.123976,1,1,1,0
3,0.892857,250000,0.057028,4,1,1,0
4,0.785714,450000,0.174807,1,1,1,2
...,...,...,...,...,...,...,...
4335,0.785714,409999,0.099181,1,1,1,2
4336,0.785714,409999,0.099181,1,1,1,2
4337,0.607143,110000,0.102900,4,1,1,2
4338,0.857143,865000,0.111579,1,1,1,0


# Train Test Split

In [60]:
from sklearn.model_selection import train_test_split

X = df.drop('selling_price', axis=1)  # Features
y = df['selling_price']              # Target

# Train-Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check shape
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (3472, 6)
Test shape: (868, 6)


# Linear Regression

In [61]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# R² Score
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")


R² Score: 0.39


# Random Forest Regression

In [62]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Initialize the model
model = RandomForestRegressor(random_state=42)

# Fit the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# R² Score
r2 = r2_score(y_test, y_pred)
print(f"R² Score (Random Forest Regressor): {r2:.2f}")


R² Score (Random Forest Regressor): 0.50
