<h1 align  = center> <font color = '#FF5500'>Some preprocessing Using scikit-learn</font></h1>

#### What is preprocessing ?

Preprocessing is the initial stage of data analysis where we clean, transform, and prepare the data for machine learning algorithms. It involves handling missing values, scaling numerical data, encoding categorical data, and splitting the data into training and testing sets.

#### Why Preprocessing is important ?

- Improves the accuracy of machine learning models by making the data more consistent, reliable, and useful.
- Reduces the risk of overfitting by removing noise, outliers, and inconsistencies in the data.
- Allows us to build more accurate and reliable machine learning models.

#### Common preprocessing techniques in scikit-learn

- Handling missing values: Imputation techniques such as mean, median, or mode can be used to fill in missing values.
- Scaling numerical data: Standardization or normalization techniques can be used to scale numerical data to a common range, such as between 0 and 1.
- Encoding categorical data: One-hot encoding or label encoding can be used to convert categorical data into numerical format.
- Splitting the data into training and testing sets: Using a random or stratified split technique can be used to create a separate dataset for training and testing the machine learning model.



<h2 align = center><font color = '#5AAB30'>Importing Necessary Libraries</h2></font>

In [164]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LinearRegression



<h2 align = center><font color = '#AA0022'>Importing Dataset</h2>

In [165]:
data = pd.read_csv('car.csv')

<h2 align = center> <font color = '#76FDBF'>Data Preprocessing</h2>

<h2> <font color = '#3311FF'>Handling Null Values </h2>

### Top 10 Rows Of Dataset

In [166]:
data.head(10)

Unnamed: 0,Brand,Price,Body,Mileage,EngineV,Engine Type,Registration,Year,Model
0,BMW,4200.0,sedan,277,2.0,Petrol,yes,1991,320
1,Mercedes-Benz,7900.0,van,427,2.9,Diesel,yes,1999,Sprinter 212
2,Mercedes-Benz,13300.0,sedan,358,5.0,Gas,yes,2003,S 500
3,Audi,23000.0,crossover,240,4.2,Petrol,yes,2007,Q7
4,Toyota,18300.0,crossover,120,2.0,Petrol,yes,2011,Rav 4
5,Mercedes-Benz,199999.0,crossover,0,5.5,Petrol,yes,2016,GLS 63
6,BMW,6100.0,sedan,438,2.0,Gas,yes,1997,320
7,Audi,14200.0,vagon,200,2.7,Diesel,yes,2006,A6
8,Renault,10799.0,vagon,193,1.5,Diesel,yes,2012,Megane
9,Volkswagen,1400.0,other,212,1.8,Gas,no,1999,Golf IV


### Dropping Model Column

In [167]:


data.drop('Model', axis=1, inplace=True)

#### Checking Null Values

In [168]:
data.isnull().sum()

Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
dtype: int64

### Filling null values in 'Price' column with mean of column 'Price'

In [169]:
data.fillna({'Price':data['Price'].mean()},inplace=True)


### Filling null values in 'EngineV' column with mean of column 'EngineV'

In [170]:
data.fillna({'EngineV' : data['EngineV'].mean()} , inplace = True)

### Checking Dataset columns datatypes

In [171]:
data.dtypes

Brand            object
Price           float64
Body             object
Mileage           int64
EngineV         float64
Engine Type      object
Registration     object
Year              int64
dtype: object

#### As we can see there are multiple categorical columns. So , we need to encode them so that we can use them training model 

<h2> <font color = '#7ABB3C'>Encoding Categorical Data </h2>

### Encoding 'Brand' Column 

Since Brand is not an ordinal data , so we use one hot encoding 

In [172]:

data = pd.get_dummies(data=data, columns=['Brand'])
data


Unnamed: 0,Price,Body,Mileage,EngineV,Engine Type,Registration,Year,Brand_Audi,Brand_BMW,Brand_Mercedes-Benz,Brand_Mitsubishi,Brand_Renault,Brand_Toyota,Brand_Volkswagen
0,4200.0,sedan,277,2.000000,Petrol,yes,1991,False,True,False,False,False,False,False
1,7900.0,van,427,2.900000,Diesel,yes,1999,False,False,True,False,False,False,False
2,13300.0,sedan,358,5.000000,Gas,yes,2003,False,False,True,False,False,False,False
3,23000.0,crossover,240,4.200000,Petrol,yes,2007,True,False,False,False,False,False,False
4,18300.0,crossover,120,2.000000,Petrol,yes,2011,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4340,125000.0,sedan,9,3.000000,Diesel,yes,2014,False,False,True,False,False,False,False
4341,6500.0,sedan,1,3.500000,Petrol,yes,1999,False,True,False,False,False,False,False
4342,8000.0,sedan,194,2.000000,Petrol,yes,1985,False,True,False,False,False,False,False
4343,14200.0,sedan,31,2.790734,Petrol,yes,2014,False,False,False,False,False,True,False



### Encoding 'Body' Column


Since Body is an nominal data, so we use one hot encoding

In [173]:
data = pd.get_dummies(data=data , columns=['Body'])


### Encoding 'Engine Type' Column


Since Engine Type is an nominal data, so we use OneHotEncoding

In [174]:
data = pd.get_dummies(data=data , columns=['Engine Type'])

### Encoding 'Registration' Column


Since Registration is an nominal data, so we use OneHotEncoding encoding

In [175]:
data = pd.get_dummies(data=data, columns=['Registration'])

<h2> <font color = "#44AAFF"> Scaling Data </font></h2>

As we can see the price column range is much greater than the other columns ranges, so it is wise to scale the data

### Standard Scaler

In [176]:
scaler = StandardScaler()

X = data.drop('Price', axis = 1)
y= data['Price']

X_scaled = scaler.fit_transform(X)



<h2> <font color = "#EE20AA">Splitting Dataset in training and testing data </font></h2>


In [177]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 40)


<h2> Model Training </h2>

In [178]:
model = LinearRegression()
model.fit(X_train, y_train)

## Model Evaluation

y_pred = model.predict(X_test)
model.score(X_test, y_test)



0.4297593829199904

<h2><font color = '#11FFAA'> Cross-Validation Tests </h2>



#### What is Cross-Validation ?

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. It helps to mitigate the risk of overfitting and improve the generalization ability of the model.

In [179]:
scores = cross_val_score(model, X_scaled, y, cv=10)
print(scores)
print(f"Model Mean Score :{scores.mean()}")



[0.39558411 0.41565442 0.50304259 0.37590593 0.44291376 0.35244407
 0.45240615 0.40035656 0.48950824 0.3288681 ]
Model Mean Score :0.41566839364499797


<h2 align = center>Conclusion</h2>


The preprocessing makes our model performance better.