# Data Preprocessing

## steps :-
##### 1. seperate features and labels from dataset
##### 2. Handle missing values
##### 3. encoding categorical data both independent and dependent
##### 4. Splitting dataset into train & test sets
##### 5. Feature Scaling only on training dataset
feature scaling should be done on tranining dataset only that too after spliting. bcz, the the test should be realistic and should not be made contact or leaked to the model while training

### importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### import datasets

In [2]:
df = pd.read_csv("../Datasets/data_preprocessing.csv")
df.head()

print(df.describe())

             Age        Salary
count   9.000000      9.000000
mean   38.777778  63777.777778
std     7.693793  12265.579662
min    27.000000  48000.000000
25%    35.000000  54000.000000
50%    38.000000  61000.000000
75%    44.000000  72000.000000
max    50.000000  83000.000000


## seperate the features from the dataset as x

In [3]:
# values method makes the indexed values as a np array
x = df.iloc[:,:-1].values 
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## seperating the labels from the dataset

In [4]:
y = df.iloc[:,-1].values
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## taking caring of missing values using sklearn.impute

#### The SimpleImputer class in sklearn.impute is used to fill missing values in a dataset with a specified strategy, such as:

Mean (mean) – replaces missing values with the column’s mean.
Median (median) – replaces missing values with the column’s median.
Most Frequent (most_frequent) – replaces missing values with the most common value.
Constant (constant) – replaces missing values with a specified constant value.

### fit() computes the mean of each column.
### transform() fills missing values using the computed means.

In [5]:
#information about missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

# specify the only numerical columns

#fit method calculates mean the for each targeted column
imputer.fit(x[:,1:])

# transform method replaces the missing values and returns the updated matrix
x[:,1:] = imputer.transform(x[:,1:])
print(x)

# we can also fit_transform method 


[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

Many machine learning models work with numerical data only, meaning that categorical data (like colors, cities, or labels) must be converted into numbers before being used for training. Encoding categorical data ensures that:

Models Understand Categorical Information – Algorithms like linear regression and decision trees can’t directly process text labels.
Improves Model Performance – Correct encoding can enhance accuracy and make patterns easier to recognize.
Avoids Bias in Ordinal Encoding – Some models may incorrectly assume order in categorical variables if not encoded properly.
Handles Missing or Unseen Categories – Proper encoding ensures that missing or new values do not break the model.
Types of Cate

### One-Hot Encoding (OHE)
Converts categories into binary columns.
Ideal for nominal (unordered) categorical data. ex: [Red, Blue, Green ,Blue]

### Ordinal Encoding (For Ordered Categories)
If data has a natural order (e.g., small, medium, large), ordinal encoding is used.

### Encoding the Independent variables(features but
 not all features are independent)
 #### sklearn.compose


### Explanation of Each Argument:

#### transformers:
#### A list of tuples, where each tuple contains:

##### Name (str): A unique name for the transformer.
##### Transformer (object or "drop" or "passthrough"): The transformation applied (e.g., StandardScaler(), OneHotEncoder(), "drop", or "passthrough").

##### Columns (list, slice, str, or array-like): The columns to apply the transformer to.


#### remainder ("drop", "passthrough", or estimator):

How to handle columns that are not specified in transformers. Options:

"drop" (default): Discards unused columns.
"passthrough": Keeps unused columns unchanged.
estimator: Apply a transformation to remaining columns.
n_jobs (int, default=None):
Specifies the number of jobs to run in parallel for transformations. None means single-threaded.

verbose (bool, default=False):
If True, will log the transformation process.

verbose_feature_names_out (bool, default=True):
If True, the feature names in the output are prefixed by the transformer names.

In [6]:
from sklearn.compose import ColumnTransformer  
# Helps apply different transformations to different columns

from sklearn.preprocessing import OneHotEncoder  
# Converts categorical variables into numerical form (one-hot encoding)

# for tranformers:- [(name,Transformer,[indexes of columns])]
# for remainder:- 
ct = ColumnTransformer(transformers=[("encoder",OneHotEncoder(),[0])],remainder="passthrough")

x = np.array(ct.fit_transform(x))

print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## Encoding dependent variable

we use labelEncoder for this variables

In [7]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

In [8]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting Dataset into train & test sets

#### using sklearn.model_selection for splitting

In [9]:
from sklearn.model_selection import train_test_split

# it returns four sets of data two of x(test,train) and also for y

# (independent,dependent,test_size,train_size,randome_state=1)
# random_state = 1 selects the random data

x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.2,train_size=0.8,random_state=1)

In [10]:
x_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [11]:
x_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [12]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1])

In [13]:
y_test

array([0, 1])

## feature scaling

makes all the features on the same scale to prevent the domination of larger features on smaller features

In [16]:
### sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# only numerical features
# applying the standization

#fit method calculates the mean and standard deviation
# transform method applies the formula  = (xi - xmean)/s_std

#fit_tranform method does both 
x_train[:,3:] = sc.fit_transform(x_train[:,3:])

### for test set only perform the transform

In [17]:
x_test[:,3:] = sc.transform(x_test[:,3:])

In [18]:
print("x_train: " , x_train)

x_train:  [[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [19]:
print("x_test: " , x_test)

x_test:  [[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
