# Data Preprocessing Tools

## Importing the libraries

**NumPy**

NumPy is a Powerful Python library for numerical operations and array manipulation.

**Matplotlib**

Matplotlib.pyplot is a module in Matplotlib library used for creating visualizations in python.

**Pandas**

Pandas is a versatile Python library for data manipulation and analysis, especially with tabular data structures like DataFrames.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

Divide the dataset into independent and dependent variables.

Independent Variables - Features or input variables.

Dependent variables - Target variables

In [None]:
dataset = pd.read_csv('/content/Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

**Reason to handle Missing Data?**

Generally the missing values will prone to make errors when training machine learning model.

Ways to handle:
1. We can either ignore the entire observation - This works when there is a larger dataset and there are only few missing values

2. Replace missing value by average of all the values in the column

As we have smaller dataset , we will be doing the second way to handle missing data.


In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

**Reason why to do one hot encoding ?**

We need to encode categorical variables in machine learning because most machine learning algorithms require numerical input data. Categorical variables, such as colors or categories, need to be converted into numerical form for the algorithms to process them effectively.

**When to use LabelEncoder and OneHotEncoder?**

When deciding between LabelEncoder and One-Hot Encoder, it's essential to consider the nature of the categorical variable.
- Use LabelEncoder when the categorical variable has an inherent order or hierarchy, like low, medium, high. LabelEncoder assigns a unique numerical value to each category based on their order.
- Use One-Hot Encoder when the categorical variable has no intrinsic order or hierarchy. One-Hot Encoder creates binary columns for each category, where each column represents a category with a value of 0 or 1, indicating the presence or absence of that category in the data.



In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

### Encoding the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])



Splitting the dataset -

Train set - We train machine learning model on existing observations

Test Set - Evaluate performance on new observations



## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
x_train

array([[1.0, 0.0, 0.0, 35.0, 58000.0],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0]], dtype=object)

In [None]:
y_train

array([1, 0, 1, 0, 1, 1, 0, 0])

In [None]:
x_test

array([[0.0, 1.0, 0.0, 50.0, 83000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0]], dtype=object)

In [None]:
y_test

array([0, 1])

**FAQ:**

**Do we have to apply Feature Scaling before Splitting the dataset or after and why?**

We have to apply feature scaling after splitting the dataset.

We perform feature scaling after the train-test split to prevent data leakage. If we scale the features before splitting the data into training and testing sets, information from the test set could influence the training set, leading to biased results. By scaling the features after the split, we ensure that the scaling is based only on the training data, maintaining the integrity of the test set for evaluation.

**Why Feature Scaling? **

Feature scaling is important in machine learning to ensure that all features have a similar scale or range. This is crucial because many machine learning algorithms perform better or converge faster when features are on a similar scale.

Imagine you have one feature that ranges from 0 to 1000 and another feature that ranges from 0 to 1. The algorithm might give more importance to the feature with a larger range, leading to biased results. Feature scaling helps to level the playing field by bringing all features to a similar scale, making the algorithm more effective in analyzing the data accurately.

**Types of Feature Scaling**

The two main types of feature scaling are Min-Max scaling and Standardization (Z-score normalization).

- Min-Max scaling (Normalization): This method scales the data to a fixed range, usually between 0 and 1. It is calculated as (X - X_min) / (X_max - X_min). Use Min-Max scaling when you need the data to be within a specific range and when the distribution of the data is not Gaussian.

- Standardization (Z-score normalization): This method scales the data so that it has a mean of 0 and a standard deviation of 1. It is calculated as (X - mean) / standard deviation. Use Standardization when the data follows a Gaussian distribution, as it does not bound the values to a specific range but centers them around the mean.



## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
x_test[:,3:]=sc.transform(x_test[:,3:])


In [None]:
x_train

array([[1.0, 0.0, 0.0, -0.7529426005471072, -0.6260377781240918],
       [1.0, 0.0, 0.0, 1.008453807952985, 1.0130429500553495],
       [1.0, 0.0, 0.0, 1.7912966561752484, 1.8325833141450703],
       [0.0, 1.0, 0.0, -1.7314961608249362, -1.0943465576039322],
       [1.0, 0.0, 0.0, -0.3615211764359756, 0.42765697570554906],
       [0.0, 1.0, 0.0, 0.22561095973072184, 0.05040823668012247],
       [0.0, 0.0, 1.0, -0.16581046438040975, -0.27480619351421154],
       [0.0, 0.0, 1.0, -0.013591021670525094, -1.3285009473438525]],
      dtype=object)

In [None]:
x_test

array([[0.0, 1.0, 0.0, 2.1827180802863797, 2.3008920936249107],
       [0.0, 0.0, 1.0, -2.3186282969916334, -1.7968097268236927]],
      dtype=object)

##### These are the links for reference, kindly visit these for more information
- [Scikit-learn : Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Numpy](https://numpy.org/doc/)
- [Pandas](https://pandas.pydata.org/docs/)
- [Matplotlib](https://matplotlib.org/stable/index.html)