<a href="https://colab.research.google.com/github/NeveChrono/ML_CodeDump/blob/main/DataProcessingTools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Preprocessing Tools


#Import Libraries
*   Numpy - work with arrays
* Pandas - Import the dataset and create a matrix set
* Mathplotlib - Use for visualization
*   Scikit learn - Most popular library which is popularly used in ML








In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Import the Dataset

In [7]:
# creating a variable which has a dataset ie in the form of a dataframe

dataset = pd.read_csv('Data.csv')

''' In any training model we have two things to work with one is features and
the dependent variables which will give me the output. In this case of Data.csv is the
date of weather a car was purchased by a person of a certain age country and salary
So,
 Features = Country Age and Salary
 Dependent Variable = Purchased

 So we will make prediction if they will purchased or not. Generally features are are written in the first columns
 and dependent variables are written in the last column.
'''
# entities:

# locate the data in the set based on the index , it has two things [row range , column range]
# Basically will select all the rows and columns upto the last one as the last column is not included
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values

# we need to use .values to extract it out


#To check the values we extracted.

In [8]:
print("Independent Variable or Features")
print(X)

print("Dependent Variable")
print(Y)

Independent Variable or Features
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
Dependent Variable
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# To take care of the Missing Data


the missing data can cause issues. One way is to ignore that entire reading but it can only work with large dataset having less missing values.


Another way is to replace the data with the average of the entire parameter. Using scikit-learn SimpleImputer

In [10]:
from sklearn.impute import SimpleImputer
# we will create instance of the class SimpleImputer
# this will be taking finding the empty valuesie np.nan will replacing them with the mean values of the column

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

'''
to achieve we will use two methods:-
1. fit() - finds the missing values and then computes average simultaneously
2. transform() - replaces the missing values with the average

Since the fit function takes in the data set where it will find the empty number values
hence we need to mention which rows to look from out of the entire dataset.
That will be col 2 and 3 which is age and salary.

As for String and categorical data you need to convert into numerical before u
go into the imputer.

Common techniques include:

    - Label Encoding: Converts each category to a unique integer.
    - Ordinal Encoding: Assigns numerical values based on the order of
                        categories.
    - One-Hot Encoding: Converts categorical values into a binary.

    vector with a length equal to the number of unique categories.
'''

imputer.fit(X[:,1:3])

'''
Now to transform our dataset by replacing the missing salary
ie in the missing value
'''
X[:,1:3]=imputer.transform(X[:,1:3])





In [11]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


# Encoding Categorical Data


As Discussed above it pretty clear teh fit function can only take in data which is in numerical so for category values we need to find a way to do it.

One Approach would be to well assign each with 0,1,2.. but this can be interpreted as an order which can mess up the output.

So the best approach will be to split the column into several sub columns based on each value of the category.

Assign it a binary vector like <001> or <010> for each value in the category.

This can be applied to n number of categorcial values and is called One-Hot Encoding

In [14]:
# We will import two classes to achieve it ColumnTransformer and OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# create an objects

'''
The columnTransformer class has two arguments one is transformers and the other is remainder.

- transformers basically  tells us what kind of transform we are applying on the column
ie encoder to encode it , how the encoding is done and on which index or column name it must be done from that list

- remainder ensures that after the transform is done all the classes which where not affected by it are kept
using the keyword passthrough

   One-Hot Encoding: Converts categorical values into a binary. In case if there are no numeric values take them as false to solve the problem.
   and also u can uses indices to mention which index u want but better will be to create a list of columns for which u want to transform

'''
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')

'''
Unlike for the previous case where we have to use fit and transform seperatly to fill in the data
we can use an inbuilt method of ColumnTransfromer class called fit_transform. It does the job but has a catch as it doesnt
returns in the value as np.array which what we work with so we need to convert the result into a numpy array
'''

X = np.array(ct.fit_transform(X))
print(X)


[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


For the Output Encoding we will use the label encoder

In [15]:
from sklearn.preprocessing import LabelEncoder

# create an Instance of the class

le = LabelEncoder()

# We dont need an numpy array for this so no need for conversion as this is the result
Y = le.fit_transform(Y)
print(Y)

[0 1 0 0 1 1 0 1 0 1]


# Split The data into Traning and Dataset