# Data Preprocessing Tools

## Importing the libraries

In [1]:
#libraries to import for data preprocessing
import numpy as np                  #numpy library and alias np
import matplotlib.pyplot as pyplot  #accessing pyplot from matplotlib
import pandas as pd

# Importing the dataset$^{[1]}$

---



In [2]:
dataset = pd.read_csv("Data.csv") #accesing read_csv frpm pd lib and reading Data
X = dataset.iloc[:,:-1].values    #matrix of features , matrix of feature is the matrix of data table where the features is described [2]
y = dataset.iloc[:,-1].values    #matrix of dependables

print(dataset)


   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [3]:
print(X) #features

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y) #dependables

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encode',OneHotEncoder(),[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X))

In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [9]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

In [12]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [15]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [19]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [20]:
print(y_test)

[0 1]


## Feature Scaling

In [23]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[: , 3:])
X_test[:,3:]=sc.transform(X_test[: , 3:])

In [24]:
print(X_train)

[[0.0 0.0 1.0 -0.1915918438457856 -1.0781259408412427]
 [0.0 1.0 0.0 -0.014117293757057902 -0.07013167641635401]
 [1.0 0.0 0.0 0.5667085065333239 0.6335624327104546]
 [0.0 0.0 1.0 -0.3045301939022488 -0.30786617274297895]
 [0.0 0.0 1.0 -1.901801144700799 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237056 1.2326533634535488]
 [0.0 1.0 0.0 1.4379472069688966 1.5749910381638883]
 [1.0 0.0 0.0 -0.7401495441200352 -0.5646194287757336]]


In [25]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
#End of Data Pre-processing toolkit

#Guide Notes

[1] : Independent variables (feature) are what we expect will influence dependent variables.  like in Data set (usually 1st col) and dependant variable  (usually last col.)A Dependent variable is what happens as a result of the independent variable.

We gonna make x for feature and y for dependant variable vector.

Matrix of features is all the col except last one which is a dependant col.


[2] : Let's break down the statement (X = dataset.iloc[:,:-1]) :

dataset: This refers to a pandas DataFrame, which is a two-dimensional tabular data structure in Python used to store and manipulate data in a way similar to a spreadsheet.

iloc: It is a method provided by pandas to select data from a DataFrame based on integer-based indexing.

[:,:-1]: This part of the statement is the indexing operation within the iloc method. It specifies which rows and columns to select from the DataFrame. In this case, : before the comma indicates that all rows of the DataFrame will be selected. After the comma, :-1 indicates the columns to be selected.

:-1: This indexing operation means to select all columns except the last one. In Python, when using negative indices, -1 refers to the last element, -2 to the second last, and so on. So, :-1 will select all columns from the beginning up to (but not including) the last column.

In summary, X = dataset.iloc[:,:-1] creates a new DataFrame X that contains all the rows from dataset but excludes the last column. The assumption here is that the last column contains the target variable (the variable we want to predict or analyze), and X will be used to store the features or input variables for a machine learning model or any other data analysis task.



clear explanation of what OOPS concepts are:

A class is the model, or a blueprint, of something we want to build. For example, if we make a house construction plan that gathers the instructions on how to build a house, then this construction plan is the class.

An object is an instance of the class. So if we take that same example of the house construction plan, then an object is simply a house. A house (the object) that was built by following the instructions of the construction plan (the class).
And therefore there can be many objects of the same class, because we can build many houses from the construction plan.

A method is a tool we can use on the object to complete a specific action. So in this same example, a tool can be to open the main door of the house if a guest is coming. A method can also be seen as a function that is applied onto the object, takes some inputs (that were defined in the class) and returns some output.