# Data Preprocessing Tools

## Importing the libraries

In [None]:
import numpy as np 
# Array creation: NumPy provides functions for creating arrays of different shapes and sizes, 
# including arrays filled with zeros, ones, constant values, and arrays created from existing data.

# Array operations: NumPy supports a variety of array operations, such as element-wise addition, 
# subtraction, multiplication, and division, as well as more complex operations 
# such as matrix multiplication and element-wise functions like exponential and logarithm.

import matplotlib.pyplot as plt 
# pyplot is a module in the matplotlib library, which provides a convenient interface for plotting 
# data in Python. pyplot provides functions for creating a variety of different types of plots, including line plots, scatter plots, bar plots, histograms, and more. Additionally, pyplot includes a number of functions for customizing plots, 
# such as adding titles, labels, and legends, adjusting axis limits and scales, and controlling the appearance of lines, markers, and other elements.

import pandas as pd 
# pandas is a library in Python used for data analysis and manipulation.

# Data import and export: pandas provides functions for reading and writing data from a variety of different file formats, including CSV, Excel, JSON, and SQL databases.

# Data cleaning and transformation: pandas provides a rich set of functions for cleaning and transforming data, such as removing duplicates, handling missing values, and transforming data using operations such as grouping and pivoting.

# Data aggregation and summarization: pandas provides functions for aggregating and summarizing data, such as computing means, medians, and other summary statistics.

# Data visualization: pandas integrates well with the matplotlib library, providing functions for creating a variety of different types of plots and charts, such as line plots, bar plots, histograms, and scatter plots.

## Importing the dataset

In [None]:
dataset = pd.read_csv('Data.csv')
# pd.read_csv is a function in the pandas library that reads data from a CSV (Comma-Separated Values) file and returns a DataFrame. 
# A DataFrame is a two-dimensional data structure that can store heterogeneous data and is similar to a spreadsheet or a SQL table.
dataset.head(10)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Splitting the data in independent and dependent features.

In [None]:
X = dataset.iloc[:, :-1].values # Independent features
Y = dataset.iloc[:, -1].values # Dependent features

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

1)One of the method we can use is deleting the whole row of missing data

2)Scikit-learn is a free, open-source machine learning library for Python. It is used for data analysis, data mining, and data visualization. 

3)Scikit-learn provides algorithms for supervised and unsupervised learning, including classification, regression, clustering, and dimensionality reduction. 

3)It also includes tools for feature selection, 
feature extraction, model selection, and evaluation.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # In this step the object imputer is still not connected to the matrix of features.

# The code imports the SimpleImputer class from the scikit-learn library's impute module. The imputer object is then created with the SimpleImputer constructor and is initialized with two parameters:
# missing_values: This parameter specifies which values to replace with the imputed value. In this case, the value np.nan is passed, which represents missing or undefined values in a NumPy array.

# strategy: This parameter specifies the strategy to use for imputing missing values. In this case, the value 'mean' is passed, which indicates that the mean value of each column should be used to fill in missing values in that column.
# The imputer object can then be used to transform data by calling the 'fit' and transform methods on it.

X[: , 1:3] = imputer.fit_transform(X[: , 1:3]) 

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

In machine learning, encoding refers to the process of converting categorical data (such as strings or labels) into numerical data that can be processed by a model. This is necessary because many machine learning algorithms require that the inputs be represented as numbers, rather than text or other data types. There are several common encoding techniques, including one-hot encoding, label encoding, and ordinal encoding, each of which is suited to different types of data and use cases.

### Encoding the Independent Variable

In this case, the list of transformers has only one element: the tuple ('encoder', OneHotEncoder(), [0]). The name of the transformer is 'encoder', and it's an instance of the OneHotEncoder class. The list of column indices is [0], which means that the OneHotEncoder will be applied only to the first column of the data.

The argument 'remainder = 'passthrough'' means that the remaining columns that are not specified in the list of transformers will be passed through the ColumnTransformer unchanged. The 'passthrough' value is the default behavior of ColumnTransformer, so it is not necessary to include it in this example.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Create an instance of the ColumnTransformer class
ct = ColumnTransformer(transformers = [('encoder',OneHotEncoder(), [0])] , remainder = 'passthrough') # for 1 column
'''ct = ColumnTransformer(transformers = [('encoder',OneHotEncoder(), [0])], ('encoder',OneHotEncoder(), [1])] , remainder = 'passthrough')''' # for 1 or more column

# Apply the one-hot encoding to the first column of X
X = np.array(ct.fit_transform(X))

The reason the transformed data is converted into a numpy array is that many machine learning algorithms in scikit-learn and other libraries expect the input data to be in numpy array format.

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

Y = le.fit_transform(Y)

In [None]:
print(Y)

[0 1 0 0 1 1 0 1 0 1]


LabelEncoder is used to convert categorical variables with a limited number of categories into numerical values. It assigns each category a unique integer value, and converts the categorical variable into an integer representation. For example, if a categorical variable has three categories (A, B, C), the LabelEncoder would convert the categories into the integers 0, 1, and 2, respectively.

OneHotEncoder, on the other hand, converts the categorical variable into a binary representation known as one-hot encoding. In one-hot encoding, each category is represented by a binary vector with as many elements as there are categories. The position of the 1 in the binary vector represents the category, and all other elements are 0. For example, if a categorical variable has three categories (A, B, C), the OneHotEncoder would convert the categories into the binary vectors [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively.

In general, OneHotEncoder is preferred over LabelEncoder when the categorical variable has a large number of categories, or when there is no natural order to the categories. When the categorical variable has a small number of categories and there is a natural order to the categories, LabelEncoder can be used.

## Splitting the dataset into the Training set and Test set

Feature selection should be done after train-test splitting to avoid leaking information from the test set into the training pipeline.

In [None]:
# splits a dataset into training and testing sets using the train_test_split function from the scikit-learn library:
from sklearn.model_selection import train_test_split # the function train_test_split returns x train, x test, y train, y test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)


In [None]:
print(X_train) # 8 taken randomly

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test) # rest 2 

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(Y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(Y_test)

[0 1]


X_train, X_test, Y_train, and Y_test are likely terms used in a machine learning context.

    X_train is a set of training data for input features, used to train a machine learning model.
    X_test is a set of test data for input features, used to evaluate the performance of a trained model.
    Y_train is a set of training data for the target variable(s), corresponding to the X_train input features.
    Y_test is a set of test data for the target variable(s), corresponding to the X_test input features.

These datasets are usually split from a larger dataset to create a training set and a test set, so that the model can be trained on the training set and evaluated on the test set to estimate its generalization performance.

***The train set is a dataset used to train a machine learning model.*** The model learns the relationship between the input features (represented by X_train) and the target variable(s) (represented by Y_train) in the train set.

The test set is a dataset used to evaluate the performance of a trained machine learning model. The model is applied to the input features in the test set (represented by X_test), and its predictions for the target variable(s) are compared to the actual target values in the test set (represented by Y_test). The performance of the model is then evaluated based on the accuracy of its predictions on the test set.

## Feature Scaling

feature scaling can be applied to both independent and dependent features, although it is more commonly applied to independent features. The goal of feature scaling is to transform the features into a common range, so that no feature has an undue influence on the model due to its scale. This helps to ensure that the model is not biased towards any particular feature.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


do we need to apply standardization to dummy variables during feature scaling

No, standardization (subtracting the mean and dividing by the standard deviation) is not typically applied to dummy variables in feature scaling.

Dummy variables are binary or categorical variables that are used to represent the presence or absence of a certain category or attribute. In machine learning, they are often used to represent categorical features that can take on multiple values. Since dummy variables only take on values of 0 or 1, standardization is not applicable to these variables.

standardization provides values betwen -3 to 3, so need for doing feature scaling on dummy variables