# Data Preprocessing using Machine Learning & Artificial Intelligence

## Importing Libraries


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

- Now we are going to implement some data sets.
- Here, we are applying the iloc function for data analysis & manipulation which mainly stands for "Integer Location".
- Applying slicing to get the data according to the patients who needed to go hospital.

In [None]:
data = pd.read_csv('Covid_Data_new.csv')
a = data.iloc[:, :-1].values
b = data.iloc[:, -1].values

In [None]:
print(a)

In [None]:
print(b)

- As we can see there are few rows with missing values, so we need to fill up the values using imputer tranform from sklearn.impute.
- Since, they are in combination of integer and string so we need to apply the function twice.

In [None]:
from pandas.core import missing
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

imputer.fit(a[:, 0:1])
a[:, 0:1] = imputer.transform(a[:, 0:1])

In [None]:
print(a)

In [None]:
imputer.fit(a[:, 4:5])
a[:, 4:5] = imputer.transform(a[:, 4:5])

In [None]:
print(a)

- As we apply ML in Data Preprocessing, we need proper logical explanations.
- Encoding to data , we transform 1 column into multiple columns like C1,C2, C3, etc. to avoid any data conflicts.
- We need encode dependent variables (which we are going to predict) and independent variables.
- In OnehotCoder function, it creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category.
- ColumnTransformers provide flexibility in preprocessing heterogeneous data and help avoid data leakage.
- By specifying remainder='passthrough' , all remaining columns that were not specified in transformers, but present in the data passed to fit will be automatically passed through. This subset of columns is concatenated with the output of the transformers.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [None]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
a = np.array(ct.fit_transform(a))

In [None]:
print(a)

- Label encoding is a simple and effective way to convert categorical variables into numerical form. By using the LabelEncoder class from scikit-learn, you can easily encode your categorical data and prepare it for further analysis or input into machine learning algorithms.


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
b = le.fit_transform(b)

In [None]:
print(b)

- We want our Machine Learning can learn from whatever has happened and apply all mathematical formulas automatically based on its learning.
- We want to predict who are hospital & whether our prediction is same as what happened & find the accuracy.
- The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (a) and labels (b). The dataframe gets divided into a_train, a_test, b_train, and b_test. a_train and b_train sets are used for training and fitting the model.
- “random_state” is a parameter in train_test_split that controls the random number generator used to shuffle the data before splitting it. In other words, it ensures that the same randomization is used each time you run the code, resulting in the same splits of the data.

In [None]:
from sklearn.model_selection import train_test_split
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size = 0.2, random_state = 42)

In [None]:
print(a_train)

In [None]:
print(a_test)

In [None]:
print(b_train)

In [None]:
print(b_test)

- Feature Scaling is a technique to standardize the independent features present in the data in a fixed range.
- It is performed during the data preprocessing to handle highly varying magnitudes or values or units.
- Its a very effective technique that re-scales a feature value so that it has distribution with 0 mean values and variance equal to 1.
              X(new)= X(i)-X(mean)/Standard Deviation


In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
a_train[:, 6:] = sc.fit_tranform(a_train[:])