# DATA PREPROCESSING

#### It is a data mining technique that transforms raw data into an understandable format. Raw data(real world data) is always incomplete and that data cannot be sent through a model. That would cause certain errors. That is why we need to preprocess data before sending through a model.

Steps in Data Preprocessing
These are the steps:
1. Import libraries
2. Import dataset
3. Finding for missing values
4. Encoding categorical data
5. Data splitting
6. Feature Scaling

# 1. Importing Libraries

#### As main libraries, Pandas, Numpy 
#### Pandas: Use for data manipulation and data analysis.
#### Numpy: a fundamental package for scientific computing with Python.

As for the visualization using Matplotlib and Seaborn.
For the data preprocessing techniques and algorithms, using Scikit-learn libraries.

In [1]:
# main libraries
import pandas as pd
import numpy as np

In [2]:
# visual libraries
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Importing Dataset
#### We need to import the dataset for our Machine Learning project. Before import a dataset, we need toset the current directory as a working directory
### Extracting Dependent and Independent
#### In Machine Learning it is important to distinguish the matri of features (independent variables and dependent variables from Dataset 

In [3]:
# Read the data in the CSV file using pandas
data = pd.read_csv("Data_Items_sell.csv")
type(data)

pandas.core.frame.DataFrame

In [4]:
data.shape

(10, 4)

In [5]:
data

Unnamed: 0,Merchants,Age,Amount,Purchased
0,Bharadwaj,44.0,72000.0,No
1,Rahul,27.0,48000.0,Yes
2,Sanath,30.0,54000.0,No
3,Rahul,38.0,61000.0,No
4,Sanath,40.0,,Yes
5,Bharadwaj,35.0,58000.0,Yes
6,Rahul,,52000.0,No
7,Bharadwaj,48.0,79000.0,Yes
8,Sanath,50.0,83000.0,No
9,Bharadwaj,37.0,67000.0,Yes


In [6]:
data.head()

Unnamed: 0,Merchants,Age,Amount,Purchased
0,Bharadwaj,44.0,72000.0,No
1,Rahul,27.0,48000.0,Yes
2,Sanath,30.0,54000.0,No
3,Rahul,38.0,61000.0,No
4,Sanath,40.0,,Yes


In [7]:
data.tail()

Unnamed: 0,Merchants,Age,Amount,Purchased
5,Bharadwaj,35.0,58000.0,Yes
6,Rahul,,52000.0,No
7,Bharadwaj,48.0,79000.0,Yes
8,Sanath,50.0,83000.0,No
9,Bharadwaj,37.0,67000.0,Yes


# 3. Finding missing values
#### In our dataset contains some missing data then it may create a huge problem  for our machine learning model, Hence it is nesscesary to handle the missing values prsent in dataset
## Way to Handling Missing Data
### 1. By Deleting the particular rows:
#### It is used to deal with the null values. We just delete the Specific Rows or Columns.
### 2. By Calculating the Mean:
#### We calculate the mean of a specific row or columns which contain missing values and will put it on the place of missing values.

In [8]:

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [9]:
data.isnull()

Unnamed: 0,Merchants,Age,Amount,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,True,False
5,False,False,False,False
6,False,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


In [10]:
data.isnull().any()

Merchants    False
Age           True
Amount        True
Purchased    False
dtype: bool

In [11]:
data.isnull().any().sum()

2

In [12]:
X = data.iloc[ : , :-1].values
Y = data.iloc[ : , 3].values

In [13]:
X,Y

(array([['Bharadwaj', 44.0, 72000.0],
        ['Rahul', 27.0, 48000.0],
        ['Sanath', 30.0, 54000.0],
        ['Rahul', 38.0, 61000.0],
        ['Sanath', 40.0, nan],
        ['Bharadwaj', 35.0, 58000.0],
        ['Rahul', nan, 52000.0],
        ['Bharadwaj', 48.0, 79000.0],
        ['Sanath', 50.0, 83000.0],
        ['Bharadwaj', 37.0, 67000.0]], dtype=object),
 array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
       dtype=object))

In [14]:
# By Calculating Mean Values
imputer = imp_mean.fit(X[ : , 1:3])
X[ : , 1:3] = imp_mean.transform(X[ : , 1:3])
X

array([['Bharadwaj', 44.0, 72000.0],
       ['Rahul', 27.0, 48000.0],
       ['Sanath', 30.0, 54000.0],
       ['Rahul', 38.0, 61000.0],
       ['Sanath', 40.0, 63777.77777777778],
       ['Bharadwaj', 35.0, 58000.0],
       ['Rahul', 38.77777777777778, 52000.0],
       ['Bharadwaj', 48.0, 79000.0],
       ['Sanath', 50.0, 83000.0],
       ['Bharadwaj', 37.0, 67000.0]], dtype=object)

# 4. Encoding Categorical Data
#### Categorical Data is data which has some categories such as , in our datasset, there are categorical variables (Merchants)
#### Since Machine Learning Completely works on mathematics and numbers, but if our dataset would have a Categorical variable , then it may create trouble while building thhe model.
####     So, it is necessary to to encode these categorical variables in to numbers

In [15]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [16]:
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])

In [17]:
X

array([[0, 44.0, 72000.0],
       [1, 27.0, 48000.0],
       [2, 30.0, 54000.0],
       [1, 38.0, 61000.0],
       [2, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [1, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [2, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

# 5. Data splitting
#### In Data Preprocessing we divide our dataset in to a Training set and Test Set.
#### If we train our model very well and its trainning accuracy is also very high, but when we provide new dataset to it will decrease the performance, So we always try to make our training acuuracy high
### 1. Training Set:
#### A Subset of dataset to train the machine learning model, and by using the test set model predicts the output.
### 2. Test Set:
#### A Subset of dataset to test the machine learning model, and by using the test set model predicts the output.


In [18]:
# sklearn libraries
from sklearn.model_selection import train_test_split

In [19]:
# train, test split......
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state = 42)

In [20]:
X_train

array([[0, 35.0, 58000.0],
       [0, 44.0, 72000.0],
       [0, 48.0, 79000.0],
       [2, 30.0, 54000.0],
       [0, 37.0, 67000.0],
       [2, 40.0, 63777.77777777778],
       [1, 38.0, 61000.0],
       [1, 38.77777777777778, 52000.0]], dtype=object)

# 6. Feature Scaling
#### It is a technique to standardize the independent variables of the dataset in a specific range
#### In a feature Scaling, we put our variables in the same range and in the same scale so that no any variables dominate other variables

In [21]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
X_train

array([[-0.90453403, -0.7529426 , -0.62603778],
       [-0.90453403,  1.00845381,  1.01304295],
       [-0.90453403,  1.79129666,  1.83258331],
       [ 1.50755672, -1.73149616, -1.09434656],
       [-0.90453403, -0.36152118,  0.42765698],
       [ 1.50755672,  0.22561096,  0.05040824],
       [ 0.30151134, -0.16581046, -0.27480619],
       [ 0.30151134, -0.01359102, -1.32850095]])