# What is Data Preprocessing ?

 Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.

 Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.

 Data preprocessing is a proven method of resolving such issues.

# Steps in Data Preprocessing

 Step 1 : Import the libraries

 Step 2 : Import the data-set

 Step 3 : Check out the missing values

 Step 4 : See the Categorical Values

 Step 5 : Splitting the data-set into Training and Test Set

 Step 6 : Feature Scaling

 STEP 1 : IMPORT THE LIBRARIES

In [1]:
# step 1 importing modules 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
import warnings 
warnings.filterwarnings('ignore')

STEP 2 : IMPORTING DATA SET

In [None]:
# step 2  importing the data set 

dataset= pd.read_csv('WorldCupMatches.csv')

In [None]:
dataset

In [None]:
dataset.shape

In [None]:
dataset.head()

In [None]:
dataset.columns

In [None]:
dataset.index

In [None]:
dataset.info

# STEP 3 : Check out the Missing Values

 The concept of missing values is important to understand in order to successfully manage data.

 If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data.

 Due to improper handling, the result obtained by the researcher will differ from ones where the missing values are present.

Two ways to handle Missing Values

In [None]:
dataset.isnull().sum()

In [None]:
dataset.shape

# Method1 :-

 This method commonly used to handle the null values.

 Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 75% of missing values.

 This method is advised only when there are enough samples in the data set.

In [None]:
dataset.dropna(inplace=True)

In [None]:
dataset.isnull().sum()

In [None]:
dataset.shape

# Method2:-
 This strategy can be applied on a feature which has numeric data like the year column or Home team goal column.

 We can calculate the mean, median or mode of the feature and replace it with the missing values.

 This is an approximation which can add variance to the data set.

 But the loss of the data can be negated by this method which yields better results compared to removal of rows and columns.

 Replacing with the above three approximations are a statistical approach of handling the missing values.

 This method is also called as leaking the data while training.

 Another way is to approximate it with the deviation of neighbouring values. This works better if the data is linear.

In [None]:
# replace the NaN value with mean ,median or mode methode 2

dataset['Year'].mean()

In [None]:
dataset['Year'].replace(np.NaN,dataset['Year'].mean()).tail()

In [None]:
dataset['Year'].tail()

# See the Categorical Values

 Machine learning models are based on Mathematical equations and you can intuitively understand that it would cause some problem if we can keep the Categorical data in the equations because we would only want numbers in the equations.

In [None]:
import pandas as pd

In [None]:
# importing Data csv file

dataset = pd.read_csv('Data.csv')

In [None]:
dataset

In [None]:
dataset.Purchased.replace(('Yes', 'No'), (1, 0), inplace=True)

In [None]:
dataset

In [None]:
dataset.tail()

In [None]:
X = dataset.iloc[:, :-1].values
X

In [None]:
dataset.isnull().sum()

In [None]:
dataset.dropna(inplace=True)

In [None]:
dataset.isnull().sum()

# Working on Categorical Values

 So, we need to encode the Categorical Variable…..

 To convert Categorical variable into Numerical data we can use LabelEncoder() class from preprocessing library.

In [None]:
# working on categorical Data

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [None]:
X =dataset.iloc[:,:-1].values
X

In [None]:
X[: ,0]

In [None]:
X[:,0] =label_encoder.fit_transform(X[:,0])
X[:,0]

In [None]:
import numpy as np

In [None]:
X=np.array(X,dtype='int32')
X

In [None]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features=[0])

In [None]:
X

In [None]:
x = onehotencoder.fit_transform(X)
print(x)

In [None]:
dummy =pd.get_dummies(dataset['Country'])
dummy

In [None]:
dataset = pd.concat([dataset,dummy],axis=1)
dataset

In [None]:
dataset

In [None]:
dataset = dataset.drop(['Country'],axis=1)

In [None]:
dataset

In [None]:
dataset.dropna(inplace=True)

In [None]:
dataset

In [None]:
y=dataset.iloc[:,2].values
y

In [None]:
#from sklearn.cross_validation import train_test_split

from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2)
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

In [None]:
X

# Feature Scaling


# What is Feature Scaling ?


 Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.

 Suppose we have this data-set

 See the Age and Salary column. You can easily noticed Salary and Age variable don’t have the same scale and this will cause some issue in your machine learning model.

 Because most of the Machine Learning models are based on Euclidean Distance.

 Let’s say we take two values from Age and Salary column

 One can easily compute and see that Salary column will be dominated in Euclidean Distance. And we don’t want this thing.


# Feature Scaling or Standardization:

 It is a step of Data Pre Processing which is applied to independent variables or features of data.

 It basically helps to normalise the data within a particular range.

 Sometimes, it also helps in speeding up the calculations in an algorithm.


from sklearn.preprocessing import StandardScaler

 Formula used in Backend Standardisation replaces the values by their Z scores.

 Mostly the Fit method is used for Feature scaling fit(X, y = None)

In [None]:
import pandas as pd 
from sklearn.preprocessing import StandardScaler 
# Read Data from CSV 
data = pd.read_csv('Data.csv') 
print(data.head() )
print('----------------')
print(data.isnull().sum())
print('----------------')
data.dropna(inplace =True)
print(data.head())
print('-------------------')
print(data.isnull().sum())
print('-------------------')
data.Purchased.replace(('No','Yes'),(0,1),inplace =True)
print(data.head() )
print('----------------')
dummy =pd.get_dummies(data['Country'])
print(dummy)
data = pd.concat([data,dummy],axis=1)
print('------------------')
print(data.head() )
print('----------------')
data.drop(columns='Country',inplace =True)
print(data.head() )
print('----------------')

In [None]:
import warnings 
warnings.filterwarnings('ignore')
# Initialise the Scaler 
print("---------------------")
scaler = StandardScaler() 
# To scale data 
data1=scaler.fit_transform(data) 
print(data1)

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler =MinMaxScaler()
data2 =min_max_scaler.fit_transform(data)
print(data2)