<h1 style=text-align:center;color:brown;font:bold> Data PreProcessing </h1>

## Divided into 6 parts
    * Common Data preparation tasks
    * Data Cleaning
    * Feature Selection
    * Data Transformation
    * Feature engineering
    * Dimensionality Reduction

<h1 style=text-align:center;color:blue;font:bold> Data Preparation </h1>

### Data Preparation - Without data leakage
    Data preparation is nothing but transforming raw data into a form that is more suitable for modeling

To Avoid Data leakage, one has to divide the dataset at the beginning itself.
Then apply all kinds of data preparation techniques.
The steps should be as follow :
* Split Data
* Fit Data preparation on train dataset
* Apply Data preparation to train and test dataset
* Evaluate models

#### Data preparation with train and test sets

In [3]:
# Prepare a Classification dataset
from sklearn.datasets import make_classification

# define dataset 
X, y = make_classification(n_samples = 1000, n_features=20, n_informative=15, n_redundant=5,random_state=12)

# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


#### Train-test evaluation with Naive data preparation

In [4]:
# Standardize the data set
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)


In [5]:
# split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12)

In [6]:
# fit the model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [7]:
# evaluate the model
y_hat = model.predict(X_test)

# evaluate predictions
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_hat)
print("Accuracy is %s" %(round(accuracy*100,2)))

Accuracy is 78.18


#### Train-test evaluation with Correct Data Preparation

In [8]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# create dataset
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative=15, n_redundant=5, random_state=1)

# Divide data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 1)

# Data normalization
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# model building
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluating model
y_hat = model.predict(X_test)
Accuracy = accuracy_score(y_test, y_hat)
print("Accuracy is %s"%(round(Accuracy*100,2)))

Accuracy is 87.88


> We can clearly see that there is a considerable difference between a model with naive data preparation and model with correct data preparation

### Data preparation with K-Fold cross validation

#### Cross validation with naive data preparation

In [9]:
from numpy import mean, std
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

# making a dataset
X, y = make_classification(n_samples=1000, n_features = 20, n_informative=15, n_redundant=5, random_state=123)

# data normalization
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X)

# define the model
model = LogisticRegression()

# evaluating the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross validation score
scores = cross_val_score(model, X_train, y, scoring = "accuracy", cv = cv, n_jobs=-1,)

# report performance
print("Accuracy is mean : %s, std : %s"%(round(mean(scores)*100,2),round(std(scores)*100,2)))

Accuracy is mean : 80.47, std : 3.84


#### Cross validation with correct data preparation

In [11]:
from numpy import mean, std
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1234)

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

# define scaling
scaler = MinMaxScaler()

# define model
model = LogisticRegression()

# define pipeline steps
steps = []
steps.append(("min_max_scaler",scaler))
steps.append(("logistic_model",model))
pipeline = Pipeline(steps = steps)

# define crossvalidation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1234)

# Evaluate model with cross validation
scores = cross_val_score(pipeline, X_train, y_train, cv = cv, n_jobs=-1)

# Printing accuracies
print("Accuracy : Mean is %s, Standard deviation is %s"%(round(mean(scores),2),round(std(scores),2)))

Accuracy : Mean is 0.85, Standard deviation is 0.04
