<h1 style=text-align:center> Data Preparation </h1>

## Divided into 6 parts
    * Common Data preparation tasks
    * Data Cleaning
    * Feature Selection
    * Data Transformation
    * Feature engineering
    * Dimensionality Reduction

### Data Preparation - Without data leakage
    Data preparation is nothing but transforming raw data into a form that is more suitable for modeling

To Avoid Data leakage, one has to divide the dataset at the beginning itself.
Then apply all kinds of data preparation techniques.
The steps should be as follow :
* Split Data
* Fit Data preparation on train dataset
* Apply Data preparation to train and test dataset
* Evaluate models

#### Data preparation with train and test sets

In [28]:
# Prepare a Classification dataset
from sklearn.datasets import make_classification

# define dataset 
X, y = make_classification(n_samples = 1000, n_features=20, n_informative=15, n_redundant=5,random_state=12)

# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


#### Train-test evaluation with Naive data preparation

In [29]:
# Standardize the data set
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)


In [30]:
# split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12)

In [31]:
# fit the model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [32]:
# evaluate the model
y_hat = model.predict(X_test)

# evaluate predictions
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_hat)
print("Accuracy is %s" %(round(accuracy*100,2)))

Accuracy is 78.18


#### Train-test evaluation with Correct Data Preparation

In [33]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# create dataset
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative=15, n_redundant=5, random_state=1)

# Divide data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 1)

# Data normalization
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# model building
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluating model
y_hat = model.predict(X_test)
Accuracy = accuracy_score(y_test, y_hat)
print("Accuracy is %s"%(round(Accuracy*100,2)))

Accuracy is 87.88


> We can clearly see that there is a considerable difference between a model with naive data preparation and model with correct data preparation