## Applied ML process -  Basic steps
 Step 1: Define Problem.

 Step 2: Prepare Data.

 Step 3: Evaluate Models.

 Step 4: Finalize Model.

### Data Prep :  
is the transformation of raw data into a form that is more suitable for modeling.

is to discover how to best expose the underlying
structure of the problem to the learning algorithms.

### Why Data Prep:

* Machine learning algorithms require data to be numbers.
* Some machine learning algorithms impose requirements on the data.
* Statistical noise and errors in the data may need to be corrected.
* Complex nonlinear relationships may be teased out of the data.

## 1. Definining the problem: 

 Gather data from the problem domain.

 Discuss the project with subject matter experts.

 Select those variables to be used as inputs and outputs for a predictive model.

 Review the data that has been collected.

 Summarize the collected data using statistical methods.

 Visualize the collected data using plots and charts.

## 2. Data Prep:



 **Data Cleaning:** Identifying and correcting mistakes or errors in the data.

 **Feature Selection:** Identifying those input variables that are most relevant to the task.

 **Data Transforms:** Changing the scale or distribution of variables.

 **Feature Engineering:** Deriving new variables from available data.

 **Dimensionality Reduction:** Creating compact projections of the data.

# Data Preparation Without Data Leakage

> Data Leakage: Data leakage refers to a problem where information about the
holdout dataset, such as a test or validation dataset, is made available to the model in the
training dataset

Its like your future personality(test_data) comes and gives you advice on your decision making(model_prediction).

> How data leakage happens?

We get data leakage by applying data preparation techniques(normalisation or standardization) to the entire dataset. 

> Solution:

The entire pipeline must be prepared only on the training dataset to avoid data leakage. This might include data transforms, but also other techniques such feature selection, dimensionality reduction, feature engineering and more. 

Steps: 
1. Split Data.
2. Fit Data Preparation on Training Dataset.
3. Apply Data Preparation to Train and Test Datasets.
4. Evaluate Models.



### Data Preparation With Train and Test Sets

we will evaluate a **logistic regression** model using train and test sets on a synthetic
binary classification dataset where the input variables have been normalized.

In [4]:
#libraries required for this notebook
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# using make_calssifaiction to create 1000 rows and 20 numerical input features
#test classification dataset
from sklearn.datasets import make_classification

#define dataset
X , y = make_classification(n_samples=1000, n_features=20, 
                            n_informative=15, n_redundant=5, random_state=7)

#summarize the ds
print(X.shape, y.shape)

(1000, 20) (1000,)


### Train-Test Evaluation With Naive(**incorrect**) Data Preparation

In [7]:
#normalising the ds
scalar = MinMaxScaler()

X = scalar.fit_transform(X)

In [9]:
#split into train 67% and test 33%

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size = 0.33, random_state=1) 

In [10]:
#fit he model -logistic regression
model = LogisticRegression()
model.fit(X_train,y_train)

LogisticRegression()

In [19]:
#evaluate the model
yhat = model.predict(X_test)
#evalauate prediction
acc = accuracy_score(y_test,yhat)
print("Accuracy of the model (incorrect data prep) %.3f" %(acc*100))

Accuracy of the model (incorrect data prep) 84.848


This estimate of model accuracy is wrong.

### Train-Test Evaluation With Correct Data Preparation

In [20]:
#split the dastaset first
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size = 0.33, random_state=1) 

In [21]:
#apply necessary transformations separtely

# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

In [22]:
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.455


This is the correct method and correct estimate.

### Data Preparation With k-fold Cross Validation

Cross validation: https://www.datamuni.com/@abhishek/basics-of-cross-validation

The evaluation procedure changes from simply and incorrectly evaluating just the model
to correctly evaluating the entire pipeline of data preparation and model together as a single
atomic unit. This can be achieved using the **Pipeline** class. This class takes a list of steps
that define the pipeline. Each step in the list is a tuple with two elements. The first element is
the name of the step (a string) and the second is the configured object of the step, such as a
transform or a model. The model is only supported as the final step, although we can have as
many transforms as we like in the sequence.

In [23]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [26]:
#define the pipeline

steps = list()
steps.append(('scalar', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

steps

[('scalar', MinMaxScaler()), ('model', LogisticRegression())]

In [27]:
#define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

#evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, 
                         scoring='accuracy', cv=cv, n_jobs=-1)

In [28]:
#report the performance
print("Accuracy of k-fold CV: %.3f (%.3f)" %(mean(scores)*100, std(scores)*100))

Accuracy of k-fold CV: 85.433 (3.471)


#### API


* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html


* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html