# **Scikit-learn - Topic 2 -Split your data, fit a model, predict and save the model.**


## Objectives

 Learn and implement the basic workflow for splitting the data, fitting a model, predicting on data and saving the model.

We will cover how to: 

* Split  data
* Fit a model
* Run predictions with the fitted model
* Save the model for later use



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\ML_practice\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\ML_practice'

## Splitting Data

We are interested in splitting your data in supervised learning. In conventional ML, like in Scikit-learn, we split the data into Train and Test sets.

The validation set is part of the Train set. When using a specific Scikit-learn function for hyperparameter optimisation, the validation set is grabbed automatically. Therefore, we will split it into Train and Test sets only.

In [4]:
#import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [5]:
#load iris dataset
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


How do you know which variables are features and which variable is a target variable?

It will depend on the context of your ML project. You will need to know or investigate the objective of your ML project to determine the features and the target.

We will select species as our target variable

In [6]:
df['species'].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

### Split the data

We use train_test_split() to split the data. The documentation is here. The parameters we will use are:

* The first two are the features and target, respectively. In this case, for the features, you drop species, and for the target, you subset species.
* test_size: it represents the data proportion to include in the test set. We set it at 0.2
* random_state:According to the documentation, it controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. It can be any positive integer. We suggest keeping the same random_state value across your project.

random_state is a critical parameter in ML, which we will use in other use cases. It essentially gives REPRODUCIBILITY to your project. That means the same result you get here right now; another person will get elsewhere at another time.

In [7]:
#Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['species'],axis=1),
                                                    df['species'],
                                                    test_size=0.2,
                                                    random_state=101)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (120, 4) (120,) 
* Test set: (30, 4) (30,)


We will examine X_train

* These will be the features used to train the model
* Note the features are numbers. Scikit-learn uses numbers to fit models. That is why we have to encode categorical data
* In this dataset, we don't need any data cleaning or categorical encoding.

In [8]:
X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
104,6.5,3.0,5.8,2.2
89,5.5,2.5,4.0,1.3
116,6.5,3.0,5.5,1.8
82,5.8,2.7,3.9,1.2
112,6.8,3.0,5.5,2.1


Inspecting y_train we can see these  are categories.

When the ML task is classification, Scikit-learn handles either numbers or categories for the target variable.

In [9]:
y_train

104     virginica
89     versicolor
116     virginica
82     versicolor
112     virginica
          ...    
63     versicolor
70     versicolor
81     versicolor
11         setosa
95     versicolor
Name: species, Length: 120, dtype: object

In [10]:
type(y_train)

pandas.core.series.Series

### Fitting the model

 We will use a decision tree algorithm to fit a model to demonstrate the basic workflow for fitting a model.

We will use DecisionTreeClassifier(), the documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

We create a python object/variable called model and instantiate DecisionTreeClassifier(). A common convention is to set the object name as a model.

Note: we created a model and fit. We can do that since the data doesn't require a pre-processing step, like data cleaning or categorical encoding, for the fitting.

* *Generally we would fit the model as part of a pipeline*

In [11]:
#create model object with decision tree algorithm
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

In [12]:
#fit the model to the training data
model.fit(X_train, y_train)


### Run predictions

We use .predict() and parse the test set features (X_test)

This creates an array

In [13]:
model.predict(X_test)

array(['setosa', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'virginica', 'setosa',
       'virginica', 'setosa', 'setosa', 'virginica', 'virginica',
       'versicolor', 'versicolor', 'versicolor', 'setosa', 'virginica',
       'versicolor', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'virginica', 'setosa', 'setosa'],
      dtype=object)

You can predict the probability (between 0.0 and 1.0) for each class for a given observation using .predict_proba()

In [14]:
model.predict_proba(X_test)

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

#### predict on real-time data.

In an application, you will likely create an interface to collect the data or will get the data from somewhere else, from an API, for example.
In this case, we will manually create a DataFrame that contains the features. We call that X_live. It will have one row only (you could have a set of rows, that would mean running predictions in a batch; in our case, it is only one prediction)
In theory, you can set any value to the variable, But in practice, the values will follow the actual data distribution.

In [None]:
#Create a DataFrame for live data
X_live = pd.DataFrame(data={'sepal_length':6.0,
                            'sepal_width':3.9,
                            'petal_length':2.5,
                            'petal_width':0.9},
                      index=[0] # the DataFrame needs an index (either number or category), we just parsed the number 0
                      )
X_live

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,6.0,3.9,2.5,0.9


In [16]:
#predict form X_live
model.predict(X_live)

array(['versicolor'], dtype=object)

In [17]:

model.predict_proba(X_live)

array([[0., 1., 0.]])

In [18]:
df['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

### Saving the model