<a href="https://colab.research.google.com/github/Martinmbiro/Pytorch-classification-basics/blob/main/01%20Data%20preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data preparation**
> For this exercise, I'll be using the [`sklearn's`](https://scikit-learn.org/stable/index.html) famous [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) to train a Neural Network for multi-class classification

### Loading the data
> ✋ **Info**
+ The dataset contains a set of 150 records with four features: `sepal length`, `sepal width`, `petal length`, `petal width` that attribute to three species of Iris, `setosa`, `versicolor`, & `virginica`
+ Calling [`load_iris(return_X_y=False, as_frame=False)`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris) on [`sklearn.dataset`](https://scikit-learn.org/stable/api/sklearn.datasets.html) returns a [`sklearn.utils.Bunch`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch) object that extends the normal python [`dictionary`](https://www.w3schools.com/python/python_dictionaries.asp) but with additional capabilities
+ If `as_frame=True`, `data` will be a `pandas` [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html#dataframe), otherwise, a `NumPy` [`ndarray`](https://numpy.org/doc/stable/reference/arrays.ndarray.html#the-n-dimensional-array-ndarray). For the sake of practice, I'll be working with an `ndarray`

In [None]:
# imports
import numpy as np
from sklearn import datasets

# load iris dataset
iris = datasets.load_iris()

In [None]:
# return bunch keys
print(f'keys:\n{iris.keys()}\n')
# print column names
print(f'feature_names:\n{iris.feature_names}\n')
# target names
print(f'target_names:\n{iris.target_names}')

keys:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

feature_names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

target_names:
['setosa' 'versicolor' 'virginica']


In [None]:
# check the shape
iris.data.shape, iris.target.shape

((150, 4), (150,))

### Pre-process the data
> Here, I'll build a scikit learn [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#pipeline) for preprocessing the data. Part of this will entail scaling, since this is a standard practice in Machine Learning in general  

> 🔔 **Info**
+ [`SimplImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#simpleimputer) - Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value
+ [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#minmaxscaler) - Transform features by scaling each feature to a given range, by default `(0, 1)`
+ [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#make-pipeline) - Construct a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#pipeline) for preprocessing the data from given _estimators_

In [None]:
# imports
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

In [None]:
# make preprocessing pileline with inputer & minmaxscaler
preprocessor = make_pipeline(
  SimpleImputer(strategy='median'), # handle missing values, if any
  MinMaxScaler() # scale data on a scale of (0,1)
)

In [None]:
# apply the pipeline on data and view 1st 10 rows:
preprocessor.fit_transform(iris.data)[:5]

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667]])

> ▶️ **Up Next**

> Loading data and model building