# A Quick Machine Learning Modelling Tutorial with Python and Scikit-Learn

#### Resources
- https://github.com/mrdbourke/zero-to-mastery-ml/blob/81492352d12d7a52caef57bba7744cbdc34af33f/section-2-data-science-and-ml-tools/introduction-to-scikit-learn.ipynb

## Overview
[Scikit-Learn](https://scikit-learn.org) (`sklearn`) is an open-sourced, Python, ML library built on NumPy and Matplotlib.
- provides many utilities for common ML activities

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


### End-to-end Scikit-learn Workflow
> **note**: this notebook is focused on supervised learning

 1. Prepare data (cleaning, split into features & labels, split into training and testing, etc.)
 2. Choose the right model (linear regression, k-means, classification, etc.)
 3. Fit the model to the data and use it to make predictions
 4. Evaluate the model (and iterate!)
 5. Prepare for deployment & sharing

## 1. Prepare Data

The main data transformation actions you'll have to take are:
- splitting data columns into features & labesl (often labelled `X` and `Y`)
- splitting the data records into test, validation, and training subsets
- filling (aka imputing) or dropping missing values
- converting non-numerical data into a numerical format (**feature-encoding**)

In [3]:
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv") # load data directly from URL (requires raw form on GitHub, source: https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/data/heart-disease.csv)
X = heart_disease.drop('target', axis=1)
Y = heart_disease['target']

# splitting data into training, testing, and potentially validation
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
heart_disease.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

((303, 14), (242, 13), (61, 13), (242,), (61,))

## 1.1 Encoding Values

All data must be in a numerical format. We can **encode** non-numerical (categorical) data using preprocessing techniques like `OneHotEncoding`:

| Color |
| --- |
| Red |
| Red |
| Yellow |
| Green |
| Yellow |

becomes

| Red | Yellow | Green |
| --- | --- | --- |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |


In [14]:
car_sales = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended.csv")
car_sales.head

# Split into X & y and train/test
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define the categorical features to transform
categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

# Create an instance of a transformer using the OneHotMethod
transformer = ColumnTransformer([("one hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [15]:
# Create a model using the encoded features
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.3235867221569877