#### 3. Choosing the right estimator/algorithm for your problem

Some things to note:

- Sklearn refers to machine learning models, algorithms as estimators.
- Classification problem - predicting a category (heart disease or not)
    - Sometimes you'll see clf (short for classifier) used as a classification estimator
- Regression problem - predicting a number (selling price of a car)
  
If you're working on a machine learning problem and looking to use Sklearn and not sure what model you should use, refer to the sklearn machine learning map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

##### Picking a machine learning model for a regression problem

Let's use the California Housing dataset - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

## 1-Regression

In [2]:
# Standard imports
# %matplotlib inline # No longer required in newer versions of Jupyter (2022+)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [40]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

# does not return the data directly - has other info to, we need to extract the dataset
housing;

In [41]:
# Extracting the data (data and columns are seperate - create a data frame)
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22


In [43]:
# Add the target - (its not added bt default - comes seperately)
housing_df["target"] = housing["target"]
housing_df.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585


Lets make the model

In [None]:
# Import algorithm/estimator
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"] # median house price in $100,000s

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit the model (on the training set)
model = Ridge()
model.fit(X_train, y_train);

In [None]:
# Check the score of the model (on the test set)
model.score(X_test, y_test) # co-efficient of determination

What if Ridge didn't work or the score didn't fit our needs?

Well, we could always try a different model...

How about we try an ensemble model (an ensemble is combination of smaller models to try and make better predictions than just a single model)?

Sklearn's ensemble models can be found here: https://scikit-learn.org/stable/modules/ensemble.html

In [None]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create random forest model - chnages here
model = RandomForestRegressor()
model.fit(X_train, y_train);

In [None]:
# Check the score of the model (on the test set)
model.score(X_test, y_test # seems like a better score

## 2-Classification

To decide on a classification model we can use the sk-learn map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [11]:
# import the data
heart_disease = pd.read_csv("data/heart-disease.csv")

# changing the index (opt)
index = range(1, len(heart_disease)+1)
heart_disease.index= index

# View top rows
heart_disease.head(1); 

In [18]:
heart_disease["target"].value_counts();

Consulting the map and it says to try LinearSVC.

In [16]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

# Setup random seed - to get same randomness
np.random.seed(42)

# Make the data (X -label, y - features)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data (test and train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate LinearSVC and fit the model
clf = LinearSVC(max_iter=100)
clf.fit(X_train, y_train)

# Evaluate the LinearSVC
clf.score(X_test, y_test)

0.8688524590163934

Using RandomForest classfier

In [19]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Evaluate the Random Forest Classifier
clf.score(X_test, y_test)

0.8524590163934426

Note: 
1. If you have structured data, used ensemble methods
2. If you have unstructured data, use deep learning or transfer learning

### 3-Fit the model/algorithm on our data and use it to make predictions

##### 3.1 Fitting the model to the data

Different names for label and features:

    X = features, features variables, data
    y = labels, targets, target variables

In [21]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Make the model object and then fit it

In [24]:
# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)

# Fit the model to the data (training the machine learning model)
clf.fit(X_train, y_train);

In [25]:
# Evaluate the Random Forest Classifier (use the patterns the model has learned)
clf.score(X_test, y_test)

0.8688524590163934

3.2 Make predictions using a machine learning model

2 ways to make predictions:

    `predict()`
    `predict_proba()`

In [31]:
# Use a trained model to make predictions (predict takes in features - preferably test_data - and outouts labels (predicted))
clf.predict(X_test);

In [32]:
# actual labels of the test data
pd.array(y_test);

In [33]:
# Compare predictions to truth labels to evaluate the model - method -1 
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.8688524590163934

In [34]:
# method - 2 (use this)
clf.score(X_test, y_test)

0.8688524590163934

In [35]:
# method - 3
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8688524590163934

Make predictions with predict_proba() - use this if someone asks you "what's the probability your model is assigning to each prediction?"

In [37]:
# predict_proba() returns probabilities of a classification label (first 5)
clf.predict_proba(X_test[0:5])
#      [0    ,1   ]

array([[0.87, 0.13],
       [0.4 , 0.6 ],
       [0.4 , 0.6 ],
       [0.92, 0.08],
       [0.21, 0.79]])

In [38]:
clf.predict(X_test[0:5])

array([0, 1, 1, 0, 1], dtype=int64)

The default threshold is 0.5

predict() can also be used for regression models.

In [44]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data - make sure to run run the 2 and 3 cells to import the dataset
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create model instance
model = RandomForestRegressor()
# Fit the model to the data
model.fit(X_train, y_train)

# Make predictions
y_preds = model.predict(X_test)

In [45]:
y_preds[0:5]

array([0.49384  , 0.75494  , 4.9285964, 2.54316  , 2.33176  ])

In [46]:
np.array(y_test[0:5])

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   ])

In [47]:
# Compare the predictions to the - method - 1 
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

0.32659871732073664

In [51]:
from sklearn.metrics import r2_score
r2_score(y_test, y_preds)

0.8065734772187598

In [49]:
# method - 2 (use this)
model.score(X_test, y_test)

0.8065734772187598