- Notebook Author: [Trenton McKinney][1]
- Course: **[DataCamp: Supervised Learning with scikit-learn][2]**
 - This [notebook][3] was created as a reproducible reference.
 - The material is from the course
 - I completed the exercises
 - If you find the content beneficial, consider a [DataCamp Subscription][4].
 - I added a function (**`create_dir_save_file`**) to automatically download and save the required data (`data/2020-10-14_supervised_learning_sklearn`) and image (`Images/2020-10-14_supervised_learning_sklearn`) files.

  [1]: https://trenton3983.github.io/
  [2]: https://learn.datacamp.com/courses/supervised-learning-with-scikit-learn
  [3]: https://github.com/trenton3983/DataCamp/blob/master/2020-10-14_supervised_learning_sklearn.ipynb
  [4]: https://www.datacamp.com/pricing

#### Course Description

Machine learning is the field that teaches machines and computers to learn from existing data to make predictions on new data: Will a tumor be benign or malignant? Which of your customers will take their business elsewhere? Is a particular email spam? In this course, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. You'll learn how to build predictive models, tune their parameters, and determine how well they will perform with unseen data—all while using real world datasets. You'll be using scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.

#### Imports

In [None]:
import pandas as pd
import numpy as np
from pprint import pprint as pp
from itertools import combinations
import requests
from pathlib import Path
from sklearn.datasets import load_iris, load_digits
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error

#### Configuration Options

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
# plt.style.use('ggplot')
plt.rcParams["patch.force_edgecolor"] = True

#### Functions

In [None]:
def create_dir_save_file(dir_path: Path, url: str):
    """
    Check if the path exists and create it if it does not.
    Check if the file exists and download it if it does not.
    """
    if not dir_path.parents[0].exists():
        dir_path.parents[0].mkdir(parents=True)
        print(f'Directory Created: {dir_path.parents[0]}')
    else:
        print('Directory Exists')
        
    if not dir_path.exists():
        r = requests.get(url, allow_redirects=True)
        open(dir_path, 'wb').write(r.content)
        print(f'File Created: {dir_path.name}')
    else:
        print('File Exists')

In [None]:
data_dir = Path('data/2020-10-14_supervised_learning_sklearn')
images_dir = Path('Images/2020-10-14_supervised_learning_sklearn')

#### Datasets

In [None]:
file_mpg = 'https://assets.datacamp.com/production/repositories/628/datasets/3781d588cf7b04b1e376c7e9dda489b3e6c7465b/auto.csv'
file_housing = 'https://assets.datacamp.com/production/repositories/628/datasets/021d4b9e98d0f9941e7bfc932a5787b362fafe3b/boston.csv'
file_diabetes = 'https://assets.datacamp.com/production/repositories/628/datasets/444cdbf175d5fbf564b564bd36ac21740627a834/diabetes.csv'
file_gapminder = 'https://assets.datacamp.com/production/repositories/628/datasets/a7e65287ebb197b1267b5042955f27502ec65f31/gm_2008_region.csv'
file_voting = 'https://assets.datacamp.com/production/repositories/628/datasets/35a8c54b79d559145bbeb5582de7a6169c703136/house-votes-84.csv'
file_wwine = 'https://assets.datacamp.com/production/repositories/628/datasets/2d9076606fb074c66420a36e06d7c7bc605459d4/white-wine.csv'
file_rwine = 'https://assets.datacamp.com/production/repositories/628/datasets/013936d2700e2d00207ec42100d448c23692eb6f/winequality-red.csv'

In [None]:
datasets = [file_mpg, file_housing, file_diabetes, file_gapminder, file_voting, file_wwine, file_rwine]
data_paths = list()

for data in datasets:
    file_name = data.split('/')[-1].replace('?raw=true', '')
    data_path = data_dir / file_name
    create_dir_save_file(data_path, data)
    data_paths.append(data_path)

#### DataFrames

# Classification

In this chapter, you will be introduced to classification problems and learn how to solve them using supervised learning techniques. And you'll apply what you learn to a political dataset, where you classify the party affiliation of United States congressmen based on their voting records.

## Supervised learning

**What is machine learning?**
- The art and science of giving computers the ability to learn to make decisions from data without being explicitly programmed.
- Examples:
  - Your computer can learn to predict whether an email is spam or not spam, given its content and sender.
  - Your computer can learn to cluster, say, Wikipedia entries, into different categories based on the words they contain.
    - It could then assign any new Wikipedia article to one of the existing clusters.
- Note that, in the first example, we are trying to predict a particular class label, the is, spam or not spam.
- In the second example, there is not such label.
- When there are labels present, we call it supervised learning.
- Where there are not labels present, we call it unsupervised learning.

**Unsupervised learning**
- In essence, is the machine learning task of uncovering hidden patterns and structures from unlabeled data.
- Example:
  - A business may wish to group its customers into distinct categories (Clustering) based on their purchasing behavior without knowing in advance what these categories might be.
    - This is known as clustering, one branch of unsupervised learning.

**Reinforcement learning**
- Machines or software agents interact with an environment.
  - Reinforcement learning agents are able to automatically figure out how to optimize their behavior given a system of rewards and punishments.
- Reinforcement learning draws inspiration from behavioral psychology and has applications in many fields, such as, economics, genetics, as well as game playing.
- In 2016, reinforcement learning was used to train Google DeepMind's AlphaGo, which was the first computer program to beat the world champion in Go.

**Supervised learning**
- In supervised learning, we have several data points or samples, described using predictor variables or features and a target variable.
- Out data is commonly represented in a tables structure such as the one below, in which there is a row for data point and a column for each feature.
```python
|    |                        Predictor Variables                      | Target    |
|    |   sepal_length |   sepal_width |   petal_length |   petal_width | species   |
|---:|---------------:|--------------:|---------------:|--------------:|:----------|
|  0 |            5.1 |           3.5 |            1.4 |           0.2 | setosa    |
|  1 |            4.9 |           3   |            1.4 |           0.2 | setosa    |
|  2 |            4.7 |           3.2 |            1.3 |           0.2 | setosa    |
|  3 |            4.6 |           3.1 |            1.5 |           0.2 | setosa    |
|  4 |            5   |           3.6 |            1.4 |           0.2 | setosa    |
```
- Here, we see the iris dataset: each row represents measurements of a different flower and each column is a particular kind of measurement, like the width and length of a certain part of the flower.
- The aim of supervised learning is to build a model that is able to predict the target variable, here, the particular species of a flower, given the predictor variables, the physical measurements.
- _If the target variable consists of categories_, like `'click'` or `'no click'`, `'spam'` or `'not spam'`, or different species of flowers, we call the learning task, **classification**.
- Alternatively, _if the target is a continuously varying variable_, the price of a house, it is a **regression** task.
- This chapter will focus on classification, the following, on regression.
- The goal of supervised learning is frequently to either automate a time-consuming, or expensive, manual task, such as a doctor's diagnosis, or to make predictions about the future, say whether a customer will click on an add, or not.
- For supervised learning, you need labeled data and there are many ways to go get it: you can get historical data, which already has labels that you are interested in; you can perform experiments to get labeled data, such as A/B-testing to see how many clicks you get; or you can also use crowd-sourced labeling data, like reCAPTCHA does for text recognition.
- In any case, the goal is to learn from data, for which the right output is known, so that we can make predictions on new data from which we don't know the output.

**Supervised learning in python**
- There are many ways to perform supervised learning in Python.
- In this course, we will use `scikit-learn`, or `sklearn`, one of the most popular and use-friendly machine learning libraries for Python.
- It also integrate very well with the `SciPy` stack, including libraries such as `NumPy`.
- There are a number of other ML libraries out there, such as `TensorFlow` and `keras`, which are well worth checking out, once you get the basics down.

**Naming conventions**
- A note on naming conventions: out in the wild, you will find that what we call a feature, others may call a predictor variable, or independent variable, and what we call a target variable, others may call dependent variable, or response variable.

### Which of these is a classification problem?

Once you decide to leverage supervised machine learning to solve a new problem, you need to identify whether your problem is better suited to classification or regression. This exercise will help you develop your intuition for distinguishing between the two.

Provided below are 4 example applications of machine learning. Which of them is a supervised classification problem?

**Answer the question**

- **Using labeled financial data to predict whether the value of a stock will go up or go down next week.**
  - Exactly! In this example, there are two discrete, qualitative outcomes: the stock market going up, and the stock market going down. This can be represented using a binary variable, and is an application perfectly suited for classification.
- ~~Using labeled housing price data to predict the price of a new house based on various features.~~
  - Incorrect. The price of a house is a quantitative variable. This is not a classification problem.
- ~~Using unlabeled data to cluster the students of an online education company into different categories based on their learning styles.~~
  - Incorrect. When using unlabeled data, we enter the territory of unsupervised learning.
- ~~Using labeled financial data to predict what the value of a stock will be next week.~~
  - Incorrect. The value of a stock is a quantitative value. This is not a classification problem.

## Exploratory data analysis

- Samples are in rows
- Features are in columns

In [None]:
iris = load_iris()
print(f'Type: {type(iris)}')
print(f'Keys: {iris.keys()}')
print(f'Data Type: {type(iris.data)}\nTarget Type: {type(iris.target)}')
print(f'Data Shape: {iris.data.shape}')
print(f'Target Names: {iris.target_names}')

X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
df['label'] = y
species_map = dict(zip(range(3), iris.target_names))
df['species'] = df.label.map(species_map)
df = df.reindex(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'species', 'label'], axis=1)
display(df.head())

# pd.plotting.scatter_matrix(df, c=y, figsize=(12, 10))
sns.pairplot(df.iloc[:, :5], hue='species', corner=True)
plt.show()

### Numerical EDA

In this chapter, you'll be working with a dataset obtained from the [UCI Machine Learning Repository][1] consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of [Statistical Thinking in Python (Part 1)][2].

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called `df`. Use pandas' `.head()`, `.info()`, and `.describe()` methods in the IPython Shell to explore the DataFrame, and select the statement below that is **not** true.

  [1]: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
  [2]: https://trenton3983.github.io/certificates/2019-11-21_DataCamp_statistical_thinking_in_python_I.pdf

In [None]:
cols = ['party', 'infants', 'water', 'budget', 'physician', 'salvador', 'religious', 'satellite', 'aid',
        'missile', 'immigration', 'synfuels', 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
votes = pd.read_csv(data_paths[4], header=None, names=cols)
votes.iloc[:, 1:] = votes.iloc[:, 1:].apply(lambda col: col.map({'?': None, 'n': 0, 'y': 1}))
votes.head()

**Possible Answers**

- The DataFrame has a total of `435` rows and `17` columns.
- Except for `'party'`, all of the columns are of type `int64`.
- The first two rows of the DataFrame consist of votes made by Republicans and the next three rows consist of votes made by Democrats.
- ~~There are 17 predictor variables, or features, in this DataFrame.~~
  - The number of columns in the DataFrame is not equal to the number of features. One of the columns - `'party'` is the target variable.
- The target variable in this DataFrame is `'party'`.

### Visual EDA

The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the `scatter_matrix()` function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as [seaborn.countplot][1].

Given on the right is a `countplot` of the `'education'` bill, generated from the following code:

```python
plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
```

In `sns.countplot()`, we specify the x-axis data to be `'education'`, and hue to be `'party'`. Recall that `'party'` is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the `'education'` bill, with each party colored differently. We manually specified the color to be `'RdBu'`, as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the `'satellite'` and `'missile'` bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with `plt.figure()` so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.


  [1]: http://seaborn.pydata.org/generated/seaborn.countplot.html

In [None]:
# in order to use catplot, the dataframe needs to be in a tidy format
vl = votes.set_index('party').stack().reset_index().rename(columns={'level_1': 'cat', 0: 'vote'})

g = sns.catplot(data=vl, x='vote', col='cat', col_wrap=6, hue='party', kind='count', height=3, palette='RdBu')

**Possible Answers**

- ~~`'satellite'`.~~
- ~~`'missile'`.~~
- **Both `'satellite'` and `'missile'`.**
- ~~Neither `'satellite'` nor `'missile'`.~~

## The classification challenge

- We have a set of labeled data and we want to build a classifier that takes unlabeled data as input and output a label.
- How do we construct this classifier?
- We first need to choose a type of classifier, and it needs to learn from the already labeled data.
- For this reason, we call the already labeled data, the training data.

**k-Nearest Neighbors (KNN)**
- We'll choose a simple algorithm call K-nearest neighbors.
- the basic idea of KNN, is to predict the label of any data point by looking at the K, for example, 3, closest labeled data points, and getting them to vote on what label the unlabeled point should have.
- ![knn][knn]
  - In this image, there's an example of KNN in two dimensions: how do you classify the data point in the middle?
- ![knn3][knn3]
  - If `k=3`, you would classify it as red
- ![knn5][knn5]
  - If `k=5`, you would classify it as green

**KNN: Intuition**
- To get a bit of intuition for KNN, let's check out a scatter plot of two dimensions of the iris dataset, petal length and petal width.
- ![iris_petal][iris_petal]
- The following holds for higher dimensions, however, we'll show thae 2D case for illustrative purposes.
- What the KNN algorithm essentially does, is create a set of decision boundaries and we visualized the 2D case here.
- ![ip_db][iris_petal_db]
- Any new data point will have a species prediction based on the boundary.

**`scikit-learn` `fit` and `predict`**
- All machine learning models in `scikit-learn` are implemented as python classes
- These classes serve two purposes:
  - They implement the algorithms for learning a model, and predicting
  - Storing all the information that is learned from the data.
- Training a model on the data is also called fitting the model to the data.
  - In `scikit-learn` we use the `.fit()` method to do this.
  - The `.predict()` is used to predict the label of an unlabeled data point.


  [knn]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/knn.JPG
  [knn3]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/knn3.JPG
  [knn5]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/knn5.JPG
  [iris_petal]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/iris_petal.JPG
  [iris_petal_db]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/iris_petal_db.JPG

#### Code to create boundary plot in the previous block

- See [How to extract only the boundary values from k-nearest neighbors predict][1] to see a method to extract the boundary values.


  [1]: https://stackoverflow.com/questions/64398946

In [None]:
# instantiate model
knn = KNeighborsClassifier(n_neighbors=6)

# predict for 'petal length (cm)' and 'petal width (cm)'
knn.fit(df.iloc[:, 2:4], df.label)

h = .02  # step size in the mesh

# create colormap for the contour plot
cmap_light = ListedColormap(list(sns.color_palette('pastel', n_colors=3)))

# Plot the decision boundary.
# For that, we will assign a color to each point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = df['petal length (cm)'].min() - 1, df['petal length (cm)'].max() + 1
y_min, y_max = df['petal width (cm)'].min() - 1, df['petal width (cm)'].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# create plot
fig, ax = plt.subplots()

# add data points
sns.scatterplot(data=df, x='petal length (cm)', y='petal width (cm)', hue='species', ax=ax, edgecolor='k')

# add decision boundary countour map
ax.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.4)

# legend
lgd = plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# plt.show()
plt.close()

#### Using `scikit-learn` to fit a classifier

- `from sklearn.neighbors import KNeighborsClassifier`
- The API requires data as a `pandas.DataFrame` or as a `numpy.array`
- The API features must take on continuous values, such as the price of a house, as opposed to categories, such as `'male'` or `'female'`.
- There should be no missing values in the data.
- All dataset we'll work with, satisfy these properties.
- Dealing with categorical features and missing data will be discussed later in the course.
- The API requires that the features are in an array, where each column is a feature, and each row, a different observation or data point.
- There must be a label for each observation.
- Check out what's returned when the classifier is fit
  - It returns the classifier itself, and modifies it, to fit it to the data.
- Now that the classifier is fit, use it to predict on some unlabeled data.

In [None]:
# new data
X_new = np.array([[5.6, 2.8, 3.9, 1.1], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]])

fig, ax = plt.subplots()
sns.scatterplot(data=df, x='petal length (cm)', y='petal width (cm)', hue='species', ax=ax, edgecolor='k')
sns.scatterplot(x=X_new[:, 2], y=X_new[:, 3], ax=ax, color='magenta', label='uncategorized', s=70)
plt.show()

In [None]:
# instantiate the model, and set the number of neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# fit the model to the training set, the labeled data
knn.fit(df.iloc[:, :4], df.label)

# predit the label of the new data
pred = knn.predict(X_new)
spcies_pred = list(map(species_map.get, pred))
print(f'Predicted Label: {pred}\nSpecies: {spcies_pred}')

### k-Nearest Neighbors: Fit

Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame `df`.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array `X` and response variable `y`: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the `n_neighbors` parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called `df`.

**Instructions**

- Import `KNeighborsClassifier` from `sklearn.neighbors`.
- Create arrays `X` and `y` for the features and the target variable. Here this has been done for you. Note the use of `.drop()` to drop the target variable `'party'` from the feature array `X` as well as the use of the `.values` attribute to ensure `X` and `y` are NumPy arrays. Without using `.values`, `X` and `y` are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.
- Instantiate a `KNeighborsClassifier` called `knn` with `6` neighbors by specifying the `n_neighbors` parameter.
- Fit the classifier to the data using the `.fit()` method.

In [None]:
v_na = votes.dropna().reset_index(drop=True)
v_na.head()

In [None]:
# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(v_na.iloc[:, 1:], v_na.party)

**Now that your k-NN classifier with 6 neighbors has been fit to the data, it can be used to predict the labels of new data points.**

### k-Nearest Neighbors: Predict

Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the `.predict()` method on the `X` that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as `X_new`. You will use your classifier to predict the label for this new data point, as well as on the training data `X` that the model has already seen. Using `.predict()` on `X_new` will generate 1 prediction, while using it on `X` will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as `df`. This time, you will create the feature array `X` and target variable array `y` yourself.

**Instructions**

- Create arrays for the features and the target variable from `df`. As a reminder, the target variable is `'party'`.
- Instantiate a `KNeighborsClassifier` with `6` neighbors.
- Fit the classifier to the data.
- Predict the labels of the training data, `X`.
- Predict the label of the new data point `X_new`.

In [None]:
X_new = np.array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897, 0.42310646, 0.9807642 , 0.68482974,
                   0.4809319 , 0.39211752, 0.34317802, 0.72904971, 0.43857224, 0.0596779 , 0.39804426, 0.73799541]])

In [None]:
# Create arrays for the features and the response variable
y = v_na.party
X = v_na.iloc[:, 1:]

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print(f'Prediction: {new_prediction}')

**Did your model predict 'democrat' or 'republican'? How sure can you be of its predictions? In other words, how can you measure its performance? This is what you will learn in the next video.**

## Measuring model performance

- Now that we know how to fit a classifier and use it to predict the labels of previously unseen data, we need to figure out how to measure its performance. We need a metric.
- In classification problems, accuracy is a commonly-used metric.
- The accuracy of a classifier is defined as the number of correct predictions divided by the total number of data points.
- This begs the question though: which data do we use to compute accuracy?
- What we're really interested in is how well out model will perform on new data; samples that the algorithm has never seen before.
- You could compute the accuracy on the data you used to fit the classifier.
- However, as this data was used to train it, the classifier's performance will not be indicative of how well it can generalize to unseen data.
- For this reason, it is common practice to split the data into two sets, a training and test set.
- The classifier is trained or fit on the training set.
- Then predictions are made on the labeled test set, and compared with the known labels.
- The accuracy of the predictions is then computed.

**Train Test Split**
- [`sklearn.model_selection.train_test_split`][1]
  - `random_state` sets a seed for the random number generator that splits the data into train and test, which allows for reproducing the exact split of the data.
  - returns four arrays: train data, test data, training labels and test labels.
  - the default split is %75/%25, which is a good rule of thumb, and is specified by `test_size`.
  - it is also best practice to perform the split so that the split reflects the labels on the data.
    - That is, you want the labels to be distributed in train and test sets as they are in the original dataset, as is achieved by setting `stratify=y`, where `y` is the array or dataframe of labels.
  - See below that the accuracy of the model is approximately %96, which is pretty good for an out-of-the-box model.
  
**Model complexity and over / underfitting**
- Recall that we recently discussed the concept of a decision boundary.
  - ![neighbors][2]
  - We visualized a decision boundary for several, increasing values of `K` in a KNN model.
  - As `K` increases, the decision boundary get smoother and less curvy.
  - Therefore, we consider it to be a less complex model than those with a lower `K`.
  - Generally, complex models run the risk of being sensitive to noise in the specific data that you have, rather than reflecting general trends in the data.
    - This is known as overfitting.
  - If you increase `K` even more, and make the model even simpler, then the model will perform less well on both test and training sets, as indicated in the following schematic figure, known as a model complexity curve.
  - ![neighbors][3]
    - We can see there is a sweet spot in the middle that gives us the best performance on the test set


  [1]: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
  [2]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/neighbors.JPG
  [3]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/neighbors2.JPG

In [None]:
df.head()

In [None]:
# from sklearn.model_selection import train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :4], df.species, test_size=0.3, random_state=21, stratify=df.species)

# instantiate the classifier
knn = KNeighborsClassifier(n_neighbors=8)

# fit it to the training data
knn.fit(X_train, y_train)

# make predictions on the test data
y_pred = knn.predict(X_test)

# check the accuracy using the score method of the model
score = knn.score(X_test, y_test)

# print the predictions and score
print(f'Test set score: {score:0.3f}\nTest set predictions:\n{y_pred}')

### The digits recognition dataset
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the [MNIST][1] dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type `Bunch`, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an `'images'` key in addition to the `'data'` and `'target'` keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this `'images'` key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see [Chapter 2 of DataCamp's course on Data Visualization with Python)][2]. On the other hand, the `'data'` key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these `Bunch` objects in two different ways: By using the `.` notation, as in `digits.images`, or the `[]` notation, as in `digits['images']`.

For more on the MNIST data, check out this [exercise][3] in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

**Instructions**

- Import `datasets` from `sklearn` and `matplotlib.pyplot` as `plt`.
- Load the digits dataset using the `.load_digits()` method on `datasets`.
- Print the keys and `DESCR` of digits.
- Print the shape of `images` and `data` keys using the `.` notation.
- Display the 1011th image using `plt.imshow()`. This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be!


  [1]: http://yann.lecun.com/exdb/mnist/
  [2]: https://trenton3983.github.io/files/projects/2020-03-05_intro_to_data_visualization_in_python/2020-03-05_intro_to_data_visualization_in_python.html
  [3]: https://campus.datacamp.com/courses/introduction-to-importing-data-in-python/introduction-and-flat-files-1?ex=10

In [None]:
# Load the digits dataset: digits
digits = load_digits()

# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits.DESCR)

# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

**It looks like the image in question corresponds to the digit '5'. Now, can you build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset? You'll do so in the next exercise!**

### Train/Test Split + Fit/Predict/Accuracy

Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the `.score()` method.

**Instructions**

- Import `KNeighborsClassifier` from `sklearn.neighbors` and `train_test_split` from `sklearn.model_selection`.
- Create an array for the features using `digits.data` and an array for the target using `digits.target`.
- Create stratified training and test sets using `0.2` for the size of the test set. Use a random state of `42`. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.
- Create a k-NN classifier with `7` neighbors and fit it to the training data.
- Compute and print the accuracy of the classifier's predictions using the `.score()` method.

In [None]:
# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# predict
pred = knn.predict(X_test)

result = list(zip(pred, y_test))
not_correct = [v for v in result if v[0] != v[1]]
num_correct = len(result) - len(not_correct)

# Print the accuracy
score = knn.score(X_test, y_test)

print(f'Incorrect Result: {not_correct}\nNumber Correct: {num_correct}\nScore: {score:0.2f}')

**Incredibly, this out of the box k-NN classifier with 7 neighbors has learned from the training data and predicted the labels of the images in the test set with 98% accuracy, and it did so in less than a second! This is one illustration of how incredibly useful machine learning techniques can be.**

### Overfitting and underfitting

Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as `X_train`, `X_test`, `y_train`, `y_test`. In addition, `KNeighborsClassifier` has been imported from `sklearn.neighbors`.

**Instructions**

- Inside the for loop:
  - Setup a k-NN classifier with the number of neighbors equal to `k`.
  - Fit the classifier with `k` neighbors to the training data.
  - Compute accuracy scores the training set and test set separately using the `.score()` method and assign the results to the `train_accuracy` and `test_accuracy` arrays respectively.

In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('KNN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

**It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. Now that you've grasped the fundamentals of classification, you will learn about regression in the next chapter!**

# Regression

In the previous chapter, you used image and political datasets to predict binary and multiclass outcomes. But what if your problem requires a continuous outcome? Regression is best suited to solving such problems. You will learn about fundamental concepts in regression and apply them to predict the life expectancy in a given country using Gapminder data.

## Introduction to regression

- In regression tasks, the target value is a continuously varying variable, such as a country's GDP or the price of a house.
- The first regression task will be using the Boston housing dataset.
- The data can be loaded from a CSV or scikit-learn's built-in datasets.
- `'CRIM'` is per capita crime rate
- `'NX'` is nitric oxides concentration
- `'RM'` is average number of rooms per dwelling
- The target variable, `'MEDV'`, is the median value of owner occupied homes in thousands of dollars

**Creating feature and target arrays**
- Recall that scikit-learn wants `features` and `target` values in distinct arrays, `X` and `y`.
- Using the [`.values`][values] attribute returns the `NumPy` arrays.
  - `pandas` documentation recommends using [`.to_numpy`][tonumpy]

**Predicting house value from a single feature**
- As a first task, let's try to predict the price from a single feature: the average number of rooms
- The 5th column is the average number of rooms, `'RM'`
- To reshape the arrays, use the `.reshape` method to keep the first dimension, but add another dimension of size one to `X`.

**Fitting a regression model**
- Instantiate [`sklearn.linear_model.LinearRegression`][linreg]
- Fit the model by passing in the data and target
- Check out the regressors predictions over the range of the data with [`np.linspace`][linsp] between the `max` and `min` value of `X_rooms`.


  [values]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html
  [tonumpy]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy
  [linreg]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
  [linsp]: https://numpy.org/doc/stable/reference/generated/numpy.linspace.html

In [None]:
boston = pd.read_csv(data_paths[1])
display(boston.head())

# creating features and target arrays
X = boston.drop('MEDV', axis=1).to_numpy()
y = boston.MEDV.to_numpy()

# predict from a single feature
X_rooms = X[:, 5]

# check variable type
print(f'X_rooms type: {type(X_rooms)}, shape: {X_rooms.shape}\ny type: {type(y)}, shape: {y.shape}')

# reshape
X_rooms = X_rooms.reshape(-1, 1)
y = y.reshape(-1, 1)
print(f'X_rooms shape: {X_rooms.shape}\ny shape: {y.shape}')

# instantiate model
reg = LinearRegression()

# fit a linear model
reg.fit(X_rooms, y)

# data range variable
pred_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1)

# plot house value as a function of rooms
sns.scatterplot(data=boston, x='RM', y='MEDV', label='Data')
plt.plot(pred_space, reg.predict(pred_space), color='k', lw=3, label='Regression')
plt.legend(loc='lower right')
plt.xlabel('Number of Rooms')
plt.ylabel('Value of house /1000 ($)')
plt.show()

### Which of the following is a regression problem?

Andy introduced regression to you using the Boston housing dataset. But regression models can be used in a variety of contexts to solve a variety of different problems.

Given below are four example applications of machine learning. Your job is to pick the one that is best framed as a regression problem.

**Answer the question**

- ~~An e-commerce company using labeled customer data to predict whether or not a customer will purchase a particular item.~~
- ~~A healthcare company using data about cancer tumors (such as their geometric measurements) to predict whether a new tumor is benign or malignant.~~
- ~~A restaurant using review data to ascribe positive or negative sentiment to a given review.~~
- **A bike share company using time and weather data to predict the number of bikes being rented at any given hour.**
  - The target variable here - the number of bike rentals at any given hour - is quantitative, so this is best framed as a regression problem.

### Importing data for supervised learning

In this chapter, you will work with [Gapminder][gm] data that we have consolidated into one CSV file available in the workspace as `'gapminder.csv'`. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: `'fertility'`, which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's `.reshape()` method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

**Instructions**

- Import `numpy` and `pandas` as their standard aliases.
- Read the file `'gapminder.csv'` into a DataFrame `df` using the `read_csv()` function.
- Create array `X` for the `'fertility'` feature and array `y` for the `'life'` target variable.
- Reshape the arrays by using the `.reshape()` method and passing in `-1` and `1`.


  [gm]: https://www.gapminder.org/data/

In [None]:
# Read the CSV file into a DataFrame: df
df = pd.read_csv(data_paths[3])

# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

### Exploring the Gapminder data

As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as `df` and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with `life`, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as `.info()`, `.describe()`, `.head()`.

In case you are curious, the heatmap was generated using [Seaborn's heatmap function][sbhm] and the following line of code, where `df.corr()` computes the pairwise correlation between columns:

`sns.heatmap(df.corr(), square=True, cmap='RdYlGn')`

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

**Instructions**

- The DataFrame has `139` samples (or rows) and `9` columns.
- `life` and `fertility` are negatively correlated.
- The mean of `life` is `69.602878`.
- ~~`fertility` is of type `int64`.~~
- `GDP` and `life` are positively correlated.


  [sbhm]: http://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

## The basics of linear regression

- How does linear regression work?

**Regression mechanics**
- We want to ft a line to the data and a line in two dimensions is always of the form $y=a*x+b$, where $y$ is the _target_, $x$ is the single feature, and $a$ and $b$ are the parameters of the model that we want to learn.
- The question of fitting is reduced to: how do we choose $a$ and $b$?
- A common method is to define an error function for any given line, and then choose the line that minimizes the error function.
  - Such an error function is also called a loss or a cost function.
  
**The loss function**
- What will our loss function be?
- We want the line to be as close to the actual data points as possible.
  - For this reason, we wish to minimize the vertical distance between the fit, and the data.
- For each data point, calculate the vertical distance between it and the line.
  - This distance is called a residual.
- We could try to minimize the sum of the residuals, but then a large positive residual would cancel out a large negative residual.
  - For this reason, **we minimize the sum of the squares of the residuals**.
  - This will be the loss function, and using this loss function is commonly called _ordinary least squares (OLS)_.
  - ![][loss]
  - Note this is the same as minimizing the mean squared error of the predictions on the training set.
    - See the statistic curriculum for more detail.
- When `.fit` is called on a linear regression model in scikit-learn, it performs this OLS, under the hood.

**Linear regression in higher dimensions**
- When we have two features and one target, a line is in the form $y=a_{1}x_{1}+a_{2}x_{2}+b$, so to fit a linear regression model, is to specify three variables, $a_{1}$, $a_{2}$, and $b$.
- In higher dimensions, with more than one or two features, a line is of this form, $y=a_{1}x_{1}+a_{2}x_{2}+a_{3}x_{3}+...+a_{n}x_{n}+b$, so fitting a linear regression model is to specify a coefficient, $a_{i}$, for each features, as well as the variable $b$.
- The scikit-learn API works exactly the same in this case: pass two arrays to the `.fit` method, one containing the _features_, the other is the _target_ variable.

**Linear regression on all _Boston Housing_ features**
- The default scoring method for linear regression is called $R^2$.
  - This metric quantifies the amount of variance in the target variable that is predicted from the feature variables.
    - See the scikit-learn documentation, and the DataCamp statistics curriculum for more details.
  - To compute $R^2$, apply the [.score][score] method to the model, and pass it two arguments, the features and target data.
- Generally, linear regression will never be used out of the box, like this; you will mostly likely wish to use _regularization_, which we'll see soon, and which places further constraints on the model coefficients.
- Learning about linear regression and how to use it in scikit-learn, is an essential first step toward using regularized linear models.

  
  [loss]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/loss.JPG
  [score]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score

In [None]:
boston.head()

In [None]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)

# instantiate the regressor
reg_all = LinearRegression()

# fit on the training set
reg_all.fit(X_train, y_train)

# predict on the test set
y_pred = reg_all.predict(X_test)

# score the model
score = reg_all.score(X_test, y_test)

print(f'Model Score: {score:0.3f}')

### Fit & predict for regression

Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the `'RM'` feature of the Boston housing dataset. In this exercise, you will use the `'fertility'` feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is `'life'`. The array for the target variable has been pre-loaded as `y` and the array for `'fertility'` has been pre-loaded as `X_fertility`.

A scatter plot with `'fertility'` on the x-axis and `'life'` on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the $R^2$ score using sckit-learn's `.score()` method.

**Instructions**

- Import `LinearRegression` from `sklearn.linear_model`.
- Create a `LinearRegression` regressor called `reg`.
- Set up the prediction space to range from the minimum to the maximum of `X_fertility`. This has been done for you.
- Fit the regressor to the data (`X_fertility` and `y`) and compute its predictions using the `.predict()` method and the `prediction_space` array.
- Compute and print the $R^2$ score using the `.score()` method.
- Overlay the plot with your linear regression line. This has been done for you, so hit 'Submit Answer' to see the result!

In [None]:
df.head()

In [None]:
X_fertility = df.fertility.to_numpy().reshape(-1, 1)
y = df.life.to_numpy().reshape(-1, 1)

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(df.fertility.max(), df.fertility.min()).reshape(-1,1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2 
score = reg.score(X_fertility, y)
print(f'Score: {score}')

# Plot regression line
sns.scatterplot(data=df, x='fertility', y='life')
plt.xlabel('Fertility')
plt.ylabel('Life Expectancy')
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

**Notice how the line captures the underlying trend in the data. And the performance is quite decent for this basic regression model with only one feature.**

### Train/test split for regression

As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over **all** features. In addition to computing the $R^2$ score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array `X` and target variable array `y` have been pre-loaded for you from the DataFrame `df`.

**Instructions**

- Import `LinearRegression` from `sklearn.linear_model`, [`mean_squared_error`][mse] from `sklearn.metrics`, and `train_test_split` from `sklearn.model_selection`.
- Using `X` and `y`, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of `42`.
- Create a linear regression regressor called `reg_all`, fit it to the training set, and evaluate it on the test set.
- Compute and print the $R^2$ score using the `.score()` method on the test set.
- Compute and print the RMSE. To do this, first compute the Mean Squared Error using the `mean_squared_error()` function with the arguments `y_test` and `y_pred`, and then take its square root using `np.sqrt()`.

  [mse]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

In [None]:
X = df.drop(['life', 'Region'], axis=1).to_numpy()
y = df.life.to_numpy()

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print(f"R^2: {reg_all.score(X_test, y_test):0.3f}")
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:0.3f}")

**Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this, as well how to better validate your models, in the next section.**

## Cross-validation

- You're now also becoming more acquainted with train test split, and computing model performance metrics on the test set.
- Can you spot a potential pitfall of this process?
  - If you're computing $R^2$ on your test set, the $R^2$ returned, is dependent on the way the data is split.
  - The data points in the test set may have some peculiarities that mean the $R^2$ computed on it, is not representative of the model's ability to generalize to unseen data.
- To combat this dependence on what is essentially an arbitrary split, we use a technique call _cross-validation_.
- ![][cv]
  - Begin by splitting the dataset into five groups, or folds.
  - Hold out the first fold as a test set, fit the model on the remaining 4 folds, predict on the test test set, and compute the metric of interest.
  - Next, hold out the second fold as the test set, fit on the remaining data, predict on the test set, and compute the metric of interest.
  - Then, similarly, with the third, fourth and fifth fold.
  - As a result, there are five values of $R^2$ from which statistics of interest can be computed, such as mean, median, and 95% confidence interval.
- As the dataset is split into 5 folds, this process is called _5-fold cross validation_.
  - 10 folds would be _10-fold cross validation_.
- Generally, if _k_ folds are used, it is called _k-fold cross validation_ or _k-fold CV_.
- The trade-off is, more folds are computationally more expensive, because there is more fitting and predicting.
- This method avoids the problem of the metric of choice being dependent on the train test split.


**Cross-validation in scikit-learn**
- [`sklearn.model_selection.cross_val_score`][cvs]
- This returns an array of _cross-validation_ scores, which are assigned to `cv_results`
- The length of the array is the number of folds specified by the `cv` parameter.
- The reported score is $R^2$, the default score for linear regression
- We can also compute the `mean`



  [cv]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2020-10-14_supervised_learning_sklearn/crossvalidation.JPG
  [cvs]: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [None]:
# instantiate the model
reg = LinearRegression()

# call cross_val_score
cv_results = cross_val_score(reg, boston.drop('MEDV', axis=1), boston.MEDV, cv=5)

print(f'Scores: {np.round(cv_results, 3)}')
print(f'Scores mean: {np.round(np.mean(cv_results), 3)}')

### 5-fold cross-validation

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's `cross_val_score()` function uses $R^2$ as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as `df` and split into the feature/target variable arrays `X` and `y`. The modules `pandas` and `numpy` have been imported as `pd` and `np`, respectively.

**Instructions**

- Import `LinearRegression` from `sklearn.linear_model` and `cross_val_score` from `sklearn.model_selection`.
- Create a linear regression regressor called `reg`.
- Use the `cross_val_score()` function to perform 5-fold cross-validation on `X` and `y`.
- Compute and print the average cross-validation score. You can use NumPy's `mean()` function to compute the average.

In [None]:
# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(f'Scores: {np.round(cv_scores, 3)}')

print(f'Scores mean: {np.round(np.mean(cv_scores), 3)}')

**Now that you have cross-validated your model, you can more confidently evaluate its predictions.**

### K-Fold CV comparison

Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, you can use `%timeit` to see how long each 3-fold CV takes compared to 10-fold CV by executing the following `cv=3` and `cv=10`:

`%timeit cross_val_score(reg, X, y, cv = ____)`

`pandas` and `numpy` are available in the workspace as `pd` and `np`. The DataFrame has been loaded as `df` and the feature/target variable arrays `X` and `y` have been created.

**Instructions**

- Import `LinearRegression` from `sklearn.linear_model` and `cross_val_score` from `sklearn.model_selection`.
- Create a linear regression regressor called `reg`.
- Perform 3-fold CV and then 10-fold CV. Compare the resulting mean scores.

In [None]:
# Create a linear regression object: reg
reg = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(f'cv=3 scores mean: {np.round(np.mean(cvscores_3), 3)}')

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(f'cv=10 scores mean: {np.round(np.mean(cvscores_10), 3)}')

In [None]:
cv3 = %timeit -n10 -r3 -q -o cross_val_score(reg, X, y, cv=3)
cv10 = %timeit -n10 -r3 -q -o cross_val_score(reg, X, y, cv=10)

print(f'cv=3 time: {cv3}\ncv=10 time: {cv10}')

## Regularized regression

**Why regularize?**
- Recall that what a linear regression does, is minimize a loss function, to choose a coefficient, $a_{i}$, for each feature variable.
- If we allow these coefficients, or parameters, to be super large, we can get _overfitting_.
- It isn't easy to see in two dimensions, but when there are many features, this is, if the data sit in a high-dimensional space with large coefficients, it gets easy to predict nearly anything.
- For this reason, it's common practice to alter the loss function, so it penalizes for large coefficients.
  - This is called _**Regularization**_.

**Ridge regression**
- The first type of regularized regression that we'll look at, is called _ridge regression_, in which out loss function is the standard _OLS_ loss function, plus the squared value of each coefficient, multiplied by some constant, $\alpha$
  - $\text{Loss function}=\text{OLS loss function}+\alpha*\sum_{i=1}^n a_{i}^2$
  - Thus, when minimizing the loss function to fit to our data, models are penalized for coefficients with a large magnitude: large positive and large negative coefficients.
  - Note, $\alpha$ is a parameter we need to choose in order to fit and predict.
  - Essentially, we can select the $\alpha$ for which our model performs best.
  - Picking $\alpha$ for ridge regression is similar to picking `k` in `KNN`.
- This is called _hyperparameter tuning_, and we'll see much more of this in section 3.
- This $\alpha$, which you may also see called $\lambda$ in the wild, can be thought of as a parameter that controls the model complexity.
- Notice when $\alpha = 0$, we get back $\text{OLS}$, which can lead to _overfitting_.
  - Large coefficients, in this case, are not penalized, and the _overfitting_ problem is not accounted for.
- A very high $\alpha$ means large coefficients are significantly penalized, which can lead to a model that's too simple, and end up _underfitting_ the data.
- The method of performing _ridge regression_ with scikit-learn, mirrors the other models we have seen.

**Ridge regression in scikit-learn**
- [`sklearn.linear_model.Ridge`][ridge]
- Set $\alpha$ with the `alpha` parameter.
- Setting the `normalize` parameter to `True`, ensures all the variables are on the same scale, which will be covered later in more depth.

**Lasso regression**
- There is another type of _regularized regression_ called _lasso regression_, in which our loss function is the standard _OLS_ loss function, plus the absolute value of each coefficient, multiplied by some constant, $\alpha$.
  - $\text{Loss function}=\text{OLS loss function}+\alpha*\sum_{i=1}^n |a_{i}|$

**Lasso regression in scikit-learn**
- [`sklearn.linear_model.Lasso`][lasso]
- Lasso regression in scikit-learn, mirrors ridge regression.

**Lasso regression for feature selection**
- One of the useful aspects of _lasso regression_ is it can be used to select important features of a dataset.
  - This is because it tends to reduce the coefficients of less important features to be exactly zero.
  - The features whose coefficients are not shrunk to zero, are 'selected' by the `LASSO` algorithm.
- Plotting the coefficients as a function of feature name, yields graph below, and you can see directly, the most important predictor for our target variable, `housing price`, is number of rooms, `'RM'`.
- This is not surprising, and is a great sanity check.
- This type of feature selection is very important for machine learning in an industry or business setting, because it allows you, as the Data Scientist, to communicate important results to non-technical colleagues.
- The power of reporting important features from a linear model, cannot be overestimated.
- It is also valuable in research science, in order to identify which factors are important predictors for various physical phenomena.

  [ridge]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
  [lasso]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

#### Ridge Regression

In [None]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)

# instantiate the model
ridge = Ridge(alpha=0.1, normalize=True)

# fit the model
ridge.fit(X_train, y_train)

# predict on the test data
ridge_pred = ridge.predict(X_test)

# get the score
rs = ridge.score(X_test, y_test)

print(f'Ridge Score: {round(rs, 4)}')

#### Lasso Regression

In [None]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)

# instantiate the regressor
lasso = Lasso(alpha=0.1, normalize=True)

# fit the model
lasso.fit(X_train, y_train)

# predict on the test data
lasso_pred = lasso.predict(X_test)

# get the score
ls = lasso.score(X_test, y_test)

print(f'Ridge Score: {round(ls, 4)}')

#### Lasso Regression for Feature Selection

In [None]:
# store the feature names
names = boston.drop('MEDV', axis=1).columns

# instantiate the regressor
lasso = Lasso(alpha=0.1)

# extract and store the coef attribute
lasso_coef = lasso.fit(boston.drop('MEDV', axis=1), boston.MEDV).coef_

plt.plot(range(len(names)), lasso_coef)
plt.xticks(range(len(names)), names, rotation=60)
plt.ylabel('Coefficients')
plt.grid()
plt.show()

### Questions

### Questions

# Fine-tuning your model

Having trained your model, your next task is to evaluate its performance. In this chapter, you will learn about some of the other metrics available in scikit-learn that will allow you to assess your model's performance in a more nuanced manner. Next, learn to optimize your classification and regression models using hyperparameter tuning.

## How good is your model

### Questions

## Logistic regression and the ROC curve

### Questions

### Questions

### Questions

## Area under the ROC curve

### Questions

## Hyperparameter tuning

### Questions

### Questions

## Hold-out set for final evaluation

### Questions

### Questions

### Questions

# Preprocessing and pipelines

This chapter introduces pipelines, and how scikit-learn allows for transformers and estimators to be chained together and used as a single unit. Preprocessing techniques will be introduced as a way to enhance model performance, and pipelines will tie together concepts from previous chapters.

## Preprocessing data

### Questions

### Questions

### Questions

## Handling missing data

### Questions

### Questions

### Questions

## Centering and scaling



### Questions

### Questions

### Questions

### Questions

# Certificate

![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/jpd_dir/file.jpg)