- Notebook Author: [Trenton McKinney][1]
- Course: **[DataCamp: Supervised Learning with scikit-learn][2]**
 - This [notebook][3] was created as a reproducible reference.
 - The material is from the course
 - I completed the exercises
 - If you find the content beneficial, consider a [DataCamp Subscription][4].
 - I added a function (**`create_dir_save_file`**) to automatically download and save the required data (`data/2020-10-14_supervised_learning_sklearn`) and image (`Images/2020-10-14_supervised_learning_sklearn`) files.

  [1]: https://trenton3983.github.io/
  [2]: https://learn.datacamp.com/courses/supervised-learning-with-scikit-learn
  [3]: https://github.com/trenton3983/DataCamp/blob/master/2020-10-14_supervised_learning_sklearn.ipynb
  [4]: https://www.datacamp.com/pricing

#### Course Description

Machine learning is the field that teaches machines and computers to learn from existing data to make predictions on new data: Will a tumor be benign or malignant? Which of your customers will take their business elsewhere? Is a particular email spam? In this course, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. You'll learn how to build predictive models, tune their parameters, and determine how well they will perform with unseen data—all while using real world datasets. You'll be using scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.

#### Imports

In [None]:
import pandas as pd
import numpy as np
from pprint import pprint as pp
from itertools import combinations
import requests
from pathlib import Path
from sklearn.datasets import load_iris
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

#### Configuration Options

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
# plt.style.use('ggplot')
plt.rcParams["patch.force_edgecolor"] = True

#### Functions

In [None]:
def create_dir_save_file(dir_path: Path, url: str):
    """
    Check if the path exists and create it if it does not.
    Check if the file exists and download it if it does not.
    """
    if not dir_path.parents[0].exists():
        dir_path.parents[0].mkdir(parents=True)
        print(f'Directory Created: {dir_path.parents[0]}')
    else:
        print('Directory Exists')
        
    if not dir_path.exists():
        r = requests.get(url, allow_redirects=True)
        open(dir_path, 'wb').write(r.content)
        print(f'File Created: {dir_path.name}')
    else:
        print('File Exists')

In [None]:
data_dir = Path('data/2020-10-14_supervised_learning_sklearn')
images_dir = Path('Images/2020-10-14_supervised_learning_sklearn')

#### Datasets

In [None]:
file_mpg = 'https://assets.datacamp.com/production/repositories/628/datasets/3781d588cf7b04b1e376c7e9dda489b3e6c7465b/auto.csv'
file_housing = 'https://assets.datacamp.com/production/repositories/628/datasets/021d4b9e98d0f9941e7bfc932a5787b362fafe3b/boston.csv'
file_diabetes = 'https://assets.datacamp.com/production/repositories/628/datasets/444cdbf175d5fbf564b564bd36ac21740627a834/diabetes.csv'
file_gapminder = 'https://assets.datacamp.com/production/repositories/628/datasets/a7e65287ebb197b1267b5042955f27502ec65f31/gm_2008_region.csv'
file_voting = 'https://assets.datacamp.com/production/repositories/628/datasets/35a8c54b79d559145bbeb5582de7a6169c703136/house-votes-84.csv'
file_wwine = 'https://assets.datacamp.com/production/repositories/628/datasets/2d9076606fb074c66420a36e06d7c7bc605459d4/white-wine.csv'
file_rwine = 'https://assets.datacamp.com/production/repositories/628/datasets/013936d2700e2d00207ec42100d448c23692eb6f/winequality-red.csv'

In [None]:
datasets = [file_mpg, file_housing, file_diabetes, file_gapminder, file_voting, file_wwine, file_rwine]
data_paths = list()

for data in datasets:
    file_name = data.split('/')[-1].replace('?raw=true', '')
    data_path = data_dir / file_name
    create_dir_save_file(data_path, data)
    data_paths.append(data_path)

#### DataFrames

# Classification

In this chapter, you will be introduced to classification problems and learn how to solve them using supervised learning techniques. And you'll apply what you learn to a political dataset, where you classify the party affiliation of United States congressmen based on their voting records.

## Supervised learning

**What is machine learning?**
- The art and science of giving computers the ability to learn to make decisions from data without being explicitly programmed.
- Examples:
  - Your computer can learn to predict whether an email is spam or not spam, given its content and sender.
  - Your computer can learn to cluster, say, Wikipedia entries, into different categories based on the words they contain.
    - It could then assign any new Wikipedia article to one of the existing clusters.
- Note that, in the first example, we are trying to predict a particular class label, the is, spam or not spam.
- In the second example, there is not such label.
- When there are labels present, we call it supervised learning.
- Where there are not labels present, we call it unsupervised learning.

**Unsupervised learning**
- In essence, is the machine learning task of uncovering hidden patterns and structures from unlabeled data.
- Example:
  - A business may wish to group its customers into distinct categories (Clustering) based on their purchasing behavior without knowing in advance what these categories might be.
    - This is known as clustering, one branch of unsupervised learning.

**Reinforcement learning**
- Machines or software agents interact with an environment.
  - Reinforcement learning agents are able to automatically figure out how to optimize their behavior given a system of rewards and punishments.
- Reinforcement learning draws inspiration from behavioral psychology and has applications in many fields, such as, economics, genetics, as well as game playing.
- In 2016, reinforcement learning was used to train Google DeepMind's AlphaGo, which was the first computer program to beat the world champion in Go.

**Supervised learning**
- In supervised learning, we have several data points or samples, described using predictor variables or features and a target variable.
- Out data is commonly represented in a tables structure such as the one below, in which there is a row for data point and a column for each feature.
```python
|    |                        Predictor Variables                      | Target    |
|    |   sepal_length |   sepal_width |   petal_length |   petal_width | species   |
|---:|---------------:|--------------:|---------------:|--------------:|:----------|
|  0 |            5.1 |           3.5 |            1.4 |           0.2 | setosa    |
|  1 |            4.9 |           3   |            1.4 |           0.2 | setosa    |
|  2 |            4.7 |           3.2 |            1.3 |           0.2 | setosa    |
|  3 |            4.6 |           3.1 |            1.5 |           0.2 | setosa    |
|  4 |            5   |           3.6 |            1.4 |           0.2 | setosa    |
```
- Here, we see the iris dataset: each row represents measurements of a different flower and each column is a particular kind of measurement, like the width and length of a certain part of the flower.
- The aim of supervised learning is to build a model that is able to predict the target variable, here, the particular species of a flower, given the predictor variables, the physical measurements.
- _If the target variable consists of categories_, like `'click'` or `'no click'`, `'spam'` or `'not spam'`, or different species of flowers, we call the learning task, **classification**.
- Alternatively, _if the target is a continuously varying variable_, the price of a house, it is a **regression** task.
- This chapter will focus on classification, the following, on regression.
- The goal of supervised learning is frequently to either automate a time-consuming, or expensive, manual task, such as a doctor's diagnosis, or to make predictions about the future, say whether a customer will click on an add, or not.
- For supervised learning, you need labeled data and there are many ways to go get it: you can get historical data, which already has labels that you are interested in; you can perform experiments to get labeled data, such as A/B-testing to see how many clicks you get; or you can also use crowd-sourced labeling data, like reCAPTCHA does for text recognition.
- In any case, the goal is to learn from data, for which the right output is known, so that we can make predictions on new data from which we don't know the output.

**Supervised learning in python**
- There are many ways to perform supervised learning in Python.
- In this course, we will use `scikit-learn`, or `sklearn`, one of the most popular and use-friendly machine learning libraries for Python.
- It also integrate very well with the `SciPy` stack, including libraries such as `NumPy`.
- There are a number of other ML libraries out there, such as `TensorFlow` and `keras`, which are well worth checking out, once you get the basics down.

**Naming conventions**
- A note on naming conventions: out in the wild, you will find that what we call a feature, others may call a predictor variable, or independent variable, and what we call a target variable, others may call dependent variable, or response variable.

### Which of these is a classification problem?

Once you decide to leverage supervised machine learning to solve a new problem, you need to identify whether your problem is better suited to classification or regression. This exercise will help you develop your intuition for distinguishing between the two.

Provided below are 4 example applications of machine learning. Which of them is a supervised classification problem?

**Answer the question**

- **Using labeled financial data to predict whether the value of a stock will go up or go down next week.**
  - Exactly! In this example, there are two discrete, qualitative outcomes: the stock market going up, and the stock market going down. This can be represented using a binary variable, and is an application perfectly suited for classification.
- ~~Using labeled housing price data to predict the price of a new house based on various features.~~
  - Incorrect. The price of a house is a quantitative variable. This is not a classification problem.
- ~~Using unlabeled data to cluster the students of an online education company into different categories based on their learning styles.~~
  - Incorrect. When using unlabeled data, we enter the territory of unsupervised learning.
- ~~Using labeled financial data to predict what the value of a stock will be next week.~~
  - Incorrect. The value of a stock is a quantitative value. This is not a classification problem.

## Exploratory data analysis

- Samples are in rows
- Features are in columns

In [None]:
iris = load_iris()
print(f'Type: {type(iris)}')
print(f'Keys: {iris.keys()}')
print(f'Data Type: {type(iris.data)}\nTarget Type: {type(iris.target)}')
print(f'Data Shape: {iris.data.shape}')
print(f'Target Names: {iris.target_names}')

X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
df.species = df.species.map(dict(zip(range(3), iris.target_names)))
display(df.head())

# pd.plotting.scatter_matrix(df, c=y, figsize=(12, 10))
sns.pairplot(df, hue='species', corner=True)
plt.show()

### Numerical EDA

In this chapter, you'll be working with a dataset obtained from the [UCI Machine Learning Repository][1] consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of [Statistical Thinking in Python (Part 1)][2].

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called `df`. Use pandas' `.head()`, `.info()`, and `.describe()` methods in the IPython Shell to explore the DataFrame, and select the statement below that is **not** true.

  [1]: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
  [2]: https://trenton3983.github.io/certificates/2019-11-21_DataCamp_statistical_thinking_in_python_I.pdf

In [None]:
cols = ['party', 'infants', 'water', 'budget', 'physician', 'salvador', 'religious', 'satellite', 'aid',
        'missile', 'immigration', 'synfuels', 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
votes = pd.read_csv(data_paths[4], header=None, names=cols)
votes.iloc[:, 1:] = votes.iloc[:, 1:].apply(lambda col: col.map({'?': None, 'n': 0, 'y': 1}))
votes.head()

**Possible Answers**

- The DataFrame has a total of `435` rows and `17` columns.
- Except for `'party'`, all of the columns are of type `int64`.
- The first two rows of the DataFrame consist of votes made by Republicans and the next three rows consist of votes made by Democrats.
- ~~There are 17 predictor variables, or features, in this DataFrame.~~
  - The number of columns in the DataFrame is not equal to the number of features. One of the columns - `'party'` is the target variable.
- The target variable in this DataFrame is `'party'`.

### Visual EDA

The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the `scatter_matrix()` function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as [seaborn.countplot][1].

Given on the right is a `countplot` of the `'education'` bill, generated from the following code:

```python
plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
```

In `sns.countplot()`, we specify the x-axis data to be `'education'`, and hue to be `'party'`. Recall that `'party'` is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the `'education'` bill, with each party colored differently. We manually specified the color to be `'RdBu'`, as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the `'satellite'` and `'missile'` bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with `plt.figure()` so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.


  [1]: http://seaborn.pydata.org/generated/seaborn.countplot.html

In [None]:
# in order to use catplot, the dataframe needs to be in a tidy format
vl = votes.set_index('party').stack().reset_index().rename(columns={'level_1': 'cat', 0: 'vote'})

g = sns.catplot(data=vl, x='vote', col='cat', col_wrap=6, hue='party', kind='count', height=3, palette='RdBu')

**Possible Answers**

- ~~`'satellite'`.~~
- ~~`'missile'`.~~
- **Both `'satellite'` and `'missile'`.**
- ~~Neither `'satellite'` nor `'missile'`.~~

## The classification challenge

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X, y)

X_new = np.array([[5.6, 2.8, 3.9, 1.1], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]])

pred = knn.predict(X_new)
pred

### k-Nearest Neighbors: Fit

Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame `df`.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array `X` and response variable `y`: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the `n_neighbors` parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called `df`.

**Instructions**

- Import `KNeighborsClassifier` from `sklearn.neighbors`.
- Create arrays `X` and `y` for the features and the target variable. Here this has been done for you. Note the use of `.drop()` to drop the target variable `'party'` from the feature array `X` as well as the use of the `.values` attribute to ensure `X` and `y` are NumPy arrays. Without using `.values`, `X` and `y` are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.
- Instantiate a `KNeighborsClassifier` called `knn` with `6` neighbors by specifying the `n_neighbors` parameter.
- Fit the classifier to the data using the `.fit()` method.

In [None]:
# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = ____

# Fit the classifier to the data
____

### k-Nearest Neighbors: Predict

Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the `.predict()` method on the `X` that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as `X_new`. You will use your classifier to predict the label for this new data point, as well as on the training data `X` that the model has already seen. Using `.predict()` on `X_new` will generate 1 prediction, while using it on `X` will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as `df`. This time, you will create the feature array `X` and target variable array `y` yourself.

**Instructions**

- Create arrays for the features and the target variable from `df`. As a reminder, the target variable is `'party'`.
- Instantiate a `KNeighborsClassifier` with `6` neighbors.
- Fit the classifier to the data.
- Predict the labels of the training data, `X`.
- Predict the label of the new data point `X_new`.

In [None]:
# Create arrays for the features and the response variable
y = ____
X = ____

# Create a k-NN classifier with 6 neighbors: knn
knn = ____

# Fit the classifier to the data
____

# Predict the labels for the training data X
y_pred = ____

# Predict and print the label for the new data point X_new
new_prediction = ____
print("Prediction: {}".format(new_prediction))

## Measuring model performance

### The digits recognition dataset
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the [MNIST][1] dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type `Bunch`, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an `'images'` key in addition to the `'data'` and `'target'` keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this `'images'` key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see [Chapter 2 of DataCamp's course on Data Visualization with Python)][2]. On the other hand, the `'data'` key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these `Bunch` objects in two different ways: By using the `.` notation, as in `digits.images`, or the `[]` notation, as in `digits['images']`.

For more on the MNIST data, check out this [exercise][3] in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

**Instructions**

- Import `datasets` from `sklearn` and `matplotlib.pyplot` as `plt`.
- Load the digits dataset using the `.load_digits()` method on `datasets`.
- Print the keys and `DESCR` of digits.
- Print the shape of `images` and `data` keys using the `.` notation.
- Display the 1011th image using `plt.imshow()`. This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be!


  [1]: http://yann.lecun.com/exdb/mnist/
  [2]: https://trenton3983.github.io/files/projects/2020-03-05_intro_to_data_visualization_in_python/2020-03-05_intro_to_data_visualization_in_python.html
  [3]: https://campus.datacamp.com/courses/introduction-to-importing-data-in-python/introduction-and-flat-files-1?ex=10

In [None]:
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset: digits
digits = ____

# Print the keys and DESCR of the dataset
print(____)
print(____)

# Print the shape of the images and data keys
print(____.____.shape)
print(____.____.shape)

# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

### Train/Test Split + Fit/Predict/Accuracy

Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the `.score()` method.

**Instructions**

- Import `KNeighborsClassifier` from `sklearn.neighbors` and `train_test_split` from `sklearn.model_selection`.
- Create an array for the features using `digits.data` and an array for the target using `digits.target`.
- Create stratified training and test sets using `0.2` for the size of the test set. Use a random state of `42`. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.
- Create a k-NN classifier with `7` neighbors and fit it to the training data.
- Compute and print the accuracy of the classifier's predictions using the `.score()` method.

In [None]:
# Import necessary modules
____
____

# Create feature and target arrays
X = ____
y = ____

# Split into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size = ____, random_state=____, stratify=____)

# Create a k-NN classifier with 7 neighbors: knn
knn = ____

# Fit the classifier to the training data
____

# Print the accuracy
print(knn.score(____, ____))

### Overfitting and underfitting

Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as `X_train`, `X_test`, `y_train`, `y_test`. In addition, `KNeighborsClassifier` has been imported from `sklearn.neighbors`.

**Instructions**

- Inside the for loop:
  - Setup a k-NN classifier with the number of neighbors equal to `k`.
  - Fit the classifier with `k` neighbors to the training data.
  - Compute accuracy scores the training set and test set separately using the `.score()` method and assign the results to the `train_accuracy` and `test_accuracy` arrays respectively.

In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = ____

    # Fit the classifier to the training data
    ____
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(____, ____)

    #Compute accuracy on the testing set
    test_accuracy[i] = ____(____, ____)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

# Regression

In the previous chapter, you used image and political datasets to predict binary and multiclass outcomes. But what if your problem requires a continuous outcome? Regression is best suited to solving such problems. You will learn about fundamental concepts in regression and apply them to predict the life expectancy in a given country using Gapminder data.

## Introduction to regression

### Questions

### Questions

### Questions

## The basics of linear regression

### Questions

### Questions

## Cross-validation

### Questions

### Questions

## Regularized regression

### Questions

### Questions

# Fine-tuning your model

Having trained your model, your next task is to evaluate its performance. In this chapter, you will learn about some of the other metrics available in scikit-learn that will allow you to assess your model's performance in a more nuanced manner. Next, learn to optimize your classification and regression models using hyperparameter tuning.

## How good is your model

### Questions

## Logistic regression and the ROC curve

### Questions

### Questions

### Questions

## Area under the ROC curve

### Questions

## Hyperparameter tuning

### Questions

### Questions

## Hold-out set for final evaluation

### Questions

### Questions

### Questions

# Preprocessing and pipelines

This chapter introduces pipelines, and how scikit-learn allows for transformers and estimators to be chained together and used as a single unit. Preprocessing techniques will be introduced as a way to enhance model performance, and pipelines will tie together concepts from previous chapters.

## Preprocessing data

### Questions

### Questions

### Questions

## Handling missing data

### Questions

### Questions

### Questions

## Centering and scaling



### Questions

### Questions

### Questions

### Questions

# Certificate

![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/jpd_dir/file.jpg)