

#  <span style="color:#0b186c;">Introduction to Machine Learning</span>

---



“Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.” – IBM 2020

<br></br>

## <span style="color:#0b186c;">Table of Contents:</span>
* [Supervised vs Unsupervised Learning](#first-bullet)
* [Dataset Information](#second-bullet)
* [Supervised Learning](#third-bullet)
* [Unsupervised Learning](#fourth-bullet)
* [Conclusion](#fifth-bullet)

#  <span style="color:#0b186c;">Supervised vs Unsupervised Learning</span><a class="anchor" id="first-bullet"></a>

---
## <span style="color:#0b186c;">Reuired Imports:</span>

<div class="alert alert-warning">

<b>Note:</b> If you have not previously installed these `packages`, you can use the cell below to perform the required `pip` installs.

</div>

In [None]:
# In case you still need to perform some pip installs:
! pip install --user pandas -q
! pip install --user numpy -q
! pip install --user scikit-learn -q

In [None]:
# Dataframe and array libraries
import pandas as pd
import numpy as np

# Libraries for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns

# Retrieves the dataset from Scikit-learn
from sklearn.datasets import load_iris

# Required for performing standardization
from sklearn.preprocessing import StandardScaler

# Required for training and validating a model
from sklearn.model_selection import train_test_split

# Required for instantiating and running a Decision Tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Classification metrics and confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix, ConfusionMatrixDisplay

# Required for instantiating and running a KMeans clustering model
from sklearn.cluster import KMeans

# Filters out warning messages
import warnings
warnings.filterwarnings('ignore')

#  <span style="color:#0b186c;">Dataset Information</span><a class="anchor" id="second-bullet"></a>

---

We will be using a dataset containing 3 species in the Iris genus, namely, Iris Setosa, Iris Versicolor and Iris Virginica found in the Gaspé Peninsula. For the purposes of an integral study, the collected Iris samples were, "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus." The dataset contains 150 rows of data, 50 rows of data for each species of Iris flower. The column names represent the feature of the flower that was studied and recorded.

Our target dataset can be found in the Scikit-learn library, so we will be importing it directly from the library and storing it into a Pandas dataframe.

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

In [None]:
# Import the iris dataset
iris = load_iris(as_frame=True)

# Place the dataset into a dataframe
df = iris.frame 

# View the first 5 records in the dataset
df.head()

<div class="alert alert-info">
   
We can use the `.info()` method for our dataframe to view a concise summary of the information contained within. This includes the number of observations, columns and data types, and any missing values.
    
 </div>

In [None]:
df.info()

<div class="alert alert-info">
   
For numerical features in the dataframe, we can use the `.describe()` method to view relevant statistical information about each of the features. Understanding these values can assist in identifying the presence of outliers.
    
 </div>

In [None]:
df.describe()

<div class="alert alert-info">
   
Additionally, we can use a `.pairplot()` from the `seaborn` library to visualize a scatter matrix of the independent variables. We can color code the plotted points based on the `target` feature to identify any discernable patterns in the measurement values.
    
 </div>

In [None]:
# Set the figure size
sns.set(rc={'figure.figsize':(12,8),'ytick.labelsize':(12)})

# Create a pairplot
sns.pairplot(df, hue = "target", palette = "Set2")

<div class="alert alert-info">
   
Our dataset contains an equal number of observations for each of the Iris flowers. We can visualize the target variable distributions with a pie chart:
    
</div>

In [None]:
# Create a pie chart for the target variable
df.target.value_counts().plot(kind='pie', figsize=(8, 8), fontsize=10, autopct='%1.0f%%')
plt.title("Target Variable Distribution", fontsize = 20)
plt.show()

<div class="alert alert-info">
   
Lastly, we can use the `.corr()` method on our dataframe to identify linear relationships between the independent variables and the dependent variable. This also helps identify collinearity that may exist amongst the independent variables as well. The correlation matrix can be enhanced by using a `.heatmap()` from the `seaborn` library that scales the specified hue based on the severity of the linear relationship.
    
</div>

In [None]:
# Set the figure size
sns.set(rc={'figure.figsize':(12,8),'ytick.labelsize':(12)})

# Use the corr method to create the correlation matrix
correlation_matrix = df.corr().round(2)

# Create a heatmap based on the severity of the linear relationship
sns.heatmap(data = correlation_matrix, annot = True, cmap = "Blues")
plt.title("Variable Correlation Heatmap\n", fontsize = 20)
plt.show()

#  <span style="color:#0b186c;">Supervised Learning</span><a class="anchor" id="third-bullet"></a>

---

In Supervised Learning, the algorithms are provided with a combination of independent variables (X) and a labeled dependent variable (y). The algorithm learns how to map to the desired output based on the input-output pairs in the training process. Supervised Learning can be dissected into two subcategories, Regression and Classification.

- Regression models predict a continous, numerical output value.
- Classification models predict a discrete, categorical output value.

## <span style="color:#0b186c;">Classification of Iris Flowers</span>

Since our target variable represents discrete, categorical representations of the different genus of Iris flowers, we will be building a classification model. One of the simplest forms of a classification model is the `Decision Tree`. First, we have to 
partition our dataset into a training and test set.

<div class="alert alert-info">
    
&nbsp;**Note:** It is imperative that the subsets are representative of the whole &nbsp;dataset. 
The best way to accomplish this is using the built-in function, &nbsp;`train_test_split()`.

</div>

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
# Split the independent (X) and dependent (y) variables
X = df.iloc[:, :-1]
y = df.iloc[:, -1].values

# Split the data into an 80/20 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Output the shape of the training set
X_train.shape

Using **standardization**, we can change the form of our features into a normal distribution, so that it easier to correctly represent the feature weights in the modeling process.

The `StandardScaler()` from `scikit-learn` standardizes independently on each feature by setting the mean to 0 and the standard deviation to 1 to accomplish the scaling appropriately. First, the scaler has to be fit on the training data to learn the relevant statistics. Using the `.fit_transform()` method, we can fit and simultaneously transform the training data in a single line of code. The test data is then transformed using the `.transform()` method.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
# Instantiate the standard scaler
sc = StandardScaler()

# Fit and transform the scaler on the training set
X_train = sc.fit_transform(X_train) 

# Transform the fit scaler on the test set
X_test = sc.transform(X_test) 

Decision Trees are one of the most popular types of Classification algorithms due to their flexibility on handling missing values and different data types in the input variables. The Decision Tree creates a flowchart tree structure, where each internal node denotes a test on an independent variable. For each test, branches are created based on the outcome. This process continues to a terminal node, which holds the decided class label from the dependent variable. The model can be loaded directly from `scikit-learn`.


https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
# Instantiate the classifier
classifier = DecisionTreeClassifier()

# Fit the model on the training data
classifier.fit(X_train, y_train)

# Plot the tree structure
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(classifier, 
                   max_depth = 2,
                   feature_names= X.columns,
                   class_names = True,
                   filled = True)

Once the model has been trained, we can use the test data to validate our model and identify how well our model's **generalizations** are in comparison to the real events in `y_test`. This is where we can identify potential underfitting or overfitting of our model.

In [None]:
# Make predictions based on the X values in the test set
y_pred = classifier.predict(X_test)

# Calculate the accuracy score of the test set
score = round((accuracy_score(y_test, y_pred) * 100), 2)

# changing the rc parameters to adjust the size
plt.rcParams['figure.figsize'] = [10, 10]

#Plot the confusion Matrix for the predictions
fig = plot_confusion_matrix(classifier, X_test, y_test, cmap = plt.cm.Blues)
fig.ax_.set_title("Confusion Matrix")
plt.grid(False)
plt.show()

# Print the accuracy score on the validation data
print(f"Accuracy = {score}%")

#  <span style="color:#0b186c;">Unsupervised Learning</span><a class="anchor" id="fourth-bullet"></a>

---

In Unsupervised Learning, the algorithms are not provided with an expected output in the form of a dependent variable. The algorithm extrapolates patterns from the input variables and draws its own conclusions about the unlabeled data. Unsupervised Learning can primarily be grouped into two subcategories, Clustering and Dimensionality Reduction.

- Clustering models group data points based on similarity and separate groups by dissimilarity.
- Dimensionality Reduction models transform data from the original dimensional space into a smaller dimension, while still capturing meaningful variance in the data.

## <span style="color:#0b186c;">Clustering of Unlabeled Flowers</span>

The independent variables from the iris dataset can be isolated without the target label to perform clustering. This provides a good opportunity to compare the outcomes of clustering with the known labels in the dataset. One of the simplest forms of a clustering model is the `KMeans`. First, let's review the independent variables stored in `X`:

In [None]:
# Review the input variables without a target label
X

Clustering algorithms work by defining clusters, or groups of data points, such that the total intra-cluster variation is minimized. The KMeans algorithm accomplishes this by decreasing the within-cluster sum of squares, or the deviations from each observation and the cluster centroid. The centroids are what determines the total number of cluster centers present in the data. The Naive method for determining the optimal number of clusters is by plotting the inertia, which represents the within-cluster sum of squares, against the number of clusters and identifying the `elbow curve`.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# Instantiate the KMeans algorithm, fit on 1 to 10 clusters
km = [KMeans(n_clusters=i).fit(X) for i in range(1, 11)]

# Calculate the within-cluster sum of squares for each number of clusters
scores = [km[i].score(X) for i in range(len(km))]

# Plot the number of clusters against their respective within-cluster sum of squares
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(range(1, 11), scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

In [None]:
# Instantiate the clustering algorithm with optimal number of clusters
km = KMeans(n_clusters=3)

# Fit on the input data
km.fit(X)

# Predict the labels (assigned cluster)
km.predict(X)

# View the output labels
print(km.labels_)

<img src = ".\Media\iris_pairplot.PNG" align = "right" width = "52%">

Our original data had 3 different types of Iris genus, each with distinctive features. We can compare the outcome of our clustering against the `.pairplot()` on the original data. As you can see, the Setosa genus is starkly different from the other 2 types of flowers. There is a less discernable line between the Virginica and Versicolor species, which is reflected in the overlap of data points.

In the `elbow curve` above, we saw that the greatest variance was captured at the 2nd cluster. After the 3rd cluster, there was very little difference to the captured variance. This is highlights the difference between working with labeled data in Supervised Learning vs unlabeled data in Unsupervised Learning.

In [None]:
# Set the figure size
sns.set(rc={'figure.figsize':(12,8),'ytick.labelsize':(12)})

# Add the cluster labels to the dataframe
X['clusters'] = km.labels_

# Create a pairplot with the color-coded clusters
sns.pairplot(X, hue = "clusters", palette = "Set2")

In [None]:
# Create a pie chart for the cluster variable
X.clusters.value_counts().plot(kind='pie', figsize=(8, 8), fontsize=10, autopct='%1.0f%%')
plt.title("Target Cluster Distribution", fontsize = 20)
plt.show()

#  <span style="color:#0b186c;">Conclusion</span><a class="anchor" id="fifth-bullet"></a>

---
