<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<img src="support_files/images/cropped-SummerWorkshop_Header.png">  
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<center><h1>Hands-On scikit-learn Tutorial: Getting Started with Machine Learning</h1><br>
    <img src="https://www.analyticsvidhya.com/wp-content/uploads/2015/01/scikit-learn-logo.png">
</center>

</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<h2>Introduction</h2>

<p>Scikit-learn, also known as sklearn, is a popular and widely-used open-source machine learning library in Python. This library provides a robust set of tools for various machine learning tasks, including supervised and unsupervised learning which are a central focus of this tutorial. Scikit-learn is built on top of other scientific computing libraries like NumPy and SciPy, making it a powerful and efficient choice for data scientists and machine learning practitioners.

<p>The primary goal of scikit-learn is to simplify the process of applying machine learning algorithms to real-world data. It offers a user-friendly interface for working with data, creating machine learning models, and evaluating their performance. The library supports a broad range of algorithms, from traditional statistical methods to cutting-edge machine learning techniques, making it suitable for both beginners and experienced researchers.

<p>The purpose of this tutorial is to provide a high-level overview of supervised and unsupervised learning and describe how to utilize scikit-learn to build machine learning models tailored for these specific tasks. By the end of this tutorial, you will have a solid understanding of the key concepts, methods, and techniques required to effectively leverage scikit-learn for creating powerful and accurate machine learning models in both supervised and unsupervised settings. 
</div>


<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<h2>Supervised Learning</h2>

<p>Supervised learning is a machine learning approach where an algorithm learns from labeled training data to make accurate predictions or decisions. In this setting, a dataset consists of input-output pairs, where each input is associated with a corresponding desired output. The goal is to learn a model (or function) that maps inputs to outputs (or targets) from the training set so that the model can then accurately predict outputs for new, unseen inputs.


<p>Based on this description there are 3 main ingredients to supervised learning: (1) inputs, (2) targets, and (3) model. Let's consider an example of learning a model that predicts house prices.
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

1. **Inputs**

First, we are given a set of _inputs_ which we denote as $X=\{x^{(1)},\ldots, x^{(n)}\}$. An input $x^{(i)}$ is a vector where each entry is a feature that describes that data point. Using the housing price prediction example, some relevant features for a house are the number of bedrooms and square footage. If we have a new house with 4 bedrooms and 2500 square footage, then the feature vector for this house is $x^{(i)}=$ (4, 2500).


2. **Targets**

Second, we are given a set of _targets_ denoted as $y=\{y^{(1)},\ldots, y^{(n)}\}$ such that each target $y^{(i)}$ is the value that the model is trying to predict or estimate based on the input $x^{(i)}$. Targets represent the values we are aiming to understand, predict, or classify using the trained machine learning model. Using the housing price prediction example, the label would be the actual price of the house corresponding to the input data. For example, suppose that our new house with 3 bedrooms and 1800 square footage is sold for $\$500,000$, then the target would be $y^{(i)}=500,000$.

3. **Model**

Lastly, our objective is to train a _model_ on a dataset that contains examples of inputs along with corresponding desired label. The model learns from these examples by identifying underlying relationships and adjusting its internal parameters to optimize its performance. In this example, our goal is to learn some function $f$ such that $f(x^{(i)})\approx y^{(i)}$, then use this function to predict the value of a house that is about to be listed. 

</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h3>Let's Explore a Real Dataset!</h3>

<p>The California Housing Prices dataset is a well-known and frequently used dataset in the field of machine learning and housing market analysis. It provides information about housing characteristics and pricing across various regions in California, making it valuable for regression and predictive modeling tasks. The California Housing Prices dataset contains features that describe different aspects of housing neighborhoods in California, such as median income, average housing occupancy, median house age, and more. The target variable is the median house value for California districts. Researchers and data analysts commonly use the California Housing Prices dataset to explore and analyze the relationships between housing attributes and prices.

</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

**Key Information**:
    
(https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset)

Number of training examples: 20,640<br>
Target: Median house value (in $100k)<br>
Number of Features: 8 numerical features<br>
Feature Information:

- MedInc:        median income in block group
- HouseAge:      median house age in block group
- AveRooms:      average number of rooms per household
- AveBedrms:     average number of bedrooms per household
- Population:    block group population
- AveOccup:      average number of household members
- Latitude:      block group latitude
- Longitude:     block group longitude
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<h4>Let's load the data!</h4>
</div>


In [None]:
from sklearn.datasets import fetch_california_housing

housing_dataset = fetch_california_housing(as_frame=True)
housing_dataset.frame.head()

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>Now let's visualize the data!</h4>

</div>

In [None]:
housing_dataset.frame.hist(figsize=(10, 7), bins=30, edgecolor="black");

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>Are there Patterns in this Data?</h4>
</div>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fix, ax = plt.subplots()
sns.scatterplot(
    data=housing_dataset.frame,
    x="Longitude",
    y="Latitude",
    size="MedHouseVal",
    hue="MedHouseVal",
    palette="viridis",
    alpha=0.5,
    ax=ax,
)
ax.set_aspect('equal')
plt.legend(title="MedHouseVal", bbox_to_anchor=(1.05, 0.95), loc="upper left")
_ = plt.title("Median house value per district,\ndepending on spatial location")

<div class="exercise" style="background: #DFF0D8; border-radius: 3px; padding: 10px; color: #000;">

<h4>Exercise: Is any single feature predictive of the target MedHouseVal?</h4>

Try selecting different features to plot here and see which are most predictive of house value.
    
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>Simple Model: Linear Regression</h4>

<p>Linear regression is a method to find a straight line that best fits data points. It helps predict a target value based on one or more input variables. The goal is to minimize the difference between predicted and actual values. This technique is used for tasks like predicting prices, understanding relationships, and identifying trends in data. We're going to start by splitting the data into train and test sets. The train set is the portion of the dataset used to teach a model by showing it input data and expected outcomes. The test set is a separate part of the data used to evaluate the model's performance by comparing its predictions with the actual outcomes it hasn't seen before.

<p><center><img src="https://i0.wp.com/thaddeus-segura.com/wp-content/uploads/2020/09/3.1.1.1.1-Linear-Regression.png?resize=1024%2C498&ssl=1"></center>
</div>

In [None]:
from sklearn.model_selection import train_test_split

X = housing_dataset.data
y = housing_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print("Number of training examples:", len(X_train))
print("Number of testing examples:", len(X_test))

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>1. Choose a class of model</h4>

In Scikit-Learn, every class of model is represented by a Python class.
So, for example, if we would like to compute a simple `LinearRegression` model, we can import the linear regression class:
</div>

In [None]:
from sklearn.linear_model import LinearRegression


<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

Note that other more general linear regression models exist as well; you can read more about them in the [`sklearn.linear_model` module documentation](http://Scikit-Learn.org/stable/modules/linear_model.html).
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>2. Choose model hyperparameters</h4>

An important point is that *a class of model is not the same as an instance of a model*.

Once we have decided on our model class, there are still some options open to us.
Depending on the model class we are working with, we might need to answer one or more questions like the following:

- Would we like to fit for the offset (i.e., *y*-intercept)?
- Would we like the model to be normalized?
- Would we like to preprocess our features to add model flexibility?
- What degree of regularization would we like to use in our model?
- How many model components would we like to use?

These are examples of the important choices that must be made *once the model class is selected*.
These choices are often represented as *hyperparameters*, or parameters that must be set before the model is fit to data.
In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation.
We will explore how you can quantitatively choose hyperparameters in [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb).

For our linear regression example, we can instantiate the `LinearRegression` class and specify that we would like to fit the intercept using the `fit_intercept` hyperparameter:
</div>

In [None]:
model = LinearRegression(fit_intercept=True)
model

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

Keep in mind that when the model is instantiated, the only action is storing the hyperparameter values.
In particular, we have not yet applied the model to any data: the Scikit-Learn API makes very clear the distinction between *choice of model* and *application of model to data*.
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>4. Fit the model to the data</h4>

Now it is time to apply our model to the data.
This can be done with the `fit` method of the model. 
    
First, let's try fitting just against median income per district:
</div>

In [None]:
features_to_fit = ['MedInc']
model.fit(X_train[features_to_fit], y_train)
print(f"Model fit coefficient: {model.coef_[0]}  intercept: {model.intercept_}")

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

This fit command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore. In Scikit-Learn, by convention all model parameters that were learned during the fit process have trailing underscores; for example in this linear model, the `coef_` and `intercept_` attributes printed above determine the line of best fit.
<br><br>
Let's see how this line matches the data we fit the model against:
</div>

In [None]:
import numpy as np
fig, ax = plt.subplots()

# Scatter plot of training data
x_train = X_train['MedInc']
ax.scatter(x_train, y_train, alpha=0.1)

# Line showing best fit
xlim = np.array([x_train.min(), x_train.max()])
ylim = xlim * model.coef_[0] + model.intercept_
ax.plot(xlim, ylim, color='red');

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h4>5. Predict target values for unknown data</h4>

<p>Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. In Scikit-Learn, this can be done using the <code>predict()</code> method:
</div>

In [None]:
# how accurate is this model?
import sklearn.metrics
y_predicted = model.predict(X_test[features_to_fit])
print(f"mean squared error: {sklearn.metrics.mean_squared_error(y_predicted, y_test)}")

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
What if instead we fit a hyperplane against all 8 features; do we get better prediction performance?
</div>

In [None]:
# Fit the model again, this time using all 8 input features
model.fit(X_train, y_train)
print(f"Model fit coefficients: \n{model.coef_}  \nintercept: {model.intercept_}")

In [None]:
y_predicted = model.predict(X_test)
print(f"mean squared error: {sklearn.metrics.mean_squared_error(y_predicted, y_test)}")

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
Another way we can explore the performance of the model is to plot target versus predicted house values:
</div>

In [None]:
fig, ax = plt.subplots()
ax.scatter(y_test, y_predicted, alpha=0.2)
ax.set_aspect('equal')
ax.plot([0, 5], [0, 5], color=(0.3, 0.3, 0.3), linewidth=1)
ax.set_ylim(-0.5, 5.5)
ax.set_xlim(-0.5, 5.5)
ax.set_xlabel('Target house value')
ax.set_ylabel('Predicted house value')

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
The plot above gives us some clues about when we can expect this model to fail and how we might design a better model.<br><br>
    
One question that frequently comes up regards the uncertainty in such internal model parameters.
In general, Scikit-Learn does not provide tools to draw conclusions from internal model parameters themselves: interpreting model parameters is much more a *statistical modeling* question than a *machine learning* question.
Machine learning instead focuses on what the model *predicts*.

If you would like to dive into the meaning of fit parameters within the model, other tools are available, including the <a href="http://statsmodels.sourceforge.net/">`statsmodels` Python package</a>.
</div>

<div class="exercise" style="background: #DFF0D8; border-radius: 3px; padding: 10px; color: #000;">

<b>Exercise: See if you can achieve higher regression accuracy by using a more complex model.</b>

* Select a model from https://scikit-learn.org/stable/supervised_learning.html
* Create an instance of the model. What hyperparameters do you need to think about here?
* Fit the model against X_train, y_train
* Use the model to generate a prediction from X_test and compare to y_test
* Does your model perform better or worse than the linear regression? Why?
</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<h2>Unsupervised Learning</h2>

<p>In <i>supervised</i> learning, we begin with a set of known inputs and outputs (targets), then generate a model that is capable of mapping from inputs to outputs. <i>Unsupervised</i> learning is similar, but instead we begin with only the inputs, and the job of the model is to come up with a set of meaningful outputs that best summarize the inputs using less data. 

<h3>Data dimensionality</h3>

<p>The California housing dataset introduced above contains 20,640 houses with 8 different numerical features per house.
We can think of each house as a vector of length 8 that specifies a point inside an 8-dimensional space.
Note that this is different from the `X_train.ndim`! The training data are stored in a 2D table, but the *dimensionality* of this dataset is higher.
    
<p>8-dimensional spaces are difficult to visualize and reason about. Fortunately, the points in this dataset occupy a fairly restricted subset of this 8-D space. It may be possible, then, to describe the most noteworthy features of the dataset using fewer dimensions, if we are willing to discard aspects of the data that seem less noteworthy. 

<p>The goal of unsupervised learning is to automatically discover such simpler descriptions. It should be clear that there is usually no correct way to do this--how we reduce the dimensionality of our data, and which aspects of the data to keep or discard, is a matter of art. Treat the results of unsupervised learning as a hypothesis to be tested!

<p>Further reading: <a href="https://scikit-learn.org/stable/unsupervised_learning.html">https://scikit-learn.org/stable/unsupervised_learning.html</a>

</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<h3>Let's start with a fake dataset</h3>

<p>This dataset will consist of 1000 samples, each described by 20 features (a 1000x20 array). 

<p>We could imagine this is, for example, a description of 1000 neurons, where each neuron is desctibed by features like "average firing rate", "width of cell body", "number of input synapses", etc. 
</div>

In [None]:
# Ask sklearn to make a fake dataset for us:
import sklearn.datasets
data, target = sklearn.datasets.make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=4, 
    n_clusters_per_class=1,
    class_sep=1.9,
    n_informative=10,
    random_state=4,
)
print("Data shape:", data.shape)

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<p>Each neuron is described by a vector of length 20, and thus the space of all possible descriptions is 20-dimensional. We suspect that the neurons do not uniformly fill this space, that there is some structure to the data -- for example, some features may be correlated with each other, or perhaps the neurons form distinct clusters within this space. In other words, some significant part of this space is <i>empty</i>, and therefore we don't need the full 20 degrees of freedom to adequately summarize our data.

<p>To begin to visualize the data, let's just pick pairs of features and plot these against each other:
</div>

In [None]:
# First, let's just look at the relationship between pairs of features (since flat screens are good at visualizing 2 dimensions at a time).
# This is a kind of very simple dimensionality reduction in which we pick two dimensions to keep, and discard the rest.
import matplotlib.pyplot as plt
n_rows, n_cols = 5, 5
fig, ax = plt.subplots(n_rows, n_cols, figsize=(10, 10))
color = plt.cm.Set1(target)
for row in range(n_rows):
    ax[row, 0].set_ylabel(f"feature {row+1}")
    for col in range(n_cols):
        ax[-1, col].set_xlabel(f"feature {col+1}")
        if row == col:
            ax[row, col].hist(data[:, row], bins=30)
        else:
            ax[row, col].scatter(data[:, row], data[:, col], s=2,
                # color=color,
            )


<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<p>When we look at 2 dimensions at a time, we see some structure: correlations between features tells us that the features are not totally independent of one another.
    
<p>But maybe we could see more structure if we were able to combine more than 2 dimensions at a time? 

<p>Here we'll use scikit-learn to do a principal component analysis (PCA) to determine such an optimal combination of features:
</div>

In [None]:
# Scikit-learn follows the same pattern here that we used for supervised learning:

# Create a model instance
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

# Fit the model to our data
pca.fit(data)

# Transform the data
pca_reduced = pca.transform(data)

# The result has 1000 rows just like the input data, but now only 2 features per row.
pca_reduced.shape

In [None]:
fig, ax = plt.subplots()
ax.scatter(pca_reduced[:, 0], pca_reduced[:, 1], 
    # color=color
)
ax.set_xlabel('PCA component 1')
ax.set_ylabel('PCA component 2');

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">

<p>Scikit-learn has many models that provide different approaches to dimensionality reduction. 

<p>A more powerful (and less interpretable) method is UMAP, which is actually not included in sklearn, but nonetheless implements a compatible API following the same pattern:
</div>

In [None]:
from umap import UMAP

# Create a model instance
umap = UMAP(
    n_components=2,
    n_neighbors=20,
    min_dist=0.4,
)

# Fit the model to data
umap.fit(data)

# Transform the data
umap_reduced = umap.transform(data)
umap_reduced.shape

In [None]:
fig, ax = plt.subplots()
ax.scatter(umap_reduced[:, 0], umap_reduced[:, 1], 
    # color=color
)
ax.set_xlabel('UMAP Feature 1')
ax.set_ylabel('UMAP Feature 2');

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<p>At this point it looks clear that there is an even simpler way to describe our dataset:

<p>Rather than each neuron being represented as a 2D vector, we could instead assign each a single integer label for each of the 4 apparent clusters.

<p>To do this, we use another unsupervised learning method called clustering.

</div>

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<h3>Clustering</h3>

The dimensionality reduction methods we looked at above were able to reduce our 20-dimensional dataset down to 2D, and plotting these reduced vectors revealed some hidden structure in the data. Now we will use K-means clustering to further reduce the 2D data down to a 1D label.
</div>

In [None]:
from sklearn.cluster import KMeans

# Create a model instance
kmeans = KMeans(n_clusters=4, n_init='auto', random_state=42)

# fit the model to our UMAP result
kmeans.fit(umap_reduced)

<div class="default" style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; color: #000;">
<h4>Visualize the Clustering </h4>
</div>

In [None]:
# Coordinates of cluster centers
centers = kmeans.cluster_centers_

# Cluster assignments for each data point
labels = kmeans.labels_

# Plot clusters
plt.scatter(umap_reduced[:, 0], umap_reduced[:, 1], c=labels, s=50, cmap='viridis', alpha=0.8)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X')
plt.title("K-Means Clustering Results")
plt.xlabel("UMAP Feature 1")
plt.ylabel("UMAP Feature 2")
plt.show()

<div class="exercise" style="background: #DFF0D8; border-radius: 3px; padding: 10px; color: #000;">

<b>Exercise: Play with the UMAP inputs and parameters to see ways you might end up with a different result.</b>

See: https://umap-learn.readthedocs.io/en/latest/parameters.html
    
* How is the result affected by having less data to fit (try 300 or 30 samples)
* Try low (2) and high (200) values of n_neighbors
* Try low (0.05) and high (1.0) values of min_dist
</div>