<b><font size=20, color='#A020F0'>Scikit-learn</font></b>

Hannah Zanowski<br>
12/8/25<br>

#### <span style="color:green">Learning Goals</span>
By the end of this notebook you will
1. Become familiar with the various modules available in scikit-learn for doing machine learning
2. Practice selecting, building, and exectuing a simple ML model

#### Resources
[scikit-learn website](https://scikit-learn.org/stable/index.html)<br>
[scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)<br>
[scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/): MOOC = massively open online course. Go learn ML with scikit-learn here!

#### Acknowledgements
Most of today's lecture was adapted from the scikit-learn user guide and the early parts of the scikit-learn MOOC.

# A little about Scikit-learn

Scikit-learn is a python package for machine learning. It has a wide variety of built-in "estimators" (ML algoriths and models) that you can use on an equally wide variety of problems. In addition to these estimators, Scikit-learn includes tools for model fitting, data preprocessing, model selection, model evaluation, pipelines, etc. A key goal of scikit-learn is to provide a set of tools to make ML more straightforward and accessible to everyone.

## Why machine learning?
The entire point of machine learning is to build predictive models that we can use to tell us something about the world around us. It is a powerful tool for doing just that, especially when you have immensely complex or large data that are difficult to parse with standard methods of statistical inquiry. However, ML is not some sort of magical, silver bullet that will solve all your problems and answer all your burning scientific questions. At the end of the day, it's only as smart as the person who employs it, so if you're not thoughtful or careful about its application and interpretation and when it is appropriate or not to use it, then that's on you. Garbage in, garbage out.

My goal today though is to help get you started using scikit-learn in some very simple contexts, so you can get more comfortable with a few aspects of ML.

## Some terminology
There is a hot mess of machine learning jargon that you'll need to learn if you plan to use it in your day-to-day research activities. I cannot cover it all here (I also don't know most of it), but below are a few things we'll need for today:

### Training vs. Testing
<b><font color='darkmagenta'>Training data:</font></b> The data used to develop the ML model. It's what the model uses to 'learn' patterns/relationships.

<b><font color='darkmagenta'>Test data:</font></b> The part of the data **NOT** included in the training data that is used to test and validate your model.

<b><font color='darkmagenta'>Generalizing:</font></b> Applying the model to new data that it has never 'seen' before. Is your model able to accurately predict results with the new data? Or does it only work well on your training data?

### Overfitting vs. Underfitting
<b><font color='darkmagenta'>Overfitting:</font></b> Too many paramaters are used to fit the data so that the general patterns AND the noise are fit. Model is too complex for the data. Model will not generalize well. Usually there's not enough data and too much noise.

<b><font color='darkmagenta'>Underfitting:</font></b> Not enough paramaters are used to fit the model. General patterns are not fully represented by the model because it is not complex enough. Model may not generalize well.

### Supervised vs. Unsupervised Learning
<b><font color='darkmagenta'>Supervised:</font></b> Data are labeled or associated with a target class/value. Given some values X, how do we predict Y? (e.g., regression, classification)

<b><font color='darkmagenta'>Unsupervised:</font></b> Data are not labeled. Instead, given some values X, what are the underlying patterns/structure in the data that generalize? (e.g., [clustering](https://scikit-learn.org/stable/modules/clustering.html), [dimensionality reduction](https://scikit-learn.org/stable/modules/decomposition.html))

# Getting Started
In the examples that follow, we're going to use the seaborn penguins dataset to do supervised learning.

First let's import some the packages we'll need:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn

# Making a predictive model
Some general steps you might take to make a predictive model using ML are as follows:

0. Figure out what your research question is
1. Look at your data
2. Decide if you actually need ML to answer your question
3. Assuming yes, figure out what model to use
4. Separate/prepare the data
5. Build the model
6. Apply model to new (or test) data
7. Assess the model--(how well does it generalize?)

For this example, our "research" question is "can we use penguin physiological characteristics to predict penguin species?"

### Looking at the data
Let's read in and look at our penguin data by making a pairplot to summarize the information that we have and look for patterns. 

In [None]:
df=sns.load_dataset("penguins")
df.head()

In [None]:
#Make a pairplot colored by species
sns.pairplot(df,hue='species',palette=['teal','darkmagenta','darkgoldenrod'])

---

#### <font color='blue'>Questions for the class</font>
1. Which variables might be reasonable predictors of penguin species?
2. Do you think we need ML to predict penguin species from physiological traits?

---

### Choosing the model
Ok, so you've decided you're going to use ML to predict penguin species from some of their traits. What model should you use to do that? First you need to think about what type of question you are asking. In our case, we want to predict a categorical variable (penguin species), from a few numerical variables (the penguin traits). This is a classification problem, because the target to be predicted is discrete (in a regression problem the target to be predicted is a continuous variable). We also want our model to be able to take multiple inputs (e.g., flipper length and bill length) to predict multiple penguin species (three in this case). This means our problem lends itself well to a [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) or a [decision tree](https://scikit-learn.org/stable/modules/tree.html#decision-trees) model, either of which will allow us to classify our penguins according to a set of rules the model will create based on features it finds in the physiological traits.

### Processing the data
Before we make a model from our data, we need to reformat it a bit. That means we need to split the data into training and test subsets, and we also need to remove the target variable (penguins species) from the that will be used to build the model. Scikit-learn has a nice method for doing that, called [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#train-test-split)

In [None]:
from sklearn.model_selection import train_test_split

df=df.dropna() #drop nans for LogisticRegression

#For keeping only the columns of the two variables we'll use as predictors of species
traits_columns = ["flipper_length_mm", "bill_length_mm"]
#For separating the target variable from the predictors
species_column = "species"

#Make dataframes with only the information above
traits, species= df[traits_columns], df[species_column]

#Use train_test_split to split the two dataframes into training and testing
#default is to use 25% of data for testing
#reandom state is set to an integer here so that the results are the same each time this block is run
traits_train, traits_test, species_train, species_test = train_test_split(traits, species, random_state=0)

### Doing the learning
It's time to fit a model to our training data. As mentioned in the beginning, scikit-learn has tons of built-in ML algorithms. Here we're using sklearn's [LogisticRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#logisticregression). We apply logistic regression to the data by using the [.fit](https://scikit-learn.org/stable/glossary.html#term-fit) method. The LogisticRegression class will find the best _lines_ that split the data according to the classifiers:

In [None]:
from sklearn.linear_model import LogisticRegression

#Set up the LogisticRegression class
penguin_log=LogisticRegression(max_iter=200) #max_iter is the max number of iterations for the solver; allows us
#to avoid convergence warnings when doing cross-validaiton later
#Apply it to the data
penguin_log.fit(traits_train, species_train)

### Visualizing your model
So, what did we actually _do_? The nice thing about logistic regression (and other linear models) is that they are simple enough that sklearn has some handy tools that you can use to visualize how the training data were categorized, such as [DecisionBoundaryDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#decisionboundarydisplay):

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay
palette=['teal','darkmagenta','darkgoldenrod']
cmap=mpl.colors.ListedColormap(palette)
dbd=DecisionBoundaryDisplay.from_estimator(penguin_log,traits_train,response_method="predict",eps=2.5,cmap=cmap,alpha=0.3)
#the 'predict' response_method shows the prediction space for the samples; other options are predict_probna or decision_function
sns.scatterplot(data=df,x=traits_columns[0],y=traits_columns[1],hue=species_column,palette=palette)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.title("Decision boundary using Logistic Regression");

You can actually use your model to predict penguin species using the [.predict()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) method:

In [None]:
penguin_log.predict(traits_test)

### Assessing the model
Above you can see how your model decided to classify the penguin species. Now we need to apply it to the unseen, or test data and figure out if it generalizes reasonably well. The simplest way to do this in a classification problem is with the accuracy score (a value that falls between 0 and 1 representing the fraction of correctly classified data; values closer to 1 mean higher accuracy):

In [None]:
#Use model on the test data and determine the error
#Use the .score method, which can be applied to any estimator
#the score method will use the most appropriate metric given the estimator
#for example for an actual regression, .score would return r^2 instead
test_score=penguin_log.score(traits_test, species_test)
test_score

## <p style="border-width:3px; border-style:solid; border-color:black; background-color:lightyellow; padding: 1em;"><font color='red'><u>CAUTION</u></font><font size=3><br>There are MANY ways to assess a model beyond a simple accuracy/error metric, and you should _always_ explore a range of options when assessing your model. You can read more about sklearn's scoring metrics and methods for cross-validation [here](https://scikit-learn.org/stable/model_selection.html#model-selection-and-evaluation).<br><br><b>Some other questions you should be asking as part of this process:</b> What are the sizes of my test and training datasets? If the test set is small, are scoring metrics likely to be a true estimate of the testing error (this will impact how generalizable the model is)? Are my data independent and identically distributed? Is my model over or underfitting the data? Are there tunable paramaters in my model that might impact the results?</p>

### Cross-validation
In the example above, even though we used a random number generator to choose our test data, what if we just got lucky and picked out the specific data that were easiest to classify? How can we be sure that our accuracy metrics are actually legitimate? That's where cross-validation comes in. Cross-validation allows you to test the robustness of your model by splitting your training and test data multiple times, fitting the model to the new training data, and evaluating the model on the new training data. Below is an example of cross-validation on our max depth 2 tree using the [ShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#shufflesplit) method and [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#cross-val-score) to compute the accuracy scores:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

cv=ShuffleSplit(n_splits=5, test_size=0.25, random_state=0) #set how many splits you want
cv_scores=cross_val_score(penguin_log, traits, species,cv=cv) #this returns an iterator so we need to turn it into a list or something readable
cv_scores=pd.DataFrame(cv_scores,columns=['Accuracy Score']) #turn it into a dataframe to make it look nice
cv_scores

### Hyperparameters
Some models have tunable parameters (e.g., the degree of a polynomial, the number of neighbors used in a nearest neighbors calculation, etc). These are often referred to as <b><font color='darkmagenta'>hyperparameters</font></b>. To get a sense for how the choice of hyperparameters can impact a model, we'll use a [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) instead. This model has a hyperparameter, `max_depth`, that impacts the number of decision splits the tree can have. We'll use sklearn's [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#decisiontreeclassifier) for this example. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

#Set up the decision tree class
penguin_tree=DecisionTreeClassifier(max_depth=2) #max_depth is the maximum number of levels with decision splits allowed in the tree; 
#this is a tunable parameter!
#Apply it to the data
penguin_tree.fit(traits_train, species_train)

Apply the model to the test data and quickly score it:

In [None]:
test_score=penguin_tree.score(traits_test, species_test)
test_score

Visualize how the model split the species based on bill length and flipper length:

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay
palette=['teal','darkmagenta','darkgoldenrod']
cmap=mpl.colors.ListedColormap(palette)

dbd=DecisionBoundaryDisplay.from_estimator(penguin_tree,traits_train,response_method="predict",eps=2.5,cmap=cmap,alpha=0.3)
sns.scatterplot(data=df,x=traits_columns[0],y=traits_columns[1],hue=species_column,palette=palette)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.title("Decision boundary using a Decision Tree");

We can also visualize the tree itself. With `max_depth` set to 2, there are 2 levels where decision splits are made:

In [None]:
from sklearn.tree import plot_tree
fig,ax=plt.subplots(figsize=(10,8))
plot_tree(penguin_tree, feature_names=traits_columns,class_names=penguin_tree.classes_.tolist(),impurity=False,ax=ax);

So what happens when we change the max depth of our tree? Let's try it for a `max_depth` of 1 and 5

In [None]:
#Set up the decision tree class
penguin_tree1=DecisionTreeClassifier(max_depth=1) #max_depth is the maximum number of decision splits allowed in the tree; this is a tunable parameter!
#Apply it to the data
penguin_tree1.fit(traits_train, species_train)
#Quick accuracy score
test_score1=penguin_tree1.score(traits_test, species_test)

#Set up the decision tree class
penguin_tree5=DecisionTreeClassifier(max_depth=5) #max_depth is the maximum number of decision splits allowed in the tree; this is a tunable parameter!
#Apply it to the data
penguin_tree5.fit(traits_train, species_train)
#Quick accuracy score
test_score5=penguin_tree5.score(traits_test, species_test)

print('Max Depth 1 score:', test_score1)
print('Max Depth 5 score:', test_score5)

Plot the decision boundaries for each:

In [None]:
palette=['teal','darkmagenta','darkgoldenrod']
cmap=mpl.colors.ListedColormap(palette)

fig,ax=plt.subplots(1,2,figsize=(10,5))
dbd=DecisionBoundaryDisplay.from_estimator(penguin_tree1,traits_train,response_method="predict",eps=2.5,cmap=cmap,alpha=0.3,ax=ax[0])
sns.scatterplot(data=df,x=traits_columns[0],y=traits_columns[1],hue=species_column,palette=palette,ax=ax[0],legend=False)
ax[0].set_title("Max depth 1");

dbd=DecisionBoundaryDisplay.from_estimator(penguin_tree5,traits_train,response_method="predict",eps=2.5,cmap=cmap,alpha=0.3,ax=ax[1])
sns.scatterplot(data=df,x=traits_columns[0],y=traits_columns[1],hue=species_column,palette=palette,ax=ax[1])
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
ax[1].set_title("Max depth 5");

---

#### <font color='blue'>Question for the class</font>
Do you think either of the models above are over or underfit? If so, why?

---