# Weekly Tutorial Problems: Week 1
Jérôme Dockès, Nikhil Bhagwat, Jacob Sanz-Robinson.

Welcome to the first set of tutorial problems of the ABCD ReproNim Machine Learning course! They are largely based on Jérôme and Nikhil's exercises for the MAIN educational course: https://github.com/neurodatascience/main-2021-ml-parts-1-2

Let's get started.

# Lecture 1: Core Concepts in Machine Learning 1

## Question 1: Fitting and Prediction

We have a chunk of code below. It imports the necessary libraries, generates some random data, and creates an instance of a linear least squares model with l2 (ridge) regularization. It fits the model to the data, obtains predictions, and calculates the Mean Squared Error of the predictions that were made.

**Your task:**
* The code runs, but there is a methodological issue in the way we are fitting data. What is this issue in the code below?

In [None]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=80, n_features=600, noise=10, random_state=0) #Generates a random linear combination of random features, with noise.

model = Ridge(alpha=1e-8)
model.fit(X, y)
predictions = model.predict(X)
mse = mean_squared_error(y, predictions)

print(f"\nMean Squared Error: {mse}")
print("MSE is 0 up to machine precision:", np.allclose(mse, 0))

Now let's compare training and testing performance. We generate some new data, and split it into a train and test set.

**Your task:**
* Fit the model on training data only, get predictions for test data, and compute prediction error. Is it a much larger than error on the training data?

In [None]:
X, y = make_regression(n_samples=160, n_features=600, noise=10, random_state=0)
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

#Your code here

In [None]:
################Solution################
model = Ridge(alpha=1e-8)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f"\nOn a separate test set...\nOut-of-sample Mean Squared Error: {mse}")

## Question 2: Cross-Validation

As in the previous question, we import the libraries we need, generate some random data, and create an instance of a linear least squares model with l2 regularization.

In [None]:
from sklearn.datasets import make_regression
from sklearn import model_selection
from sklearn.linear_model import Ridge

X, y = make_regression(noise=10) #Generates a random linear combination of random features, with noise.
model = Ridge()

**Your task (should be ~2 lines of code):**
* Using an appropriate function from scikit-learn, compute cross-validation scores for a ridge regression on this dataset.
* What cross-validation strategy are you using? What do the scores represent -- what performance metric is used?
* What is a good choice for k?
* Once you are satisfied with the scores, fit the model to the data.

In [None]:
################Solution################
from sklearn.datasets import make_regression
from sklearn import model_selection
from sklearn.linear_model import Ridge

X, y = make_regression(noise=10) #Generates a random linear combination of random features, with noise.
model = Ridge()

scores = model_selection.cross_validate(model, X, y, scoring="neg_mean_squared_error", cv=model_selection.KFold(5))
print(scores)
model.fit(X, y)

## Question 3: Hyperparameters and Grid Search

The chunk of code below imports the necessary datasets, generates some random data for us to classify.

**Your task:**
* Modify the "model" variable. It should use the GridSearchCV to run a Logistic Regression, doing a grid search over at least 3 different values of C (I suggest they each vary by an order of magnitude), and "l1" and "l2" penalty types.

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, GridSearchCV
from matplotlib import pyplot as plt

X, y = make_classification() #Generate random data
model = #Your line of code goes here!
scores = cross_validate(model, X, y)
print(scores)

In [None]:
################Solution################
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, GridSearchCV
from matplotlib import pyplot as plt

X, y = make_classification()
model = GridSearchCV(LogisticRegression(solver="liblinear"), {"C": [0.01, 0.1, 1.0], "penalty": ["l1", "l2"]})
scores = cross_validate(model, X, y)
print(scores)

# Creating a figure to visualize the grid of hyperparameters
fig, ax = plt.subplots(figsize=(4, 4))
a, b = np.mgrid[:3, :2]
plt.scatter(a.ravel(), b.ravel())
ax.set_xticks([0, 1, 2])
ax.set_xticklabels([0.01, 0.1, 1.0])
ax.set_yticks([0, 1])
ax.set_yticklabels(["l1", "l2"])
ax.set_xlabel("Hyperparam: C")
ax.set_ylabel("Hyperparam: 'penalty'")
ax.set_ylim(-1, 2)
plt.show()

## Question 4: Prediction and Performance Scores

In this question we're going to build a model that can recognize handwritten digits.

Firstly, below are all the libraries we are going to use for this question. I suggest you Google them, or have a glance at their documentation if you don't already recognize them! Popular libraries often have very informative documentation, complete with examples.
For example: https://scikit-learn.org/stable/user_guide.html

In [None]:
from sklearn.datasets import fetch_openml
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

We're going to download the dataset we'll be using. It's the MNIST dataset - a set of 70,000 images of digits handwritten by high school students. It's a commonly used benchmark for Machine Learning algorithms, and a part of many Machine Learning courses, earning it it's infamous nickname of the 'Hello World' of Machine Learning.

In [None]:
mnist = fetch_openml('mnist_784', version=1, as_frame= False) #~130MB, might take a little time to download!
mnist.keys()

Let's look at the data dimensions, what the first digit looks like, and what it's label is:

In [None]:
X, y = mnist["data"], mnist["target"]
print(X.shape) 

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = mpl.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

print(y[0])

Now we're going to separate the data to use the first 60,000 images as our training set, and the final 10,000 for our test set.

We're going to try to quickly make a binary classification model. It will be able to tell the difference between the number 5 and all the other digits. A 5-detector.

Our classifier will be an instance of a SGDClassifier model. I've chosen this classifier because it can handle large datasets efficiently. The default setting we are using is a linear SVM (you will see what this is next week). It is sped up by using a variation of Gradient Descent - Stochastic Gradient Descent (deals with training instances independently, one at a time).

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
y_train = y_train.astype(np.int8) #Casting labels from strings to integers

#Here we are binarizing our labels. All labels that are 5 are converted to True, and the rest to False. 
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

sgd_clf = SGDClassifier(random_state=42) #42 is arbitrarily chosen. From documentation: "Pass an int for reproducible output across multiple function calls"

**Your task (should only be ~3 lines of code):**
* Fit the sgd_clf model to the training data, and make a prediction using it. Was the prediction it made right?
* Run cross validation using 3 folds on the model to determine it's accuracy. What accuracies are you getting for the 3 folds?

In [None]:
#Your code here

In [None]:
################Solution################
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([some_digit])
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

If you did everything right, you probably got something amazing, like 95% accuracy on all folds.
Too good to be true? Yup...only about 10% of the images are 5s, so if you always guess that an image is NOT a 5, you will be right about 90% of the time...Not as impressive now.

**Your task (should only be ~2 lines of code):**

Use the appropriate imported libraries to:
* Obtain the predictions made on each fold of the training data.
* Build a confusion matrix for these predictions.

In [None]:
#Your code here

In [None]:
################Solution################
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
confusion_matrix(y_train_5, y_train_pred)

Remember that each row in a confusion matrix represents an actual class, while each column represents a predicted class. So the first row of this matrix considers non-5 images (the negative class). The second row considers the images of 5s (the positive class).

**Your task:**
In your confusion Matrix, how many did you get for each of
* true negatives: 
* false positives: 
* false negatives: 
* true positives:

* Use sklearn to find the precision, recall, and f1_score of these obtained predictions.

In [None]:
################Solution################
print(precision_score(y_train_5, y_train_pred))
print(recall_score(y_train_5, y_train_pred))
print(f1_score(y_train_5, y_train_pred))

# Lecture 2: Core Concepts in Machine Learning 2

## Question 1: Dataset Transformation
**Your task:**
Use an sklearn function seen in the recorded lectures to modify the variable X_scaled. X_scaled should standardize the feature variable X by removing the mean and scaling to unit variance.

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler

X = np.asarray([[0, 1, -10], [0, -1, 0], [1, 0, 10], [1, 0, 0]])
print(f"X:\n{X}\n")

X_scaled = #Your code goes here!

print(f"X scaled:\n{X_scaled}\n")
print(f"mean: {X_scaled.mean(axis=0)}\nstd: {X_scaled.std(axis=0)}")

In [None]:
################Solution################

import numpy as np
from sklearn.preprocessing import StandardScaler

X = np.asarray([[0, 1, -10], [0, -1, 0], [1, 0, 10], [1, 0, 0]])
print(f"X:\n{X}\n")

X_scaled = StandardScaler().fit_transform(X)

print(f"X scaled:\n{X_scaled}\n")
print(f"mean: {X_scaled.mean(axis=0)}\nstd: {X_scaled.std(axis=0)}")

## Question 2: Feature Selection

As in the previous questions, we import relevant libraries and generate random data for regression. We then use the SelectKBest function from sklearn to perform feature selection (select features according to the k highest scores using f_regression - a linear model for testing the effect of a regressor), and perform cross validation using the Ridge linear model.

**Your task:**
What is the methodological issue with the chunk of code?

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt

X, y = make_regression(noise=10, n_features=5000, random_state=0)

X_reduced = SelectKBest(f_regression).fit_transform(X, y)
scores = cross_validate(Ridge(), X_reduced, y)["test_score"]
print("feature selection in 'preprocessing':", scores)

So...now let's fix it. To do this we are going to create a pipeline. We will make sure to fit the whole pipeline to the training set.

**Your task:**
* Modify the 'model' variable. Use the appropriate sklearn method to make a pipeline out of the functions from the previous chunk of code.
* How do your results vary? Why do you think this is?

In [None]:
model = #Your code here!!!
scores_pipe = cross_validate(model, X, y)["test_score"]
print("feature selection on train set:", scores_pipe)

# Plotting our results!
plt.boxplot(
    [scores_pipe, scores],
    vert=False,
    labels=[
        "feature selection on train set",
        "feature selection on whole data",
    ],
)
plt.gca().set_xlabel("R² score")
plt.tight_layout()
plt.show()

In [None]:
################Solution################
model = make_pipeline(SelectKBest(f_regression), Ridge())
scores_pipe = cross_validate(model, X, y)["test_score"]
print("feature selection on train set:", scores_pipe)

# Plotting our results!
plt.boxplot(
    [scores_pipe, scores],
    vert=False,
    labels=[
        "feature selection on train set",
        "feature selection on whole data",
    ],
)
plt.gca().set_xlabel("R² score")
plt.tight_layout()
plt.show()

## Question 3: Dimensionality Reduction

The sklearn library has included a dataset featuring breast cancer data. We fit the data, then transform it, standardizing the feature variable.

**Your task:**
* Use the sklearn PCA model to only keep the first two principal components of the data.
* Fit the PCA model to breast cancer data, and tranform the data onto the first two principal components.
* Compare original and reduced shape of the data...Is it behaving how you would expect?
* Bonus: Create a plot to visualize the first vs second principal components (or any other method you would like to compare them)! Does it seem like the classes are easily separable?

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

cancer = load_breast_cancer()
scaler = StandardScaler()
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)

#Your code here

In [None]:
################Solution################
pca = PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
print("Original shape: %s" % str(X_scaled.shape))
print("Reduced shape: %s" % str(X_pca.shape))

print()
print(pca.components_.shape)
print(pca.components_)

# plot fist vs second principal component, color by class
plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cancer.target, s=60)
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")

## Question 4: Big Picture - Scanning site prediction

We have some fMRI time series, that we use to compute a connectivity matrix for each participant. We use the connectivity matrix values as our input features to predict to which site the participant belongs.
We classify participants using a logistic regression. However we make several additions.

### Pipeline
We use scikit-learn's `sklearn.pipeline.Pipeline`, that enables chaining several transformations into a single scikit-learn estimator (an object with a `fit` method). This avoids dealing with the connectivity feature extraction separately and ensures everything is fitted on the training data only -- which is crucial here because we will add scaling a dimensionality reduction step with Principal Component Analysis.

### Scaling
We add scaling of the input features using scikit-learn's StandardScaler, which removes the mean and scales the features to unit variance. This helps the logistic regression solver converge faster and often improves performance.

### Dimensionality Reduction
We also consider a pipeline that reduces the dimension of input features with PCA, and compare it to the baseline logistic regrssion. One advantage is that the pipeline that uses PCA can be fitted much faster.

### Cross-validation
Here, we will use scikit-learn's `cross_validate` to perform K-Fold cross-validation and get a better estimate of our model's generalization performance. This allows comparing logistic regression with and without PCA, as well as a naive baseline.

Moreover, instead of the plain `LogisticRegression`, we use scikit-learn's `LogisticRegressionCV`, which automatically performs a nested cross-validation loop on the training data to select the best hyperparameter.

**We therefore obtain a typical supervised learning experiment, with learning pipelines that involve chained transformations, hyperparameter selection, a cross-validation, and comparison of several models and a baseline.**

# Exercises

* Read, understand and run the code in the cell below. `load_connectivity_data` loads the data and returns the matrices `X` and `y`. `prepare_pipelines` returns a dictionary whose values are scikit-learn estimators and whose keys are names for each estimator. All estimators are instances of scikit-learn's `Pipeline`, and the first step is always connectivity feature extraction with nilearn's `ConnectivityMeasure`.


* At the moment `prepare_pipelines` only returns 2 estimators: the logistic regression and a dummy estimator. Add a third estimator in the returned dictionary, which contains a dimensionality reduction step: a PCA with 20 components. To do so, add a `sklearn.decomposition.PCA` as the second step of the pipeline. Note 20 is an arbitrary choice; how could we set the number of components in a principled way? What is the largest number of components we could ask for?


* There are 111 regions in the atlas we use to compute region-region connectivity matrices: the output of the `ConnectivityMeasure` has 111 * (111 - 1) / 2 = 6105 columns. If the dataset has 100 participants, What is the size of the coefficients of the logistic regression? of the selected (20 first) principal components? of the output of the PCA transformation (ie the compressed design matrix)?


* Here we are storing data and model coefficients in arrays of 64-bit floating-point values, meaning each number takes 64 bits = 8 bytes of memory. Approximately how much memory is used by the design matrix X? by the dimensionality-reduced data (ie the kept left singular vectors of X)? by the principal components (the kept right singular vectors of X)?


* As you can see, in this script we do not specify explicitly the metric functions that are used to evaluate models, but rely on scikit-learn's defaults instead. What metric is used in order to select the best hyperparameter? What metric is used to compute scores in `cross_validate`? Are these defaults appropriate for our particular situation?


* We do not specify the cross-validation strategy either. Which cross-validation procedure is used in `cross_validate`, and by the `LogisticRegressionCV`? Are these choices appropriate?

## Additional exercises (optional)

* Try replacing the default metrics with other scoring functions from scikit-learn or functions that you write yourself. Does the relative performance of the models change?

* Specify the cross-validation strategy explicitly, possibly choosing a different one than the default.

* Add another estimator to the options returned by `prepare_pipelines`, that uses univariate feature selection instead of PCA.

* What other approach could we use to obtain connectivity features of a lower dimension?

In [None]:
from nilearn import datasets
from nilearn.connectome import ConnectivityMeasure

from sklearn.base import clone
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.dummy import DummyClassifier

from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
def load_timeseries_and_site(n_subjects=100):
    """Load ABIDE timeseries and participants' site.
    Returns X, a list with one array of shape (n_samples, n_rois) per
    participant, and y, an array of length n_participants containing integers
    representing the site each participant belongs to.
    """
    data = datasets.fetch_abide_pcp(
        n_subjects=n_subjects, derivatives=["rois_ho"], quality_checked=False
    )
    X = data["rois_ho"]
    y = LabelEncoder().fit_transform(data["phenotypic"]["SITE_ID"])
    return X, y

def prepare_pipelines():
    """Prepare scikit-learn pipelines for fmri classification with connectivity.
    Returns a dictionary where each value is a scikit-learn estimator (a
    `Pipeline`) and the corresponding key is a descriptive string for that
    estimator.
    As an exercise you need to add a pipeline that performs dimensionality
    reduction with PCA.
    """
    connectivity = ConnectivityMeasure(
        kind="correlation", vectorize=True, discard_diagonal=True
    )
    scaling = StandardScaler()
    logreg = LogisticRegressionCV(solver="liblinear", cv=3, Cs=3)
    logistic_reg = make_pipeline(
        clone(connectivity), clone(scaling), clone(logreg)
    )
    # make_pipeline is a convenient way to create a Pipeline by passing the
    # steps as arguments. clone creates a copy of the input estimator, to avoid
    # sharing the state of an estimator across pipelines.
    dummy = make_pipeline(clone(connectivity), DummyClassifier())
    # TODO: add a pipeline with a PCA dimensionality reduction step to this
    # dictionary. You will need to import `sklearn.decomposition.PCA`.
    return {
        "Logistic no PCA": logistic_reg,
        "Dummy": dummy,
    }

def compute_cv_scores(models, X, y):
    """Compute cross-validation scores for all models
    `models` is a dictionary like the one returned by `prepare_pipelines`, ie
    of the form `{"model_name": estimator}`, where `estimator` is a
    scikit-learn estimator.
    `X` and `y` are the design matrix and the outputs to predict.
    Returns a `pd.DataFrame` with one row for each model and cross-validation
    fold. Columns include `test_score` and `fit_time`.
    """
    all_scores = []
    for model_name, model in models.items():
        print(f"Computing scores for model: '{model_name}'")
        model_scores = pd.DataFrame(cross_validate(model, X, y))
        model_scores["model"] = model_name
        all_scores.append(model_scores)
    all_scores = pd.concat(all_scores)
    return all_scores

In [None]:
X, y = load_timeseries_and_site()
models = prepare_pipelines()
all_scores = compute_cv_scores(models, X, y)
print(all_scores.groupby("model").mean())
sns.stripplot(data=all_scores, x="test_score", y="model")
plt.tight_layout()
plt.show()

In [None]:
########### Solutions ##############
# ## Exercises
#
# Read, understand and run this script. `load_connectivity_data` loads the data
# and returns the matrices `X` and `y`. `prepare_pipelines` returns a
# dictionary whose values are scikit-learn estimators and whose keys are names
# for each estimator. All estimators are instances of scikit-learn's
# `Pipeline`, and the first step is always connectivity feature extraction with
# nilearn's `ConnectivityMeasure`.
#
# At the moment `prepare_pipelines` only returns 2 estimators: the logistic
# regression and a dummy estimator. Add a third estimator in the returned
# dictionary, which contains a dimensionality reduction step: a PCA with 20
# components. To do so, add a `sklearn.decomposition.PCA` as the second step of
# the pipeline. Note 20 is an arbitrary choice; how could we set the number of
# components in a principled way? What is the largest number of components we
# could ask for?
# Answer: include it in grid search, 80 (rank of X_train)
#
# There are 111 regions in the atlas we use to compute region-region
# connectivity matrices: the output of the `ConnectivityMeasure` has
# 111 * (111 - 1) / 2 = 6105 columns. If the dataset has 100 participants, What
# is the size of the coefficients of the logistic regression? of the selected
# (20 first) principal components? of the output of the PCA transformation (ie
# the compressed design matrix)?
# Answer: 6105 coefficients + intercept; principal components: 20 x 6105;
# compressed X: 100 x 20.
#
# Here we are storing data and model coefficients in arrays of 64-bit
# floating-point values, meaning each number takes 64 bits = 8 bytes of memory.
# Approximately how much memory is used by the design matrix X? by the
# dimensionality-reduced data (ie the kept left singular vectors of X)? by the
# principal components (the kept right singular vectors of X)?
# Answer: X: 4,884,000 B (6105*100*8), compressed X: 16,000 B (100*20*8), V: 976,800 B (20*6105*8)
# (+ 96 bytes for all for the array object)
#
# As you can see, in this script we do not specify explicitly the metric
# functions that are used to evaluate models, but rely on scikit-learn's
# defaults instead. What metric is used in order to select the best
# hyperparameter? What metric is used to compute scores in `cross_validate`?
# Are these defaults appropriate for our particular situation?
# Answer: sklearn.metrics.accuracy_score for both, yes
#
# We do not specify the cross-validation strategy either. Which
# cross-validation procedure is used in `cross_validate`, and by the
# `LogisticRegressionCV`? Are these choices appropriate?
#
# ## Additional exercises (optional)
#
# Try replacing the default metrics with other scoring functions from
# scikit-learn or functions that you write yourself. Does the relative
# performance of the models change?
#
# Specify the cross-validation strategy explicitly, possibly choosing a
# different one than the default.
#
# Add another estimator to the options returned by `prepare_pipelines`, that
# uses univariate feature selection instead of PCA.
#
# What other approach could we use to obtain connectivity features of a lower
# dimension?
# Answer: use an atlas with less regions


from nilearn import datasets
from nilearn.connectome import ConnectivityMeasure

from sklearn.base import clone
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.dummy import DummyClassifier

from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns


def load_timeseries_and_site(n_subjects=100):
    """Load ABIDE timeseries and participants' site.
    Returns X, a list with one array of shape (n_samples, n_rois) per
    participant, and y, an array of length n_participants containing integers
    representing the site each participant belongs to.
    """
    data = datasets.fetch_abide_pcp(n_subjects=n_subjects, derivatives=["rois_ho"], quality_checked=False)
    X = data["rois_ho"]
    print(data)
    y = LabelEncoder().fit_transform(data["phenotypic"]["SITE_ID"])
    return X, y


def prepare_pipelines():
    """Prepare scikit-learn pipelines for fmri classification with connectivity.
    Returns a dictionary where each value is a scikit-learn estimator (a
    `Pipeline`) and the corresponding key is a descriptive string for that
    estimator.
    As an exercise you need to add a pipeline that performs dimensionality
    reduction with PCA.
    """
    connectivity = ConnectivityMeasure(
        kind="correlation", vectorize=True, discard_diagonal=True
    )
    scaling = StandardScaler()
    logreg = LogisticRegressionCV(solver="liblinear", cv=3, Cs=3)
    logreg = LogisticRegression(C=10)
    logistic_reg = make_pipeline(
        clone(connectivity), clone(scaling), clone(logreg)
    )
    # make_pipeline is a convenient way to create a Pipeline by passing the
    # steps as arguments. clone creates a copy of the input estimator, to avoid
    # sharing the state of an estimator across pipelines.
    pca_logistic_reg = make_pipeline(
        clone(connectivity),
        clone(scaling),
        PCA(n_components=20),
        clone(logreg),
    )
    kbest_logistic_reg = make_pipeline(
        clone(connectivity),
        clone(scaling),
        SelectKBest(f_classif, k=300),
        clone(logreg),
    )
    dummy = make_pipeline(clone(connectivity), DummyClassifier())
    # TODO: add a pipeline with a PCA dimensionality reduction step to this
    # dictionary. You will need to import `sklearn.decomposition.PCA`.
    return {
        "Logistic no PCA": logistic_reg,
        "Logistic with PCA": pca_logistic_reg,
        "Logistic with feature selection": kbest_logistic_reg,
        "Dummy": dummy,
    }


def compute_cv_scores(models, X, y):
    """Compute cross-validation scores for all models
    `models` is a dictionary like the one returned by `prepare_pipelines`, ie
    of the form `{"model_name": estimator}`, where `estimator` is a
    scikit-learn estimator.
    `X` and `y` are the design matrix and the outputs to predict.
    Returns a `pd.DataFrame` with one row for each model and cross-validation
    fold. Columns include `test_score` and `fit_time`.
    """
    all_scores = []
    for model_name, model in models.items():
        print(f"Computing scores for model: '{model_name}'")
        model_scores = pd.DataFrame(cross_validate(model, X, y, return_train_score=True))
        model_scores["model"] = model_name
        all_scores.append(model_scores)
    all_scores = pd.concat(all_scores)
    return all_scores


X, y = load_timeseries_and_site()
models = prepare_pipelines()
all_scores = compute_cv_scores(models, X, y)
print(all_scores.groupby("model").mean())
sns.stripplot(data=all_scores, x="train_score", y="model")
plt.tight_layout()
plt.show()