<a href="https://colab.research.google.com/github/AGeographer/cand3_2025_python/blob/main/CAnD3_ML_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1><img src="https://www.mcgill.ca/cand3/files/cand3/cand3_logo_final_fullname.png" width="150">Introduction to Machine Learning with <img src="https://s3.dualstack.us-east-2.amazonaws.com/pythondotorg-assets/media/community/logos/python-logo-only.png" width="30"> <code>Python</code> for social scientific research </h1>
<h3>Instructor: Dr. Tim Elrick, GIC, McGill (<a mailto="tim.elrick@mcgill.ca">tim.elrick@mcgill.ca</a>)</h3>

## Preparation

First, we prepare our notebook by loading relevant modules.

In [56]:
# numpy for working with data
import numpy as np

# pandas for data sets
import pandas as pd

# scikit-learn for machine learning
# for preprocessing (normalizing data)
from sklearn.preprocessing import StandardScaler

# for selecting models

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV

# for model fitting
from sklearn.neighbors import KNeighborsClassifier

Now, we load the data set that we want to work on. Here, we use a data set on restaurants in New York City that contains *Michelin guide*-ratings (they are coming to Montreal this year!).

In [None]:
# Load data set
data = pd.read_csv("https://gattonweb.uky.edu/sheather/book/docs/datasets/MichelinNY.csv",
                   encoding ='latin-1')

# Have a look at the data
data.describe(include= 'all')

We want to use the unsupervised machine learning method *k-nearest neighbours* to find out if we can predict which restaurant is mentioned in the Michelin guide or not.

In [58]:
# We separate the predicted variable y and the predictor variables X (capital X for a vector of variables)

y = data['InMichelin']

X = data.drop(columns = ['InMichelin', 'Restaurant Name'])

## Cross-Validation

Next, we'll split our sample into two disjoint sets: a **training set** featuring 80% of our observations; and a **testing set**—or *hold-out sample* comprising 20% of the original dataset—that will not be involved in the training or validation process. We'll also ensure that our feature vectors have been standardized.

Then, we'll initialize our KNN and use (stratified) $k$-fold cross-validation to fit a basic KNN model.

In [59]:
# Perform train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify = y,
                                                    test_size = 0.2,
                                                    random_state = 905)

In [60]:
# Standardizing our data
scaler = StandardScaler()

# we use .fit_transform() on the training data (fitting means that we calculate average and standard deviation)
X_train = scaler.fit_transform(X_train)

# we use .transform() on the test data (we don't want the test data to get a peek on the training data, that is called a data leakage)
X_test = scaler.transform(X_test)

In [None]:
# Initializing KNN classifier for fitting the model. Here, we randomly choose k=5

model = KNeighborsClassifier(n_neighbors = 5)

model.fit(X_train, y_train)

In [34]:
# Now, let's use the k-fold cross-validation. We split the data set into 5 to 10 folds.
# for smaller data sets you usually choose higher folds than for larger data sets.

skfold = StratifiedKFold(n_splits = 10,
                         shuffle = True,
                         random_state = 905)

In [None]:
# Cross-validation scores for each fold (accuracy)

scores = cross_val_score(model, X_train, y_train,
                         cv = skfold)

scores

If we get a high mean score, the model performs well.

In [None]:
# So, the average of all folds is

scores.mean()

Now, let's compare this to our actual model, when we use the **test data**. If the value is similar to the *cross-validation*, it's a good sign. If your model score is much higher than the cross-validation your model might be overfitted.

In [None]:
# Measure of predictive performance (accuracy of our model)

model.score(X_test, y_test)

## Hyperparameter Optimization

Next, we'll use the `GridSearchCV` method to select the optimal value of $k$ by using a grid search of possible hyperparameter values (**Rule of thumb:** odd numbers between 1 and $\sqrt{sample}$, here $\sqrt164=13$).

In [None]:
# Creating a array/vector (in ML here called grid) of potential hyperparameter values (odd numbers from 1 to 13):

k_grid = {'n_neighbors': np.arange(start = 1, stop = 15, step = 2) }

print(k_grid)

In [None]:
# Setting up a grid search to home-in on best value of k:

grid = GridSearchCV(KNeighborsClassifier(), param_grid = k_grid, cv = skfold)

print(grid)

In [None]:
# Now we fit the grid models

grid.fit(X_train, y_train)


In [None]:
# Extract best score and hyperparameter value:

print("Best Mean Cross-Validation Score: {:.3f}".format(grid.best_score_))

print("Best Parameters (Value of k): {}".format(grid.best_params_))

print("Test Set Score: {:.3f}".format(grid.score(X_test, y_test)))

In [None]:
# We can look at the results as a data frame

pd.DataFrame(grid.cv_results_).head()