# Love Thy Neighbour

In this Mission you'll learn all about distances between data points and how we can take insights from each point's neighbours. You will also learn two new models:

 - KNN for Classification
 - K-Means for Clustering

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## The Dataset

The provided dataset contains information about **penguins**:
 - `species`: (string) the penguin's species
 - `island`: (string) the penguin's origin island
 - `culmen_length_mm`: (number) culmen's length in milimiters 
 - `culmen_depth_mm`: (number) culmen's depth in milimiters 
 - `flipper_length_mm`: (number) flipper's length in milimiters
 - `body_mass_g`: (number) penguin's mass in grams
 - `is_male`: (integer) whether the penguin is male (1) or female (0)
 
### Load the data

In [2]:
train_df = pd.read_csv('penguins_train.csv')

## Tasks

Always start by taking a look at the dataset, getting familair with the variables and check any data issues, before diving into the problem.

## 1. Let's get different distances between all our points.
#### 1.1 Create a new dataframe with only the numerical columns

#### 1.2. Create a new dataframe, called `all_comb` with all the combinations of pairs of points, *i.e.,* a dataframe with the following columns: 
**Question**: How many rows do you expect the dataframe to have?
```
['index_x', 'culmen_length_mm_x', 'culmen_depth_mm_x',
'flipper_length_mm_x', 'body_mass_g_x', 'index_y', 'culmen_length_mm_y',
'culmen_depth_mm_y', 'flipper_length_mm_y', 'body_mass_g_y']
```

*Hint:* Check `pd.merge` with parameter `how = 'cross'` and use `.reset_index()` to keep the original indexes of the points.

#### 1.3. Build a function that receives a row of the dataframe `all_comb` and outputs the two points you want to measure distance between. That is:
 - point A should have info about `'culmen_length_mm_x', 'culmen_depth_mm_x','flipper_length_mm_x', 'body_mass_g_x'`
 - point B should have info about `'culmen_length_mm_y', 'culmen_depth_mm_y','flipper_length_mm_y', 'body_mass_g_y'`
 
*Hint:* Use `.to_numpy()` to turn a series into a numpy array.

#### 1.4. Build 3 functions corresponding to 3 different distances: they should receive a row of `all_comb` and output the corresponding distance from the two point in a row

#### 1.5. Create 3 new columns with the different distances between the points by `apply`ing your functions to the `all_comb` rows.
Note: It can take some time to run - your doing a lot of calculations!

## 2. Getting a point's neighbors 
#### 2.1 Get the 5 closest points to the penguin with original `index_x = 10` (using a distance of your choice).
*Note:* the closest point will always be the point itself with a distance of 0.

#### 2.2. Using the indexes from the obtained points, check the original dataframe (with categorical features) to see if there is a pattern.

#### 2.3. If you didn't know the species and gender of your penguin, based on its neighbours what would be your guess?

#### 2.4. 🤔  Look at the numerical variables' scale. What feature do you think is impacting the distance the most? Should this be the case? What can we do to prevent this?

#### (OPTIONAL) 2.5. Repeat the same exercise with different distances of your choice and compare the results.

## 3. Let's use Sklearn to predict a penguin's gender with the KNN model.
**Steps** 
 1. Import [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) model from Sklearn. 
 
 2. Create a dataframe `X` with the numerical features and a series `y` with the target (`is_male`)
 3. Instantiate the model `knn = KNeighborsClassifier()`
 4. Fit it on your data.
 5. Check what parameters `n_neighbors` and `metric` are in the documentation. What values are being used by default?
 6. Import classification metrics of your choice from sklearn.
 7. Use your trained model to make predictions for your dataset.
 8. Measure your model's performance with your classification metrics. 

## 4. Comparing different models

**Steps:**
 1. Load the test data below
 
 2. Create variables `X_test` and `y_test`, similarly as you did before
 3. Define different `KNeighborsClassifier` models by choosing different values for `n_neighbors` and `metric` (see available distances [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics)). You should use at least 4 models: 5 and 10 neighbors with Euclidean and Manhattan distances. (Feel free to test as many as you want)
 4. Use the provided function `get_metrics` to compare the different models.
 5. Which parameters yields a better peformance?

## 5. The importance of scaling

**Steps:**
 1. Import [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from Sklearn
 
 2. Instantiate (`scaler = StandardScaler()`) and fit your scaler on your training data
 3. Create two new datasets `X_train_scaled` and `X_test_scaled` by using your scaler to `transform` your training and test data
**🚨 You should ONLY fit your scaler on the training data. Test data is unseen data which we know nothing about at the moment of training, it will be scaled with the information we get from the training data**
 4. Re-run your `get_metrics` function with the same models as before bu using the scaled versions of your data: `X_train_scaled` and `X_test_scaled`.
 5. Contemplate the importance of scaling!

## 6. Clustering with KMeans

**Steps:**
 1. Import [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) from Sklearn.
 2. Create a **for** loop that for each value of **k** between 1 and 10: 
  1. Defines `model = KMeans()` with parameter **k**
  2. Fits model on `X_train_scaled`
  3. Saves model's inertia (`model.inertia_`) in a list
 
 3. Plot the values of inertia against values of k 
 4. Identify the elbow in the curve
 5. Define and fit a KMeans with `n_clusters = 3` on `X_train_scaled`
 6. Use your model to make predicitions on `X_train_scaled`, store the predictions on a variable named `clusters_3`
 7. Run the following command `plt.scatter(X.culmen_depth_mm,X.culmen_length_mm, c = clusters_3)`, plotting the original's data culmen_depth_mm vs. culmen_length_mm, with each point colored according to the cluster it belongs to. 
 8. What do you think the clusters could represent? Check your original dataframe's categorical features. 
 9. Use `pd.crosstab` to compare `clusters_3` with the `island` and `species` features. Which variable do the clusters seem to be representing?

**(OPTIONAL)** Repeat 5 - 8 with `n_clusters = 2` and compare predictions with the `is_male` variable. 

**⚠️ Remember:** Clustering is a type of Unsupervised Learning - there is **no target**. It is up to us to interpret what each cluster can represent and analyse each cluster in order to draw conclusions. 