# A quick hands-on tutorial in Supervised ML

In this tutorial we are going to use an already clean dataset from the [Nomadlist Cities](https://nomadlist.com/) data to predict the continent where the city is located.

![alt text](https://source.unsplash.com/1JNk998-g70/800x600)

We are going to encode target-label (world regions), scale our features and try out different algorithms including LogisticRegression, Random Forest and XGBoost. 

We also will try to predict the `nomad-score`, a continuous variable. This is a different kind of problem - a regression problem. We will need to use slightly different tooling for model-fitting and for evaluation.

The tutorial will mostly rely on the Sklearn ML library.
You will see that the syntax and logic of Sklearn is also used in other new libraries like XGBoost

In [None]:
# Import standard Libraries
import pandas as pd
import seaborn as sns
import altair as alt


sns.set(rc={'figure.figsize':(10,10)})

## Loading and selecting the data

In [None]:
# Load data
data = pd.read_csv('https://github.com/CALDISS-AAU/sdsphd20/raw/master/datasets/cities_sds_phd.csv')

In [None]:
data.info()

In [None]:
# Select the (independant) features that we are going to use to train the model
X = data.loc[:,'cost_nomad':'weed']

In [None]:
# Define the dependant variabel / target to predict (world region)
y = data.region

## Transforming, preprocessing and splitting

In [None]:
# Load and instantiate a LabelEncoder that will turn our text labels (regions into indices)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
# Transform labels into indices by passing y to the encoder
y_enc = encoder.fit_transform(y)

In [None]:
# Load and instantiate a StandardSclaer 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
# Apply the scaler to our X-features
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the data using the train_test_split module. We keep 20% of the data for testing and use 80% to train the model
# Random state defined with an arbitrary number for reproducibility

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_enc, test_size = 0.2, random_state = 42)

## Training and evaluating various models

In [None]:
# Import modules that we are going to use for all models

# Import K-fold crossvalidation
from sklearn.model_selection import cross_val_score

# Import Classification Report for later evaluatoion of performance
from sklearn.metrics import classification_report

### LogisticRegression (let's call it that for now without going into details)

In [None]:
# Import and instantiate the model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=2000)

# K-fold cross-validation (splitting the 80% into 5 chunks, using 4 to train and 1 to evaluate)
scores = cross_val_score(model, X_train, y_train, cv = 5)
print(scores)

# Model training
model.fit(X_train, y_train)

# Model performance on the test-set
print(model.score(X_test, y_test))

The overall accuracy is at 65% which is not too impressive. In a multiclass setting that number is also somewhat hard to interpret and that's where it's useful to look at other evaluation statistics

In [None]:
# Performance evaluation using the classification_report

target_names = encoder.inverse_transform(list(set(y_test))) # get real region names back using inverse_transform

y_pred = model.predict(X_test) # predict from the testset

print(classification_report(y_test, y_pred, target_names = target_names)) #Print out the report

Logistic regression is not doing too well. It is particularly bad when predicting African cities.

Here the recall score is perhaps more interesting than the precision score.

#### Let's inspect the performance visually

In [None]:
!!pip uninstall -qq mlxtend -y

In [None]:
# For that we need to install an updated version of the MLxtend library (it will make plotting of the confusion matrix easy)
!pip install -qq -U mlxtend

In [None]:
# Import the confusion matrix plotter module
from mlxtend.plotting import plot_confusion_matrix

# We will also import sklearns confusion matrix module that will make it easy to produce a confusion matrix
# It's actually just a cross-tab of predicted vs. real values
from sklearn.metrics import confusion_matrix

In [None]:
# calculate the confusion matrix
confmatrix = confusion_matrix(y_test,y_pred) 

# Let's plot
plot_confusion_matrix(conf_mat=confmatrix,
                                colorbar=True,
                                show_absolute=True,
                                show_normed=True,
                                hide_spines = True,
                                class_names=target_names)

As you can see, the model struggeled a lot with African cities and places in Oceania. That is probably also because those are not too many in the data and thus it is hard for the model to learn abot their characteristics.

Some vities form the Americas have been placed in Europe (probably places like Boston or cities in Latinamerica that are similar to Southern Europe). It's an interesting exercise to explore misplaced observations...

### Random Forest
Now we can try out a more complex model (and hopefully more powerfull)
The process is exactly the same and thus there are not too many comments in the code

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

scores = cross_val_score(model, X_train, y_train, cv = 5)
print(scores)

model.fit(X_train, y_train)
print(model.score(X_test, y_test))

The test-score is well within the values produced in the crossvalidation
Overall performance goes up (as expected)

In [None]:
# Performance evaluation using the classification_report

target_names = encoder.inverse_transform(list(set(y_test))) # get real region names back using inverse_transform

y_pred = model.predict(X_test) # predict from the testset

print(classification_report(y_test, y_pred, target_names=target_names)) #Print out the report

In [None]:
# calculate the confusion matrix
confmatrix = confusion_matrix(y_test,y_pred) 

# Let's plot
plot_confusion_matrix(conf_mat=confmatrix,
                                colorbar=True,
                                show_absolute=True,
                                show_normed=True,
                                class_names = target_names)

While the model is better at classification of the lartger groups, performance is the same for Oceania and goes down for Africa

### XGBoost
Finally, XGBoost (again we will use standard settings - i.e. no hyperameter tuning)

In [None]:
import xgboost as xgb

model = xgb.XGBClassifier()

scores = cross_val_score(model, X_train, y_train, cv = 5)
print(scores)

model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Overall performance is even higher as with Catboost. But let's see how the algorighm is dealing with our problematic small classes

In [None]:
# Performance evaluation using the classification_report

target_names = encoder.inverse_transform(list(set(y_test))) # get real region names back using inverse_transform

y_pred = model.predict(X_test) # predict from the testset

print(classification_report(y_test, y_pred, target_names=target_names)) #Print out the report

In [None]:
# calculate the confusion matrix
confmatrix = confusion_matrix(y_test,y_pred) 

# Let's plot
plot_confusion_matrix(conf_mat=confmatrix,
                                colorbar=True,
                                show_absolute=True,
                                show_normed=True,
                                class_names = target_names)

Overall, it seems XGBoost wins this time.

This notebook is only a quick example of the machanics of valious algorithms on small data. 
In real-world situations we would need to spend much more time tuning the models. Also: More compelx models do not always perform better...

## Predicting the Nomad Score

So far we have considered a classificaion problem - the model had to pick one of the 5 options. The outcome variable was a class. Let's shift gears and look at a different type of problem - a prediction where the outcome is a continuous variable. This is our "typical" regression problem.

In the following we are going to predict the nomad score. The inputs into the model will be the same that we already used for predicting the region. We are only going to change the dependant.

In [None]:
# picking a different outcome variable

y_reg = data.nomad_score

In [None]:
# We need to creat new train / test splits here - as the nomad_score was not part of the previous split.

X_train, X_test, y_train, y_test, data_train, data_test = train_test_split(X_scaled, y_reg, data, test_size = 0.2, random_state = 42)

as you can see, I also have the overall dataframe in the split as a 3rd coponent. This is only for some interactive visuals down the line. But yeah, you can do that too... :-) Sometimes it's also handy when passing in some indices that you want to use to get back to data that would be inaccessible.

In [None]:
# Import and instantiate the baseline model
from sklearn.linear_model import LinearRegression
model = LinearRegression()

# Model training
model.fit(X_train, y_train)

# Model performance on the test-set / This score is not accacy but a R^2
print(model.score(X_test, y_test))

In [None]:
# We can also inspect our results visually
y_pred = model.predict(X_test)

sns.scatterplot(y_test,y_pred)

In [None]:
data_test.info()

In [None]:
data_test['nomad_score_pred'] = y_pred

alt.Chart(data_test).mark_circle(size=60).encode(
    x='nomad_score',
    y='nomad_score_pred',
    color=alt.Color('region', scale=alt.Scale(scheme='category10')),
    tooltip=['region','weed','place']
).interactive()

Let's try a different model class

In [None]:
# Import and instantiate a XGBoost Regressor

model = xgb.XGBRegressor()

# Model training
model.fit(X_train, y_train)

# Model performance on the test-set / This score is not accacy but a R^2
print(model.score(X_test, y_test))

In [None]:
# We can also inspect our results visually
y_pred = model.predict(X_test)

sns.scatterplot(y_test,y_pred)

In [None]:
data_test['nomad_score_pred'] = y_pred

alt.Chart(data_test).mark_circle(size=60).encode(
    x='nomad_score',
    y='nomad_score_pred',
    color=alt.Color('region', scale=alt.Scale(scheme='category10')),
    tooltip=['region','weed','place']
).interactive()

# Your turn


In the repo, you will find a dataset describing employee turnover in a company.

https://raw.githubusercontent.com/CALDISS-AAU/sdsphd20/master/datasets/turnover.csv

The dataset contains data collected in an employee survey and enriched with HR data.

The variable `churn` tells us if the employee left the company in the past 3 months. The other variables are collected

## Classification

Try to predict `churn` using a classification pipeline (perhaps add some simple exploration of the data first)

## Regression
Try to predict the number of weekly average hours worked.

**Before** working with the data, you should use `pd.get_dummies` to get dummies for categorical variables.

