# Workshop 1: Welcome to Machine Learning! #

- **When**: Wednesday Week 2, 18:00 - 19:30 
- **Where**: AT 5.04
- **Contact**: hello@edinburghai.org
- **Credits**: This notebook is created by EdinburghAI for use in its workshops. If you plan to use it, please credit us. 
- **P.S.**: All data is FAKE!

## Today
- Use **linear regression**📈 and **decision trees**🌲 to learn linear relationships and to classify
- Learn about fully-connected **neural networks** 🧠 using Python 🐍 and [Pytorch](https://pytorch.org/).
- Train your first neural net with **gradient descent** 🧑‍🎓

Let's get started! 🚀

# Instructions

This is a Jupyter notebook. It contains cells. There are 2 kinds of cells - markdown and Python. Markdown cells are like this one, and are just there to give you information. Python cells run code. You can run a cell with `Ctrl + Enter` or run a cell and move the next one with `Shift + Enter`. Try running the cell below.

In [None]:
print('Ctrl + Enter runs this cell!')
output = 'The last line of a cell is printed by default'
output

There are points to stop and think indicated by **Think🤔**. Please stop, think, maybe write an answer, and discuss with those around you. There are places to write code, indicated by `...` in a python cell. You should fill these in or nothing will work! If you have any questions, just ask one of the EdinburghAI people :)

Good luck!

# What is Machine Learning? 🤖

Supppose you're in charge of the Google internship hiring team. You've been tasked with creating an automated system that decides who to give internships to. You have access to their grades and their CVs. Think - how would you do it? Maybe you could write a function that assigns some score to their average grade, and adds on some extra points if they were part of programming club. But what number? And how much should you add on?

Machine learning provides a way for the machine **to find this function by itself from data**. The machine chooses the function by analysing previous successful and unsuccessful intern hires, and deciding what was most important in those decisions.

**Think🤔**: Is this a good system for deciding intern hires?

This is all super high-level and intuititve, so let's get building to see it in action. There are loads of ways machines can learn from data. First up, we're going to cover two methods called Linear Regression and Decision Trees.

# Linear Regression #

Linear regression is a fancy way of saying that you want to draw a straight line. Remember in school science class when you plotted your data and drew your line of best fit? The machine can draw this straight line for you. If you want to see how, you can google 'Ordinary Least Squares Regression' - but we'll skip the details here. 

**Think🤔**: How you would design an algorithm that draws a straight line through some points?

This isn't very advanced, but it is Machine Learning. Understanding what's going on here is crucial to understanding what's actually happening with more advanced models. You give the machine some data points and a rough idea what the function should look like, and the machine decides on the detail. This is fundamentally the same as any ML technique. Let's see it in action.

In [None]:
# First we load some (FAKE) data in from a csv file using a package called pandas
import pandas as pd

# Load the data from the csv file and display the first few rows
bigcheese_data = pd.read_csv('./data/bigcheese.csv')
bigcheese_data.head()

In [None]:
# Next we can plot the data using matplotlib
import matplotlib.pyplot as plt
plt.scatter(data=bigcheese_data, x='Units of alcohol per week', y='Big cheese attendances per year')
plt.xlabel('Units of alcohol per week')
plt.ylabel('Big cheese attendances per year')
plt.title('Big cheese attendances per year vs units of alcohol per week')
plt.show()

Now we can fit a linear regression model to the data. We can use the `LinearRegression` class from the [sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [None]:
from sklearn.linear_model import LinearRegression

# We initialise a linear regression model
linear_model = LinearRegression()

# We can 'fit' model to the data. This means that we are finding the best line that cuts through the middle of the data.
linear_model.fit(bigcheese_data[['Units of alcohol per week']], bigcheese_data['Big cheese attendances per year'])

# And finally we can plot the data and the line that the model has found
plt.scatter(data=bigcheese_data, x='Units of alcohol per week', y='Big cheese attendances per year')
plt.plot(bigcheese_data['Units of alcohol per week'], linear_model.predict(bigcheese_data[['Units of alcohol per week']]), color='red')
plt.xlabel('Units of alcohol per week')
plt.ylabel('Big cheese attendances per year')
plt.title('Big cheese attendances per year vs units of alcohol per week')
plt.show()

In [None]:
# We can look at our straight line equation using the coef_[0] and model.intercept_ attributes of our model. Fill in the blanks below! 
m = ...
c = ...
print(f'y = {m}x + {c}')

**Think and Discuss:** Why do you think `model.coef_` is a list?

*Hint: Imagine you also had information on students' average bedtimes on a Saturday night and wanted to use this in your model.*

Now let's predict how many big cheeses per year someone attends from their alcohol consumption using `model.predict()`.

In [None]:
input_alchol_per_week = 10

# We can use our model to make predictions. Fill in the blanks below!
prediction_big_cheeses_per_year = ...  

print(f'Predicted big cheeses per year: {round(prediction_big_cheeses_per_year, 2)}')

Try messing around with the prediction and answer the following questions with those around you.

What happens if you input 0 units per week? What happens if you input 50 units per week? 

**Think🤔**: Do these make sense? How many big cheese are there per year? How might you correct your model to be more realistic?

## Extension 💯

Perhaps you think this is boring 😴 because we can only genrerate straight lines. To see why linear regression is actually more powerful than you think, try the next exercise.

In [None]:
# Let's get some new data. Fill in the blanks below!
new_data = ...

# And plot it with our linear model fitted to the new data
linear_model.fit(...)

plt.scatter(data=new_data, x='Units of alcohol per week', y='Big cheese attendances per year')
plt.plot(new_data['Units of alcohol per week'], linear_model.predict(new_data[['Units of alcohol per week']]), color='red')
plt.xlabel('Units of alcohol per week')
plt.ylabel('Big cheese attendances per year')
plt.title('Big cheese attendances per year vs units of alcohol per week')
plt.show()

This doesn't look as good as last time. **Think🤔**: How could you quantify this? How would you measure how 'good' the line is? 

[*Hint*](https://en.wikipedia.org/wiki/Root_mean_square_deviation)

Maybe you can spot that this looks more like a quadratic relationship. Do you think you'd be able to use the same linear model to fit this relationship? 

*Hint: The answer is yes you can...*. 

Try it below!

In [None]:
import numpy as np

# We can add a quadratic term to our model by creating a new column in our data that is the square of the 'Units of alcohol per week' column
new_data['Units of alcohol per week squared'] = ...

# And then we can fit a new model to the data with the quadratic term included
linear_model.fit(new_data[['Units of alcohol per week', 'Units of alcohol per week squared']], new_data['Big cheese attendances per year'])

# And plot the data with the new model
# Don't worry too much about the code below, it's just to make the plot look nicer
plt.scatter(data=new_data, x='Units of alcohol per week', y='Big cheese attendances per year')
x_smooth = np.linspace(new_data['Units of alcohol per week'].min(), new_data['Units of alcohol per week'].max(), 100)
x_smooth_squared = x_smooth ** 2
x_smooth_data = np.column_stack([x_smooth, x_smooth_squared])
y_smooth = linear_model.predict(x_smooth_data)
plt.plot(x_smooth, y_smooth, color='red', label='Quadratic fit')
plt.xlabel('Units of alcohol per week')
plt.ylabel('Big cheese attendances per year')
plt.title('Big cheese attendances per year vs units of alcohol per week')
plt.show()

That looks a bit better! 

Can you see that if you messed around with whether we have an `x**2` or `x**3` term etc, and with multiple different inputs, and even with interactions between them (`x*z`), we can actually do a lot with linear regression?

## Decision Trees

We're going to take the Google internship hiring example for this one.

If you program a system by hand, you might make rules like: 'The grade must be above 70%'. Or 'If their grade is high enough, they must also have been part of programming club'. You could implement this as a bunch of if statements. But why 70%? And should it be programming club, or should you look at if they've done previous internships? This is where Decision Trees come in. They make these decisions for you.

High-level, what a decision tree does, is look at the data and see which splits it can make to most neatly divide the people into who got an internship and who didn't. If you want to understand more, [here's a 4 minute video explaining more.](https://www.youtube.com/watch?v=JcI5E2Ng6r4)

Let's load some data and have a go.

In [None]:
# Load some data
data = ...

# You can inspect the data by calling the head() method on the data
...

Before running our model, let's introduce the idea of *training* and *testing*. Your model is like a student. You can give it exercises to practice, but you also want to know how good it is. So, you can give it a test with similar questions to what it's previously seen, but not exactly the same because you don't want it just to memorise.

To do this, we first split our data into train and test. We then train our model using the train data, and test our model using the unseen test data. We can then report how good our model was on the test data.

Let's split our data using `sklearn`'s `train_test_split`, with a ratio of 20% testing data. This number is arbitrary, but generally we test using between 10 and 40% of the data depending on how much data is available and other factors.

We'll also introduce the convention of using `X` as input to the model, and `y` as the output.

**Think🤔**: What is the type of `X` and `y`? How big are they?

In [14]:
from sklearn.model_selection import train_test_split

X = data[['Average grade']]
y = data['Hired']

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(...)

In [None]:
from sklearn.tree import DecisionTreeClassifier

# We can create a decision tree model
decision_tree = DecisionTreeClassifier(max_depth=1)

# And fit it to the training data. Fill in the blanks below!
...

# We can visualise the decision tree using the plot_tree function
from sklearn.tree import plot_tree
plt.figure(figsize=(20, 15))
plot_tree(decision_tree, filled=True, feature_names=['Average grade'], class_names=['Not', 'Hired'])
plt.show()

**Think🤔**: How do you interpret the information above? What does the `Average grade <= ...` mean? And `samples=...`? And `class = Not`? You can ignore the `gini` values - (google 'gini impurity' if you're interested). 

You can read the documentation on the `plot_tree` function [here.](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py)

Try setting the `max_depth` parameter to 2 instead 1 and re-run. What changed? Do you think this is reasonable?

How good is our tree? Let's measure it's accuracy it on the test data!

In [None]:
# We can use our model to make predictions on the test data. Use the predict method on the model to fill in the blank.
predictions = ...

# We can calculate the accuracy of our model
from sklearn.metrics import accuracy_score
accuracy = ...
print(f'Accuracy: {accuracy}')

Wow! Over 99% accuracy! 

**Think🤔**: Is this actually impressive?

*Hint: What would happen if the model always guessed 'No Hire'? Could you write some code below to test what accuracy this would give? Do you even 
need to, given your decision tree above?*

Let's use a different metric called *recall* which measures how many of the hires it actually detects. A recall of 1 means it successfully detected all hires, and 0 means none.


In [None]:
from sklearn.metrics import recall_score

# calculate recall
recall = ...
print(f'Recall: {recall}')

Oh dear! Perhaps this isn't surprising given our tree above.

Much of the problem here is a lack of data. We only have one column! Let's load some new data and try again.

In [None]:
data = pd.read_csv('./data/googleinternship_big.csv')
data.head()

It's always a good idea to try to understand the data better first. Let's look at some job hiring stats.

In [None]:
# Number of rows, number of hires, and job offer rate
total_rows = ...
total_hires = ...
job_offer_rate = ...

total_rows, total_hires, f'{job_offer_rate}%'

Now try to build a model yourself that does better. We're going to use both accuracy and recall as the main metric for grading your model here (**Think🤔**: Why can't we just use recall?). 

You can adjust the `max_depth` of your tree. This is what we call a *hyperparameter* of a model. It is not a parameter because it is not something the machine learns itself. Instead, it is something that you, as the machine learning engineer, decide on to guide the machine learning algorithm.

In [None]:
# Fill in the blanks to create our input X and output y.
# Hint: you can use the .drop('column', axis=1) method on a Pandas dataframe method to remove columns from the data rather than selecting all the columns.
X = ...
y = ...

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = ...

# We can create a decision tree model
decision_tree = ...

# And fit it to the data
...

# We can visualise the decision tree using the export_text function this time to see which features are being used
from sklearn.tree import export_text
print(export_text(decision_tree, feature_names=list(X.columns)))

# Make predictions and calculate recall and accuracy
predictions = ...
recall = ...
accuracy = ...
print(f'Recall: {recall}')
print(f'Accuracy: {accuracy}')

Hopefully that looks a bit better.

**Think🤔**: What were the most important features in deciding whether to hire or not? Do you like this machine learning system? Why, or why not? How would you change it to be 'better'? What does 'better mean' to you?

These sorts of questions are what you need to be asking yourself every time you're building a model, and it only gets harder when the models get more complicated.

Decision trees are incredibly powerful. What you've seen here is the most basic version. But they can be expanded vertically (larger depth, this is called *bagging*) and horizontally (add more trees, where each tree 'votes' on the overall outcome, this is called *boosting* or *ensembling*). If you randomly add trees together, you get a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). One of the most powerful ML techniques that isn't a neural network is called [XGBoost](https://xgboost.readthedocs.io/en/stable/) which is just bunch of fancy decision trees, where each one tries to make up for each other's weaknesses. [Here's a 4 minute video that explains how it works.](https://www.youtube.com/watch?v=TyvYZ26alZs) In many cases, XGBoost works better than neural networks, especially when you have tabular data. 



## Well Done!

That concludes our introduction to ML! Hope you had fun! Next up, we're going to look at neural networks, which are the foundation of recent advances in AI.