Exercise 5 - Logistic Regression
===

Simple logistic regression predicts binary (yes/no) events. For example, we may want to predict if someone will arrive at work on time, or if a person shopping will buy a product. 

This exercise will demonstrate simple logistic regression: predicting an outcome from only one feature.

#### Run the code below to prepare the necessary libraries for this exercise.

In [None]:
# Run this!

suppressMessages(install.packages("tidyverse"))
suppressMessages(library("tidyverse"))
suppressMessages(library("glmnet"))

Step 1
---

We want to place a bet on the outcome of the next football (soccer) match. It is the final of a competition, so there will not be a draw. We have historical data about our favourite team playing in matches such as this. 

Complete the exercise below to see the structure of this data.

### In the cell below replace:
#### 1. `<dataPreviewStr>` with `str`
#### 2. `<dataPreviewHead>` with `head`
#### then __run the code__.

In [None]:
team_stats <- read.delim("Data/football data.txt")

###
# REPLACE <dataPreviewStr> WITH str AND <dataPreviewHead> WITH head
###
<dataPreviewStr>(team_stats)
<dataPreviewHead>(team_stats)
###

summary(team_stats$average_goals_per_match)

The `team_stats` data shows the average goals per match of our team for the season in the first column, and whether the team won the competition in the second column. The `won_competition` variable is a binary outcome, where 1 represents a win, and 0 represents a loss.

Step 2
---

Let's graph the data so we have a better idea of what's going on. 

Complete the exercise below to make a scatter plot of `team_stats`. Replace the x variable with the names of the feature we want to plot on the x-axis.

#### In the cell below replace `<xData>` with `average_goals_per_match`

In [None]:
team_stats  %>% 
###
# REPLACE <xData> WITH average_goals_per_match
###
ggplot(aes(x = <xData>, y = as.factor(won_competition), colour = as.factor(won_competition))) +
###

geom_jitter() +
ggtitle("Game statistics for favourite football team") +
xlab("Average number of goals scored per match") +
ylab("Competition win") +
# Align title to centre
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")

In the plot above, we have used ggplot2's `geom_jitter` function, which adds a small amount of random variation to the location of each point. Since we have binary outcomes in this dataset, using this function allows us to handle overplotting.

> If you want to test this for yourself, change the `geom_jitter` call in the code block above to `geom_point`; it is harder to decipher which points overlap using the latter function.

We can see that in general, when our team has a good score average (x-axis), they tend to win the competition.

Step 3
---

How can we predict whether the team will win this season? Let's apply AI to this problem, by making a logisitic regression model using this data and then graphing it.

We will use the function `glm`, which stands for generalized linear models. We will set the type of model ("family" argument) as binomial logistic regression - to specify that we want a logistic regression model. 

We'll use the standard R format for the formula, which is `labels ~ features` (if you see a `.` this means it will select all features in the dataset).

### In the cell below replace:
#### 1. `<formula>` with `won_competition ~ average_goals_per_match`
#### 2. `<dataset>` with `team_stats`
#### then __run the code__.

In [None]:
###
# REPLACE <formula> WITH won_competition ~ average_goals_per_match AND <dataset> WITH team_stats
###
glm_team <- glm(formula = <formula>, family = binomial(link = "logit"), 
                data = <dataset>)
###
summary(glm_team)

# And we'll quickly print out some predictions to make sure it's working
head(predict(glm_team, data = team_stats, type = "response"))

Alright, that's the model done. Now run the code below to graph it.

In [None]:
# Run this!

# Plot using ggplot2
team_stats %>%
ggplot(aes(x = average_goals_per_match, y = won_competition)) +
geom_point(aes(colour = as.factor(won_competition)), alpha = 0.5, size = 3) +
geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"), 
            colour = "black") +
ggtitle("Binomial logistic regression model for football team competition win") +
xlab("Average number of goals scored per match") +
ylab("Competition win") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none") + 
scale_y_continuous(labels = c("0", "", "", "", "1"))

We now have a binomial logistic regression model to fit our data. The black line represents our model.

Step 4
------

We can read the model above like so:
* Take the average number of goals per match for the current year. Let's say it is 2.5.
* Find 2.5 on the x-axis. 
* What value (on the y axis) does the line have at x = 2.5?
* If this value is above 0.5, then the model predcits that our team will win this year. If it is less than 0.5, it predicts that our team will lose.

Because this line is just a mathematical function (equation) we don't have to do this visually.

In the exercise below, choose the number of goals you want to evaluate.

The code will calculate the probability that our team will win with your chosen number of goals in the match.

#### Replace `<numberOfGoals>` with a number between 0 and 3, then run the code.

In [None]:
###
# REPLACE <numberOfGoals> WITH A NUMBER BETWEEN 0 AND 3
###
goals <- <numberOfGoals>
###

# Create data frame for input to predict function
mean_goals <- data.frame(average_goals_per_match = c(goals))

# Run predict function based on inout to goals
mean_goals$prediction <- predict(object = glm_team, newdata = mean_goals, type = "response")

# View result
mean_goals

# Print out the result to screen
paste0("The probability of our team winning this year is ", round(mean_goals$prediction * 100, digits = 4), "%")

Now let's plot our chosen number of goals in the context of our model using ggplot2:

In [None]:
# Run this!

team_stats %>% 
ggplot(aes(x = average_goals_per_match, y = won_competition)) +
geom_point(aes(colour = as.factor(won_competition)), alpha = 0.5, size = 3) +
geom_point(data = mean_goals, aes(x = average_goals_per_match, y = prediction), size = 5, colour = "black",
           shape = "cross") +
geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"), 
            colour = "black") +
ggtitle("Binomial logistic regression model for football team competition win") +
xlab("Average number of goals scored per match") +
ylab("Competition win") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
geom_hline(yintercept = mean_goals$prediction, linetype = "dotted") +
geom_vline(xintercept = mean_goals$average_goals_per_match, linetype = "dotted")

Conclusion
-----

Well done! We have calculated the likelihood that our team will win this year's competition.

You can go back to the course now and click __'Next Step'__.