Exercise 2 - Simple Linear Regression
===

In Exercise 1, we used R within Jupyter Notebooks to load information about chocolate bars, and stored it in a variable named `choc_data`. We checked the structure of `choc_data`, and explored some of the variables we have about chocolate bars using graphs.

In this exercise, we want to know how to make our chocolate-bar customers happier. To do this, we need to know whether chocolate bar _features_ can predict customer happiness. For example, customers may be happier when chocolate bars are bigger, or when they contain more cocoa.

We have data on customer happiness when eating chocolate bars with different features. Let's explore the relationship between customer happiness and the different features we have available.

Step 1
---

First, we need to load the required libraries and data we will use in this exercise.

Below, we'll also use the functions `str`, `head`, and `tail` to inspect the structure of `choc_data`.

** In the cell below replace: **

** 1. `<structureFunction>` with `str` **

** 2. `<headFunction>` with `head` **

** 3. `<tailFunction>` with `tail` **

** then __run the code__. **

In [None]:
# Load `ggplot2` library for graphing capabilities
library(ggplot2)

# Load the chocolate data and save it to the variable name `choc_data`
choc_data <- read.delim("Data/chocolate data.txt")

###
# REPLACE <structureFunction> <headFunction> <tailFunction> WITH str, head, and tail
###

# Check the structure of `choc_data` using `str(choc_data)`
<structureFunction>(choc_data)

# Inspect the start of the data by typing `head(choc_data)`
<headFunction>(choc_data)

# Inspect the end of the data by typing `tail(choc_data)`
<tailFunction>(choc_data)

Our object `choc_data` contains 100 different chocolate bar observations for 5 variables: weight, cocoa percent, sugar percent, milk percent, and customer happiness.

Step 2
---


We want to know which chocolate bar features make customers happy.

The example below shows a linear regression between __cocoa percentage__ and __customer happiness__. 

** Run the code below to visualise this. You do not need to edit the code block below, just run it. **

In [None]:
# Run this box

# DO NOT EDIT THIS CODE

# Create our own function to generate a linear regression model then graph the result
lin_reg_choc <- function(x, y, my_data){
    
    x_arg <- my_data[ , substitute(x)]
    y_arg <- my_data[ , substitute(y)]
    
    # Perform linear regression using `lm` (stands for linear models) function
    lm_choc <- lm(formula = y_arg ~ x_arg, data = my_data)
    
    # Create scatter plot of choc_data together with linear model
    ggplot(data = my_data, aes_string(x = x, y = y)) +
    geom_point() +
    # Add line based on linear model
    geom_abline(intercept = lm_choc$coefficients[1], 
                slope = lm_choc$coefficients[2],
                colour = "red") +
    # x-axis label remains constant
    xlab("Customer happiness") +
    # y-axis label; use `gsub` function to remove underscore from 
    ylab(gsub("_", " ", y)) +
    # graph title
    ggtitle(paste("Customer satisfaction with chocolate bars given", gsub("_", " ", y))) +
    theme(plot.title = element_text(hjust = 0.5))

}

# This performs the linear regression steps listed above
lin_reg_choc(x = "customer_happiness", y = "cocoa_percent", my_data = choc_data)

In the scatter plot above, each point represents an observation for a single chocolate bar.

It seems that __a higher percentage of cocoa increases customer happiness__. We think this because as we increase the amount of cocoa (y-axis), the amount of customer happiness (x-axis) increases, as shown by our linear model (red line). 

Step 3
---

** In the cell below: **

** 1. replace the text `<addFeatureHere>` with __`weight`__ to see if heavier chocolate bars make people happier. **

** 2. Also try the variables `sugar_percent` and  `milk_percent` to see if these improve customers' experiences. **

** Remember to run each box when you are ready.**

In [None]:
###
# CHANGE <addFeatureHere> TO "weight" IN THE LINE BELOW (INCLUDING THE QUOTATION MARKS)
###
lin_reg_choc(x = "customer_happiness", y = <addFeatureHere>, my_data = choc_data)
###

In [None]:
###
# CHANGE <addFeatureHere> TO "sugar_percent" IN THE LINE BELOW (INCLUDING THE QUOTATION MARKS)
###
lin_reg_choc(x = "customer_happiness", y = <addFeatureHere>, my_data = choc_data)
###

In [None]:
###
# CHANGE <addFeatureHere> TO "milk_percent" IN THE LINE BELOW (INCLUDING THE QUOTATION MARKS)
###
lin_reg_choc(x = "customer_happiness", y = <addFeatureHere>, my_data = choc_data)
###

It looks like heavier chocolate bars make customers happier, whereas larger amounts of sugar or milk don't seem to make customers happier. 

We can draw this conclusion based on the slope of our linear regression models (red line): 

* Our linear regression model for "weight vs. customer happiness" reveals that as chocolate bar weight  increases, customer happiness also increases;
* Our linear regression models for "sugar percent vs. customer happiness" and "milk percent vs. customer happiness" reveal that as the percentage of sugar or milk increases, customer happiness decreases.

> *N.B. It is possible to perform linear regression directly with `ggplot2` using the following function and arguments: `stat_smooth(method = "lm")`. However, we want to show you how to create linear models without the dependency of `ggplot2`.*

Conclusion
---
Well done! You have run a simple linear regression that revealed chocolate bars heavier in weight and with higher percentages of cocoa make customers happy.

You can now go back to the course and click __'Next Step'__ to move onto using linear regression with multiple features.