# 2B: Predicting Births

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

<img src="https://i.postimg.cc/jTgVzWLd/X2-Maternity-Ward.jpg" alt="Several newborn babies sleeping in a maternity ward" width = 70%>

## Staffing the Maternity Ward

Hospitals use complex staffing models to make sure they have the appropriate number of doctors, nurses, and support staff on call for mothers going into labor on any particular day. If hospitals were able to predict the exact number of births at their facilities on future days, they could improve efficiency and reduce instances of staff shortage. Can we use data to predict the number of births on a certain day, at a typical hospital in the US?

### Motivating Question: How many babies will be born tomorrow?

### The Dataset
##### Description
Shows the number of births per day at a typical, large hospital in the US every day from 2000-2014. Note: Due to privacy law, data from a specific hospital is unavailable. Instead, we took data on the total number of births in the US on these dates and applied a formula to project how many births occurred at a typical, large hospital on each of the dates. **The notebook can be completed as if this data is from a specific hospital.**

##### Variables
- `year`: Year (2000-2014)
- `month`: Month, where 1 = January and 12 = December
- `date_of_month`: Day number of month
- `day_of_week`: Day of week, where 1 = Monday and 7 = Sunday
- `births`: Number of births


##### Data Sources 
 - Full birth data (for total births across US) was compiled by Five Thirty Eight, originally from the Social Security Administration (https://github.com/fivethirtyeight/data/tree/master/births)
 - CourseKata applied a formula to the total births to project the amount of births at a typical, large hospital during each of the same dates. The result is the dataset used in this notebook.


### 1.0 - Exploring the `baby_days` Dataset

**1.1 -** Run the following codeblock to download the dataset. The dataset shows the number of births per day at a typical, large hosptial in the US.

In [None]:
# Download the dataset
baby_days <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQFPYUVf36moVxfg7Amdds86UlNTxo7ISj9h1LAhfac-09J3K9HiHPNUsTP8vy8VSa5npeqhBT8SY_a/pub?output=csv")
head(baby_days)

**1.2 -** What years are represented in this dataset? How many total days are represented?

**1.3 -** (If applicable) Find the number of people born at this hospital on the day that **YOU** were born. Or, look up a date you are interested in between 2000-2014.

In [None]:
# Try modifying the code below to match your birthday
filter(baby_days, year == 2002 & month == 8 & date_of_month == 18)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**1.4 -** Run the code below to create three separate visualizations in which you show the number of births by:

 - `year`
 - `month`
 - `day_of_week`

Comment on any trends you notice.

 </div>

### 2.0 - Making predictions with the empty model

**2.1 -** If you were asked to predict the number of births on a randomly selected day from this dataset, what number of births would you predict? 

What if we used the empty model to predict the number of births? What is good about that prediction?

**2.2 -** Create and store an empty model for predicting the number of births per day as the variable `empty_model`. Then, display the fitted values of your model in GLM notation.

Here is some markdown to get you started:

$Y_i = b_0 + e_i$

In [None]:
# Empty model


**2.3 -** The code block below draws a random day from the dataset and shows the number of births on that day. Imagine you made a prediction for the number of births on that day using your empty model. Spoiler alert -- your prediction will probably be wrong.

What kind of error would this prediction make? What is the practical consequence of making this type of error?

In [None]:
##Number of births from a random day
set.seed(123) # set seed to have consistent results
random_day <- sample(baby_days, 1) # select random row from dataset 

random_day

**2.4 -** Find all the predictions and store them back into the data frame. Do the same thing with the residuals. 

(You might want to just call these new variables `predictions` and `residuals`.)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**2.5 -** If we sum up all the residuals, what would you expect to get? 

Now try it. Why do you get this result? Does this result justify using the mean to predict the number of births on a random day? Why or why not?

</div>

**2.6 -** Although the residuals may be balanced, they may also be large. Find the Sum of Squares (SS). Is it easy to interpret what this number means? Why or why not?

**2.7 -** A more interpretable measure of the "typical" error amount is the standard deviation. Calculate and interpret the standard deviation of births.

### 3.0 - Making Conditional Predictions

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>


**3.1 -** Which dates had the largest residuals (either positive or negative)? What do you notice about these dates? Why is our model so wrong on these days?

</div>

**3.2 -** Earlier, we noticed that births on the weekends tended to vary from births during the week. Let's explore this further. Make a new variable that says whether the day is a weekend or not.

**3.3 -** Now visualize and find the mean number of births per day, for weekends and for weekdays. What do you notice?

**3.4 -** Run the following code, which creates a new version of the dataset (`Dec_12_2014`) that includes all the dates up to (but not including) Saturday, December 13th, 2014.

In [None]:
Dec_12_2014 <- baby_days[1:5460,]
head(Dec_12_2014)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

#### Class Competition

**3.5 -** Using only the data in `Dec_12_2014`, predict the number of births on Saturday, December 13th, 2014. This is the date after the dataset ends. Note: Your prediction **must** be based on a mean of some subset (or full set) of data points in `Dec_12_2014`. Whoever has the lowest magnitude error wins the class competition! No cheating.

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**3.6 -** The instructor will now reveal the true number of births on Saturday, December 13th, 2014. Reflect on the error from your prediction. What direction was the error? How high was the error in magnitude? Why do you think this occurred?

</div>