## 5A: Which YouTubers make a living wage?

<img src="https://i.postimg.cc/ty1GkxB8/Skew-the-Script-Logo.png" title="Skew the script logo" width=200 align = "right">

Made in collaboration with [Skew the Script](https://skewthescript.org/).

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

# Adjust scientific notation
options(scipen = 10)

# Pull in data from csv
YouTubers <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRc-xoYTbcG_kCs0oNcR0uDgXn0MWroDw9e-EqLJeFDsRtNqXZJqt7A43IvhrSyjR8uEozsKjIy9EH3/pub?gid=615413346&single=true&output=csv", header=TRUE)
YouTubers$Gender <- factor(YouTubers$Gender)

# Data frame with filtered outlier
YouTubers_no_outlier <- filter(YouTubers, Views <= 2500000)

### <center>Students used to say: "I'm gonna drop out and become a..."</center>

<img src="https://i.postimg.cc/dwybGc5C/9-B-YT-Income-Mus-Ath-Act.png" title="entertainment figures and their 2020 income" width=500>

### <center>Now, students say: "I'm gonna drop out and become a..."</center>

<img src="https://i.postimg.cc/pxRNz6gc/9-B-YT-Income-Influencers.png" title="social media figures and their 2020 income" width=500>

### The Exception or The Rule?

For today, let's think about YouTube content creators.

According to [PolicyAdvice](https://policyadvice.net/insurance/insights/average-american-income/), the average US income is around \$60-70,000 per year. According the [Department of Health and Human Services](https://aspe.hhs.gov/2021-poverty-guidelines), the individual poverty threshold is \$12,880.

**What do you think?** What kind of money do people make when YouTubing is their full-time career? (Do you think most YouTube content creators are above the poverty threshold? Above the average income?)

## 1.0: Getting Data									

**1.1, Discussion:** If we wanted to get some data on YouTube earnings, you'll find out YouTube doesn't release data on individual earners. Why do you think that is? 


**1.2, Discussion:** We could send out a survey and ask YouTubers what their annual income is. Do you think this would be a good method of collecting data? Why or why not?		

Here is a data set called `YouTubers` that was collected on the first **35 YouTube videos** (out of hundreds) that come up with the search term: "*How much I make on YouTube*"

<img src="https://i.postimg.cc/XNVRxJxq/9-B-YT-Income-Thumbnails.png" title="Top 30 YouTube videos about income from YouTube" width=500>

The data frame contains the following variables:
- `ChannelName`	The name of the YouTube channel
- `VideoLink` The link to the channel
- `Income` The annual income reported in the video
- `Gender` The gender of the channel host
- `Subscribers` The number of subscribers the channel has (at time of data collection, May 2021)
- `Views` The number of views the video has (at time of data collection, May 2021)
- `ThumbsUp` The number of thumbs up on the video (at time of data collection, May 2021)	
- `ThumbsDown` The number of thumbs down on the video (at time of data collection, May 2021)
- `Theme` The general theme of the videos on the channel
- `ThemeBusiness` Whether the channel theme has something to do with business or finance
- `Vlogs` Whether the channel is a vlogging channel or not	
- `Comments` The number of comments on the video (at time of data collection, May 2021)	
- `UploadMonth`	The month the video was uploaded
- `UploadYear` The year the video was uploaded
- `LengthMins` The length of the video in minutes
- `Verified` Whether or not the channel is verified (with an official check mark near the channel name)	

The income made by these creators were recorded from screenshots of channel revenue pages in the videos:
<img src="https://i.postimg.cc/gGT6WLZ9/9-B-YT-Income-Channel-Analytics.png" title="Thumbnail of a YouTubers income analytics page" width=400>

**1.3, Discussion:** Take a peek at the data frame `YouTubers` and take a look at some of the annual incomes. Do you think these numbers are more trustworthy than if you had asked for income on a survey? Do these numbers tell you how much these YouTubers earned in 2020?

In [None]:
str(YouTubers)

## 2.0: Explore Variation in YouTuber Income

**2.1:** Let's explore variation in `Income`. Run the code below. What do you notice about how `Income` is distributed in this sample? 

In [None]:
gf_histogram(~IncomeK, data = YouTubers)%>%
gf_boxplot() %>%
gf_labs(x = "Income (in $1000s)")

**2.2, Discussion:** Are these YouTubers making a "living wage"? Does this data mean that most YouTubers make a living wage?

**2.3:** What might explain variation in these YouTubers income? Come up with a hypothesis and write it as a word equation.

**2.4:** Make a visualization to explore your hypothesis. What are your initial impressions? (Try chaining on `gf_model()`.)

## 3.0: Model Variation

**3.1:** What is the best fitting model for your hypothesis? Answer this question by updating the GLM notation in this cell.

$Income = b_0 + b_1X_i + e_i$

**3.2:** How much error did we reduce by adding an explanatory variable to our model? (e.g., in our example, `Views`)

## 4.0: Could there be no effect in the true DGP?

Adding our explanatory variable (e.g., gender, verified, subscribers, etc.) reduces error by some amount. But that doesn't mean this more complex model is the right model of the DGP.

Even a DGP where the $\beta_1=0$ (the empty model) **could** produce samples of data with non-zero $b_1$s and explain some error! But we can use `shuffle()` to simulate a DGP where $\beta_1=0$ and take a look at what kind of samples get generated.

Today we'll look at $b_1$s and Fs generated from the empty model. (Even numbered groups: go to 4.1; Odd numbered groups: go to 4.2)

#### 4.1: Explore shuffled $b_1$s

- Is our sample $b_1$ an "unlikely" sample from `shuffle()`?
- What do you think our p-value will be?
- What do we think about the empty model of the DGP?

#### 4.2: Explore shuffled $F$s

- Is our sample F an "unlikely" sample from `shuffle()`?
- What do you think our p-value will be?
- What do we think about the empty model of the DGP?

#### 4.3: What does the ANOVA table say?

The ANOVA table does not use simulation (you'll learn about the mathematical models it uses in pg. 10.4). 

- According to the ANOVA table, what is the probability of getting a $b_1$ or F more extreme than our sample from the empty model of the DGP (the p-value)?
- Is this an "unlikely" sample to be generated from the empty model of the DGP? 
- So what do we think about the empty model of the DGP?

#### 4.5: What's in common?

Across all these approaches (the sampling distribution of F and $b_1$; using `shuffle()` or the ANOVA table), what do they say about the empty model of the DGP?

What does this mean for your hypothesis?

## 5.0: Generalizing These Results

**5.1:** Do you think your conclusion is true for most YouTubers? Why or why not?

**5.2:** Based on our class' analyses today, would you have any advice for people who want to become content creators/influencers? 