<div class="alert alert-block alert-danger">

# 1C: TV Teens (COMPLETE)

*This notebook is intended for students who have completed:*
 
**All of Chapter 1**

</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this lesson, students will explore a dataset that contains information on TV teen dramas. They will try to answer questions such as: Are the actors that play these characters actually of high school age? When is the actor's age really different from the character's age?

#### Includes:

- Effectively visualizing both quantitative and categorical data
- Investigating missing values and handling them thoughtfully (i.e., think through alternatives to simply dropping missing cases)
- Using visualizations and summary statistics to make meaningful inferences about datasets

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# loads data
tv_teens <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQIO_RIn3K6izwdQhvWzo84ALKQ1MP_ZTZ9EzGqqecvejQSpouysYAxDecGBIPdwKs31OoZZmQTR3au/pub?gid=395733557&single=true&output=csv")

tv_teens <- tv_teens %>% 
  mutate(any_love = factor(ifelse(is.na(num_love_interest),"none","1+")),
         character_gender = factor(character_gender))

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 55-65 Mins

</div>

<img src="https://i.postimg.cc/HTGdRdfd/teen-actors.jpg" alt="so called teenagers from popular shows" width = 50%>

## Are those really high schoolers? 

We have seen a renaissance of teen dramas: Euphoria, Sex Education, Riverdale, Elite, Yellowjackets, etc. These popular shows depict high schoolers doing very "adult" things (as the saying goes: "sex, drugs, and rock & roll"... minus the "rock & roll"). 

Today we'll ask: Are the actors that play these characters actually of high school age? When is the actor's age really different from the character's age?

### Motivating Question: How old are the actors who play famous teenage TV characters?

### The Dataset

#### Description

The data frame `tv_teens` shows the ages of TV characters and the ages of the actors who play them, **during the first season** that the show was on the air. Dataset has tidy organization, with one row per TV character.

#### Variables

- `name`: Name of actor
- `title`: Title of show
- `character_name`: Name of character
- `character_age`: Character's age 
- `character_year`: Character's school grade level (if provided)
- `character_gender`: Gender of character (M - Male, F - Female, NB - Non-binary, TM - Trans Man, TW - Trans Woman)
- `love_interest`: Love interests of character on the show
- `num_love_interest`: Number of love interests 
- `any_love`: "+1" if character has any love interests, "none" if `love_interest` is left blank 
- `release_year`: Year when show was released
- `release_date`: Date when show was released
- `actor_age`: Actor age at time of release

##### Data Source 

 - [Amber Thomas (Data World), CC-BY-SA](https://data.world/amberthomas/age-of-characters-and-actors-in-teen-tv-shows/activity)


<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  15-20 mins

</div>

### 1.0 - Exploring the TV Teens Dataset

**1.1 -** As usual, let's start by just exploring the data. Run some code to take a look at the dataset.

In [None]:
# Sample Response
head(tv_teens)

**1.2 -** In this dataset, each row represents a character. How many total characters are included in the dataset?

In [None]:
# Sample Response 1
str(tv_teens)

# Sample Response 2
# note-this function is not in the book but can also work here
dim(tv_teens)[1]

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

There are 222 characters in the dataset.

</div>

**1.3 -** The `title` column lists the shows that are included in the dataset. What shows and how many characters from each show are included in `tv_teens`?

In [None]:
# Sample Response
tally(~ title, data = tv_teens)

# Sample Response
tally(tv_teens$title)

# Sample Response
gf_bar(title ~ 1, data = tv_teens)

**1.4 -** Visualize the distribution of the genders of characters in this dataset. Describe the distribution.

In [None]:
# Sample Response
gf_bar(~character_gender, data = tv_teens)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The majority of characters are identified as female and male (with slightly more female than male). A few characters are identified as non-binary and a few are identified as transgender men. There are much fewer character of these identities than female & male characters.<br><br>


<b>Instructor Note:</b> 

It's worth noting that there are various ways that gender is encoded into datasets, and they have tradeoffs. For example, here, the dataset author decided to separate cis-gender men from trans men. The benefit is that the data can now be analyzed with respect to the individual portrayals of these two groups in TV shows. The detriment is that, by encoding cis-gender male categorized as simply "Male," it's implying that cis-gender men are the "default" male group. A more just way to encode these groups might be "CM" (for cis-gender male) and "TM" (for transgender male).

</div>

**1.5 -** Visualize the distribution of character ages in this dataset. Describe the distribution.

In [None]:
# Sample Response
gf_histogram(~character_age, data = tv_teens)

# Sample Response
gf_dhistogram(~character_age, data = tv_teens)

# Sample Response
gf_histogram(~character_age, data = tv_teens)%>%
    gf_boxplot(fill = "white", width=3)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The ages vary between 10 and 25, which a majority between the ages of 16-17. The shape of the distribution is fairly symmetric.

</div>

**1.6 -** In the prior question, you may have seen a warning message about missing values when you produced your graph. Are there missing values in the `character_age` column? If so, how many?

In [None]:
#Use tally to see number of missing values

# Sample Response
tally(~ character_age, data = tv_teens)
tally(tv_teens$character_age)

favstats(~ character_age, data = tv_teens)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

We have 100 missing values for the `character_age` column.

</div>

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  20-25 mins

</div>

### 2.0 - Handling Missing Values

Our key variable of interest (`character_age`) has a lot of missing values. How should we handle this? Let's explore!

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>


<br>

**2.1 -** One solution would be to discard all rows that have a missing `character_age`. Is this an ideal solution? Why or why not?

</div>

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

No - With 100 missing values out of a 222 row dataset, we'd be losing almost half our data! This could lead to unrepresentative analyses later down the line.<br><br>

<b>Instructor Note:</b> 

The most common way that students handle missing data, unless told otherwise, is to delete any dataset rows with missing values and move on. This can lead to biased analyses. Instead, it's best to first encourage students to explore patterns. Ask, "Are there any other values in the dataset that seem to explain why the data might be missing for a particular variable?" This often leads to more productive methods for handling missing data. There are various technical methods for imputing missing data, but we'll focus on cases where thoughtful reasoning leads to good solutions for missing data. Such an example is shown in the following exercises of this notebook.

</div>

**2.2 -** Display the head of the dataset again, but this time include 20 rows. Is there anything else in the dataset that could give us information about character ages?

In [None]:
# Sample Response
head(tv_teens, 20)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

`character_year` is present in cases where `character_age` is missing. We can use years in school to provide an estimate for the missing character ages. 
 
</div>

**2.3 -** Use the information you identified above to fill in reasonable estimates for the missing values in `character_age`. Include this in a new column called `character_age_filled`. 

In [None]:
# if you wish to provide some scaffolding for your students 
# you may consider starting with this

# Show counts for character_year by character_age
tally(character_year ~ character_age, data = tv_teens)

# Let's start with a new column (copy of current characte_age column)
# This way we start with the ages of the characters that already have ages
tv_teens$character_age_filled <- tv_teens$character_age

# Here is some code to put 18 for characters that are considered "college freshman"
## 'college freshman' = 18 years old
tv_teens$character_age_filled[tv_teens$character_year == "college freshman"] <- 18 


In [None]:
# Full Sample Response

# Create a new column (copy of current character_age column)
tv_teens$character_age_filled <- tv_teens$character_age

# Designations:
## 'college freshman' = 18 years old
tv_teens$character_age_filled[tv_teens$character_year == "college freshman"] <- 18 

## 'hs senior' = 17 years old
tv_teens$character_age_filled[tv_teens$character_year == "hs senior"] <- 17 

## 'hs junior' = 16 years old
tv_teens$character_age_filled[tv_teens$character_year == "hs junior"] <- 16 

## 'hs sophomore` = 15 years old
tv_teens$character_age_filled[tv_teens$character_year == "hs sophomore"] <- 15 

## 'hs freshman' = 14 years old
tv_teens$character_age_filled[tv_teens$character_year == "hs freshman"] <- 14

## '7th grade' = 12 years old
tv_teens$character_age_filled[tv_teens$character_year == "7th grade"] <- 12

## '6th grade' = 11 years old
tv_teens$character_age_filled[tv_teens$character_year == "6th grade"] <- 11

#check if any remaining missing values
sum(is.na(tv_teens$character_age_filled))
tally(character_year ~ character_age_filled, data = tv_teens)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**2.4 -** Could your method for filling in missing values of `character_age` introduce any bias into your analyses? If so, in what way(s)?

</div>

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

In our sample response, we assumed typical ages for each level of school, opting for the lower age of each grade. For example, high school seniors are typically 17-18 years old, and we filled in 17. Writers of the shows may have intended some of these high school seniors to be 18, or even a higher age. So, our method for filling in missing values could lead to underestimates of character ages.  

</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  20-25 mins

</div>

### 3.0 - Analyzing Age

**3.1 -** Visualize the character ages and visualize the actor ages. What pattern do you notice?

In [None]:
# Sample Response
gf_histogram(~character_age_filled, data = tv_teens) %>%
    gf_histogram(~actor_age, data = tv_teens, fill = "orchid4")

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Overall, the actor age distribution tends to have large values than the character age distribution.

</div>

**3.2 -** Create a new variable in the dataset, which shows the difference between actors' ages and the characters they plan (actor - character). Call this variable `age_diff
`.

In [None]:
# Sample Response
tv_teens$age_diff <- tv_teens$actor_age - tv_teens$character_age_filled

# Sample Response
tv_teens <- tv_teens %>%
    mutate(age_diff = actor_age - character_age_filled)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**3.3 -** Visualize the distribution of `age_diff`. Describe the distribution. Does it provide evidence that actors tend to be older than the characters they play? Why or why not?

</div>

In [None]:
# Sample Response
gf_histogram(~age_diff, data = tv_teens)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The majority of actors seem to be older than their characters - the vast majority of age differences are above 0.

<b>Instructor Note:</b> 

Make a big deal about interpreting what "above 0" means in this context. As an extension question, it may be helpful to ask students what extra information this visualization provides, compared to the visualization they made for question 3.1.

</div>

**3.4 -** Calculate the following statistics. What do each of these statistics suggest about the relationship between actors' ages and the age of their characters? 
a. The average age of the actor versus the average age of the character
b. The proportion of characters whose ages are higher than their characters
c. The `favstats` of age difference

In [None]:
# Average age difference
## Sample Response
print("average actor_age")
mean(~ actor_age, data = tv_teens)
print("average character_age_filled")
mean(~ character_age_filled, data = tv_teens)

## Sample Response
favstats(~ actor_age, data = tv_teens)
favstats(~ character_age_filled, data = tv_teens)

# Proportion with higher ages
## Sample Response
tally(~ actor_age > character_age_filled, data = tv_teens, format = "proportion")
tally(~ age_diff > 0, data = tv_teens, format = "proportion")
## Sample Response
gf_props(~ age_diff > 0, data = tv_teens)

# favstats of age_diff
favstats(tv_teens$age_diff)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The vast majority (95%) of actors have higher ages than their characters. In fact, the average difference is age is 5.8, meaning that actors are on average 5.8 years older than their characters. When talking about teenagers, 5.8 is no small amount! The maximum age difference is 16 years - Wow! The minimum is -2, which means at least one actor was younger than the character they played.

</div>

**3.5 -** Are there any characters played by younger actors? If so, who are these characters/actors?

In [None]:
# Sample Response
arrange(tv_teens, age_diff)

# Sample Response
filter(tv_teens, age_diff < 0)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

There are several actors who play characters older then them: Sasha Pieterse, Peyton Kennedy, Noah Schnapp, and Allegra Acosta. (The characters are, respectively, Alison DiLaurentis, Kate Messner, Will Byers, and Molly Hayes Hernandez.)

</div>