# 3C: Traits of Fictional Characters 

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# This code will make sure the middle rows/columns don't get cut out (ellipsized) when you 
# print out a really large data frame (note: you can adjust the values for max rows/cols)
options(repr.matrix.max.rows=900, repr.matrix.max.cols=100)

# Load the data frame
characters <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQk_n4m-VBCD7CtcpB1kOsiNDLrPmEOEtlOoaKwDhogE_YeGEW5PYTaOtZaqypEgHRFGWsZ0pdYvt_A/pub?gid=0&single=true&output=csv")

## Favorite Fictional Characters

<img src="https://i.postimg.cc/s34kFsZr/xcd-03-B-fictional-chars.png" alt="A collage of the faces of various fictional characters" width = 30%>

There are many popular fictional universes out there with their own unique set of fictional characters with a wide range of personality types. Some characters are known for being quite likeable, and some characters are known for being unlikeable, or even despised. 

Who are some of your favorite fictional characters (i.e., from any of your favorite books, movies, TV shows, or games)? What is it about those characters that you like?

Who are some unlikeable characters? What makes them unlikeable?

Today we'll explore and explain the variation in character likeability.

### Motivating Question: What traits make a character more likeable?

### The Dataset

**Description:** The `characters` data frame contains characters from various fictional universes. More than [3 million volunteers from the internet](https://openpsychometrics.org/tests/characters/) rated these characters on various traits by using a sliding scale. For example, the character Mushu (from Disney's Mulan), is depicted below being rated on a scale from zero, rude, to 100, respectful.

<img src="https://i.postimg.cc/tXVg4SjZ/rating-characters.png" alt="example of how people rated a character with a slider" width = 40%>

##### Variable Descriptions

- `char_id` The character ID.
- `char_name` The character's name.	
- `uni_id` The universe ID for the book, game, movie, or TV show.
- `uni_name` The universe name of the book, game, movie, or TV show.
- `gender` The gender of the character (M=Male, F=Female, NB=NonBinary).
- `abstract` The average rating of how abstract (vs concrete) the character is on a scale of 0-100 (0-concrete, 100-abstract).
- `agreeable` The average rating of how agreeable (vs stubborn) the character is on a scale of 0-100 (0-stubborn, 100-agreeable).	
- `anxious` The average rating of how anxious (vs calm) the character is on a scale of 0-100 (0-calm, 100-anxious).
- `attractive` The average rating of how attractive (vs repulsive) the character is on a scale of 0-100 (0-repulsive, 100-attractive).	
- `beautiful` The average rating of how beautiful (vs ugly) the character is on a scale of 0-100 (0-ugly, 100-beautiful).	
- `chaotic` The average rating of how chaotic (vs orderly) the character is on a scale of 0-100 (0-orderly, 100-chaotic).
- `chill` The average rating of how chill (vs offended) the character is on a scale of 0-100 (0-offended, 100-chill).	
- `cool` The average rating of how cool (vs dorky) the character is on a scale of 0-100 (0-dorky, 100-cool).	
- `decisive` The average rating of how decisive (vs hesitant) the character is on a scale of 0-100 (0-hesitant, 100-decisive).	
- `emotional` The average rating of how emotional (vs unemotional) the character is on a scale of 0-100 (0-unemotional, 100-emotional).	
- `extrovert` The average rating of how extroverted (vs introverted) the character is on a scale of 0-100 (0-introvert, 100-extrovert).	
- `feminine` The average rating of how feminine (vs masculine) the character is on a scale of 0-100 (0-masculine, 100-feminine).	
- `future_focused` The average rating of how future-focused (vs present-focused) the character is on a scale of 0-100 (0-present-focused, 100-future-focused).	
- `loveable` The average rating of how loveable (vs punchable) the character is on a scale of 0-100 (0-punchable, 100-loveable).
- `messy` The average rating of how messy (vs neat) the character is on a scale of 0-100 (0-neat, 100-messy).		
- `moody` The average rating of how moody (vs stable) the character is on a scale of 0-100 (0-stable, 100-moody).		
- `open_minded` The average rating of how open-minded (vs close-minded) the character is on a scale of 0-100 (0-close-minded, 100-open-minded).
- `reasoned` The average rating of how reasoned (vs instinctual) the character is on a scale of 0-100 (0-instinctual, 100-reasoned).
- `respectful` The average rating of how respectful (vs rude) the character is on a scale of 0-100 (0-rude, 100-respectful).
- `self_assured` The average rating of how self-assured (vs self-conscious) the character is on a scale of 0-100 (0-self-conscious, 100-self-assured).
- `self_disciplined` The average rating of how self-disciplined (vs disorganized) the character is on a scale of 0-100 (0-disorganized, 100-self-disciplined).	
- `tall` The average rating of how tall (vs short) the character is on a scale of 0-100 (0-short, 100-tall).	
- `trusting` The average rating of how trusting (vs suspicious) the character is on a scale of 0-100 (0-suspicious, 100-trusting).


##### Data Source: 

Originally collected at [Open Psychometrics](https://openpsychometrics.org/tests/characters/) made available by Tanya Shapiro as a [Tidy Tuesday data set](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-08-16).

### 1.0 - Explore the Data Frame

#### Free Play

**1.1:** Let's just start by getting familiar with the data we will be working with. Run the code below to get a glimpse at the data frame. Then, take a minute to freely explore the data. Look at anything interesting to you, or that you are curious about, or anything you think we might want to know about the data frame, the cases, or the variables before we start modeling anything.

In [None]:
# Check out the data
head(characters)

**1.2:** What questions do you have about this data set? 

#### Measuring Likeability

**1.3--Discussion:** We are interested in modeling what predicts how likeable a character is, but there isn't a variable called "likeable." Which variables in the data frame might be a good measure for this trait?

**1.4:** One variable we might consider as a measure of a character's likeability is the variable `loveable`. Let's pursue that as our outcome variable together first, then afterwards, you can pursue some models using any of the other variables you are interested in as well.

So, before we develop any complex models, let's start by getting a little bit of information about `loveable`.

Take a look at the visualization and empty model for `loveable` below. Describe the distribution and interpret the empty model.

In [None]:
# Run this code and interpet the output

empty_model <- lm(loveable ~ NULL, data = characters)
empty_model

gf_histogram(~loveable, data = characters) %>%
gf_model(empty_model) %>%
gf_boxplot(width = 5)

**1.5--Discussion:** Take a look at the 50 most/least `loveable` characters. Does anything stand out about the two groups of characters? Do the characters in each group have anything in common? As you look over the data, try to come up with a few theories about what might explain variation in `loveable`.

In [None]:
# The 50 most loveable characters
head(arrange(characters, desc(loveable)), 50)

# The 50 least loveable characters
head(arrange(characters, loveable), 50)

### 2.0 - Explaining Variation in `loveable`

**2.1:** We are going to compare a few models to see which one(s) might do a better job helping us predict (i.e., explain variation in) `loveable`. 

We have developed a few theories and put them into word equations below:

> 1. **loveable = tall + other stuff**
> 2. **loveable = moody + other stuff**
> 3. **loveable = open_minded + other stuff**

For each word equation above, indicate whether you predict there will be:

- a positive relationship (as x goes up, y (loveable) goes up)
- a negative relationship (as x goes up, y (loveable) goes down)
- some other relationship
- no relationship

#### Explore the Distribution

**2.2:** We've set you up with some code to look at these theories in a visualization. Describe what kind of patterns you see. Does one model appear to explain more variation than the others?

In [None]:
# loveable = tall + other stuff
gf_point(loveable ~ tall, data = characters)

# loveable = moody + other stuff
gf_point(loveable ~ moody, data = characters)

# loveable = open_minded + other stuff
gf_point(loveable ~ open_minded, data = characters)

#### Fit and Interpret the Models

**2.3:** We've also set you up with some code to fit these models. Put them into the GLM notation we have started below, and interpret the parameter estimates. Then, engage in the discussion questions below.

**GLM Notation:**

> - $loveable_i = b_0 + b_1(tall_i) + e_i$
> - $loveable_i = b_0 + b_1(moody_i) + e_i$
> - $loveable_i = b_0 + b_1(open\_minded_i) + e_i$

***Discussion Questions:***

Compare the $b_1$ estimates of the three models. Why are some negative? Which model has the smallest (or largest) $b_1$ (in absolute value)? What does this suggest?

Also, compare the $b_0$ estimates of the three models. Are they similar? Different? Why is that?

In [None]:
# loveable = tall + other stuff
tall_model <- lm(loveable ~ tall, data = characters)
tall_model

# loveable = moody + other stuff
moody_model <- lm(loveable ~ moody, data = characters)
moody_model

# loveable = open_minded + other stuff
open_minded_model <- lm(loveable ~ open_minded, data = characters)
open_minded_model

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.4:**  Add the models to the visualizations and connect the parameter estimates to the model in the graph.

</div>

#### Make Predictions with the Models

**2.5:** Make some predictions with the models. For example, what does the model predict for a character who is rated as very tall (e.g., a rating of 75)? How about a character who is rated not very tall (e.g., a rating of 25)?

**2.6:** Take a look at the character selected below (or filter for another character). How far off is the model prediction for that character?

In [None]:
# Select the row for a particular character (e.g., Harry Potter)
characters[characters$char_name == "Harry Potter", ]

#### Evaluate the Models

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.7:** How much variation in `loveable` does each model explain? Are any of the models much better than the empty model? And, if so, which model explains the most variation in `loveable`. Use statistics to support your answer.

</div>

In [None]:
# Here are the saved model names to get you started

# loveable = tall + other stuff
# tall_model

# loveable = moody + other stuff
# moody_model

# loveable = open_minded + other stuff
# open_minded_model

### 3.0 - Make Your Own Models

#### Explore the Distribution

**3.1:** Take a look through the data and select a character trait that you are interested in as an outcome variable. 

Create a visualization to explore the distribution, and fit the empty model.


**3.2:** Come up with two different theories about the DGP for that variable, and write them as word equations (i.e., pick two explanatory variables).


Make some predictions about what you might expect to find, then create visualizations to explore your hypotheses. Describe what you see. Does one model appear to explain more variation than the others?

#### Fit and Interpret the Models

**3.3:** Fit your models, put them into GLM notation ($Y_i = b_0 + b_1(X_i) + e_i$), and interpret the parameter estimates.

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.4:** Add the models to your visualizations, and connect the parameter estimates to the models in the graphs.

</div>

#### Make Predictions with the Models

**3.5:** Make some predictions with the models. For example, what does the model predict for a character who is low in that trait versus a character who is high in that trait?

#### Evaluate the Models

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.6:** How much variation in your outcome variable does each model explain? Are any of the models much better than the empty model? And, if so, which model explains the most variation in your outcome variable? Use statistics to support your answer.

</div>

### 4.0 - BONUS: Explore A Universe

**4.1:** Try filtering the data for a particular fictional universe that you are interested in (probably one with at least 10+ characters). 

In [None]:
## Just in case: Check which universes have the most/fewest characters
# sort(tally(~uni_name, data = characters))


**4.2:** How do the trends for that universe compare to the trends you found when looking at all the universes? Are the trends for those characters similar to the trends for the broader set of characters, or are they quite different? Why do you think that is?