In [1]:
options(digits=15)

In [2]:
library(dplyr)
library(tidyr)
library(readr)
set.seed(123)

“package ‘dplyr’ was built under R version 4.3.2”

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [3]:
movies <- read.csv("imdb_movies.csv")
print(nrow(movies))

[1] 10178


# What Kind of Genres are There?

In [4]:
# Remove any movies that have a genre of multiple types. (which means, look for commas.)
# For example, we have a movie with genre "Horror, Thriller".
# We exclude that, and only keep movies that are either "Horror" or "Thriller", but not both.

filtered_df <- subset(movies, !grepl(",", genre))
# filtered_df

unique(filtered_df$genre)

# but we subtract one, because there is an empty genre, ''.

print(paste('The number of unique genres is' , length(unique(filtered_df$genre))-1))

[1] "The number of unique genres is 18"


Notice that there is a genre that's "". We need to exlcude those.

After counting the rest, we find out that there's 18 unique genres.

However, there's a catch. What we did here was find all genres that are single, by themselves.

We later found out that there's another genre, called "TV Movie", which **only appears when it's combined with other genres**. There's not a movie who's genre is only just "TV Movie" , which is why we did not discover it here.

In [5]:
# There are some movies without a genre.
# Firstly, we wish to exlcude those.
movies <- movies %>% filter(genre!='')

In [6]:
print(nrow(movies))
# Number of rows went from 10,178 to 10,093.

[1] 10093


In [7]:
head(movies)

Unnamed: 0_level_0,names,date_x,score,genre,overview,crew,orig_title,status,orig_lang,budget_x,revenue,country
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>
1,Creed III,03/02/2023,73,"Drama, Action","After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodigy, Damien Anderson, resurfaces after serving a long sentence in prison, he is eager to prove that he deserves his shot in the ring. The face-off between former friends is more than just a fight. To settle the score, Adonis must put his future on the line to battle Damien — a fighter who has nothing to lose.","Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duke' Evers, Phylicia Rashād, Mary Anne Creed, Mila Davis-Kent, Amara Creed, Florian Munteanu, Viktor Drago, José Benavidez Jr., Felix Chavez, Selenis Leyva, Laura Chavez",Creed III,Released,English,75000000.0,271616668.0,AU
2,Avatar: The Way of Water,12/15/2022,78,"Science Fiction, Adventure, Action","Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.","Sam Worthington, Jake Sully, Zoe Saldaña, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Miles Quaritch, Kate Winslet, Ronal, Cliff Curtis, Tonowari, Joel David Moore, Norm Spellman, CCH Pounder, Mo'at, Edie Falco, General Frances Ardmore",Avatar: The Way of Water,Released,English,460000000.0,2316794914.0,AU
3,The Super Mario Bros. Movie,04/05/2023,76,"Animation, Adventure, Family, Fantasy, Comedy","While working underground to fix a water main, Brooklyn plumbers—and brothers—Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi.","Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Keegan-Michael Key, Toad (voice), Seth Rogen, Donkey Kong (voice), Fred Armisen, Cranky Kong (voice), Kevin Michael Richardson, Kamek (voice), Sebastian Maniscalco, Spike (voice)",The Super Mario Bros. Movie,Released,English,100000000.0,724459031.0,AU
4,Mummies,01/05/2023,70,"Animation, Comedy, Family, Adventure, Fantasy","Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey in search of an old ring belonging to the Royal Family, stolen by ambitious archaeologist Lord Carnaby.","Óscar Barberán, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis Pérez Reina, Carnaby (voice), María Luisa Solá, Madre (voice), Jaume Solà, Sekhem (voice), José Luis Mediavilla, Ed (voice), José Javier Serrano Rodríguez, Danny (voice), Aleix Estadella, Dennis (voice), María Moscardó, Usi (voice)",Momias,Released,"Spanish, Castilian",12300000.0,34200000.0,AU
5,Supercell,03/17/2023,61,Action,"Good-hearted teenager William always lived in hope of following in his late father’s footsteps and becoming a storm chaser. His father’s legacy has now been turned into a storm-chasing tourist business, managed by the greedy and reckless Zane Rogers, who is now using William as the main attraction to lead a group of unsuspecting adventurers deep into the eye of the most dangerous supercell ever seen.","Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quinn Brody, Daniel Diemer, William Brody, Jordan Kristine Seamón, Harper Hunter, Alec Baldwin, Zane Rogers, Richard Gunn, Bill Brody, Praya Lundberg, Amy, Johnny Wactor, Martin, Anjul Nigam, Ramesh",Supercell,Released,English,77000000.0,340941958.6,US
6,Cocaine Bear,02/23/2023,66,"Thriller, Comedy, Crime","Inspired by a true story, an oddball group of cops, criminals, tourists and teens converge in a Georgia forest where a 500-pound black bear goes on a murderous rampage after unintentionally ingesting cocaine.","Keri Russell, Sari, Alden Ehrenreich, Eddie, O'Shea Jackson Jr., Daveed, Ray Liotta, Syd, Kristofer Hivju, Olaf (Kristoffer), Margo Martindale, Ranger Liz, Christian Convery, Henry, Isiah Whitlock Jr., Bob, Jesse Tyler Ferguson, Peter",Cocaine Bear,Released,English,35000000.0,80000000.0,AU


Now, we have an issue where a movie may have multiple genres. For example, the first movie, Creed III, has a genre type of **both** Drama, Action. We will simply randomly pick one genre from each row, and let that be that.

In [8]:
movies$genre <- sapply(strsplit(movies$genre, ','), function(x) sample(x, 1))

nrow(movies)

In [9]:
sort(unique(movies$genre))

As you can see, we have repeated values.

In [10]:
sort(unique(filtered_df$genre))

In [11]:
unique_genres <- unique(filtered_df$genre)
unique_genres <- unique_genres[unique_genres != ""]

for (item in unique_genres) {
  movies$genre[grepl(item, movies$genre)] <- item
}


# keyword <- "Action"
# replacement <- "Action"

# movies$genre[grepl(keyword, movies$genre)] <- replacement



In [12]:
unique(movies$genre)

Great! Now the genre of each movie only has one word, not multiple. Notice that we stil have 10,093 rows.  

In [13]:
group_by(movies, genre) |>
  summarize(count = n())

genre,count
<chr>,<int>
Action,939
Adventure,578
Animation,467
Comedy,1285
Crime,454
Documentary,180
Drama,1800
Family,409
Fantasy,407
History,166


We need to manually address "TV Movie", as it was not included in our unique_genres list.

In [14]:
movies$genre[grepl("TV Movie", movies$genre)] <- "TV Movie"


In [15]:
genre_count <- group_by(movies, genre) %>%
  summarize(count = n())

genre_count <- genre_count[order(genre_count$count), ]
genre_count

genre,count
<chr>,<int>
TV Movie,60
Western,69
War,86
Music,93
History,166
Documentary,180
Mystery,264
Fantasy,407
Science Fiction,407
Family,409


# Say we demand an accuracy of +- 4.0% points, 19 times out of 20.
(This suggests a moe of 0.04, and a 95% CI).

In [16]:
nrow(movies)

In [17]:
p <- mean(movies$score >= 65)
p

In [18]:
s_2 = p*(1 - p )
s_2

In [19]:
moe <- 0.04
moe

In [20]:
# May need to change s_2 to 0.5 depending on requirements.

n = 1.96^2 * s_2 / moe^2
n

In [21]:
n_star = n / (1 + n/nrow(movies))

n_star

In [22]:
n_star = ceiling(n_star)
n_star

# SRS

In [23]:
# Get a simple random sample (SRS) of size 563
srs <- sample_n(movies, size = n_star)

#### SRS Calculation for Average Score of Movie (Parameter 1)

In [24]:
# Calculate the average score and its standard error
average_score_srs <- mean(srs$score)
std_error_srs <- sd(srs$score) / sqrt(nrow(srs)) * sqrt((1-nrow(srs)/nrow(movies)))


# Print the average score and its standard error
print(paste("The average score of movies from a SRS of size 563 is " , average_score_srs))
print(paste("The standard error of movies score from a SRS of size 563 is " , std_error_srs))

# 95% Confidence Interval Calculation
lowerBound_srs <- average_score_srs - 1.96 * std_error_srs
upperBound_srs <- average_score_srs + 1.96 * std_error_srs
print(paste("Our 95% confidence interval for mean score of a SRS sample of size 563 has an lower bound of ", lowerBound_srs, " and an upper bound of ", upperBound_srs))

[1] "The average score of movies from a SRS of size 563 is  64.2291296625222"
[1] "The standard error of movies score from a SRS of size 563 is  0.519996819207333"
[1] "Our 95% confidence interval for mean score of a SRS sample of size 563 has an lower bound of  63.2099358968758  and an upper bound of  65.2483234281686"


#### SRS Calculation for Proportion of Movies with Score above 65 (Parameter 2)

In [25]:
above65_SRS <- srs %>% filter(score > 65)

proportionAbove65_SRS <- nrow(above65_SRS) / nrow(srs)
SE_proportion_srs <- sqrt(proportionAbove65_SRS * (1-proportionAbove65_SRS) / nrow(srs)) * sqrt((1-nrow(srs)/nrow(movies)))

# Print the average score and its standard error
print(paste("The sample proportion of movies with a score over 65 from a SRS of size 563 is " , proportionAbove65_SRS))
print(paste("The standard error of the sample proportion of movies with a score over 65 from a SRS of size 563 is " , SE_proportion_srs))


# 95% Confidence Interval Calculation
lowerBound_srs_proportion <- proportionAbove65_SRS - 1.96 * SE_proportion_srs
upperBound_srs_proportion <- proportionAbove65_SRS + 1.96 * SE_proportion_srs
print(paste("Our 95% confidence interval for sample proportion of movies with a score over 65 from a SRS sample of size 563 has an lower bound of", lowerBound_srs, "and an upper bound of", upperBound_srs))

[1] "The sample proportion of movies with a score over 65 from a SRS of size 563 is  0.49911190053286"
[1] "The standard error of the sample proportion of movies with a score over 65 from a SRS of size 563 is  0.0204762977235707"
[1] "Our 95% confidence interval for sample proportion of movies with a score over 65 from a SRS sample of size 563 has an lower bound of 63.2099358968758 and an upper bound of 65.2483234281686"


# Stratified Sampling

In [26]:
unique(movies$genre)

length(unique(movies$genre))

For the context of stratified sampling, we are going to assume that in the process of data collection, the guessed variance and cost of sampling **are equal across all strata**.
This means we can exercise proportional allocation, where the optimal choices of sample sizes for each strata is nh = n x Nh/N

In [27]:
names(genre_count)[names(genre_count) == "count"] <- "Nh"


genre_count <- genre_count %>% mutate(nh = round(n_star * Nh / nrow(movies)))
head(genre_count)

genre,Nh,nh
<chr>,<int>,<dbl>
TV Movie,60,3
Western,69,4
War,86,5
Music,93,5
History,166,9
Documentary,180,10


In [28]:
stratified_data <- data.frame()


for (n in 1:length(genre_count$genre)){
    filteredData <- filter(movies,genre == genre_count$genre[n])
    sampledData <- filteredData[sample(nrow(filteredData), genre_count$nh[n]), ]
    stratified_data <- rbind(stratified_data, sampledData)
}

head(stratified_data)

Unnamed: 0_level_0,names,date_x,score,genre,overview,crew,orig_title,status,orig_lang,budget_x,revenue,country
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>
30,The Death of the Incredible Hulk,02/18/1990,50,TV Movie,"During the critical experiment that would rid David Banner of the Hulk,a spy sabotages the laboratory. Banner falls in love with the spy, Jasmin, who performs missions only because her sister is being held hostage by Jasmin's superiors. Banner and Jasmin try to escape from the enemy agents to rebuild their lives together, but the Hulk is never far from them.","Bill Bixby, Dr. David Bruce Banner, Lou Ferrigno, The Hulk, Elizabeth Gracen, Jasmin, Philip Sterling, Dr Ronald Pratt, Barbara Tarbuck, Amy Pratt, Anna Katarina, Bella / Voshenko, John Novak, Zed, Andreas Katsulas, Kasha, Chilton Crane, Betty",The Death of the Incredible Hulk,Released,English,139200000,398272687.8,US
19,Bad Sister,08/24/2015,68,TV Movie,"As a top student at St. Adeline's Catholic Boarding School, Zoe senses that something is not quite right about the school's new nun-- a sense proven to be true when it is revealed the ""good' nun is an imposter with a fatal attraction to Zoe's brother.","Alyshia Ochse, Laura, Devon Werkheiser, Jason, Ryan Whitney Newman, Zoe, Helen Eigenberg, Sister Rebecca, Robert Leeshock, David, Lise Simms, Cheryl, Sloane Avery, Sara, Hugh Holub, Father Macey, Josh Plasse, Chris",Bad Sister,Released,English,77400000,428263992.8,CA
36,Teen Titans: Trouble in Tokyo,09/15/2006,77,TV Movie,"America's coolest heroes, the Teen Titans, go to Tokyo to track down the mysterious Japanese criminal Brushogun.","Greg Cipes, Beast Boy (voice), Scott Menville, Robin / Japanese Boy (voice), Khary Payton, Cyborg (voice), Tara Strong, Raven / Computer (voice), Hynden Walch, Starfire / Mecha-Boi (voice), Robert Ito, Mayor / Bookseller (voice), Janice Kawaye, Nya-Nya / Timoko (voice), Yuri Lowenthal, Scarface / Japanese Biker (voice), Cary-Hiroyuki Tagawa, Brushogun (voice)",Teen Titans: Trouble in Tokyo,Released,English,151000000,867111926.4,US
1,The Last Manhunt,01/01/2023,56,Western,"In 1909, Willie Boy and his love Carlota go on the run after he accidentally shoots her father in a confrontation gone terribly wrong. With President Taft coming to the area, the local sheriff leads two Native American trackers seeking justice for their “murdered” tribal leader.","Martin Sensmeier, Willie Boy, Mainei Kinimaka, Carlotta, Jason Momoa, Big Jim, Zahn McClarnon, William Johnson, Lily Gladstone, Maria, Raoul Max Trujillo, Hyde, Brandon Oakes, Segundo, Christian Camargo, Sheriff Wilson, Wade Williams, Reche",The Last Manhunt,Released,English,80300000,321306715.6,AU
6,Desperate Riders,02/25/2022,61,Western,"After Kansas Red rescues young Billy from a card-game shootout, the boy asks Red for help protecting his family from the outlaw Thorn, who’s just kidnapped Billy’s mother, Carol. As Red and Billy ride off to rescue Carol, they run into beautiful, tough-as-nails Leslie, who’s managed to escape Thorn’s men. The three race to stop Thorn’s wedding to Carol with guns a-blazing - but does she want to be rescued?","Drew Waters, Kansas Red, Trace Adkins, Thorn, Tom Berenger, Doc Tillman, Vanessa Evigan, Leslie, Sam Ashby, Billy, Victoria Pratt, Carol, Cowboy Troy, Finnegan, Rob Mayes, Deputy Harris, Peter Sherayko, Linstrom",Desperate Riders,Released,English,88600000,899820065.0,US
45,Mackenna's Gold,05/09/1969,66,Western,"A bandit kidnaps a Marshal who has seen a map showing a gold vein on Indian lands, but other groups are looking for it too, while the Apache try to keep the secret location undisturbed.","Gregory Peck, Marshal MacKenna, Omar Sharif, Colorado, Camilla Sparv, Inga Bergmann, Julie Newmar, Hesh-Ke, Telly Savalas, Sergeant Tibbs, Keenan Wynn, Sanchez, Ted Cassidy, Hachita, Lee J. Cobb, The Editor, Raymond Massey, The Preacher",Mackenna's Gold,Released,English,7000000,277238035.8,US


# Stratified Sampling (Parameter 1: Mean Score of Movie)

In [29]:
# Calculate the average score and standard deviation for each stratum
stratified_stats_y <- stratified_data %>%
  group_by(genre) %>%
  summarise(ysh = mean(score),
            sd_score = sd(score),
            nh = n())

stratified_stats_y <- stratified_stats_y[order(stratified_stats_y$nh), ]


stratified_stats_y$Nh <- genre_count$Nh


stratified_stats_y

genre,ysh,sd_score,nh,Nh
<chr>,<dbl>,<dbl>,<int>,<int>
TV Movie,65.0,13.74772708486752,3,60
Western,61.75,4.3493294502333,4,69
Music,70.0,2.54950975679639,5,86
War,64.8,6.97853852894716,5,93
History,70.6666666666667,6.36396103067893,9,166
Documentary,64.1,24.50600833355862,10,180
Mystery,62.4666666666667,8.95119039702595,15,264
Family,66.0434782608696,6.66386634203492,23,407
Fantasy,67.5652173913043,9.71795941429401,23,407
Science Fiction,62.5652173913044,10.38716903840556,23,409


In [30]:
ystr = sum((stratified_stats_y$nh/n_star) * stratified_stats_y$ysh)
ystr
se_ystr = sqrt(sum((stratified_stats_y$nh/n_star)^2 * (1-stratified_stats_y$nh/stratified_stats_y$Nh) * ((stratified_stats_y$sd_score)^2/stratified_stats_y$nh)))
se_ystr

# Stratified Sampling (Parameter 2: Proportion of Movies above Score of 65)

In [31]:
psh = c()

for (i in 1:length(genre_count$genre)){
    same_genre <- stratified_data %>% filter(genre==genre_count$genre[i])
    same_genre_over_65 <- same_genre %>% filter(score > 65)
    psh <- c(psh, nrow(same_genre_over_65) / nrow(same_genre))
    }


stratified_stats_p <- cbind((stratified_stats_y %>% select(-ysh, -sd_score)), psh)
stratified_stats_p

genre,nh,Nh,psh
<chr>,<int>,<int>,<dbl>
TV Movie,3,60,0.666666666666667
Western,4,69,0.25
Music,5,86,0.2
War,5,93,1.0
History,9,166,0.888888888888889
Documentary,10,180,0.6
Mystery,15,264,0.333333333333333
Family,23,407,0.608695652173913
Fantasy,23,407,0.304347826086957
Science Fiction,23,409,0.478260869565217


In [32]:
pstr <- sum((stratified_stats_p$nh/n_star) * stratified_stats_p$psh)
pstr
se_pstr <- sqrt(sum((stratified_stats_p$nh/n_star)^2 * (1-stratified_stats_p$nh/stratified_stats_p$Nh) * stratified_stats_p$psh*(1-stratified_stats_p$psh)/stratified_stats_p$nh))
se_pstr