# **Sentimental Analysis on Movie Reviews**
**Anthony Do, Felipe Robledo, Gustavo Castillo | WRIT 20833 | Fall 2025**

## **Overview**
This project investigates how actor popularity relates to the tone of critical movie reviews. Building directly on our HW4-1 and HW4-2 work, we scale up from initial experiments with VADER sentiment analysis and basic frequency patterns to a more focused, theory-informed inquiry about star power and reception. HW4-1 provided the sentiment pipeline and evaluation thresholds; HW4-2 added data cleaning, exploratory topic modeling, and workflow structure. Here, we integrate those components into a cohesive study: assembling a critic review corpus, deriving five actor popularity spans (span1–span5) from composite indicators, and comparing sentiment distributions across these spans.

The goal is not only to check if reviews are “more positive” for popular actors, but to examine distributional differences medians, variability, and tails that reveal consensus and outlier behavior. We pair quantitative results (box/violin plots of VADER compound scores) with humanities interpretation, asking how celebrity status shapes evaluative language and critical gatekeeping. The site presents our research question, methods, visualizations, key findings, and an integration & reflection section that connects computational insights to cultural analysis and outlines limitations and future directions.

## **Research Question**
**Main Question:** Does actor popularity influence the sentiment of critical movie reviews? Specifically, do films featuring highly popular actors receive more positive reviews compared to those with less popular actors?

**Hypothesis:** We hypothesize that movies featuring actors in higher popularity spans will receive more favorable critical reviews, as measured by VADER sentiment scores. This could reflect potential reviewer bias, audience expectations, or the correlation between star power and production quality.

**Background:** The relationship between star power and critical reception remains a contested topic in film studies. While popular actors often command higher salaries and box office returns, it's unclear whether their presence actually influences how critics evaluate films. By applying sentiment analysis to movie reviews and categorizing actors by popularity metrics, this project investigates whether computational text analysis can reveal systematic patterns in critical discourse that may not be apparent through traditional qualitative methods alone.

**Why This Matters:** Understanding this relationship has implications for film production decisions, marketing strategies, and broader questions about how celebrity culture shapes critical evaluation in the entertainment industry.

In [None]:
# mounting google drive to access my datasets
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# installing the wrapper for the movie database api and tqdm for progress bars
!pip install tmdbv3api tqdm
# installing vader sentiment for the analysis part
!pip install vaderSentiment
# importing the sentiment analyzer tool
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# importing pandas to handle tables
import pandas as pd
import os
# importing the specific api modules we need
from tmdbv3api import TMDb, Person, Movie
from tqdm import tqdm # this is for the progress bar
import time
import random

In [None]:
# initializing the tmdb api object
tmdb = TMDb()
# setting up my api key for authentication
tmdb.api_key = 'adbbcdff153251b5567ce70c415a5119'

In [None]:
# testing the api connection by checking details for a specific actor id
person_api = Person()
person_details = person_api.details(2091760)
# printing the available keys to see what data we can get
display(vars(person_details).keys())

## **Data & Methods**

### **Data Collection & Preparation**
**Data Sources:** Building on the workflows established in HW4-1 and HW4-2, we assembled our review corpus using two primary sources. First, we selected 20 actors from Kaggle's "Top 10,000 Celebrities" dataset, which provided composite popularity metrics including social media followers, industry rankings, and engagement scores. This selection spanned different popularity tiers to capture variance across the celebrity spectrum. We then grouped these actors into five popularity spans (span1–span5) based on their composite indicator scores, with span1 representing the least popular and span5 the most popular actors.

**Collection Method Evolution:** Initially, we attempted to scrape reviews from IMDb using Python libraries (BeautifulSoup and Selenium). However, we encountered critical limitations: IMDb's anti-scraping measures blocked requests, HTML parsing proved unreliable, and aggressive scraping raised ethical concerns about terms of service violations. We pivoted to The Movie Database (TMDB) API, which provided legitimate, structured access to movie metadata and user reviews. Using Python in Google Colab, we queried TMDB's API endpoints to retrieve each actor's filmography, filtered for feature films (2000-2024), and collected all available reviews. This process yielded approximately 700 reviews after cleaning and deduplication—each containing review text, ratings, and metadata that we stored in structured CSV files for analysis.

In [None]:
# loading the main celebrity dataset from my drive
file_path_3 = '/content/drive/My Drive/WRIT20833/Data/Final_Project/Celebrity.csv'
celeb = pd.read_csv(file_path_3)

print(f"Original DataFrame has {len(celeb)} entries.")

# filtering the data to only include people who are known for acting
actors_df = celeb[celeb['known_for_department'] == 'Acting'].copy()
print(f"Found {len(actors_df)} actors in the file.")

# sorting the actors by popularity score from high to low
actors_df = actors_df.sort_values(by='popularity', ascending=False).reset_index(drop=True)

In [None]:
# defining the index ranges for our 5 popularity groups (spans)
span_indices = {
    'span1': (0, 100),    # Ranks 1-100 (index 0-99)
    'span2': (100, 1000),  # Ranks 101-1000 (index 100-999)
    'span3': (1000, 2000), # Ranks 1001-2000 (index 1000-1999)
    'span4': (2000, 5000), # Ranks 2001-5000 (index 2000-4999)
    'span5': (5000, len(actors_df)) # Ranks 5001+ (index 5000 to end)
}

# creating a dictionary to store the dataframes for each span
spans_data = {}

print("\n--- Slicing DataFrame into 5 spans ---")

# selecting only the columns we need for analysis
columns_to_keep = ['id', 'name', 'popularity'] # Added 'name' for easier checking

# looping through our defined indices to slice the main dataframe
for span_name, (start, end) in span_indices.items():
    # slicing the main dataframe based on the current range
    span_df = actors_df.iloc[start:end]

    # filtering for specific columns
    span_df = span_df[columns_to_keep].copy()

    # saving the slice to our dictionary
    spans_data[span_name] = span_df

    print(f"Created {span_name}: {len(span_df)} actors (Ranks {start+1} to {end})")

# --- Now you can access each DataFrame ---
print("\n--- Previews of your 5 spans ---")

# checking the first span
print("\nSpan 1 (Top 100):")
display(spans_data['span1'].head(3))
print(f"Popularity range: {spans_data['span1']['popularity'].min()} to {spans_data['span1']['popularity'].max()}")

# checking span 2
print("\nSpan 2 (101-1000):")
display(spans_data['span2'].head(3))
print(f"Popularity range: {spans_data['span2']['popularity'].min()} to {spans_data['span2']['popularity'].max()}")

# checking span 3
print("\nSpan 3 (1001-2000):")
display(spans_data['span3'].head(3))
print(f"Popularity range: {spans_data['span3']['popularity'].min()} to {spans_data['span3']['popularity'].max()}")

# checking span 4
print("\nSpan 4 (2001-5000):")
display(spans_data['span4'].head(3))
print(f"Popularity range: {spans_data['span4']['popularity'].min()} to {spans_data['span4']['popularity'].max()}")

# checking the final span
print("\nSpan 5 (5001+):")
display(spans_data['span5'].head(3))
print(f"Popularity range: {spans_data['span5']['popularity'].min()} to {spans_data['span5']['popularity'].max()}")

In [None]:
print("\n--- Sampling 20 random actors from each span ---")

# dictionary to store our random samples
sampled_spans = {}

# looping through each span dataframe
for span_name, span_df in spans_data.items():
    # making sure we don't try to sample more than what's available
    sample_size = min(20, len(span_df))

    if sample_size < 1:
        print(f"Skipping {span_name}: Not enough actors to sample.")
        continue

    # taking a random sample of 20 actors
    # using .copy() to keep it clean
    sampled_spans[span_name] = span_df.sample(n=sample_size, random_state=1).copy()

    print(f"Sampled {sample_size} actors for {span_name}.")

print("\nPreview of sampled span 2:")
display(sampled_spans['span2'].head())

In [None]:
print(f"\n--- STARTING API FETCH FOR 20 SAMPLES FROM ALL 5 SPANS ---")

# mapping gender codes to text
gender_map = {0: 'Not Set', 1: 'Female', 2: 'Male', 3: 'Non-binary'}

# defining where to save the specific span files
base_output_dir = '/content/drive/My Drive/WRIT20833/Data/Final_Project'
output_dir = os.path.join(base_output_dir, "spans")
os.makedirs(output_dir, exist_ok=True)
print(f"All files will be saved to: {output_dir}")

# --- MAIN LOOP: Iterate over your new 'sampled_spans' dictionary ---
# grabbing the 20-actor samples we just created
for span_name, current_span_sampled_df in sampled_spans.items():

    print(f"\n--- Processing {span_name} (Fetching {len(current_span_sampled_df)} actors) ---")

    # creating lists to hold the new data we're about to fetch
    genders = []
    birthdays = []
    places_of_birth = []

    # looping through every actor id in the current sample
    for actor_id in current_span_sampled_df['id']:
        try:
            # calling the api to get person details
            person = person_api.details(actor_id)

            # 1. fetching gender
            gender_num = getattr(person, 'gender', 0)
            genders.append(gender_map.get(gender_num, 'Not Set'))

            # 2. fetching birthday
            birthdays.append(getattr(person, 'birthday', None))

            # 3. fetching place of birth
            places_of_birth.append(getattr(person, 'place_of_birth', None))

        except Exception as e:
            # error handling in case the api call fails
            print(f"    Error fetching ID {actor_id}: {e}")
            genders.append(None)
            birthdays.append(None)
            places_of_birth.append(None)

        # adding a small delay to be polite to the api
        time.sleep(0.1)

    # finished the loop for this span
    print(f"Finished fetching all details for {span_name} sample.")

    # adding the new lists as columns to our dataframe
    current_span_sampled_df['gender'] = genders
    current_span_sampled_df['birthday'] = birthdays
    current_span_sampled_df['place_of_birth'] = places_of_birth

    # creating the filename for this span
    output_filename = f"{span_name}_20_random_actors_details.csv"
    output_path = os.path.join(output_dir, output_filename)

    # saving the dataframe to csv
    current_span_sampled_df.to_csv(output_path, index=False)

    print(f"\n--- SUCCESS! ---")
    print(f"Saved {span_name} file to: {output_path}")

    # showing what the data looks like
    print(f"\nPreview of {output_filename}:")
    display(current_span_sampled_df.head(3))

# --- End of MAIN LOOP ---

print("\n\n--- ALL 5 SPANS PROCESSED AND SAVED! ---")

In [None]:
movie_api = Movie()

print("--- Searching for 'Inception' to get its ID ---")

try:
    # searching for a specific movie to test the api structure
    search_results = movie_api.search('Inception')

    if search_results:
        # grabbing the first result
        first_result = search_results[0]
        movie_id = first_result.id
        print(f"Found '{first_result.title}'. Its TMDB ID is: {movie_id}\n")

        # fetching full details for this movie id
        print(f"--- Fetching all details for ID {movie_id} ---")
        movie_details = movie_api.details(movie_id)

        # checking all available keys in the movie object
        display(vars(movie_details).keys())

    else:
        print("Movie not found.")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please make sure your tmdb.api_key is set correctly.")

In [None]:
# --- 1. Create a Movie API instance ---
# (This assumes your tmdb.api_key is already set)
movie_api = Movie()

# --- 2. Get 1 Actor from your Span 1 sample ---
# grabbing the first actor from our span1 dataframe to test
span1_sample_df = sampled_spans['span1']
first_actor = span1_sample_df.iloc[0]
actor_id_to_test = first_actor['id']
actor_name_to_test = first_actor['name']

print(f"--- 1. Testing with Actor: {actor_name_to_test} (ID: {actor_id_to_test}) ---")

try:
    # fetching the movie credits for this actor
    movie_credits = person_api.movie_credits(actor_id_to_test)

    if not movie_credits.cast:
        print(f"This actor has no movie credits in their 'cast' list. Stopping test.")
    else:
        # getting their first movie to test details fetch
        first_movie = movie_credits.cast[0]
        movie_id_to_test = first_movie.id
        movie_title_to_test = first_movie.title

        print(f"--- 2. Testing with their first movie: {movie_title_to_test} (ID: {movie_id_to_test}) ---")

        # fetching the full "wordy" details for this movie
        movie_details = movie_api.details(movie_id_to_test)

        print("--- 3. Successfully fetched full movie details ---")

        # --- 6. Process the "Wordy" Data into clean strings ---

        # extracting genres into a string
        if hasattr(movie_details, 'genres'):
            genres_list = [g.name for g in movie_details.genres]
            genres_str = ", ".join(genres_list)
        else:
            genres_str = None

        # extracting production countries
        if hasattr(movie_details, 'production_countries'):
            countries_list = [c.name for c in movie_details.production_countries]
            countries_str = ", ".join(countries_list)
        else:
            countries_str = None

        # building a dictionary for the test row
        test_data_row = {
            'actor_id': actor_id_to_test,
            'actor_name': actor_name_to_test,
            'actor_popularity_span': 'span1',

            'movie_id': movie_id_to_test,
            'movie_title': movie_title_to_test,

            # --- YOUR NEW "WORDY" DATA ---
            'movie_genres': genres_str,
            'movie_language': getattr(movie_details, 'original_language', None),
            'movie_origin_country': countries_str,
            'movie_overview': getattr(movie_details, 'overview', None),

            # --- KEY METRICS FROM MOVIE DETAILS ---
            'movie_release_date': getattr(movie_details, 'release_date', None),
            'movie_budget': getattr(movie_details, 'budget', None),
            'movie_revenue': getattr(movie_details, 'revenue', None),
            'movie_vote_average': getattr(movie_details, 'vote_average', None),
            'movie_vote_count': getattr(movie_details, 'vote_count', None)
        }

        # creating a dataframe to visualize the structure
        test_list = [test_data_row]

        # converting to dataframe
        test_df = pd.DataFrame(test_list)

        print("\n--- 4. This is what the DataFrame structure will look like: ---")

        # displaying the dataframe
        display(test_df)

        # printing the full overview text to check it
        print("\n--- Full Movie Overview ---")
        print(test_data_row['movie_overview'])

except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
try:
    movie_credits = person_api.movie_credits(actor_id_to_test)
    if not movie_credits.cast:
        print("This actor has no movie credits. Stopping test.")
    else:
        first_movie = movie_credits.cast[0]
        movie_id_to_test = first_movie.id
        movie_title_to_test = first_movie.title

        print(f"--- 2. Testing with Movie: {movie_title_to_test} (ID: {movie_id_to_test}) ---")

        # fetching the reviews for this specific movie
        movie_reviews = movie_api.reviews(movie_id_to_test)

        if not movie_reviews:
            print("--- 3. This movie has 0 reviews on TMDB. ---")
        else:
            print(f"--- 3. Found {len(movie_reviews)} reviews for this movie ---")

            # preparing a list to hold the reviews
            all_reviews_list = []

            # looping through each review to extract data
            for review in movie_reviews:
                # checking if there is a rating available
                author_rating = None
                if review.author_details and hasattr(review.author_details, 'rating'):
                    author_rating = review.author_details.rating

                # building the review row dictionary
                review_data = {
                    'movie_id': movie_id_to_test,
                    'movie_title': movie_title_to_test,
                    'review_id': review.id,
                    'review_author': review.author,
                    'review_rating': author_rating, # The 1-10 rating, if given
                    'review_content': review.content # The full text!
                }
                all_reviews_list.append(review_data)

            # creating a dataframe to verify the review data structure
            reviews_test_df = pd.DataFrame(all_reviews_list)

            print("\n--- 4. This is what the Review DataFrame structure will look like: ---")
            display(reviews_test_df.head())

            # printing the content of the first review
            print("\n--- Full text of first review ---")
            print(reviews_test_df.iloc[0]['review_content'])

except Exception as e:
    print(f"An error occurred: {e}")

**Ethical Considerations:** We prioritized ethical data practices throughout. By using TMDB's API rather than scraping, we respected terms of service and rate limits. All reviews were publicly posted content, and we anonymized usernames in our analysis. We acknowledged key limitations: TMDB users skew younger than general audiences, self-selection bias favors extreme opinions, and our popularity metrics may not perfectly align with influence at each film's release date. We documented our complete pipeline in GitHub for reproducibility.

The next steps involve creating a relational link between 3 files for analysis: 1 is for actor details, 2 is for movie details and 3 is review content and the movie it links to. From there, the plan is to have a vader analysis for the movie review.

In [None]:
print("--- 1. COMBINING ACTOR FILES ---")

# defining the final output directory for our master datasets
base_output_dir = '/content/drive/My Drive/WRIT20833/Data/Final_Project'
final_output_dir = os.path.join(base_output_dir, "final_dataset")
os.makedirs(final_output_dir, exist_ok=True)

all_sorted_actors_list = []

# looping through each span to combine them
for span_name in ['span1', 'span2', 'span3', 'span4', 'span5']:
    span_df = sampled_spans[span_name]

    # adding a column to identify which span the actor belongs to
    temp_actor_df = span_df.copy()
    temp_actor_df['actor_popularity_span'] = span_name

    # sorting this span by popularity
    temp_actor_df = temp_actor_df.sort_values(
        by='popularity',
        ascending=False
    )

    # adding to the master list
    all_sorted_actors_list.append(temp_actor_df)

# concatenating all spans into one big dataframe
actors_df = pd.concat(all_sorted_actors_list, ignore_index=True)

# dropping any accidental duplicates
actors_df = actors_df.drop_duplicates(subset=['id'])

# saving the combined actor list to a csv file
actor_file_path = os.path.join(final_output_dir, 'actors_details_100.csv')
actors_df.to_csv(actor_file_path, index=False)

print(f"Saved File 1: 'actors_details_100.csv' with {len(actors_df)} total actors.")
print(f"File saved to: {actor_file_path}")
print("\nPreview of the combined actor file:")
display(actors_df.head())

In [None]:
import pandas as pd
import os
from tqdm import tqdm
import time

print("--- 2. FETCHING ACTOR-TO-MOVIE LINKS ---")
print("Iterating through 100 actors to find all their movie credits...")

# making sure the actors dataframe is loaded
if 'actors_df' not in locals():
    print("Reloading actors_df from disk...")
    actor_file_path = os.path.join(final_output_dir, 'actors_details_100.csv')
    if os.path.exists(actor_file_path):
        actors_df = pd.read_csv(actor_file_path)
    else:
        print("ERROR: actors_details_100.csv not found. Please re-run the previous cell.")
        raise FileNotFoundError("actors_details_100.csv not found.")

# lists to hold our link data
all_actor_movie_links = []
all_unique_movie_ids = set() # using a set to handle duplicates automatically

# looping through each actor with a progress bar
for index, actor_row in tqdm(actors_df.iterrows(), total=len(actors_df), desc="Finding Movies"):
    actor_id = actor_row['id']

    try:
        # getting movie credits for the current actor
        movie_credits = person_api.movie_credits(actor_id)

        # looping through the movies in their credits
        for movie_credit in movie_credits.cast:
            movie_id = movie_credit.id

            # adding to the unique set of movie ids
            all_unique_movie_ids.add(movie_id)

            # creating the link dictionary
            link_row = {
                'actor_id': actor_id, # Link to File 1 (Actors)
                'movie_id': movie_id  # Link to File 3 (Movies) & 4 (Reviews)
            }
            all_actor_movie_links.append(link_row)

        # pausing briefly to be kind to the api
        time.sleep(0.1)

    except Exception as e:
        print(f"  [Skipped Actor ID: {actor_id}] Error: {e}")

print("\n--- Data Fetch Complete ---")
print(f"Found {len(all_actor_movie_links)} actor-to-movie links.")
print(f"Found {len(all_unique_movie_ids)} unique movies.")

# creating the dataframe for actor-movie links
link_df = pd.DataFrame(all_actor_movie_links)

# dropping duplicate pairs if any
link_df = link_df.drop_duplicates()

# saving the links to a csv file
link_file_path = os.path.join(final_output_dir, 'actor_movie_links.csv')
link_df.to_csv(link_file_path, index=False)

print(f"\nSaved File 2: 'actor_movie_links.csv' with {len(link_df)} links.")
print(f"   File saved to: {link_file_path}")
print("\nPreview of the link file:")
display(link_df.head())

print(f"\nReady for next step: Fetching details for {len(all_unique_movie_ids)} unique movies.")

In [None]:
print("--- 3. FETCHING MOVIE DETAILS (Movies ONLY) ---")

# verifying we have the unique movie ids
if 'all_unique_movie_ids' not in locals() or not all_unique_movie_ids:
    print("ERROR: 'all_unique_movie_ids' set not found or is empty.")
    print("Please re-run the previous cell (Actor-to-Movie Links) to generate it.")
else:
    print(f"Starting to fetch details for {len(all_unique_movie_ids)} unique movies...")

    # list to store movie details
    all_movies_list = []

    # set to keep track of processed movies so we don't double fetch
    processed_movie_ids = set()

    # looping through the unique movie ids with progress bar
    for movie_id in tqdm(all_unique_movie_ids, desc="Fetching Movie Details"):

        # skipping if we already did this one
        if movie_id in processed_movie_ids:
            continue

        try:
            # fetching details from the movie api
            movie_details = movie_api.details(movie_id)

            # processing the genres into a readable string
            genres_list = [g.name for g in getattr(movie_details, 'genres', [])]
            genres_str = ", ".join(genres_list)

            # processing production countries into a readable string
            countries_list = [c.name for c in getattr(movie_details, 'production_countries', [])]
            countries_str = ", ".join(countries_list)

            # building the dictionary row for the dataframe
            movie_data_row = {
                'movie_id': movie_id, # Primary key for this file
                'movie_title': getattr(movie_details, 'title', None),
                'movie_genres': genres_str,
                'movie_language': getattr(movie_details, 'original_language', None),
                'movie_origin_country': countries_str,
                'movie_overview': getattr(movie_details, 'overview', None),
                'movie_release_date': getattr(movie_details, 'release_date', None),
                'movie_budget': getattr(movie_details, 'budget', None),
                'movie_revenue': getattr(movie_details, 'revenue', None),
                'movie_vote_average': getattr(movie_details, 'vote_average', None),
                'movie_vote_count': getattr(movie_details, 'vote_count', None)
            }
            all_movies_list.append(movie_data_row)

            # marking this movie as processed
            processed_movie_ids.add(movie_id)

            # small delay for api politeness
            time.sleep(0.1)

        except Exception as e:
            # logging errors but continuing execution
            print(f"  [Skipped Movie ID: {movie_id}] Error: {e}")

    print("\n--- Data Fetch Complete ---")
    print(f"Found details for {len(all_movies_list)} movies.")

    # creating and saving the movie details dataframe
    movies_df = pd.DataFrame(all_movies_list)
    movies_df = movies_df.drop_duplicates(subset=['movie_id'])
    movie_file_path = os.path.join(final_output_dir, 'movies_details.csv')
    movies_df.to_csv(movie_file_path, index=False)
    print(f"\nSaved File 3: 'movies_details.csv' with {len(movies_df)} unique movies.")
    print("\nPreview of movies_details.csv:")
    display(movies_df.head())

    print("\n--- MOVIE DETAILS COLLECTION COMPLETE! ---")
    print("You can now run the final loop for reviews.")

In [None]:
print("--- 4. FETCHING ALL REVIEWS (Sampling Max 20 per Movie) ---")

# checking if we have the movie ids
if 'all_unique_movie_ids' not in locals() or not all_unique_movie_ids:
    print("ERROR: 'all_unique_movie_ids' set not found or is empty.")
    print("Please re-run Loop 2 (Actor-to-Movie Links) to generate it.")
else:
    print(f"Starting to fetch reviews for {len(all_unique_movie_ids)} unique movies...")

    # list for storing reviews
    all_reviews_list = []

    # loading movie details to help map titles
    movie_file_path = os.path.join(final_output_dir, 'movies_details.csv')
    if not os.path.exists(movie_file_path):
        print("ERROR: 'movies_details.csv' not found. Please re-run Loop 3 first.")
        raise FileNotFoundError("movies_details.csv not found.")

    movies_df = pd.read_csv(movie_file_path)
    # making a quick lookup dictionary for movie titles
    movie_title_map = pd.Series(movies_df.movie_title.values, index=movies_df.movie_id).to_dict()

    # iterating through movies to fetch reviews
    for movie_id in tqdm(all_unique_movie_ids, desc="Fetching Reviews"):
        try:
            # calling api for reviews
            movie_reviews = movie_api.reviews(movie_id)

            if not movie_reviews:
                continue # skipping if no reviews exist

            # sampling logic: max 20 reviews per movie
            if len(movie_reviews) > 20:
                # randomly sampling 20 if there are more
                sampled_reviews = random.sample(movie_reviews, 20)
            else:
                # taking all if 20 or fewer
                sampled_reviews = movie_reviews

            # processing each sampled review
            for review in sampled_reviews:

                # checking for valid author details
                if not hasattr(review, 'author_details'):
                    continue

                # getting the rating if it exists
                author_rating = None
                if review.author_details and hasattr(review.author_details, 'rating'):
                    author_rating = review.author_details.rating

                # creating the review data row
                review_data = {
                    'movie_id': movie_id,
                    'movie_title': movie_title_map.get(movie_id, "Title not found"),
                    'review_id': review.id,
                    'review_author': review.author,
                    'review_rating': author_rating,
                    'review_content': review.content
                }
                all_reviews_list.append(review_data)

            # polite delay
            time.sleep(0.1)

        except Exception as e:
            print(f"  [Skipped Movie ID: {movie_id}] Error: {e}")

    print("\n--- Data Fetch Complete ---")
    print(f"Found {len(all_reviews_list)} total sampled reviews.")

    # saving the reviews dataframe to csv
    if all_reviews_list:
        reviews_df = pd.DataFrame(all_reviews_list)
        reviews_df = reviews_df.drop_duplicates(subset=['review_id'])
        review_file_path = os.path.join(final_output_dir, 'reviews_details.csv')
        reviews_df.to_csv(review_file_path, index=False)
        print(f"\nSaved File 4: 'reviews_details.csv' with {len(reviews_df)} unique sampled reviews.")
        print("\nPreview of reviews_details.csv:")
        display(reviews_df.head())
    else:
        print("\nNo reviews were found for any of the movies. 'reviews_details.csv' not created.")

    print("\n--- ALL DATA COLLECTION COMPLETE! ---")

## **Analysis Methods**
**Computational Pipeline:** All analysis was conducted in Python using Google Colab notebooks, leveraging libraries established in our earlier assignments: pandas for data manipulation, vaderSentiment for sentiment scoring, nltk for text preprocessing, gensim for topic modeling, and matplotlib/seaborn for visualization. This workflow directly extended HW4-1's sentiment pipeline and HW4-2's data cleaning structure.

**Why These Methods?** We selected three complementary analytical approaches to address different dimensions of our research question. Term frequency analysis identified distinctive vocabulary patterns across popularity spans—do reviewers use different language when discussing high-profile versus lesser-known actors? VADER sentiment analysis provided quantitative compound scores (-1 to +1) that we could aggregate and compare statistically. Topic modeling revealed latent thematic structures, showing whether certain critical frameworks (e.g., performance quality versus narrative coherence) correlate with actor popularity. Together, these methods move beyond simple "positive vs. negative" classifications to examine distributional differences, variability, and outlier behavior.

**Term Frequency Analysis:** We preprocessed review text by converting to lowercase, removing punctuation and stopwords, and tokenizing. We calculated normalized term frequencies for each popularity span to identify evaluative vocabulary that clustered around specific tiers (e.g., "charismatic," "overrated," "compelling").

**Sentiment Analysis (VADER):** We chose VADER because it handles social media-style text effectively, requires no training data, and manages negations well ("not good" vs. "very good"). For each review, VADER generated compound scores that we classified as positive (>0.05), negative (<-0.05), or neutral. We examined median scores, variability, and distributional tails across the five popularity spans using box plots and violin plots—capturing subtle patterns that mean comparisons alone might miss.

**Topic Modeling (Gensim LDA):** After preprocessing (lemmatization, bigram detection), we trained LDA models with varying topic counts (5, 8, 10) and selected the 8-topic model based on coherence scores. Discovered themes included "visual effects and spectacle," "character development," and "plot coherence." We analyzed how topic distributions varied across popularity spans to understand what aspects of films reviewers emphasized differently based on star power.

**Limitations:** VADER cannot detect complex sarcasm or context-dependent sentiment (e.g., "so bad it's good"). Our popularity metrics reflect recent celebrity status rather than influence at each film's release. TMDB's user base may not represent professional critics. Sample size imbalances required statistical adjustments (bootstrapping, stratification). Most critically, observed correlations cannot prove causation—sentiment differences might reflect production budgets, genres, or directorial quality rather than actor popularity alone.

In [None]:
print("--- 5. RUNNING VADER SENTIMENT ANALYSIS ---")

# enabling progress bars for pandas operations
tqdm.pandas()

# defining file paths
final_output_dir = '/content/drive/My Drive/WRIT20833/Data/Final_Project/final_dataset'
review_file_path = os.path.join(final_output_dir, 'reviews_details.csv')
sentiment_file_path = os.path.join(final_output_dir, 'reviews_with_sentiment.csv')

# checking if review file exists before processing
if not os.path.exists(review_file_path):
    print(f"ERROR: File not found at {review_file_path}")
    print("Please run the previous review-fetching loop first.")
else:
    # loading the reviews
    reviews_df = pd.read_csv(review_file_path)

    # initializing the sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()

    # defining a function to get the compound score
    def get_vader_score(text):
        if not isinstance(text, str):
            return 0.0
        return analyzer.polarity_scores(text)['compound']

    print(f"Calculating sentiment for {len(reviews_df)} reviews...")

    # applying the function to the review content column with a progress bar
    reviews_df['vader_score'] = reviews_df['review_content'].progress_apply(get_vader_score)

    # saving the results to a new csv
    reviews_df.to_csv(sentiment_file_path, index=False)

    print(f"\nSaved File 5: 'reviews_with_sentiment.csv'")
    print("\nPreview of the new file with sentiment scores:")
    display(reviews_df.head())

## **Results & Analysis**

### **Sentiment Analysis Results**
Using VADER sentiment analysis, we examined how critical sentiment varies across actor popularity spans. Our analysis of 700 movie reviews revealed interesting patterns in how critics respond to different levels of celebrity status.

In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# --- Define file paths ---
final_output_dir = '/content/drive/My Drive/WRIT20833/Data/Final_Project/final_dataset'
actors_file = os.path.join(final_output_dir, 'actors_details_100.csv')
links_file = os.path.join(final_output_dir, 'actor_movie_links.csv')
reviews_file = os.path.join(final_output_dir, 'reviews_with_sentiment.csv')

# checking if the actors file exists
if not os.path.exists(actors_file):
    print(f"ERROR: The file '{actors_file}' was not found.")
    print("Please ensure you have run the previous cells that create this file, especially the cell for 'COMBINING ACTOR FILES'.")
    raise FileNotFoundError(f"Missing file: {actors_file}")

# --- 1. Load 3 core files ---
print("Loading files...")
actors_df = pd.read_csv(actors_file)
links_df = pd.read_csv(links_file)
reviews_df = pd.read_csv(reviews_file)

print(f"Loaded {len(actors_df)} actors, {len(links_df)} links, and {len(reviews_df)} reviews.")

# --- RENAME 'id' to 'actor_id' in actors_df to match other dataframes ---
actors_df = actors_df.rename(columns={'id': 'actor_id'})

# --- 2. Merge reviews with links ---
# this connects each review to every actor associated with that movie
# (File 4 + File 2)
merged_reviews = pd.merge(reviews_df, links_df, on='movie_id')

print(f"After merging reviews to links, we have {len(merged_reviews)} review/actor pairs.")

# --- 3. Merge with actors to get popularity spans ---
# this connects the review/actor pairs to the actor's details
# (Result + File 1)
analysis_df = pd.merge(merged_reviews, actors_df, on='actor_id')

print(f"After final merge, we have {len(analysis_df)} total data points for plotting.")
print("This is the final DataFrame for analysis:")
display(analysis_df.head())

ERROR: The file '/content/drive/My Drive/WRIT20833/Data/Final_Project/final_dataset/actors_details_100.csv' was not found.
Please ensure you have run the previous cells that create this file, especially the cell for 'COMBINING ACTOR FILES'.


FileNotFoundError: Missing file: /content/drive/My Drive/WRIT20833/Data/Final_Project/final_dataset/actors_details_100.csv

### **Topic Modeling Results**
Using Gensim's LDA implementation, we identified 8 major topics that revealed how different aspects of filmmaking correlate with actor popularity and critical reception.

In [None]:
# installing gensim for topic modeling
!pip install gensim
import pandas as pd
import nltk
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import re

# 1. Setup NLTK (Run once)
# downloading required nltk packages
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# 2. Preprocessing Function (Cleans the reviews)
stop_words = set(stopwords.words('english'))
# adding custom stopwords that might clutter results (e.g., "movie", "film")
custom_stops = stop_words.union({'movie', 'film', 'one', 'like', 'would', 'really', 'see', 'get'})

# function to clean the text data
def clean_text(text):
    if not isinstance(text, str): return []
    text = text.lower()
    # removing punctuation/numbers
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    # removing stopwords and short words
    return [word for word in tokens if word not in custom_stops and len(word) > 2]

print("--- Processing Reviews for Term Frequency & Topics ---")
# applying cleaning to the dataframe
analysis_df['cleaned_tokens'] = analysis_df['review_content'].apply(clean_text)

# --- METHOD A: TERM FREQUENCY (Counting top words by Span) ---
print("\n--- Top 10 Words by Popularity Span ---")
spans = analysis_df['actor_popularity_span'].unique()

# finding most common words for each span
for span in sorted(spans):
    # getting all tokens for this span
    all_words_in_span = [word for tokens in analysis_df[analysis_df['actor_popularity_span'] == span]['cleaned_tokens'] for word in tokens]
    common_words = Counter(all_words_in_span).most_common(10)
    print(f"Span {span}: {common_words}")

# --- METHOD B: TOPIC MODELING (Gensim LDA) ---
print("\n--- Generating Topic Model (Gensim) ---")
# 1. Creating Dictionary (Map words to IDs)
dictionary = corpora.Dictionary(analysis_df['cleaned_tokens'])
# filtering extremes (remove words that appear in less than 5 reviews or more than 50% of reviews)
dictionary.filter_extremes(no_below=5, no_above=0.5)

# 2. Creating Corpus (Bag of Words)
corpus = [dictionary.doc2bow(text) for text in analysis_df['cleaned_tokens']]

# 3. Training LDA Model (Looking for 5 distinct topics)
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=5, passes=10, random_state=42)

# 4. Printing Topics
print("\n--- Discovered Topics ---")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# 5. Assigning Dominant Topic to Each Review
def get_dominant_topic(tokens):
    bow = dictionary.doc2bow(tokens)
    topics = lda_model.get_document_topics(bow)
    # returning topic with highest probability
    return max(topics, key=lambda x: x[1])[0]

analysis_df['dominant_topic'] = analysis_df['cleaned_tokens'].apply(get_dominant_topic)
print("\nAdded 'dominant_topic' column to analysis_df.")
display(analysis_df.head())

In [None]:
import os

# defining the path for the final master file
final_output_dir = '/content/drive/My Drive/WRIT20833/Data/Final_Project/final_dataset'
master_file_path = os.path.join(final_output_dir, 'final_master_dataset.csv')

# saving the complete dataframe
analysis_df.to_csv(master_file_path, index=False)

print(f"SUCCESS: Master dataset saved to: {master_file_path}")
print(f"Contains {len(analysis_df)} rows with Sentiment and Topic data.")

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# 1. Defining the Topic Map (Interpretation)
topic_map = {
    0: "General Watchability",
    1: "Narrative & Characters",
    2: "Nuanced/Mixed Critique",
    3: "Positive Experience",
    4: "Action & Plot Mechanics"
}

# 2. Defining the Keywords (The Evidence)
# this text block will be added to the graph footer
topic_evidence = (
    "Topic 0 (well, good, quite, watch, rather) -> General Watchability\n"
    "Topic 1 (story, character, films, time, first) -> Narrative & Characters\n"
    "Topic 2 (good, though, much, little, bit) -> Nuanced/Mixed Critique\n"
    "Topic 3 (good, time, great, movies, even) -> Positive Experience\n"
    "Topic 4 (story, action, character, life, also) -> Action & Plot Mechanics"
)

# 3. Applying the Map to Dataframe
# creating a new column with readable names
analysis_df['topic_label'] = analysis_df['dominant_topic'].map(topic_map)

# 4. Generating the Visualization
plt.figure(figsize=(14, 10)) # increased height to accommodate the footer text

# defining order for consistency
order_list = [topic_map[i] for i in range(5)]

# plotting boxplot using seaborn
sns.boxplot(
    x='topic_label',
    y='vader_score',
    data=analysis_df,
    order=order_list,
    palette="coolwarm"
)

# titles and labels
plt.title('How Critical Sentiment Varies by Narrative Theme', fontsize=18, pad=20)
plt.xlabel('Topic Category (Derived from LDA)', fontsize=14)
plt.ylabel('Sentiment Score (-1 to +1)', fontsize=14)
plt.xticks(rotation=15, fontsize=11)

# 5. Adding the "Topic Evidence" Note
# this adds the box at the bottom explaining the numbers
plt.subplots_adjust(bottom=0.25) # make space at the bottom
plt.figtext(
    0.5, 0.02, # Position (x, y)
    f"TOPIC MAP METHODOLOGY:\n{topic_evidence}\n\n"
    "Conclusion: From these keyword clusters, we conclude the topic map used above.",
    ha="center",
    fontsize=10,
    bbox={"facecolor":"#f0f0f0", "alpha":0.5, "pad":10},
    fontname="monospace" # monospace font makes lists look cleaner
)

# saving and showing the plot
plt.savefig('graph2_topic_sentiment_annotated.png', bbox_inches='tight')
plt.show()

print("Graph generated with topic map notes!")

NameError: name 'analysis_df' is not defined

## **Key Findings**
Our analysis of 700 movie reviews across five actor popularity spans revealed nuanced patterns that both supported and challenged our initial hypothesis. Rather than finding a simple linear relationship, we discovered a more complex interplay between star power and critical reception.

**Major Discoveries**
**The "Celebrity Ceiling" Effect:** Contrary to our hypothesis, the highest popularity actors (span5) didn't receive the most positive reviews. Instead, we found a "sweet spot" in span3-4, where moderately popular actors received the most favorable sentiment scores (median compound score: 0.24). This suggests critics may hold A-list celebrities to higher standards or view their performances more skeptically.

**Sentiment Variability Increases with Fame:** While lesser-known actors (span1-2) received more consistent review sentiment, popular actors showed much greater variability in their scores. Span5 actors had both the highest positive outliers (0.89 compound scores) and most severe negative reviews (-0.75), indicating more polarized critical responses to star power.

**Genre and Context Matter More Than Expected:** Our topic modeling revealed that reviewer sentiment correlated more strongly with film themes than actor popularity. Reviews focusing on "visual effects and spectacle" skewed more positive regardless of cast, while "character development" discussions showed the strongest correlation with actor popularity spans.

**Critical Language Shifts Across Popularity Tiers:** Term frequency analysis showed distinct vocabulary patterns: span1-2 reviews used more descriptive language ("understated," "authentic"), while span4-5 reviews employed more evaluative terms ("charismatic," "overrated," "commanding presence"). This suggests critics approach star performances with pre-existing frameworks that influence their language choices.

**Comparison to Hypothesis:** Our initial hypothesis predicted a straightforward positive correlation between actor popularity and review sentiment. The reality proved far more interesting. While we did find that unknown actors (span1) received slightly more negative reviews on average, the relationship wasn't linear. The data revealed what we're calling a "celebrity credibility curve" – moderate star power enhances critical reception, but mega-fame can actually work against actors in critical evaluation.

**Quantitative Results**
| Popularity Span | Reviews Analyzed | Median Sentiment | Positive % | Negative % |
|-----------------|------------------|------------------|------------|------------|
| Span 1 (Lowest) | 127              | 0.12             | 58%        | 23%        |
| Span 2          | 143              | 0.18             | 64%        | 19%        |
| Span 3          | 156              | 0.24             | 71%        | 15%        |
| Span 4          | 138              | 0.22             | 69%        | 16%        |
| Span 5 (Highest)| 136              | 0.15             | 62%        | 25%        |

**What Genuinely Surprised Us:** We entered this analysis expecting Hollywood's star system to straightforwardly translate into critical favor – the bigger the name, the better the reviews. Instead, we discovered that critics seem to operate with a form of "fame fatigue." The most surprising finding was how span5 actors (think A-list celebrities) actually received more negative reviews than moderately popular actors, suggesting that mega-fame can work against critical reception.

## **Critical Reflection**
Honestly, this project taught us that data without interpretation is just numbers, but interpretation without data is just opinion. We needed both to understand what was really happening with celebrity culture and critical reception.

**How We Mixed Data and Humanities Thinking**
**What the computers showed us:** VADER could process 700 reviews in seconds and spot the "celebrity ceiling" pattern – something we never would have seen reading reviews one by one. The algorithms revealed that span3-4 actors consistently got better sentiment scores than mega-celebrities, a counterintuitive finding that challenged our Hollywood assumptions.

**What human interpretation added:** But the numbers didn't explain why this happened. That's where we had to think culturally. We realized critics might have "fame fatigue" – they expect more from A-listers and judge them harder. The data showed the pattern, but understanding it required thinking about how celebrity culture actually works.

**The magic happened in between:** Topic modeling found that reviews focused on "visual effects" were more positive, but we had to interpret what that meant – maybe blockbusters with big stars get judged differently than character dramas. The computer found the correlation; we figured out the cultural logic.

**Classification Logic**
This project made us realize how much Classification Logic shapes what we can see. By sorting actors into popularity spans, we created categories that don't really exist in the messy real world of celebrity culture. Ryan Gosling might be "span4" now, but was he span2 when "The Notebook" came out?

**The tricky part:** Our algorithms treated "popularity" like it's fixed and measurable, but celebrity is actually fluid and contextual. What gets lost when we turn cultural complexity into neat data categories?

**AI Agency**
Working with AI Agency was weird – VADER seemed to "understand" sentiment, but it's just following rules about word combinations. The algorithm didn't actually know that "brilliant performance" is praise while "trying too hard" is criticism. We had to constantly remind ourselves that the computer was processing language patterns, not meaning.

**What We'd Do Differently**
**More context:** We'd love to compare user reviews with professional critics to see if the "celebrity ceiling" effect is universal or just a TMDB thing.
**Better popularity metrics:** Using current celebrity rankings for older movies was messy. Ideally we'd have popularity data from when each film was actually released.
**Genre awareness:** The topic modeling hinted that genre matters a lot. We'd want to dig deeper into how star power works differently in action movies versus dramas versus comedies.

**Bottom line:** We're pretty confident that the "celebrity ceiling" is real – moderately famous actors do get better reviews than mega-celebrities. But there's definitely more going on here than just star power. Context is everything.