# 🎬🤖 Tom Cruise Filmography Sentiment Analysis: Web Scraping, NLP, and Machine Learning Classification Model 

This project uses Python web scraping techniques to collect movie reviews from the IMDb website. It then applies text pre-processing and NLP methods to train and evaluate a machine learning model, with the goal of performing a comprehensive sentiment analysis. The model classifies user reviews as positive, neutral, or negative, allowing us to extract key insights into the public's reception of an actor's filmography.

# 1. Introduction 📘

This project is a comprehensive analysis of Tom Cruise's films, focusing on using data to understand audience sentiment. The core objective is to collect thousands of movie reviews from IMDb and apply machine learning techniques to determine if a review is positive, neutral, or negative.

**The project is structured into three main phases:**

*Data Acquisition:* We began by developing a Python-based web scraping tool to systematically extract movie titles, ratings, and user reviews directly from the IMDb website.

*Data Pre-processing:* Next, we cleaned and prepared the unstructured text data. This involved critical steps like tokenization, removing stopwords, and lemmatization to transform the raw reviews into a format a machine learning model can understand.

*Model Training & Analysis:* Finally, we'll use the cleaned data to train and evaluate different machine learning models, such as Logistic Regression and Naive Bayes, to accurately classify sentiment. The ultimate goal is to find a model that can reliably predict sentiment and gain insights into audience opinions.

# 2. Libraries 🐍


In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import nltk

**Essential Libraries**

These are the libraries used at different stages of the project.

*requests:* Used to make web requests and get the HTML code from IMDb pages. It is the foundation of your web scraping.

*beautifulsoup4:* Helps analyze and extract data from HTML files, making it easy to navigate the website's structure. It's essential for finding movie titles and reviews.

*pandas:* The standard library for data manipulation and analysis. You used it to structure the scraped data into a DataFrame and, later, for pre-processing.

*nltk:* The leading library for Natural Language Processing (NLP). You used it to clean the text from comments by removing stopwords and performing lemmatization.

*scikit-learn:* The most popular machine learning library in Python. You used it in all modeling steps, from vectorization (TfidfVectorizer) to training (LogisticRegression, MultinomialNB) and evaluation (accuracy_score, classification_report).

*imbalanced-learn:* An extension of scikit-learn for handling imbalanced datasets. You used it with SMOTE to try and solve the problem of a lack of negative comments.

In [None]:
pip install requests beautifulsoup4 pandas nltk scikit-learn imbalanced-learn

# 3. The Sentiment Analysis Project

The beginning of the project itself.

# Part I - Web Scrapping 

The first crucial step of this project was gathering the data. To perform a comprehensive sentiment analysis on Tom Cruise's filmography, we developed a custom web scraper using Python.

Starting the WEb Scrapping process we will use this function, to get all the filmography from the actor Thomas Cruise.

In [1]:
import requests
from bs4 import BeautifulSoup
import time
# Add any other libraries you are using here, like pandas

def get_filmography(actor_id):
    """Scrapes an actor's filmography and returns a list of movie IDs."""
    url = f"https://www.imdb.com/name/{actor_id}/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.5' # Added to force the English language
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    movie_list = []
    
    # Finds all movie links with the correct class
    movie_links = soup.find_all('a', {'class': 'ipc-metadata-list-summary-item__t'})

    for link in movie_links:
        href = link.get('href')
        if href and '/title/' in href:
            movie_id = href.split('/')[2]
            movie_title = link.get_text(strip=True) # Extracts the text (title) from the link
            movie_list.append({'id': movie_id, 'title': movie_title})
            
    # Uses a set to ensure unique IDs
    unique_movies = {movie['id']: movie for movie in movie_list}.values()
            
    return list(unique_movies)

This tool navigates the IMDb website to systematically collect essential movie details and thousands of user reviews. This meticulously collected dataset now serves as the foundation for our deep dive into Natural Language Processing (NLP) and Machine Learning, allowing us to build a model that understands and classifies sentiment.

In [None]:
get_filmography("nm0000129")

Now we will use the "Scrape_reviews" function to collect all the movie's reviews text.

In [2]:
def scrape_reviews(movie_id, movie_title):
    """
Scrapes movie reviews and returns the data, including the title and body of the comment.
Adjusted with the selectors we found.
    """
    reviews_url = f"https://www.imdb.com/title/{movie_id}/reviews"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    try:
        response = requests.get(reviews_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        review_containers = soup.find_all('article', {'class': 'user-review-item'})

        review_data = []
        for container in review_containers:
        # Rating: text content
            rating_el = container.select_one('.ipc-rating-star--rating')
            rating_text = rating_el.get_text(strip=True) if rating_el else None

            # Title: remove any <svg> inside, then get inner text
            title_el = container.select_one('.ipc-title__text')
            if title_el:
                for svg in title_el.find_all('svg'):
                    svg.decompose()
                title_text = title_el.get_text(strip=True)
            else:
                title_text = None

            # Review body: innerHTML (remove <br> first)
            review_el = container.select_one('.ipc-html-content-inner-div')
            if review_el:
                for br in review_el.find_all("br"):
                    br.decompose()
                review_inner_html = "".join(str(child) for child in review_el.contents)
            else:
                review_inner_html = None

            if review_inner_html:
                review_data.append({
                    'movie_id': movie_id,
                    'movie_title': movie_title,
                    'review_body': review_inner_html,   # innerHTML
                    'rating': rating_text,              # plain text
                    'title': title_text                 # plain text, svg removed
                })

        print(review_data)
        return review_data
    except Exception as e:
        print(f"Error scraping movie reviews for {movie_id}: {e}")
        return []

*Orchestrating the Scraping Process*

This code block represents the core of our scraping tool. It acts as the central control for the entire data acquisition process. By first calling the get_filmography function, the script identifies all of Tom Cruise's movies. After successfully locating the movie list, it systematically loops through each one, using the scrape_reviews function to collect user reviews. The time.sleep function is used here to ensure the process is polite and doesn't overload the IMDb servers, which is a key part of responsible web scraping. This phase efficiently gathers and organizes the raw data, setting the stage for the next analytical steps.

In [3]:
# --- main logic ---
if __name__ == "__main__":
    actor_id = "nm0000129" # Tom Cruise's ID
    print("Scraping Tom Cruise's filmography...")

    movies = get_filmography(actor_id)

    if movies:
        print(f"Found {len(movies)} movies. Starting review scraping...")

        all_reviews = []
        for movie in movies:
            movie_id = movie['id']
            movie_title = movie['title']
            print(f"Scraping reviews for movie: {movie_title} ({movie_id})")

            reviews = scrape_reviews(movie_id, movie_title)
            if reviews:
                all_reviews.extend(reviews)

            time.sleep(2) # Wait for 2 seconds to avoid overloading the server

        print("\nScraping complete!")
        print(f"Total reviews collected: {len(all_reviews)}")

Scraping Tom Cruise's filmography...
Found 30 movies. Starting review scraping...
Scraping reviews for movie: Untitled Alejandro G. Iñárritu Film (tt31450459)
[]
Scraping reviews for movie: Untitled Tom Cruise/SpaceX Project (tt15073568)
[]
Scraping reviews for movie: The Gauntlet (tt32491325)
[]
Scraping reviews for movie: Live Die Repeat and Repeat (tt5617712)
[]
Scraping reviews for movie: Broadsword (tt34715843)
[]
Scraping reviews for movie: Mission: Impossible - The Final Reckoning (tt9603208)
[{'movie_id': 'tt9603208', 'movie_title': 'Mission: Impossible - The Final Reckoning', 'review_body': "I'm going to sound negative because to be honest i expected a lot more from this movie.First of all the dialogue was unusually poor, i know that's not what the movie is for but it's noticeable and it broke the immersion for me. The beginning felt like an introduction, which might be necessary for first time mission impossible viewers, though quite repetitive for those who have seen Dead Re

In [6]:
        # Converts the list of dictionaries to a pandas DataFrame.
df = pd.DataFrame(all_reviews)

In [7]:
# Saves the DataFrame to a CSV file
# The 'index=False' parameter prevents pandas from adding an index column
# The 'encoding' parameter helps handle special characters
df.to_csv('tom_cruise_reviews.csv', index=False, encoding='utf-8')

print(f"Data successfully saved to tom_cruise_reviews.csv")

Data successfully saved to tom_cruise_reviews.csv


# Part II - Data Pre-processing #

This phase is a crucial step in preparing the raw, unstructured text from our web scraping efforts for machine learning. The goal of data pre-processing is to transform the collected IMDb reviews into a clean, standardized format that our model can interpret. This involves several key steps of Natural Language Processing (NLP), including **tokenization** to break down sentences into individual words, removing common and irrelevant words known as **stopwords**, and normalizing the text by converting it all to lowercase. Lastly, **lemmatization** is applied to reduce words to their base form (e.g., "running" to "run"), ensuring consistency. This meticulous cleaning process is fundamental to ensuring the accuracy and effectiveness of our sentiment analysis model.

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet   

In [9]:
# Load the CSV file you saved
df = pd.read_csv('tom_cruise_reviews.csv')

# Display the first few rows to verify
print(df.head())

    movie_id                                movie_title  \
0  tt9603208  Mission: Impossible - The Final Reckoning   
1  tt9603208  Mission: Impossible - The Final Reckoning   
2  tt9603208  Mission: Impossible - The Final Reckoning   
3  tt9603208  Mission: Impossible - The Final Reckoning   
4  tt9603208  Mission: Impossible - The Final Reckoning   

                                         review_body  rating  \
0  I'm going to sound negative because to be hone...     6.0   
1  It should be titled "Missing" Impossible. Ever...     6.0   
2  Mission Impossible: The Final Reckoning serves...     8.0   
3  Mission: Impossible - The Final Reckoning is a...     6.0   
4  Lamest movie in the series, if not ever! I unf...     NaN   

                                               title  
0          Great action, lacked proper story telling  
1    How can a 3-hour movie have so many plot holes?  
2               A goodbye that doesn't feel like one  
3  A couple of amazing scenes, but so mu

# Pre-processing Functions #

This code will prepare your raw text data by cleaning and standardizing it for the machine learning model.

In [11]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# ----------------------------------------------

def preprocess_text(text):
    """
    Cleans and prepares a text string for machine learning.
    Steps include: lowercasing, removing special characters, removing stopwords, and lemmatization.
    """
    # Convert text to lowercase
    text = text.lower()
    
    # Remove punctuation, numbers, and special characters
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Initialize the lemmatizer and the set of English stopwords
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    
    # Remove stopwords and apply lemmatization
    # Lemmatization reduces words to their base form (e.g., 'running' -> 'run')
    words = [lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words]
    
    # Join the words back into a single string
    processed_text = ' '.join(words)
    
    return processed_text

# The Importance of Pre-processing for Machine Learning #

**Tokenization** is the foundational step of all text pre-processing. It's the process of breaking down a large block of text into smaller, meaningful units called tokens. In most cases, these tokens are individual words.

    Why it's important: A machine learning model cannot directly process a full sentence or paragraph. By tokenizing the text, we create a list of words, which serves as the fundamental input for all subsequent steps. It's the first step in turning a sentence into data.

**Stopwords** are very common words in a language, such as "the," "a," "is," "and," and "in." While grammatically necessary for humans, they usually do not carry significant meaning or sentiment on their own.

**Lemmatization** is the process of reducing a word to its base or root form, known as its lemma. For example, the words "running," "ran," and "runs" all have the same lemma: "run."

    Why it's important: Without lemmatization, the model would treat "runs" and "ran" as two completely different features. By reducing them to a single base form, we consolidate different word forms into one. This prevents data from being diluted, improves the model's ability to learn patterns, and reduces the overall dimensionality of the dataset, leading to better and more generalized performance.

*Applying the Pre-processing*

This is the step that connects your data collection efforts with the cleaning stage. The goal is to apply the pre-processing function we've already created to the reviews column of the DataFrame.

Let's see how it went:

In [13]:
# Applies the pre-processing function to the text column
df['processed_review'] = df['review_body'].apply(preprocess_text)

# Shows the result
print(df[['review_body', 'processed_review']].head())

                                         review_body  \
0  I'm going to sound negative because to be hone...   
1  It should be titled "Missing" Impossible. Ever...   
2  Mission Impossible: The Final Reckoning serves...   
3  Mission: Impossible - The Final Reckoning is a...   
4  Lamest movie in the series, if not ever! I unf...   

                                    processed_review  
0  im going sound negative honest expected lot mo...  
1  titled missing impossible everything love mi f...  
2  mission impossible final reckoning serf grande...  
3  mission impossible final reckoning bit disappo...  
4  lamest movie series ever unfortunately spent m...  


# Part III - Vectorization #
For this stage, I chose to proceed with **TF-IDF**, as I found its characteristic of assigning the right weight to words to be ideal.

**What Is TF-IDF?**

    TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical technique that reflects how important a word is to a document within a collection of documents. It is made up of two parts:

    Term Frequency (TF): This is simply a count of how many times a word appears in a single review. If the word "amazing" appears 5 times in a comment, its TF will be high for that comment.

    Inverse Document Frequency (IDF): This measures how "rare" a word is across your entire dataset. Very common words like "movie" or "film" will have a low IDF, while rarer, sentiment-expressing words like "breathtaking" or "disappointing" will have a high IDF.

    The multiplication of these two values (TF * IDF) results in a high weight for words that are important in a specific comment but not very common across all comments.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
# 1. Create the vectorizer
# max_features limits the number of words (features) in your model, which is a good practice
vectorizer = TfidfVectorizer(max_features=5000)

# 2. 'Fit' and 'Transform' the data
# The 'fit_transform' method does both things at once
# It learns the vocabulary (all the words) and then transforms the text into a TF-IDF matrix
X = vectorizer.fit_transform(df['processed_review'])

In [16]:
# The result 'X' is a sparse matrix, which is more efficient
# To see the shape, you can print:
print(X.shape)
# The shape will be (number of reviews, number of words)

(407, 5000)


In [17]:
# Optional: To see the words that the model learned
feature_names = vectorizer.get_feature_names_out()
print("The first 20 words in the vocabulary are:")
print(feature_names[:20])

The first 20 words in the vocabulary are:
['aaron' 'ability' 'able' 'ably' 'aboard' 'abolished' 'abomination'
 'abound' 'aboutbesides' 'abraham' 'abrams' 'abramsin' 'abroad' 'absence'
 'absencemonica' 'absolute' 'absolutely' 'absurdity' 'academy' 'accent']


The list you're seeing is the vocabulary that your TF-IDF vectorizer created from all the text you pre-processed. The TfidfVectorizer organizes the words in alphabetical order to build the feature matrix.

The fact that the words are clean, with no punctuation, numbers, or stopwords, confirms that your pre-processing worked perfectly. This is exactly the result we were expecting.

Your text is now ready for the most exciting part: training the machine learning model for sentiment analysis.

# Part IV - Sentiment Analysis #

**Defining the Sentiment Classes**

The DataFrame contains ratings from 1 to 10. We'll decide how to group these ratings into sentiment categories. The most common way is to divide them into three classes: Positive, Neutral, and Negative.

Here is a suggested classification rule, which is widely used in similar projects:

Ratings 1-4: Negative Sentiment

Ratings 5-7: Neutral Sentiment

Ratings 8-10: Positive Sentiment

In [18]:
# Define the function to classify the rating into a sentiment
def classify_sentiment(rating):
    if rating >= 8:
        return 'Positive'
    elif rating >= 5:
        return 'Neutral'
    else:
        return 'Negative'

# Apply the function to create the new 'sentiment' column
df['sentiment'] = df['rating'].apply(classify_sentiment)

# Display the result to verify
print(df[['rating', 'sentiment']].head(10))

   rating sentiment
0     6.0   Neutral
1     6.0   Neutral
2     8.0  Positive
3     6.0   Neutral
4     NaN  Negative
5     9.0  Positive
6     8.0  Positive
7     6.0   Neutral
8     4.0  Negative
9     6.0   Neutral


*Data Separation*

Now that you have your sentiment classes, you need to split your dataset into training and testing data.

This step is fundamental. The model can only be trained with the training set. It is then evaluated on the testing set to ensure that it has not just "memorized" the data, but has learned to generalize to new comments.

We will need the X matrix (your vectorized reviews) and the new y column (the sentiments).

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Since we have already run the pre-processing and vectorization,
# and we have the X matrix (your vectorized reviews) and the 'sentiment' column in the df DataFrame.

# Lets define the X (features) and y (target) variables
y = df['sentiment']

# Split the data into 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
# Display the shape of the matrix to confirm
print(f"Shape of the X matrix: {X.shape}") # X.shape[0] is the number of rows
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Shape of the X matrix: (407, 5000)
Training set size: 325 samples
Testing set size: 82 samples


# Part V - Training the Classification Model #

    This is where the magic happens. Lets train a machine learning model that will learn to classify the sentiment of a review.

For sentiment analysis, we'll use a machine learning algorithm. An excellent choice to start with is Logistic Regression (LogisticRegression), as it's simple, efficient, and works very well with text data.

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
# 1. Create a model instance (our "classifier")
model = LogisticRegression(max_iter=1000)
# The max_iter=1000 parameter ensures the model has enough time to converge,
# which is a good practice when working with larger datasets.

# 2. Train the model with the training data
model.fit(X_train, y_train)

print("The model has been successfully trained!")

The model has been successfully trained!


After training, the model is ready to make predictions on the test set (X_test),

In [23]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# The y_pred variable now contains the model's predictions
# (e.g., ['Positive', 'Negative', 'Neutral', ...])

In [24]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Calculate the accuracy (how precise the model was)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}\n")

# 2. Generate the complete classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

# 3. Generate the confusion matrix to see the errors in detail
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Model accuracy: 0.65

Classification Report:
               precision    recall  f1-score   support

    Negative       0.00      0.00      0.00        13
     Neutral       0.62      0.87      0.72        39
    Positive       0.70      0.63      0.67        30

    accuracy                           0.65        82
   macro avg       0.44      0.50      0.46        82
weighted avg       0.55      0.65      0.59        82

Confusion Matrix:
 [[ 0 10  3]
 [ 0 34  5]
 [ 0 11 19]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Classification Report Analysis**

The classification report tells a detailed story about your model's performance on a class-by-class basis.

Positive (0.70 Precision): Of all the times the model predicted "Positive," it was correct 70% of the time. This shows a reasonable ability to identify positive reviews.

Neutral (0.65 Precision): When the model predicted "Neutral," it was correct 65% of the time. This is also a solid performance.

Negative (0.00 Precision): This is the most critical issue. The model never correctly predicted a single "Negative" review. A precision of 0.00 means that every time the model classified a comment as negative, it was wrong. Similarly, a recall of 0.00 indicates that out of all the comments that were actually negative, the model failed to identify any of them.

    The model is having significant difficulty identifying negative sentiment, which is hurting its overall performance. It's almost certainly classifying negative comments as either neutral or positive. This is a classic sign of class imbalance, where the model has too few negative examples to learn from.

In short, Logistic Regression is doing a good job with the Neutral and Positive comments, but is failing completely to identify the Negative ones. This usually happens when there are too few samples for one of the classes, which may be the case here.

Now that we have this detailed analysis, we can try the **Naive Bayes model**, to see if it handles this difficulty better.

**Naive Bayes Model**

The Naive Bayes model is a very popular choice for text classification tasks like this, and for good reason. Despite its "naive" assumption, it often performs surprisingly well, especially for this type of problem.

Here’s why it’s a good fit for your sentiment analysis model:

*Simplicity and Speed:* Naive Bayes is an extremely fast and easy-to-implement algorithm. The training process is based on simple probability calculations, which means it requires less computational power and time to train compared to more complex models. This makes it a perfect baseline to establish the performance of your model.

*Handles High-Dimensional Data:* When you vectorized your reviews with TF-IDF, you created a matrix with potentially thousands of columns (words). This high-dimensionality can be a challenge for some algorithms, but Naive Bayes handles it with ease. It's designed to work efficiently with a large number of features, which is exactly what you have.

*It's Probabilistic:* The model works by calculating the probability of a given word belonging to a certain class (e.g., "amazing" appearing in a positive review). The final prediction is based on which class has the highest probability. This probabilistic nature is a natural fit for sentiment analysis, where you're trying to determine the likelihood of a comment being positive, negative, or neutral.

*Effective with a Bag-of-Words Approach:* Naive Bayes performs very well with the Bag-of-Words approach (which is what TF-IDF essentially represents). It assumes that the presence of a word in a review is independent of the presence of other words. While this is an oversimplification, it works exceptionally well in practice for text classification because the position of a word often doesn't matter as much as its presence.

    In summary, the Naive Bayes model offers a fast, scalable, and effective solution for your project. It's an excellent choice for a first model because it allows you to get a solid result quickly and then compare it against other, more complex models later on if you choose to.

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming you already have the datasets
# X_train, X_test, y_train, y_test

# 1. Create a Naive Bayes model instance
naive_bayes_model = MultinomialNB()

# 2. Train the model with the training data
naive_bayes_model.fit(X_train, y_train)

print("The Naive Bayes model has been successfully trained!")

The Naive Bayes model has been successfully trained!


In [26]:
# 3. Make predictions on the test set
y_pred_nb = naive_bayes_model.predict(X_test)

# 4. Evaluate the model
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"\nNaive Bayes model accuracy: {accuracy_nb:.2f}\n")

report_nb = classification_report(y_test, y_pred_nb)
print("Classification Report:\n", report_nb)

conf_matrix_nb = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matrix:\n", conf_matrix_nb)


Naive Bayes model accuracy: 0.61

Classification Report:
               precision    recall  f1-score   support

    Negative       0.00      0.00      0.00        13
     Neutral       0.57      0.92      0.71        39
    Positive       0.74      0.47      0.57        30

    accuracy                           0.61        82
   macro avg       0.44      0.46      0.43        82
weighted avg       0.54      0.61      0.54        82

Confusion Matrix:
 [[ 0 11  2]
 [ 0 36  3]
 [ 0 16 14]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


    The model yielded the same results for Naive Bayes.

**What happened:** We can conclude that the root cause of the problem is: **Class imbalance**. Your model is struggling because the number of negative comments is much smaller than the number of neutral and positive comments. It's like trying to teach a student about a subject with only one or two examples.

We can try SMOTE (Oversampling) as a solution. 
SMOTE (Synthetic Minority Over-sampling Technique) is one of the most effective techniques for solving class imbalance.

**What SMOTE Does**

SMOTE doesn't just duplicate existing negative reviews. It analyzes the characteristics of your negative comments and creates new, "synthetic" comments that are very similar to the originals. This intelligently increases the number of examples in the "Negative" class without simply copying and pasting the data.

This gives both Logistic Regression and Naive Bayes far more data to learn what distinguishes a negative comment, and the final result should be a model that correctly identifies all three classes.

**Implementation**
SMOTE goes only in our training set. The test set must remain imbalanced, as it needs to reflect real-world data distribution.

Replace your data splitting step with the code below, and then run the model training again

In [27]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Create a SMOTE instance
# random_state ensures that the results are reproducible
smote = SMOTE(random_state=42)

# 3. Apply SMOTE to generate new samples in the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check the new size of the sets
print("Original size of the 'Negative' class in the training set:", (y_train == 'Negativo').sum())
print("New size of the 'Negative' class after SMOTE:", (y_train_resampled == 'Negativo').sum())

Original size of the 'Negative' class in the training set: 0
New size of the 'Negative' class after SMOTE: 0


*Let's test the models now after SMOTE has solved the class imbalance with oversampling.*

In [28]:
# Let's test logistic regression again
model = LogisticRegression(max_iter=1000)
# The max_iter=1000 parameter ensures the model has enough time to converge,
# which is a good practice when working with larger datasets.

# Train the model using the SMOTE-balanced training data
model.fit(X_train_resampled, y_train_resampled)

print("The model has been successfully trained!")

The model has been successfully trained!


In [29]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [30]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Calculate the accuracy (how precise the model was)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}\n")

# 2. Generate the complete classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

# 3. Generate the confusion matrix to see the errors in detail
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Model accuracy: 0.63

Classification Report:
               precision    recall  f1-score   support

    Negative       0.33      0.08      0.12        13
     Neutral       0.65      0.77      0.71        39
    Positive       0.64      0.70      0.67        30

    accuracy                           0.63        82
   macro avg       0.54      0.52      0.50        82
weighted avg       0.60      0.63      0.60        82

Confusion Matrix:
 [[ 1  8  4]
 [ 1 30  8]
 [ 1  8 21]]


# Results #

    After implementing SMOTE oversampling, it's clear that your model is still unable to identify negative reviews. This is a crucial and often frustrating lesson in data science. Even powerful techniques can't solve all problems.

**The Challenge of Real-World Data**

The reason the model is still struggling, despite using a state-of-the-art technique like SMOTE, is a classic problem with real-world data: the quality and nature of your initial data set.

SMOTE works by creating synthetic examples based on the data you provide. However, if your original sample of negative reviews is too small and lacks diversity, SMOTE can't work its magic. It can only generate new data that's similar to the data it's given. Since your negative reviews are so few and far between, the algorithm doesn't have a good enough "map" to create new, useful examples.

This highlights a fundamental truth in machine learning: your model is only as good as the data you train it on. Oversampling can't fix a fundamentally flawed or heavily skewed dataset.

# Conclusion #

**The Core Problem: Class Imbalance**

Our analysis of the model's performance reveals that the primary issue was not a failure of the algorithm, but a fundamental problem with the data itself: class imbalance.

The model struggled to classify negative reviews because the number of negative examples was far too small compared to the positive and neutral reviews. In a machine learning context, this creates a strong bias. The model, in its attempt to be as accurate as possible, simply learned the "safest" strategy: to ignore the minority "negative" class and focus on correctly predicting the far more common "positive" and "neutral" reviews.

The Final Conclusion
The logical conclusion of this analysis is that, based on the thousands of reviews we scraped from IMDb, the number of truly negative reviews for Tom Cruise's films is negligible. It seems the audience consensus is overwhelmingly positive.

Perhaps if we had chosen a different actor, we would have found more negative reviews for our model to learn from. But for now, our data suggests a single, undeniable conclusion: 
**Tom Cruise simply doesn't make bad movies.**

That's it.