# Tutorial: Predicting the Decade of a Movie from Its Title

### Goal: Use machine learning (NLP) to predict the decade a movie was released in (e.g., 1950s, 1980s, 2000s) based only on the words in the title.

#### 1. Turn messy text into usable ML features with TF-IDF
#### 2. Train/test split and evaluate a model
#### 3. Interpret what words are predictive of different decades
#### 4. Build a simple function to make predictions on new titles

## Part 1. Setup & Feature Engineering

In [13]:
import pandas as pd
# Load the dataset
df = pd.read_csv("movie_titles.csv", encoding="latin1", on_bad_lines="skip")
df.head()
print(df.columns)
#Rename the columns
df.columns = ["ID", "Year", "Title"]
df.head()

Index(['1', '2003', 'Dinosaur Planet'], dtype='object')


Unnamed: 0,ID,Year,Title
0,2,2004.0,Isle of Man TT 2004 Review
1,3,1997.0,Character
2,4,1994.0,Paula Abdul's Get Up & Dance
3,5,2004.0,The Rise and Fall of ECW
4,6,1997.0,Sick


In [None]:
# dropping the unneccessary ID column
df = df.drop(columns=["ID"])


In [18]:
# gotta take care of NA values
# imputation
df.isnull().sum()
df.fillna(df['Year'].mode()[0], inplace=True)

In [None]:
# create the target variable 'Decade'
df['Year'] = df['Year'].astype('Int64')
df['Decade'] = ((df['Year'] // 10) * 10).astype('int').astype('str') + 's'
df.head()

Unnamed: 0,Year,Title,Decade
0,2004,Isle of Man TT 2004 Review,2000s
1,1997,Character,1990s
2,1994,Paula Abdul's Get Up & Dance,1990s
3,2004,The Rise and Fall of ECW,2000s
4,1997,Sick,1990s


## Part 2. Text Preprocessing

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Features (X) = movie titles
X = df['Title']

# Target (y) = decade
Y = df['Decade']

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Convert titles to numeric features using TF-IDF - so that ML models can understand
# TF-IDF = Term Frequency – Inverse Document Frequency. It reflects how important a word is to a document in a collection.
vectorizer = TfidfVectorizer(stop_words= 'english')
# fit: learns the vocabulary of all unique words in the training set
# transform: converts each title into a numeric vector
# X_train_tfidf is a sparse matrix of shape [num_titles, num_words]
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# essentially each row = one title, each column = one word from the training vocabulary, each cell = how important that word is in that title

## Part 3. Train a Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

# Predict
y_pred = model.predict(X_test_tfidf)

# Evaluate
print(classification_report(y_test, y_pred))

# We've talked about this before but remember these tips for interpreting the classification report:
# precision: Of the titles predicted as being in this decade, how many were correct? High precision = few false positives.
# recall: Of the titles that actually belong to this decade, how many did the model catch? High recall = few false negatives.
# f1-score: Harmonic mean of precision & recall (balances the two).
# support: Number of test samples in that class (how many actual titles from that decade were in your test set).


# Accuracy: 0.41 (41%): the model is correct about 4 times out of 10. This is better than random guessing, but still weak for a classifier.
# Macro avg (0.18 f1-score): the model is performing very poorly across most decades, especially older ones.
# Weighted avg (0.35 f1-score): dominated by the bigger classes

              precision    recall  f1-score   support

       1910s       1.00      0.33      0.50         3
       1920s       0.00      0.00      0.00        17
       1930s       0.50      0.02      0.04        44
       1940s       0.00      0.00      0.00        75
       1950s       0.00      0.00      0.00       100
       1960s       0.55      0.09      0.16       196
       1970s       0.44      0.04      0.08       266
       1980s       0.32      0.09      0.14       411
       1990s       0.37      0.35      0.36      1021
       2000s       0.43      0.75      0.55      1354

    accuracy                           0.41      3487
   macro avg       0.36      0.17      0.18      3487
weighted avg       0.39      0.41      0.35      3487



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Part 4. Interpret the Model

In [27]:
import numpy as np

# Get top words that influence predictions for each decade
feature_names = vectorizer.get_feature_names_out()

for decade in model.classes_:
    coef = model.coef_[model.classes_.tolist().index(decade)]
    top10 = np.argsort(coef)[-10:]
    print(f"Top words for {decade}: {feature_names[top10]}")

Top words for 1890s: ['bonus' 'material' 'love' 'series' 'vol' 'live' 'season' 'brothers'
 'films' 'lumiere']
Top words for 1900s: ['love' 'series' 'vol' 'live' 'season' 'years' 'griffith' 'discovery'
 '1913' '1909']
Top words for 1910s: ['cook' 'felix' 'slapstick' 'intolerance' 'cabiria' 'comedies' 'essanay'
 'mutuals' 'vol' 'chaplin']
Top words for 1920s: ['faust' 'sparrows' 'cocoanuts' 'piccadilly' 'gold' 'storm' 'rush' 'ages'
 'silent' 'sheik']
Top words for 1930s: ['hardy' 'rogers' 'trail' 'angel' 'frankenstein' 'little' 'collection'
 'feathers' 'fields' 'stooges']
Top words for 1940s: ['rebecca' 'chan' 'val' 'lewton' 'stooges' 'sierra' 'bataan' 'fighting'
 'sherlock' 'holmes']
Top words for 1950s: ['beat' 'kiss' 'bonanza' 'earth' 'ikiru' 'outer' 'costello' 'abbott'
 'river' 'lucy']
Top words for 1960s: ['andy' 'avengers' 'dyke' 'dr' 'doctor' 'griffith' 'flintstones' 'zone'
 'vol' 'twilight']
Top words for 1970s: ['female' 'fingers' 'happening' 'cheerleaders' 'season' 'sanford'
 '

## Part 5. Try It Yourself

In [28]:

def predict_decade(title):
    vec = vectorizer.transform([title])
    return model.predict(vec)[0]

print(predict_decade("The Return of the King"))
print(predict_decade("Cowboys of the Dusty Trail"))


# Takes a raw title string,
# Converts it into numbers using your TF-IDF vocabulary,
# Asks your trained model which decade it thinks it belongs to,
# Returns that decade as a single value.

1990s
2000s
