# Letterboxd Movie Recommender — Notebook

This notebook is a **complete, runnable starter** for the Letterboxd recommender project. It follows the architecture: Data load → Cleaning → TF-IDF (content) → LDA → Sentiment → Collaborative (SVD/item-item) → Hybrid recommendation. 

Place your CSV files in `letterboxd-movie-ratings-data/` with filenames:
- `movie_data.csv`
- `ratings_export.csv`
- `users_export.csv`

Run the cells in order. The notebook contains fallbacks so it runs even if some optional packages are missing.

## 1) Setup — Install dependencies (run once)

Run the following cell to install required packages. If you are on Colab, uncomment the pip installs.


In [52]:
!pip install pandas numpy scikit-learn nltk gensim textblob wordcloud matplotlib seaborn tqdm
!pip install tmdbv3api  

print("Skip installs if running in environment that already has packages.")

Skip installs if running in environment that already has packages.


## 2) Imports

In [53]:
!pip install gensim TextBlob



In [None]:
import re
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

STOPWORDS = set(stopwords.words('english'))
LEMM = WordNetLemmatizer()

HAVE_SURPRISE = True


In [55]:
import nltk # nltk downloads (quiet)
nltk.download('punkt')
nltk.download('stopwords')  # if using STOPWORDS from nltk
nltk.download('omw-1.4')    # for lemmatization if using WordNetLemmatizer
nltk.download('punkt_tab')  # <-- new in recent NLTK versions
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 3) Load CSV files

Adjust the path if needed.

In [95]:
movie_path = r"letterboxd-movie-ratings-data\movie_data.csv"
ratings_path = r"letterboxd-movie-ratings-data\ratings_export.csv"
users_path = r"letterboxd-movie-ratings-data\users_export.csv"

In [96]:
# Load using the safe reader
movie_dt = pd.read_csv(movie_path)
ratings_dt = pd.read_csv(ratings_path)
users_dt = pd.read_csv(users_path)

In [121]:
print("movie_data.columns:",
      movie_dt.columns,
      "\nratings_data.columns:",
      ratings_dt.columns,
      "\nusers_data.columns:",
      users_dt.columns)

movie_data.columns: Index(['_id', 'genres', 'image_url', 'imdb_id', 'imdb_link', 'movie_id',
       'movie_title', 'original_language', 'overview', 'popularity',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'tmdb_id', 'tmdb_link', 'vote_average', 'vote_count', 'year_released'],
      dtype='object') 
ratings_data.columns: Index(['_id', 'movie_id', 'rating_val', 'user_id'], dtype='object') 
users_data.columns: Index(['_id', 'display_name', 'num_ratings_pages', 'num_reviews', 'username'], dtype='object')


In [98]:
movie_data = movie_dt[['movie_id','movie_title', 'genres', 'overview',
                         'popularity', 'vote_average', 'vote_count', 'year_released']]

ratings_data = ratings_dt[['movie_id', 'rating_val', 'user_id']]

users_data = users_dt[['num_ratings_pages', 'num_reviews', 'username']]

In [122]:
print("movie_data.columns:",
      movie_data.columns,
      "\nratings_data.columns:",
      ratings_data.columns,
      "\nusers_data.columns:",
      users_data.columns)

movie_data.columns: Index(['movie_id', 'movie_title', 'genres', 'overview', 'popularity',
       'vote_average', 'vote_count', 'year_released'],
      dtype='object') 
ratings_data.columns: Index(['movie_id', 'rating_val', 'user_id'], dtype='object') 
users_data.columns: Index(['num_ratings_pages', 'num_reviews', 'username'], dtype='object')


## 4) Standardize & merge data
We map typical column names to canonical names and merge into a single dataframe of ratings with movie metadata.

In [123]:
movie_data.head(2)


Unnamed: 0,movie_id,movie_title,genres,overview,popularity,vote_average,vote_count,year_released
0,football-freaks,Football Freaks,"[""Music"",""Animation""]","Football crazy, football mad. Don’t watch this...",0.6,0.0,0.0,1971.0
1,aftermath-1960,Aftermath,[],Aftermath was the pilot for an unsold TV serie...,0.6,8.0,1.0,1960.0


In [124]:
ratings_data.head(2)

Unnamed: 0,movie_id,rating_val,user_id
0,feast-2014,7,deathproof
1,loving-2016,7,deathproof


In [125]:
users_data.head(2)

Unnamed: 0,num_ratings_pages,num_reviews,username
0,32.0,1650,deathproof
1,52.0,1915,5fc57c5d6758f6963451a07f


In [None]:
def infer_column(df, candidates):
    for c in candidates:
        if c in df.columns:
            return c
    return None

def standardize_load(movie_df, ratings_df, users_df):
    movies = movie_df.copy()
    ratings = ratings_df.copy()
    users = users_df.copy()

    # === Infer columns based on actual dataset ===
    movie_id_col = infer_column(movies, ['movie_id'])
    title_col = infer_column(movies, ['movie_title'])
    desc_col = infer_column(movies, ['overview'])
    year_col = infer_column(movies, ['year_released'])

    rated_movie_col = infer_column(ratings, ['movie_id'])
    user_id_col = infer_column(ratings, ['user_id'])   # username in ratings
    rating_col = infer_column(ratings, ['rating_val'])

    username_col = infer_column(users, ['username'])

    # === Rename columns to standard names ===
    movies = movies.rename(columns={
        movie_id_col: 'movie_id',
        title_col: 'movie_name',
        desc_col: 'description',
        year_col: 'year'
    })

    ratings = ratings.rename(columns={
        rated_movie_col: 'movie_id',
        user_id_col: 'username', 
        rating_col: 'rating'
    })

    users = users.rename(columns={
        username_col: 'username'
    })

    # === Clean movie_id to match ratings ===
    movies['movie_id'] = movies['movie_id'].str.lower().str.strip()
    ratings['movie_id'] = ratings['movie_id'].str.lower().str.strip()

    # === Merge datasets ===
    merged = (
        ratings.merge(users[['username']], on='username', how='left')
               .merge(
                   movies[['movie_id', 'movie_name', 'genres','desc', 'year']].drop_duplicates('movie_id'),
                   on='movie_id', how='left'
               )
    )

    merged['rating'] = pd.to_numeric(merged['rating'], errors='coerce')
    merged = merged.dropna(subset=['rating'])

    return movies, ratings, users, merged


In [127]:
# === Run the function ===
movies, ratings, users, merged = standardize_load(movie_data, ratings_data, users_data)
print('Merged shape:', merged.shape)

Merged shape: (11078167, 7)


In [128]:
merged.head(2)

Unnamed: 0,movie_id,rating,username,movie_name,genres,desc,year
0,feast-2014,7,deathproof,Feast,"[""Animation"",""Comedy"",""Drama"",""Family""]",This Oscar-winning animated short film tells t...,2014.0
1,loving-2016,7,deathproof,Loving,"[""Romance"",""Drama""]","The story of Richard and Mildred Loving, an in...",2016.0


In [129]:
merged = merged[['movie_id','username','movie_name', 'genres','rating', 'desc', 'year']]

In [130]:
print("merged",
      merged.columns,
      )

merged Index(['movie_id', 'username', 'movie_name', 'genres', 'rating', 'desc',
       'year'],
      dtype='object')


In [131]:
merged.head(2)

Unnamed: 0,movie_id,username,movie_name,genres,rating,desc,year
0,feast-2014,deathproof,Feast,"[""Animation"",""Comedy"",""Drama"",""Family""]",7,This Oscar-winning animated short film tells t...,2014.0
1,loving-2016,deathproof,Loving,"[""Romance"",""Drama""]",7,"The story of Richard and Mildred Loving, an in...",2016.0


## 5) Text cleaning — prepare `clean_text`

In [132]:
def clean_text(text):
    if pd.isna(text): 
        return ''
    t = str(text).lower()
    t = re.sub(r'[^a-z0-9\s]', ' ', t)
    tokens = nltk.word_tokenize(t)
    tokens = [w for w in tokens if w not in STOPWORDS and len(w) > 2]
    tokens = [LEMM.lemmatize(w) for w in tokens]
    return ' '.join(tokens)

# Prepare list of columns to combine
cols_for_text = ['movie_id','movie_name', 'desc', 'genres', 'overview']
cols_for_text = [c for c in cols_for_text if c in movies.columns]

# Convert 'genres' list strings to plain text
if 'genres' in cols_for_text:
    movies['genres'] = movies['genres'].fillna('[]').apply(lambda x: ' '.join(eval(x)) if x.startswith('[') else str(x))

# Combine columns into single text
movies['text_combined'] = movies[cols_for_text].fillna('').astype(str).agg(' '.join, axis=1)
movies['description'] = movies['text_combined'].apply(clean_text)

# Preview
display(movies[['movie_id','movie_name','genres','description']].head())

Unnamed: 0,movie_id,movie_name,genres,description
0,football-freaks,Football Freaks,Music Animation,football freak football freak football crazy f...
1,aftermath-1960,Aftermath,,aftermath 1960 aftermath aftermath pilot unsol...
2,where-chimneys-are-seen,Where Chimneys Are Seen,Drama,chimney seen chimney seen gosho celebrated fil...
3,the-musicians-daughter,The Musician's Daughter,Drama,musician daughter musician daughter carl wagne...
4,50-years-of-fabulous,50 Years of Fabulous,Documentary,year fabulous year fabulous year fabulous reco...


In [135]:
movies.head(2)

Unnamed: 0,movie_id,movie_name,genres,desc,popularity,vote_average,vote_count,year,text_combined,description
0,football-freaks,Football Freaks,Music Animation,"Football crazy, football mad. Don’t watch this...",0.6,0.0,0.0,1971.0,football-freaks Football Freaks Football crazy...,football freak football freak football crazy f...
1,aftermath-1960,Aftermath,,Aftermath was the pilot for an unsold TV serie...,0.6,8.0,1.0,1960.0,aftermath-1960 Aftermath Aftermath was the pil...,aftermath 1960 aftermath aftermath pilot unsol...


In [136]:
movies= movies[['movie_id','movie_name','genres', 'description',
                'year', 'popularity', 'vote_average', 'vote_count']]

In [138]:
movies.head(2)

Unnamed: 0,movie_id,movie_name,genres,description,year,popularity,vote_average,vote_count
0,football-freaks,Football Freaks,Music Animation,football freak football freak football crazy f...,1971.0,0.6,0.0,0.0
1,aftermath-1960,Aftermath,,aftermath 1960 aftermath aftermath pilot unsol...,1960.0,0.6,8.0,1.0


In [146]:
print("merged:",
      merged.columns,
      "\nmovies:",
      movies.columns,
      "\nratings:",
      ratings.columns,
      "\nusers:",
      users.columns)

merged: Index(['movie_id', 'username', 'movie_name', 'genres', 'rating', 'description',
       'year'],
      dtype='object') 
movies: Index(['movie_id', 'movie_name', 'genres', 'description', 'year', 'popularity',
       'vote_average', 'vote_count'],
      dtype='object') 
ratings: Index(['movie_id', 'rating', 'username'], dtype='object') 
users: Index(['num_ratings_pages', 'num_reviews', 'username'], dtype='object')


In [147]:
merged.to_csv('merged_movies_ratings_users.csv', index=False)
movies.to_csv("clean_movies.csv", index=False)
ratings.to_csv('clean_ratings.csv', index=False)
users.to_csv('clean_users.csv', index=False)

In [None]:
'movie_id', 'username', 'movie_name', 'genres', 'rating', 'description','year', 'popularity', 'vote_average', 'vote_count'