Movie Recommendation Notebook
-----------------------------
This notebook lets you input a movie preference prompt (e.g., "I want to watch a sci-fi romance with strong female leads")
and returns a top-5 list of recommended movies based on dataset embeddings and similarity search.

# 1. Install and import necessary libraries

In [6]:
%pip install pandas scikit-learn sentence-transformers faiss-cpu wget unzip
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sentence_transformers import SentenceTransformer
import faiss
import wget 
import os
import wget
import zipfile


Collecting pandas
  Downloading pandas-2.3.0-cp313-cp313-win_amd64.whl.metadata (19 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp313-cp313-win_amd64.whl.metadata (14 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp313-cp313-win_amd64.whl.metadata (5.0 kB)
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting unzip
  Downloading unzip-1.0.0.tar.gz (704 bytes)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requi

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# 2. Cargar Dataset desde la web
# Descarga automática del zip con el dataset
if not os.path.exists('ml-1m') and not os.path.exists('ml-1m.zip'):
    wget.download('https://files.grouplens.org/datasets/movielens/ml-1m.zip', 'ml-1m.zip')
    with zipfile.ZipFile('ml-1m.zip', 'r') as zip_ref:
        zip_ref.extractall('ml-1m')
else:
    print("Dataset ya descargado y descomprimido.")
# Lectura de archivos
import pandas as pd

In [8]:
# 2. Load Dataset
# Recommended datasets:
# - MovieLens 1M: https://grouplens.org/datasets/movielens/1m/
# - MovieLens 20M: https://grouplens.org/datasets/movielens/20m/
# - IMDb Titles and Ratings: https://datasets.imdbws.com/

# Dataset Recommendations
# - MovieLens 1M: ~1M ratings, 3K movies, easy to start
# - MovieLens 20M: richer, but heavier compute
# - IMDb Basic Title Dataset: for richer metadata and plots
# - TMDb API: for additional metadata and posters



movies = pd.read_csv('ml-1m/ml-1m/movies.dat', sep='::', engine='python',
                     names=['MovieID','Title','Genres'], encoding='latin1')
ratings = pd.read_csv('ml-1m/ml-1m/ratings.dat', sep='::', engine='python',
                      names=['UserID','MovieID','Rating','Timestamp'], encoding='latin1')
users = pd.read_csv('ml-1m/ml-1m/users.dat', sep='::', engine='python',
                    names=['UserID','Gender','Age','Occupation','Zip-code'], encoding='latin1')

# Comprobación rápida
print(movies.head())
print(ratings.head())
print(users.head())

   MovieID                               Title                        Genres
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy
   UserID  MovieID  Rating  Timestamp
0       1     1193       5  978300760
1       1      661       3  978302109
2       1      914       3  978301968
3       1     3408       4  978300275
4       1     2355       5  978824291
   UserID Gender  Age  Occupation Zip-code
0       1      F    1          10    48067
1       2      M   56          16    70072
2       3      M   25          15    55117
3       4      M   45           7    02460
4       5      M   25          20    55455


In [9]:
# 3. Preprocess Data
# Extract year, clean titles
movies['Year'] = movies['Title'].str.extract(r"\((\d{4})\)").astype(float)
movies['CleanTitle'] = movies['Title'].str.replace(r"\(\d{4}\)", "", regex=True).str.strip()

In [10]:
# 4. Feature Construction: average rating and genre one-hot
avg_ratings = ratings.groupby('MovieID')['Rating'].mean().reset_index().rename(columns={'Rating':'AvgRating'})
movies = movies.merge(avg_ratings, on='MovieID', how='left')
movies['AvgRating'].fillna(movies['AvgRating'].mean(), inplace=True)

genres_expanded = movies['Genres'].str.get_dummies(sep='|')
movies = pd.concat([movies, genres_expanded], axis=1)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movies['AvgRating'].fillna(movies['AvgRating'].mean(), inplace=True)


In [11]:
# 5. Text Embeddings on descriptions: use title + genres
model = SentenceTransformer('all-MiniLM-L6-v2')
movies['Text'] = movies['CleanTitle'] + ' | ' + movies['Genres']
embeddings = model.encode(movies['Text'].tolist(), show_progress_bar=True)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Batches: 100%|██████████| 122/122 [00:08<00:00, 14.99it/s]


In [12]:
# 6. Build FAISS index
d = embeddings.shape[1]
index = faiss.IndexFlatIP(d)       # inner product for cosine similarity after normalization
faiss.normalize_L2(embeddings)
index.add(embeddings)

In [13]:
# 7. Recommendation function

def recommend_movies(prompt, k=5):
    # Embed prompt
    q_emb = model.encode([prompt])
    faiss.normalize_L2(q_emb)
    # Search
    distances, indices = index.search(q_emb, k)
    recs = movies.iloc[indices[0]][['CleanTitle','Genres','AvgRating','Year']]
    return recs

# 8. User Interaction
if __name__ == '__main__':
    user_prompt = input("Describe what type of movie you want: ")
    recommendations = recommend_movies(user_prompt, k=5)
    print("Top 5 recommendations based on your prompt:\n")
    print(recommendations.to_string(index=False))

Top 5 recommendations based on your prompt:

     CleanTitle        Genres  AvgRating   Year
    Funny Bones        Comedy   3.410256 1995.0
     Funny Farm        Comedy   2.950000 1988.0
       Vacation        Comedy   3.841530 1983.0
Jimmy Hollywood        Comedy   2.458333 1994.0
    Scary Movie Comedy|Horror   3.011419 2000.0
