# **Movie Recommendation**

## 1. Data Loading

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Download data (zip file) and unzip
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

--2026-01-23 10:32:34--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.96.204
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‚Äòml-latest-small.zip‚Äô


2026-01-23 10:32:35 (1.65 MB/s) - ‚Äòml-latest-small.zip‚Äô saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [3]:
# ratings.csv: userId, movieId, rating, timestamp
ratings = pd.read_csv('ml-latest-small/ratings.csv')

# movies.csv: movieId, title, genres
movies = pd.read_csv('ml-latest-small/movies.csv')

In [4]:
# Merge both tables based on movieId
df = pd.merge(ratings, movies, on='movieId')

In [5]:
print(df.head())
print(f"\nTotal Ratings: {df.shape[0]}")
print(f"Unique Movies: {df['title'].nunique()}")
print(f"Unique Users: {df['userId'].nunique()}")

   userId  movieId  rating  timestamp                        title  \
0       1        1     4.0  964982703             Toy Story (1995)   
1       1        3     4.0  964981247      Grumpier Old Men (1995)   
2       1        6     4.0  964982224                  Heat (1995)   
3       1       47     5.0  964983815  Seven (a.k.a. Se7en) (1995)   
4       1       50     5.0  964982931   Usual Suspects, The (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                               Comedy|Romance  
2                        Action|Crime|Thriller  
3                             Mystery|Thriller  
4                       Crime|Mystery|Thriller  

Total Ratings: 100836
Unique Movies: 9719
Unique Users: 610


## 2. Data Preprocessing

The recommendation system will be based on content, so for now it will use genres to determine the recommendations.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# 1. Use the genre as tags for training
# The data has "|" as the separator, we will replace it with a space
movies['genres_clean'] = movies['genres'].str.replace('|', ' ')

# 2. Vectorize the genres (turn into numbers)
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['genres_clean'])

# 3. Calculate cosine similarity, to tell how similar Movie A with Movie B is (0 to 1)
# This creates a huge matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [7]:
# Helper functions for easier recommendation calls
# Movie index array
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

def get_content_recommendations(title, cosine_sim=cosine_sim):
  if title not in indices:
    return "Movie not found!"

  # Get index of movie
  idx = indices[title]

  # Get similarity score of all movies against this movie
  sim_scores = list(enumerate(cosine_sim[idx]))

  # Sort based on highest similarity
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  # Get scores of 10 highest similar movies, starting from index 1 (index 0 is the movie we're comparing against)
  sim_scores = sim_scores[1:11]

  # Get the movie indices
  movie_indices = [i[0] for i in sim_scores]

  return movies['title'].iloc[movie_indices]

In [8]:
print("Recommendations for 'Toy Story (1995)':")
print(get_content_recommendations('Toy Story (1995)'))

Recommendations for 'Toy Story (1995)':
1706                                          Antz (1998)
2355                                   Toy Story 2 (1999)
2809       Adventures of Rocky and Bullwinkle, The (2000)
3000                     Emperor's New Groove, The (2000)
3568                                Monsters, Inc. (2001)
6194                                     Wild, The (2006)
6486                               Shrek the Third (2007)
6948                       Tale of Despereaux, The (2008)
7760    Asterix and the Vikings (Ast√©rix et les Viking...
8219                                         Turbo (2013)
Name: title, dtype: object


The results do indeed recommend movies with the same tags as Toy Story (Adventure|Animation|Children|Comedy|Fantasy).

However, it does recommend films with lower ratings (Turbo, The Wild). The system requires a model that recommends movies based on what other people watch based on your preferences. So, the movies aren't just checked based on genre, but also on what similar films do others watch.

## 3. New Model Training

The model will use SVD (Singular Value Decomposition), which is often used for recommendation systems.

In [9]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

# 1. User-Item matrix to maps out users to ratings
movie_user_matrix = df.pivot_table(index='title', columns='userId', values='rating').fillna(0)
print(f"Current matrix Shape: {movie_user_matrix.shape}")

Current matrix Shape: (9719, 610)


In [10]:
# 2. Compress the data
# Reduce the 610 users to about 20 concepts.
SVD = TruncatedSVD(n_components=20, random_state=42)
matrix_reduced = SVD.fit_transform(movie_user_matrix)

print(f"Reduced matrix Shape: {matrix_reduced.shape}")

Reduced matrix Shape: (9719, 20)


In [11]:
# 3. Calculate cosine similarity
# We use a different function, because this one automatically normalizes it
corr_matrix = cosine_similarity(matrix_reduced)

In [12]:
def get_collaborative_recommendations(movie_title):
  if movie_title not in movie_user_matrix.index:
    return "Movie not found!"

  # 1. Find index
  movie_idx = movie_user_matrix.index.get_loc(movie_title)

  # 2. Similarity scores
  corr_scores = corr_matrix[movie_idx]

  # 3. Sort indices based on score
  sorted_indices = corr_scores.argsort()[::-1]

  # 4. Get top 10, excluding self
  top_10_indices = sorted_indices[1:11]

  return movie_user_matrix.index[top_10_indices]

In [13]:
print("People who watched 'Toy Story (1995)' also watched:")
print(get_collaborative_recommendations('Toy Story (1995)'))

People who watched 'Toy Story (1995)' also watched:
Index(['Willy Wonka & the Chocolate Factory (1971)',
       'Back to the Future (1985)', 'Home Alone (1990)',
       'Star Wars: Episode IV - A New Hope (1977)', 'Groundhog Day (1993)',
       'Independence Day (a.k.a. ID4) (1996)', 'Jurassic Park (1993)',
       'Babe (1995)', 'Princess Bride, The (1987)',
       'Star Wars: Episode VI - Return of the Jedi (1983)'],
      dtype='object', name='title')


## 4. Build and Deployment

In [14]:
# Save the files
import pickle

# 1. Save the movies list
pickle.dump(movies, open('movie_list.pkl', 'wb'))

# 2. Save the SVD result user-behavior matrix
pickle.dump(matrix_reduced, open('user_behavior.pkl', 'wb'))

print("Files successfully exported.")

Files successfully exported.


In [15]:
# Create python app
%%writefile app.py
import pickle
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# 1. Load the data
movies = pickle.load(open('movie_list.pkl', 'rb'))
matrix_reduced = pickle.load(open('user_behavior.pkl', 'rb'))

# 2. Compute similarity directly
# We can do this because the matrix is small
similarity = cosine_similarity(matrix_reduced)

# 3. Helper function for getting recommendations
def recommend(movie_title):
  if movie_title not in movies['title'].values:
    return []

  # Index
  idx = movies[movies['title'] == movie_title].index[0]

  # Scores
  scores = list(enumerate(similarity[idx]))

  # Sort
  sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)

  # Top 10 selection
  top_indices = [i[0] for i in sorted_scores[1:11]]

  return movies['title'].iloc[top_indices.values]

Writing app.py


In [16]:
# 4. App UI
!pip install -q streamlit

import streamlit as st

st.title('üé¨ Movie Recommender')
st.write("Select a movie you love, and the we will find its neighbors!")

# Dropdown box to select movies
selected_movie = st.selectbox(
    'Type or select movie from the dropdown',
    movies['title'].values
)

# Button
if st.button('Show Recommendations'):
  recommendations = recommend(selected_movie)
  st.subheader(f"Because you liked '{selected_movie}':")
  for i, movie in enumerate(recommendations):
    st.write(f"{i+1}. {movie}")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.1/9.1 MB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.9/6.9 MB[0m [31m70.9 MB/s[0m eta [36m0:00:00[0m
[?25h

2026-01-23 10:33:05.437 
  command:

    streamlit run /usr/local/lib/python3.12/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2026-01-23 10:33:05.461 Session state does not function when running a script without `streamlit run`
