<a href="https://colab.research.google.com/github/Melckykaisha/Movie_recommender_system/blob/main/Movie_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1.Load Movies data set

In [4]:
import pandas as pd

# Load movies (u.item is pipe-delimited)
movies = pd.read_csv(
    "ml-100k/u.item",
    sep="|",
    encoding="latin-1",   # prevents encoding issues
    header=None
)

# We Assign column names manually (since file has no header)
movies.columns = [
    "movie_id", "title", "release_date", "video_release_date", "IMDb_URL",
    "unknown", "Action", "Adventure", "Animation", "Children", "Comedy", "Crime",
    "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery",
    "Romance", "Sci-Fi", "Thriller", "War", "Western"
]

print(movies.head())

   movie_id              title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   
3         4  Get Shorty (1995)  01-Jan-1995                 NaN   
4         5     Copycat (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   
3  http://us.imdb.com/M/title-exact?Get%20Shorty%...        0       1   
4  http://us.imdb.com/M/title-exact?Copycat%20(1995)        0       0   

   Adventure  Animation  Children  ...  Fantasy  Film-Noir  Horror  Musical  \
0          0          1         1  ...        0          0       0        0   


2.Combine Genres into a String

This way, each movie has a single text feature (good for TF-IDF).

In [5]:
# Select genre columns
genre_cols = ["Action","Adventure","Animation","Children","Comedy","Crime",
              "Documentary","Drama","Fantasy","Film-Noir","Horror","Musical",
              "Mystery","Romance","Sci-Fi","Thriller","War","Western"]

# Combine genres into a single string
def get_genres(row):
    genres = [g for g in genre_cols if row[g] == 1]
    return " ".join(genres)

movies["genre_str"] = movies.apply(get_genres, axis=1)
print(movies[["title", "genre_str"]].head(10))


                                               title  \
0                                   Toy Story (1995)   
1                                   GoldenEye (1995)   
2                                  Four Rooms (1995)   
3                                  Get Shorty (1995)   
4                                     Copycat (1995)   
5  Shanghai Triad (Yao a yao yao dao waipo qiao) ...   
6                              Twelve Monkeys (1995)   
7                                        Babe (1995)   
8                            Dead Man Walking (1995)   
9                                 Richard III (1995)   

                   genre_str  
0  Animation Children Comedy  
1  Action Adventure Thriller  
2                   Thriller  
3        Action Comedy Drama  
4       Crime Drama Thriller  
5                      Drama  
6               Drama Sci-Fi  
7      Children Comedy Drama  
8                      Drama  
9                  Drama War  


3.Create TF-IDF Vectors

Now we treat the genres as a text corpus.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize genre strings
tfidf = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(movies["genre_str"])

print("TF-IDF shape:", tfidf_matrix.shape)


TF-IDF shape: (1682, 20)


4 .Compute Similarity

In [8]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Reset index for easy lookup
indices = pd.Series(movies.index, index=movies["title"]).drop_duplicates()

def recommend(title, n=5):
    # Get index of the movie
    idx = indices[title]

    # Get similarity scores for this movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort movies by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Take top n+1 (exclude itself)
    sim_scores = sim_scores[1:n+1]

    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]

    return movies["title"].iloc[movie_indices]

# Example
print(recommend("Toy Story (1995)", n=5))

421    Aladdin and the King of Thieves (1996)
101                    Aristocats, The (1970)
403                          Pinocchio (1940)
624            Sword in the Stone, The (1963)
945             Fox and the Hound, The (1981)
Name: title, dtype: object


Note:
👉 From here, we can polish it (e.g., add descriptions/tags, use embeddings instead of just genres).For it is just genre-based but from our data set we can also extend it to use tags (u.data + u.genre + tags) for richer recommendations