# Movie recommender system

The project takes in a dataset of movies, and vectorizes the tags to give recommendations. If the movie does not exist in the dataset, then an api is called to retrive informations regarding it. It is then vectorized and added to the dataset as well. This way, the data keeps on growing with use.

In [47]:
import pandas as pd
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import config
import requests
from fuzzywuzzy import process
import config

# Initialize stemmer and vectorizer
ps = PorterStemmer()
cv = CountVectorizer(max_features=10000, stop_words='english', ngram_range=(1, 3)) 




Using ngram_range, we are including both bi-grams and tri-grams in it.
That will better help understand the correlation between words.

In [48]:
# Load data
data = pd.read_csv("dataframe.csv")
data['tags'] = data['tags'].astype(str)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4817 entries, 0 to 4816
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4817 non-null   int64 
 1   title     4817 non-null   object
 2   tags      4817 non-null   object
dtypes: int64(1), object(2)
memory usage: 113.0+ KB


In [49]:
data.sample(5)

Unnamed: 0,movie_id,title,tags
1340,11001,Blue Streak,mile logan is a jewel thief who just hit the b...
725,10067,The Shaggy Dog,the tale of a workahol dad-turned-dog who find...
3938,549,Basquiat,director julian schnabel illustr the portrait ...
2140,1710,Copycat,an agoraphob psychologist and a femal detect m...
3439,62255,Tracker,an ex-boer war guerrilla in new zealand is sen...


In [50]:
API_KEY = '23fd07ad'

# stem function
def stem(text):
    return " ".join(ps.stem(word) for word in text.split())

# Applying stemming to the tags
data['tags'] = data['tags'].apply(stem)

# Vectorizing 
vectors = cv.fit_transform(data['tags']).toarray()

print(vectors.shape)
print(np.isnan(vectors).any()) #to check if any Nan values

similarity = cosine_similarity(vectors)

(4817, 10000)
False


  ret = a @ b


In [51]:
def recommend(movie):
    if movie not in data['title'].values:
        print("Movie not found.")
        return

    movie_index = data[data['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]

    if not movies_list:
        print("No recommendations found.")
        return

    for i in movies_list:
        try:
            print(data.iloc[i[0]].title)
        except IndexError:
            print("Index out of bounds while recommending.")

If movie is not present in the dataset, it will fetch details of that movie using api. It will include it in the vectorization and then recommend accordingly.

In [52]:
def get_movie_data(title):
    
    BASE_URL = 'http://www.omdbapi.com/'
    response = requests.get(BASE_URL, params={'t': title, 'apikey': API_KEY})
    movie_data = response.json()

    if movie_data.get('Response') == 'True':
        new_df = pd.DataFrame([{
            'movie_id': movie_data['imdbID'][2:],
            'title': movie_data['Title'],
            'tags': movie_data['Plot'] + ' ' + movie_data['Genre'] + ' ' + movie_data['Director'] + ' ' + movie_data['Actors']
        }])
        new_df['tags'] = new_df['tags'].apply(stem)

        # Append new data to the existing DataFrame
        global data
        data = pd.concat([data, new_df], ignore_index=True)

        # Re-fit the CountVectorizer and update vectors and similarity
        global cv, vectors, similarity
        vectors = cv.fit_transform(data['tags']).toarray()
        similarity = cosine_similarity(vectors)

        recommend(new_df['title'].values[0])
    else:
        print(f"Movie not found: {movie_data.get('Error')}")

In [53]:
def find_closest_match(input_title, titles):
    matches = process.extract(input_title, titles, limit=1)
    if matches:
        return matches[0][0]
    return None


In [54]:
def give_recommendation(movie):
    if movie in data['title'].values:
        recommend(movie)
    else:
        closest_match = find_closest_match(movie, data['title'].values)
        if closest_match:
            confirm = input(f"Did you mean '{closest_match}'? (yes/no): ").strip().lower()
            if confirm == 'yes' or confirm == 'y':
                print("You should try watching these movies if you liked", movie)
                recommend(closest_match)
            else:
                print("Data not found in my existing database, fetching it from the internet")
                get_movie_data(movie)
        else:
            print("Data not found in my existing database, fetching it from the internet")
            get_movie_data(movie)

    # Save updated data to CSV
    data.to_csv("dataframe.csv", index=False)

In [55]:
movie = input("Enter the name of a movie: ").lower()
give_recommendation(movie)

Enter the name of a movie: Interstellar
Did you mean 'Interstellar'? (yes/no): yes
You should try watching these movies if you liked interstellar
Silent Running
The Martian
Space Cowboys
Solaris
Space Pirate Captain Harlock


In [56]:
movie = input("Enter the name of a movie: ").lower()
give_recommendation(movie)

Enter the name of a movie: The matrix
Did you mean 'The Matrix'? (yes/no): yes
You should try watching these movies if you liked the matrix
The Matrix Revolutions
The Matrix Reloaded
Terminator 3: Rise of the Machines
Hackers
The Terminator


#### Lets try a movie which is not in the dataset

As we can see, by giving input Ad Astra, we get outputs like spaceman which are also related to space.

In [57]:
movie = input("Enter the name of a movie: ").lower()
give_recommendation(movie)

Enter the name of a movie: Ad Astra
Did you mean 'I Married a Strange Person!'? (yes/no): no
Data not found in my existing database, fetching it from the internet
Spaceman
1982
I Origins
I Love Your Work
Boogeyman


  ret = a @ b
