**Movie Recommendation System**

Installing Gradio Component


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.41.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.112.1-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m586.5 kB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from gra

Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import difflib
import re
import gradio as gr
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Importing Data in the form of a csv file

In [None]:
df = pd.read_csv('https://arjun33388.github.io/My-Resume/Movies_YBI.csv')
ff = pd.read_csv('https://arjun33388.github.io/My-Resume/bollywood_full.csv')
mf = pd.read_csv('https://arjun33388.github.io/My-Resume/Netflix_Data.csv')
nf = pd.read_csv('https://arjun33388.github.io/My-Resume/TeluguMovies_dataset.csv')
af = pd.read_csv('https://arjun33388.github.io/My-Resume/Movies_1.csv')
bf = pd.read_csv('https://arjun33388.github.io/My-Resume/Movies_2.csv')
mmf = pd.merge(nf,mf,how='outer')
nnf = pd.merge(df,ff,how='outer')
aaf = pd.merge(af,bf,how='outer')
df = pd.merge(nnf,mmf,how='outer')
df = pd.merge(df,aaf,how='outer')

In [None]:
df.tail()

Unnamed: 0,Title,Description,Genre
76599,Akhada,A Young ambitious wrestler Karan falls prey to...,"Drama, Sport"
76600,Akhada,A Young ambitious wrestler Karan falls prey to...,"Drama, Sport"
76601,Haseena,"Three boys swooned by the beauty of Haseena, w...",Comedy
76602,Haseena,"Three boys swooned by the beauty of Haseena, w...",Comedy
76603,Hero of Nation Chandra Shekhar Azad,Chandrashekhar Azad has been a leader in the f...,Biography


Removing Duplicate Movie Titles from the DataSets


In [None]:
df['Title_Lower'] = df['Title'].str.lower()
df['is_duplicate'] = df.duplicated(subset='Title_Lower', keep='first')
df = df[df['is_duplicate'] == False]
df.drop(['is_duplicate', 'Title_Lower'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['is_duplicate', 'Title_Lower'], axis=1, inplace=True)


In [None]:
df.tail(3)

Unnamed: 0,Title,Description,Genre
76598,Akhada,A Young ambitious wrestler Karan falls prey to...,"Drama, Sport"
76601,Haseena,"Three boys swooned by the beauty of Haseena, w...",Comedy
76603,Hero of Nation Chandra Shekhar Azad,Chandrashekhar Azad has been a leader in the f...,Biography


Index Reset into consecutive numbers

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
df.tail(3)

Unnamed: 0,Title,Description,Genre
20778,Akhada,A Young ambitious wrestler Karan falls prey to...,"Drama, Sport"
20779,Haseena,"Three boys swooned by the beauty of Haseena, w...",Comedy
20780,Hero of Nation Chandra Shekhar Azad,Chandrashekhar Azad has been a leader in the f...,Biography


PreProcessing of Data from DataSets

Preprocessing of the data takes place in this block.
Firstly, Genre is taken *3 times to increase recommendation effectiveness.
Stop words such as and, or, is etc are removed and converted into an array.


In [None]:
df_features = df[['Genre','Description']]
#X = df_features['Description'] + ' ' + df_features['Genre']
X = df_features['Genre']*3 + ' ' + df_features['Genre'] + ' ' + df_features['Description']
X_processed = []
for item in X:
    if isinstance(item, str):
        words = item.split()
        stop_words = set(stopwords.words('english'))
        filtered_words = [word for word in words if word.lower() not in stop_words]
        filtered_text = " ".join(filtered_words)
        filtered_text = ''.join(c for c in filtered_text if c.isalnum() or c.isspace())
        filtered_text = ' '.join(filtered_text.split())
        filtered_text = filtered_text.lower()
        X_processed.append(filtered_text)
    else:
        X_processed.append('')
X = np.array(X_processed)
X = np.where(pd.isnull(X), '', X)

Implementing Cosine Similarity Algorithm

In [None]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(X)
Similarity_Score = cosine_similarity(X)

Recommendation List Output

In [None]:
def get_input(input_text):
    return input_text

def movie_result(Favourite_Movie_Name):
    All_Movies_Titles_List = [str(title) for title in df['Title'].tolist()]

    Movie_Recommendation = difflib.get_close_matches(Favourite_Movie_Name,All_Movies_Titles_List)

    Close_Match = Movie_Recommendation[0]

    Index_of_Close_Match_Movie = df[df.Title == Close_Match].index.values[0]

    Recommendation_Score = list(enumerate(Similarity_Score[Index_of_Close_Match_Movie]))

    Sorted_Similar_Movies = sorted(Recommendation_Score,key = lambda x:x[1],reverse = True)

    i = 1
    movie_result_list = []
    for movie in Sorted_Similar_Movies:
      index = movie[0]
      title_from_index = df[df.index==index]['Title'].values[0]
      if (i<21):
        movie_result_list.append(title_from_index)
        i+=1
      else:
        break

    return "\n".join(movie_result_list)

with gr.Blocks(css=".gradio-container {background-color: red}") as demo:
    with gr.Row():
        input_component = gr.Textbox(label="Enter your Favourite Movie")
        input_button = gr.Button("Submit Movie Title")
        process_button = gr.Button("Process Recommendations")
        result_component = gr.Textbox(label="Movie Recommendations for you!! :)")
    input_button.click(
        fn=get_input,
        inputs=input_component,
        outputs=result_component
    )

    process_button.click(
        fn=movie_result,
        inputs=input_component,
        outputs=result_component
    )
    demo.css = """
        .gradio {
            background-color: pink;
        }
        .gradio-container {
            background-image: url('https://lh3.googleusercontent.com/pw/AP1GczO4k8qFrOiePuEk_3zQ8mZxONk6byBkRITC56XlKTwWPgkHcQv68drvcbkTWGu0fRV3He1gwF1LPkyVsLDM5XKczqCGvpbKVOBjYwc5-KF8hzWY61prV4zSICTBlU17csDNO_qRJXTh37Zw8_39LfFgfA=w1280-h723-s-no-gm?authuser=0') ;
            size: fit;
            background-repeat: no-repeat;
            background-size: cover;
            padding: 20px;
        }
        .gradio-container h1 {
            font-family: sans-serif;
            color: #333;
            text-align: center;
        }
        .gradio-textbox {
            border: 2px solid #ccc;
            border-radius: 5px;
            padding: 10px;
        }
    """
    demo.title = "SmartFlick Finder :)"
    demo.description = "This is a Movie Recommendation System powered by Artificial Intelligence and Machine Learning Algorithms"

demo.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://a923317d647519f0d0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




**Explanation**

In this system, we are implementing Cosine Similarity Algorithm to Recommend movies bases on the given Favourite Movie.

By using the terms: Genre, Description, Tagline and Keywords: we are able to calculate the similarites of the given favourite movie and all the movies in the dataset.
The difflib library is used to understand the user input even when it is in gibberish and make the closest assumption based on the titles available in the dataset.
Later, the similarities  are calculated based on the given terms in columns and a similarity score is generated.
This score is later sorted and the highest ones (top 20) are displayed as the Recommended Movies.