## Welcome to the coding section for the "Movie Recommender System (From scratch to deployment)". 

#### For those who have not read the project-flow for this project I encourage you to go through this [link](https://medium.com/@dozzynorm/movie-recommender-system-from-scratch-to-deployment-e808b5daeb94) and have a initial read. 
#### This project is beginner friendly as I will be referencing lot of things and explaining how the code works.
#### SO, without delay lets dive into this.

__Now lets start with fetching the data set. First go to top left corner of you screen and click the arrow button, right to the save version. Open data list and you will find the millions-of-movies dataset in you input. That is the data we are going to work with__ \
\
__After opening the notebook you will find code block already in your Notebook. In this code block you will find the imports that are important for all the ML project with dataframe, there is also a code block which will give you the *path* of our dataset, we just have to run this shell.__


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/millions-of-movies/movies.csv


# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 10px"> Fetching Dataset </p>

In [2]:
# Important Imports
import sklearn as sk
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [3]:
data = pd.read_csv("/kaggle/input/millions-of-movies/movies.csv")

# checking if data is loaded or not in variable(data)
data.head(2)

Unnamed: 0,id,title,genres,original_language,overview,popularity,production_companies,release_date,budget,revenue,runtime,status,tagline,vote_average,vote_count,credits,keywords,poster_path,backdrop_path,recommendations
0,315162,Puss in Boots: The Last Wish,Animation-Action-Adventure-Comedy-Family-Fantasy,en,Puss in Boots discovers that his passion for a...,10011.23,Universal Pictures-DreamWorks Animation,2022-12-07,90000000.0,297504470.0,103.0,Released,Say hola to his little friends.,8.611,2369.0,Antonio Banderas-Salma Hayek-Harvey Guillén-Wa...,fairy tale-talking dog-spin off-aftercreditsst...,/kuf6dutpsT0vSVehic3EZIqkOBt.jpg,/r9PkFnRUIthgBp2JZZzD380MWZy.jpg,830784-826173-417859-76600-877269-5334-46632-6...
1,536554,M3GAN,Science Fiction-Horror-Comedy,en,A brilliant toy company roboticist uses artifi...,7352.073,Universal Pictures-Blumhouse Productions-Atomi...,2022-12-28,12000000.0,101000000.0,102.0,Released,Friendship has evolved.,7.127,402.0,Allison Williams-Violet McGraw-Jenna Davis-Ami...,evil doll-aunt niece relationship-orphan-car a...,/7CNCv9uhqdwK7Fv4bR4nmDysnd9.jpg,/5kAGbi9MFAobQTVfK4kWPnIfnP0.jpg,615777-674324-661374-48209-593643-696157-67671...


# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 10px"> Pre-processing data </p>

# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 5px"> Analyzing data </p>

__The first step of pre-processing is to analyze data. How to analyze data?__ 
1. find how many movies are their in our dataset
2. find how many features of movie are given
3. find about data-types and names of features of movies
4. check if there are missing value in dataset
5. check if there are duplicate movies in dataset

There are some of the question to ask during analysis of data.

In [4]:
#check the shape of data
print(data.shape)

(728651, 20)


__Observation__ \
We have about __728K movies__ in the dataset and about __20 features__ of each movies.

In [5]:
# check columns of data and its data-types present in dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 728651 entries, 0 to 728650
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    728651 non-null  int64  
 1   title                 728647 non-null  object 
 2   genres                514726 non-null  object 
 3   original_language     728651 non-null  object 
 4   overview              608581 non-null  object 
 5   popularity            728651 non-null  float64
 6   production_companies  338007 non-null  object 
 7   release_date          674787 non-null  object 
 8   budget                728651 non-null  float64
 9   revenue               728651 non-null  float64
 10  runtime               692976 non-null  float64
 11  status                728651 non-null  object 
 12  tagline               108684 non-null  object 
 13  vote_average          728651 non-null  float64
 14  vote_count            728651 non-null  float64
 15  

In [6]:
# check null value in dataset
data.isnull().sum()

id                           0
title                        4
genres                  213925
original_language            0
overview                120070
popularity                   0
production_companies    390644
release_date             53864
budget                       0
revenue                      0
runtime                  35675
status                       0
tagline                 619967
vote_average                 0
vote_count                   0
credits                 226825
keywords                517461
poster_path             189291
backdrop_path           505751
recommendations         693700
dtype: int64

In [7]:
# check for duplicate values
data.duplicated().sum()

86

__Obsevation__ 
1. We can see that there are lots of __missing values__ in our data set.
2. We can also see that there are __99 duplicate__ movies.

***
# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 5px"> Cleaning data </p>

#### __first step of pre-processing is to clean the data. How to clean data?__ 
__1. First thing I want to do is drop unnecessary column that is not needed in our recommendation system:__ \
So, our recommendation system follow content-based model, that means it will find movies acording to the movies you watched. To do that we have to know features of movie we watched, there are 20 features in dataset and all of them are not important. Important featurs are those features which defines the movie best. In our case important features are __MovieId, Title, Genre, Overview, Credit, Keyword, Vote_count__.
These are the base features for our model. \
__2. droping duplicates in whole dataset & droping duplicates with same titles and release date:__ \
Droping duplicates will help our model to learn more easily and in later period duplicate movies will not cause harm. Droping duplicates with title and release date will help us to delete entries of same movie. \
__3. droping all the movies whoes vote_count < 350:__ \
By implementing this strategy, we will effectively reduce the number of movies in our dataset by half. This simplification of our model not only enhances its performance but also ensures a more efficient transfer of data to other platforms, thanks to the smaller size of the dataset. \
__4. deleting movies with no Genre & Overview:__ \
Movies with no Genre and Overview are of no use for us as they are most important features to have for our prediction model. \
__5. removing "-" sign from contents of Genre, Keywords & Credits__ 

***

In [8]:
# dropping unnecesary title 
df = data.drop(["production_companies", "popularity", "budget", "revenue", "status", "recommendations", "runtime", "vote_average", "backdrop_path", "tagline"], axis=1)

In [9]:
# droping duplicate values
df.drop_duplicates(inplace=True)

In [10]:
# checking duplicates in title
df.title.duplicated().sum()

86912

In [11]:
# check if duplicates titles have same release date 
df[["title", "release_date"]].duplicated().sum()

2339

In [12]:
# get rid of duplicates with same release datae
df.drop_duplicates(subset=["title","release_date"], inplace=True)

In [13]:
# get rid of vote_count lower than 350 and reseting index
df = df[df.vote_count >= 350].reset_index()

In [14]:
df.isnull().sum()

index                  0
id                     0
title                  0
genres                 0
original_language      0
overview               1
release_date           0
vote_count             0
credits                8
keywords             237
poster_path            0
dtype: int64

In [15]:
# replacing all the null value from genres adn overview with "nothing"
df.fillna("", inplace = True)

In [16]:
# delete movies with no genres and overview
index = df[(df.genres == "") & (df.overview == "")].index
df.drop(index, inplace=True)

In [17]:
# replacing genres, credits and keywords - with " "
df.genres = df.genres.apply(lambda x: " ".join(x.split("-")))
df.keywords = df.keywords.apply(lambda x: " ".join(x.split("-")))
df.credits = df.credits.apply(lambda x: " ".join(x.replace(" ", "").split("-")[:5]))

To predict similar movies using natural language processing techniques, we will utilize text-based data as input for our machine learning model. To facilitate this process, we will create a new column called "Tags" that encompasses all the crucial text features such as __Overview, Genres, Keywords, and Original Language__. This will enable us to make accurate predictions of similar movies.

In [18]:
# making tags for prediction
df["tags"] =df.overview + " "+ df.genres + " "  +df.credits + " " +df.keywords + " " + df.original_language

In [19]:
# making new framework with important features
new_df = df[["id", "title", "tags", 'poster_path']]

In [20]:
# making all the content of tags in lower case letter for better processing
new_df.tags = new_df.tags.apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [21]:
new_df.tags[0]

'puss in boots discovers that his passion for adventure has taken its toll: he has burned through eight of his nine lives leaving him with only one life left. puss sets out on an epic journey to find the mythical last wish and restore his nine lives. animation action adventure comedy family fantasy antoniobanderas salmahayek harveyguillén wagnermoura florencepugh fairy tale talking dog spin off aftercreditsstinger talking cat fear of death en'

In [22]:
new_df.head(1)

Unnamed: 0,id,title,tags,poster_path
0,315162,Puss in Boots: The Last Wish,puss in boots discovers that his passion for a...,/kuf6dutpsT0vSVehic3EZIqkOBt.jpg


# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 10px"> Vectorizer </p>

***
# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 5px"> Stemming </p>

[nltk.stem.porter doc link](https://www.nltk.org/api/nltk.stem.porter.html)
<div class="alert alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;  Stemming is the process of reducing words to their base or root form. This is done by removing suffixes, prefixes, or inflections from words to obtain their stem (for example word actors turn into actor). 
The goal of stemming is to reduce words to their core form, so that words with the same root can be recognized as the same word, regardless of their grammatical form. This can be useful in text-based natural language processing tasks, such as text classification, information retrieval, and machine translation. 
Basically, It is used to neutralize grammer of tags.
</div>

For stemming we use __nltk__ library and use __PorterStemmer__ for stemming.
***

In [23]:
#import nltk library and porter module
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [24]:
# stem function to take text and give output
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)    

In [25]:
# applying stem function in tags
new_df["tags"] = new_df["tags"].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


***
# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 5px"> Text Vectorization </p>

[CountVecotrizer docs link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
<div class="alert alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp; Text vectorization is the process of converting text data into numerical vectors so that they can be used as input for machine learning models.  
</div>

We will use CountVectorizer for this task which will convert our text data into number and it also helps to eliminate stop words (like and, the etc). It will also be used to make "__bag-of-word__" with 5000 most repeated words which is also start of our __Model Building__.

***

In [26]:
# importing countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
# setting for 5000 most repeated words, and exclude stop words
cv = CountVectorizer(stop_words="english",max_features=5000)

In [27]:
# fiting tags in count vector
vectors = cv.fit_transform(new_df["tags"]).toarray() #change it into array to use

In [28]:
cv.get_feature_names()[80:85]



['activist', 'actor', 'actress', 'actual', 'ad']

***
# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 10px"> Model Building </p>

[Cosine_similarity link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

<div class="alert alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp; Cosine Similarity is a process of creating a machine learning model that utilizes the cosine similarity metric to determine the similarity between two pieces of text. \
This step involves calculating the cosine similarity between all pairs of vectors in the dataset. The similarity value will range between -1 to 1, 1 being exactly similar and -1 being completely dissimilar.
</div>
***

In [29]:
# Similarity vector with cosine 
from sklearn.metrics.pairwise import cosine_similarity
# calculating similarity of each movie with all movies
similarity = cosine_similarity(vectors)

In [30]:
# similarity of each movie with all the movies
similarity.shape

(7525, 7525)

***
# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 10px"> Testing Model </p>

To test our model we will make a function which will take name of movie and recommend 5 movies, similar to that.
***

In [31]:
# making function to find movie and give similar movies as return
def recommend(movies):
    movie_index = new_df[new_df.title == movies].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key= lambda x:x[1])[1:6]
    
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

    

In [32]:
# checking similar movies test 1
recommend("Batman")

The Dark Knight
Batman: Mask of the Phantasm
Batman & Robin
Batman Returns
Dick Tracy


In [33]:
# test 2
recommend("Black Adam")

Zack Snyder's Justice League
X-Men: Apocalypse
Justice League
The Wolverine
Justice League vs. Teen Titans


__Observation__ \
the function is returning similar 5 movies which means our Model building is Sucessfull

***
# <p style="background-color:#C71A27;font-family:verdana;color:white;text-align:center;letter-spacing:0.5px;font-size:100%;padding: 10px"> Deployment </p>

[Pickle Library Link](https://docs.python.org/3/library/pickle.html)

<div class="alert alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp; Pickle is a Python library that allows you to serialize and deserialize Python objects, such as lists, dictionaries, and custom classes, to and from bytes, so that they can be saved to a file or sent over a network.
The library provides two main functions: pickle.dump() and pickle.load(). The pickle.dump() function serializes an object and writes it to a file-like object, while pickle.load() deserializes an object from a file-like object.
</div>


In [34]:
# import Pickle
import pickle 

In [35]:
# making portable movie pickle file to transport 
pickle.dump(new_df, open("movies.pkl", "wb"))

In [36]:
# making portable similarity file to transport
pickle.dump(similarity, open("similarity.pkl", "wb"))

First download the file movies.pkl and similarity.pkl (which is big you can make it small by reducing the number of movies).

To download go to top right corner arrow and go to output, go inside /kaggle/working and download.

### Now we will deploy this model in Steamlit
the bellow code is from app.py file which will deploy you model in streamlit, all you have to do is understand the code and deploy it. The code is nothing new it is every thing we have done already.

In [37]:
'''
# first import streamlit and pickle 
import streamlit as st
import pickle

# extract the new_df dataframe from movies.pkl
movies_list = pickle.load(open("movies.pkl", "rb"))
# extract the titles of movies
movies_list_title = movies_list["title"].values

# extract the similarity which contain our cosine similarity values
similarity = pickle.load(open("similarity.pkl", "rb"))


# make a recommend function which will take movie title and return 5 similar movies with their posters
def recommend(movie):
    movie_index = movies_list[movies_list["title"] == movie].index[0]
    distances = similarity[movie_index]
    sorted_movie_list = sorted(list(enumerate(distances)), reverse=True,
                               key=lambda x:x[1])[1:6]

    recommended_movies = []
    recommended_posters = []
    for i in sorted_movie_list:
        poster_path = movies_list["poster_path"][i[0]]
        recommended_movies.append(movies_list.iloc[i[0]].title)
        recommended_posters.append("https://image.tmdb.org/t/p/original"+poster_path)

    return recommended_movies,  recommended_posters



# Create title for your stream lit page
st.title("Project Movie Recommender System")

# Create a input box for movies name 
selected_movie_name = st.selectbox(
    "What is the movie name?",
    movies_list_title
)

# create a recommend button with function of displaying recommended movies and movie posters
if st.button("Recommend"):
    recommendation, movie_posters = recommend(selected_movie_name)

    col1, col2, col3, col4, col5 = st.columns(5)

    with col1:
        st.write(recommendation[0])
        st.image(movie_posters[0])
    with col2:
        st.write(recommendation[1])
        st.image(movie_posters[1])
    with col3:
        st.write(recommendation[2])
        st.image(movie_posters[2])
    with col4:
        st.write(recommendation[3])
        st.image(movie_posters[3])
    with col5:
        st.write(recommendation[4])
        st.image(movie_posters[4])

'''

'\n# first import streamlit and pickle \nimport streamlit as st\nimport pickle\n\n# extract the new_df dataframe from movies.pkl\nmovies_list = pickle.load(open("movies.pkl", "rb"))\n# extract the titles of movies\nmovies_list_title = movies_list["title"].values\n\n# extract the similarity which contain our cosine similarity values\nsimilarity = pickle.load(open("similarity.pkl", "rb"))\n\n\n# make a recommend function which will take movie title and return 5 similar movies with their posters\ndef recommend(movie):\n    movie_index = movies_list[movies_list["title"] == movie].index[0]\n    distances = similarity[movie_index]\n    sorted_movie_list = sorted(list(enumerate(distances)), reverse=True,\n                               key=lambda x:x[1])[1:6]\n\n    recommended_movies = []\n    recommended_posters = []\n    for i in sorted_movie_list:\n        poster_path = movies_list["poster_path"][i[0]]\n        recommended_movies.append(movies_list.iloc[i[0]].title)\n        recommended_p

After writing this whole code go to you terminal and write 

__PS E:\Machine learning bootcamp\Movies-Recomender-System> streamlit run app.py__

This will open a local host browser which will show you the page that you made in app.py folder





<div class="alert alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp; Thank you for reading! and hope you can keep making projects you like!
</div>