# 🚢 🛳**Recommendations are developed based on the similarity of product ingredients. This process is carried out through the description of the product. A recommendation is created for the next service that the user will receive by filtering the information (descriptions) of the service according to the service that the user has accessed in a limited number of times. Filtering processes on the description of the product are done by following the steps below:**

* *Texts are represented mathematically (text vectorization).*
* *Similarity, distance, vs. measuring by methods.*

![](https://editor.analyticsvidhya.com/uploads/62904R0.PNG)

# 📒📔📕***Text vectorization***
* ***Count vektör(sayım vektörü)***: *It is aimed to apply textual expressions to mathematical operations. However, since our data consists of textual expressions, it needs to be converted to numerical data..*
* ***TF-IDF matrisi***: *It performs a normalization process over the frequency of occurrence of the words both in its own text and in the whole text. The word vectors to be created are standardized by considering all the documents of the document term matrix. High frequency values ​​can create bias in the count vector method. However, in this method, it prevents bias by normalizing both in a specific document and in all documents..*

# ***🔟Dimensioning***
* ***Cosine similatary (benzerlik)***
* ***Oklid (uzaklık)***

***Recommendation systems will be developed with the "overview" and "title" variables of the movies in the data set. Content Based Filtering method will be used from recommendation systems.***

***The "overview" variable gives us the unique word group. The "title" variable gives the product to be recommended.***

# **🟡1.Import Libraries**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/the-movies-dataset/ratings.csv
/kaggle/input/the-movies-dataset/links_small.csv
/kaggle/input/the-movies-dataset/credits.csv
/kaggle/input/the-movies-dataset/keywords.csv
/kaggle/input/the-movies-dataset/movies_metadata.csv
/kaggle/input/the-movies-dataset/ratings_small.csv
/kaggle/input/the-movies-dataset/links.csv


In [2]:
pd.set_option("display.max_columns", None)
pd.set_option("display.width",500)

# 🟣**2.Getting to know the dataset**

In [3]:
df = pd.read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv', low_memory=False)
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
def check_dt(dataframe):
    print("SHAPE".center(70, "-"))
    print(dataframe.shape)
    print("TYPE".center(70, "-"))
    print(dataframe.dtypes)
    print("INFO".center(70, "-"))
    print(dataframe.info())
    print("NA".center(70, "-"))
    print(dataframe.isnull().sum())
    print("DESCRIBE".center(70, "-"))
    print(dataframe.describe().T)
    print("NUNIQUE".center(70, "-"))
    print(dataframe.nunique())
check_dt(df)

--------------------------------SHAPE---------------------------------
(45466, 24)
---------------------------------TYPE---------------------------------
adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object
---------------------------------INFO---

# 🟢**3.Generating the TF-IDF matrix**

In [5]:
tfidf = TfidfVectorizer(stop_words="english")

***NAN values ​​are determined and replaced with the expression " "***

In [6]:
df[df["overview"].isnull()]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
32,False,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 12, ...",,78802,tt0114952,fr,"Guillaumet, les ailes du courage",,0.745542,/k6ODtR38dKEfuzSGjggr8KDyAF4.jpg,"[{'name': 'Iwerks Entertainment', 'id': 70801}]","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1996-09-18,0.0,50.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Wings of Courage,False,6.8,4.0
300,False,,22000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",,161495,tt0114296,sv,Roommates,,3.395867,/hvHNlMvWS2GBt7RR971bJ3k4bJc.jpg,"[{'name': 'Hollywood Pictures', 'id': 915}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-03-01,12400000.0,108.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Roommates,False,6.4,7.0
634,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,287305,tt0117312,de,Peanuts – Die Bank zahlt alles,,0.066123,/wpk30SvRHmjC2plgKHZXxG0FlKd.jpg,"[{'name': 'Westdeutscher Rundfunk (WDR)', 'id'...","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1996-03-21,0.0,,[],Released,,Peanuts – Die Bank zahlt alles,False,4.0,1.0
635,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,339428,tt0116485,de,Happy Weekend,,0.002229,,"[{'name': 'Senator Film Produktion', 'id': 191}]","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1996-03-14,65335.0,,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,,Happy Weekend,False,0.0,0.0
641,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,10801,tt0117788,de,Das Superweib,,0.821299,/AbhMKCh3fV5PY2B9uSPF1DWEvq2.jpg,"[{'name': 'Constantin Film', 'id': 47}]","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1996-03-06,0.0,86.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,,The Superwife,False,5.3,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45342,False,,0,"[{'id': 18, 'name': 'Drama'}]",,199887,tt1771636,en,Over/Under,,0.704642,/1xLaIBGGPE4APtBJdfeuyOWICZ0.jpg,"[{'name': 'Fox Television Studios', 'id': 6529...","[{'iso_3166_1': 'US', 'name': 'United States o...",2013-01-04,0.0,87.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Over/Under,False,4.0,2.0
45377,False,,0,"[{'id': 12, 'name': 'Adventure'}]",,317389,tt0070695,es,Simbad e il califfo di Bagdad,,0.006352,/izk7KbT6LZO9baEhCkOZYMgj60w.jpg,"[{'name': 'Roas Produzioni', 'id': 21137}, {'n...","[{'iso_3166_1': 'IT', 'name': 'Italy'}]",1973-07-22,0.0,,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Released,,Simbad e il califfo di Bagdad,False,0.0,0.0
45398,False,,1254040,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",http://lmtr.fi/,468707,tt5742932,fi,Lauri Mäntyvaaran tuuheet ripset,,0.347806,/rKOpJuwb7pTqYVShHM2tl25VxyF.jpg,"[{'name': 'Elokuvayhtiö Oy Aamu', 'id': 84883}]","[{'iso_3166_1': 'FI', 'name': 'Finland'}]",2017-07-28,0.0,90.0,"[{'iso_639_1': 'fi', 'name': 'suomi'}]",Released,,Thick Lashes of Lauri Mäntyvaara,False,8.0,1.0
45399,False,,750000,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",,280422,tt3805180,ru,Все и сразу,,0.201582,/hNsmPpl3zLG36jr4EIEd5P8I4pa.jpg,"[{'name': 'Кинокомпания «Lunapark»', 'id': 420...","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",2014-06-05,3.0,0.0,"[{'iso_639_1': 'ru', 'name': 'Pусский'}]",Released,,All at Once,False,6.0,4.0


In [7]:
df["overview"] = df["overview"].fillna("")

***TF-IDF matrix is ​​created.***

In [8]:
tfidf_matris= tfidf.fit_transform(df["overview"])

# 🟠**4.Measurement values ​​are made**

In [9]:
cosine_sim = cosine_similarity(tfidf_matris, tfidf_matris)

In [10]:
df.index

RangeIndex(start=0, stop=45466, step=1)

***Movie titles are converted to pandas series with their indexes. The purpose here is to remove the repeated "title" data from the variable and use it in the recommendation system.***

In [11]:
index= pd.Series(df.index ,index= df["title"])

In [12]:
index

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

In [13]:
index = index[~ index.index.duplicated(keep='last')]

In [14]:
mov_index = index["Roommates"]

In [15]:
cosine_sim[mov_index]

array([0., 0., 0., ..., 0., 0., 0.])

***The "title" information in the created index variable is kept as a numeric value in the "mov_index" variable. In the "sim_score" variable, the closeness of the relevant movie with other movies is kept.***

In [16]:
sim_score= pd.DataFrame(data= cosine_sim[mov_index], columns=["Score"])

***sim_score variable is assigned to "mov_indexes" variable in order from smallest to largest. Finally, numerical equivalents of movies sorted from "title" variable are determined.***

In [17]:
mov_indexes = sim_score.sort_values("Score", ascending=False)[1:11].index

In [18]:
mov_indexes

Int64Index([30281, 30305, 30306, 30307, 30308, 30309, 30310, 30311, 30312, 30313], dtype='int64')

In [19]:
df["title"].iloc[mov_indexes]

30281                           Jab Tak Hai Jaan
30305                     Paranormal Whacktivity
30306    Comedy Central Roast of Pamela Anderson
30307                             The Golden Bat
30308                                   Festival
30309                              The Prospects
30310                    The Colour Out of Space
30311                                    Foxtrot
30312                           Ghost Graduation
30313                                 Duck Amuck
Name: title, dtype: object