# Project Title: Movie Recommendation System with Content-                               Based Filtering

## Problem Statement:
Develop a movie recommendation system that takes a user's input movie and provides the top 5 movie recommendations based on content similarity. The system will be presented as a web application using the Streamlit library in Python.

## Project Overview:
The goal of this project is to create a user-friendly movie recommendation system that assists users in discovering movies similar to their input preferences. Given a movie name as input, the system will employ content-based filtering to identify and recommend movies that share similar content characteristics.

## Key Components:

1. Data Collection: Gather a dataset of movies with attributes like titles, genres, and overviews.

2. Data Preprocessing: Clean and preprocess the data, handling missing values and ensuring data consistency.

3. Feature Extraction: Utilize the CountVectorizer to transform movie overviews into numerical features that can be used for content comparison.

4. Cosine Similarity: Calculate the cosine similarity between the features of the input movie and all other movies to identify the most similar ones.

5. Recommendation Generation: Rank and select the top 5 movies with the highest cosine similarity as recommendations for the user.

6. User Interface: Develop a web-based user interface using Streamlit that takes a user's input movie and displays the recommended movies along with their posters.

## Project Deliverables:
 • A functional recommendation system that provides accurate movie suggestions based on content.
 • A Streamlit web application that allows users to interact with the recommendation system.
 • Display of movie posters and names for both the input movie and recommended movies.
 • Clean, well-organized code and appropriate documentation.

## Evaluation:
The success of this project can be measured based on the accuracy and relevance of the recommended movies. You can assess the system's performance by obtaining user feedback and evaluating how often users find the recommendations appealing and in line with their preferences.

## Enhancements:
To further improve the project, you might consider adding more advanced recommendation techniques like collaborative filtering, incorporating user feedback for personalized recommendations, and expanding the dataset for a wider range of movie suggestions.

Note: While your current approach focuses on content-based recommendations, other factors like user preferences, ratings, and external data sources can be integrated to create a more comprehensive recommendation system in the future.

In [1]:
import numpy as np # linear algebra
import pandas as pd
movies = pd.read_csv('datasets/tmdb_5000_movies.csv')
credits = pd.read_csv('datasets/tmdb_5000_credits.csv') 

In [2]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [3]:
movies.shape

(4803, 20)

In [4]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [5]:
movies = movies.merge(credits,on='title')

In [6]:
movies.head()
# budget
# homepage
# id
# original_language
# original_title
# popularity
# production_comapny
# production_countries
# release-date(not sure)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [7]:
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [8]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [9]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   int64 
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.6+ KB


In [10]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [11]:
movies.dropna(inplace=True)

In [12]:
movies.duplicated().sum()

0

In [13]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [14]:
import ast #to make the string into the list 

In [15]:
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [16]:
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [17]:
movies['genres'] = movies['genres'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [18]:
movies['keywords'] = movies['keywords'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [19]:
movies['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [20]:
def convert3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter != 3:
            L.append(i['name'])
            counter+=1
        else:
            break
    return L 

In [21]:
movies['cast'] = movies['cast'].apply(convert3)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [22]:
# movies['cast'] = movies['cast'].apply(lambda x:x[0:3])

In [23]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [24]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [25]:
#movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.sample(5)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
932,752,V for Vendetta,In a world in which Great Britain has become a...,"[Action, Thriller, Fantasy]","[detective, vatican, fascism, satanism, fascis...","[Natalie Portman, Hugo Weaving, Stephen Rea]",[James McTeigue]
1342,11001,Blue Streak,Miles Logan is a jewel thief who just hit the ...,"[Action, Comedy, Crime]","[robbery, diamant, police operation, police ev...","[Martin Lawrence, Luke Wilson, Dave Chappelle]",[Les Mayfield]
2919,907,Doctor Zhivago,Doctor Zhivago is the filmed adapation of the ...,"[Drama, Romance, War]","[love triangle, nurse, suicide attempt, loss o...","[Omar Sharif, Julie Christie, Geraldine Chaplin]",[David Lean]
199,22,Pirates of the Caribbean: The Curse of the Bla...,"Jack Sparrow, a freewheeling 17th-century pira...","[Adventure, Fantasy, Action]","[exotic island, blacksmith, east india trading...","[Johnny Depp, Geoffrey Rush, Orlando Bloom]",[Gore Verbinski]
3281,43923,It's Kind of a Funny Story,A clinically depressed teenager gets a new sta...,"[Comedy, Drama]","[suicide, depression, independent film, coming...","[Keir Gilchrist, Emma Roberts, Zach Galifianakis]","[Ryan Fleck, Anna Boden]"


In [26]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [27]:
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

In [28]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [29]:
movies['overview'][0]  #fetch first rows values of this col

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [30]:
movies['overview'] = movies['overview'].apply(lambda x:x.split()) #convert into the string for concate

In [31]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [32]:
new = movies.drop(columns=['overview','genres','keywords','cast','crew'])
#new.head()

In [33]:
#convert every list into the string
new['tags'] = new['tags'].apply(lambda x: " ".join(x))
new.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [34]:
new['tags'][0]  #gives all tags of the first movie

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [35]:
#convert all tags into the lowercase
new['tags'] = new['tags'].apply(lambda x:x.lower())
new.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


The CountVectorizer is a feature extraction technique used in natural language processing and text analysis. It's a part of the scikit-learn library in Python. Here's a concise overview of its use:

1. Text to Matrix: Converts a collection of text documents into a matrix of token counts (word frequencies).
2. Feature Extraction: Transforms text data into numerical format usable by machine learning models.
3. Bag of Words: Creates a "bag of words" representation where each document is represented by the count of its words.

In [36]:
#Now convert all tags of the movies into the vector
#Text to vector (convert)
#This process is called text vectorisation
#Technique is bag of words
#tag1 + tag2 + tag3........=large text then find 5000 most common words
#m1-> 5 3 4 2 3 1  0 0 0 ....matches with the large text like action, future, drama etc.
#(5000, 5000) dimentional space.
#find closest five vectors to give the best recommendation of the movie
#We will not considered stop words like and, or, for, to, is etc.
#Then we will perform vectorisation on the remainning words.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')  # here taking most 5000 common words.

In [38]:
#convert into the numpy array as cv returns large number of 0's and try to convert into sparse m
vector = cv.fit_transform(new['tags']).toarray()
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [39]:
# cv.get_feature_names()

In [40]:
vector.shape

(4806, 5000)

The nltk (Natural Language Toolkit) is a popular Python library used for natural language processing (NLP) tasks. It provides tools and resources for working with human language data, making it easier to analyze, preprocess, and manipulate text data. 
1. Text Processing: nltk offers functions for tokenization (breaking text into words or sentences), stemming (reducing words to their base or root form), and lemmatization (reducing words to their dictionary form).

2. Stopwords Removal: It provides a list of common stopwords (such as "the," "and," "is") that are often removed from text data during preprocessing to focus on more meaningful words.

3. Part-of-Speech Tagging: nltk can tag words in a text with their corresponding part-of-speech (noun, verb, adjective, etc.), which is useful for understanding the grammatical structure of a sentence

In [41]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [42]:
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)   #again return in string format
    

In [43]:
['dancing', 'danced', 'dances']
ps.stem('dancing')

'danc'

In [44]:
new['tags']=new['tags'].apply(stem)
new['tags'][0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

In [45]:
#now calculate the cosine distance(angle b/w them) based on the simi.
#As more the dist. less the simi.
#find this dist. with every movies.

In [46]:
from sklearn.metrics.pairwise import cosine_similarity

In [47]:
cosine_similarity(vector).shape  #find the dist.with every movies

(4806, 4806)

In [48]:
similarity = cosine_similarity(vector)

In [49]:
similarity  #gives arrays of arrays

array([[1.        , 0.08964215, 0.06071767, ..., 0.02519763, 0.0277885 ,
        0.        ],
       [0.08964215, 1.        , 0.06350006, ..., 0.02635231, 0.        ,
        0.        ],
       [0.06071767, 0.06350006, 1.        , ..., 0.02677398, 0.        ,
        0.        ],
       ...,
       [0.02519763, 0.02635231, 0.02677398, ..., 1.        , 0.07352146,
        0.04774099],
       [0.0277885 , 0.        , 0.        , ..., 0.07352146, 1.        ,
        0.05264981],
       [0.        , 0.        , 0.        , ..., 0.04774099, 0.05264981,
        1.        ]])

In [50]:
similarity[0]

array([1.        , 0.08964215, 0.06071767, ..., 0.02519763, 0.0277885 ,
       0.        ])

In [51]:
sorted(similarity[0], reverse=True)

[1.0000000000000002,
 0.26089696604360174,
 0.2581988897471611,
 0.25110592822973776,
 0.24944382578492943,
 0.24846467329894412,
 0.24397501823713333,
 0.243599382882345,
 0.24147264420814757,
 0.23904572186687872,
 0.22677868380553634,
 0.2238868314198225,
 0.22230800575069137,
 0.22190115272469205,
 0.22131333406899523,
 0.2173253797873328,
 0.21653278478430665,
 0.2164218276749025,
 0.2162249910469341,
 0.2162249910469341,
 0.21380899352993948,
 0.21251185925162067,
 0.21251185925162067,
 0.21251185925162067,
 0.2114722130550724,
 0.21147221305507238,
 0.20597146021777488,
 0.20498001542269692,
 0.20283702113484398,
 0.2,
 0.1980534816610477,
 0.19518001458970666,
 0.19518001458970666,
 0.19518001458970663,
 0.19518001458970663,
 0.19477964490741226,
 0.19451950503185494,
 0.1938916835823703,
 0.1938916835823703,
 0.1932024558334913,
 0.19194297398747862,
 0.19166296949998196,
 0.18869127060994534,
 0.18832944617230335,
 0.18752289237539818,
 0.18752289237539818,
 0.187082869338697

In [52]:
new[new['title'] == 'The Lego Movie'] 

Unnamed: 0,movie_id,title,tags
744,137106,The Lego Movie,"an ordinari lego mini-figure, mistakenli thoug..."


In [53]:
# fetching the indexes
new[new['title'] == 'The Lego Movie'].index[0]

744

In [54]:
# we will use enumerate to maintain the original position of the movies corresponding to the other movies.

In [55]:
#this will sort on the indexes basis.
sorted(list(enumerate(similarity[0])),reverse=True)

[(4805, 0.0),
 (4804, 0.027788500718836418),
 (4803, 0.02519763153394848),
 (4802, 0.05345224838248488),
 (4801, 0.02492223931396134),
 (4800, 0.0),
 (4799, 0.05884898863364997),
 (4798, 0.02129588549999799),
 (4797, 0.0),
 (4796, 0.0),
 (4795, 0.0),
 (4794, 0.0),
 (4793, 0.059761430466719674),
 (4792, 0.0),
 (4791, 0.0),
 (4790, 0.03194382824999699),
 (4789, 0.023002185311411804),
 (4788, 0.0),
 (4787, 0.02258769757263128),
 (4786, 0.0),
 (4785, 0.0),
 (4784, 0.04517539514526256),
 (4783, 0.0),
 (4782, 0.027066598098038335),
 (4781, 0.06388765649999398),
 (4780, 0.023002185311411804),
 (4779, 0.0),
 (4778, 0.0),
 (4777, 0.05345224838248488),
 (4776, 0.0),
 (4775, 0.03253000243161777),
 (4774, 0.0),
 (4773, 0.029880715233359837),
 (4772, 0.026398183867422733),
 (4771, 0.042257712736425826),
 (4770, 0.0),
 (4769, 0.0),
 (4768, 0.0),
 (4767, 0.03877833671647406),
 (4766, 0.018227065414412227),
 (4765, 0.0),
 (4764, 0.0),
 (4763, 0.0),
 (4762, 0.0),
 (4761, 0.03194382824999699),
 (4760, 0

In [56]:
sorted(list(enumerate(similarity[0])),reverse=True,key = lambda x: x[1])[1:6]

[(539, 0.26089696604360174),
 (1194, 0.2581988897471611),
 (260, 0.25110592822973776),
 (1216, 0.24944382578492943),
 (507, 0.24846467329894412)]

In [57]:
def recommend(movie):
    index = new[new['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(new.iloc[i[0]].title)
        

In [58]:
new.iloc[538]

movie_id                                                 2026
title                                                 Hostage
tags        when a mafia account is taken hostag on hi bea...
Name: 538, dtype: object

In [59]:
recommend('Gandhi')

Gandhi, My Father
The Wind That Shakes the Barley
A Passage to India
Guiana 1838
Ramanujan


### The pickle module provides a way to encode and decode Python objects in a binary format. It's commonly used for tasks like:

### 1. Data Persistence: You can use pickle to save Python objects to a file and later load them back into memory. This is useful for caching, storing configurations, or saving complex data structures.

### 2. Object Communication: When you need to transmit Python objects between different processes or machines, you can pickle the object on one side and unpickle it on the other side.

In [60]:
import bz2file as bz2
import pickle

In [64]:
import bz2
import pickle

def compressed_pickle(title, data):
    with bz2.BZ2File(title + '.pbz2', 'w') as f:
        pickle.dump(data, f)


In [67]:
compressed_pickle('movie_dict', new.to_dict())
compressed_pickle('similarity', similarity)

In [None]:
# pickle.dump(new,open('movie_list.pkl','wb'))
# pickle.dump(similarity,open('similarity.sav','wb'))

In [None]:
# pickle.dump(new.to_dict(),open('movie_dict.sav','wb'))