# Introduction to  TMDB Movie Dataset

The TMDB Movie Dataset is a comprehensive collection of movie data that provides valuable insights
into the world of cinema. Comprising two main datasets, "tmdb_5000_movies.csv" and 
"tmdb_5000_credits.csv," this dataset offers a wealth of information about thousands of movies, 
including details about their production, cast, crew, and user ratings.

"tmdb_5000_movies.csv" contains essential information about each movie,
including its title, genre, release date, budget, revenue, and user ratings. 
With this dataset, you can explore the financial performance, popularity, and critical 
reception of movies from various genres and time periods.

"tmdb_5000_credits.csv" complements the movie dataset by providing detailed information about
the cast and crew involved in each film. It includes data on actors, directors, producers, and more. This dataset enables you to analyze the contributions of different individuals to the film industry and study the collaborations between artists.

Together, these datasets serve as a valuable resource for data analysis and
research in the fields of film studies, data science, and entertainment industry analysis.
Whether you're interested in understanding movie trends, exploring the influence of cast and crew on a film's success, or conducting data-driven investigations into the world of cinema, the TMDB Movie Dataset offers a rich and diverse source of information.

Explore the dataset, extract meaningful insights, and uncover hidden patterns in the 
fascinating realm of movies with the TMDB Movie Dataset.








# Import Libraries 

In [1]:
import pandas as pd 
import numpy as np  


# Dataset

In [2]:
movies=pd.read_csv('tmdb_5000_movies.csv')
credits=pd.read_csv('tmdb_5000_credits.csv')

In [3]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [4]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [5]:
# merge credits and movies on the base of title in column it will be same in both .

In [6]:
# assign a veriable movies in merge dataframe 

In [7]:
movies= movies.merge(credits,on="title")

In [8]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

In [9]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [10]:
 movies.shape

(4809, 23)

# now iwill work on my weakness that is choose important columns

In [11]:
#geners 
#id
#keyword 
#title 
#overview 
# caste 
#crew

In [12]:
# now arrange columns in dataframe 

In [13]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [14]:
movies= movies.drop(['budget','homepage','id','original_language','original_title','popularity','production_companies','production_countries','release_date','revenue','runtime','spoken_languages','status','tagline','vote_average','vote_count'],axis=1)
movies.head(2)

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",Avatar,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [15]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   genres    4809 non-null   object
 1   keywords  4809 non-null   object
 2   overview  4806 non-null   object
 3   title     4809 non-null   object
 4   movie_id  4809 non-null   int64 
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.6+ KB


In [16]:
# null values 

In [17]:
movies.isnull().sum()

genres      0
keywords    0
overview    3
title       0
movie_id    0
cast        0
crew        0
dtype: int64

In [18]:
movies.dropna(inplace=True)

In [19]:
movies.isnull().sum()

genres      0
keywords    0
overview    0
title       0
movie_id    0
cast        0
crew        0
dtype: int64

In [20]:
#duplicated

In [21]:
movies.duplicated().sum()

0

In [22]:
# the task is now we will merge columns genre kyword caste crew  in overviwes 
#here we will filter these columns  

In [23]:
# change the formate of genres columns into list columns eg=['action','adventures','fantacy']

In [24]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [25]:
#now lets convert this struing into lis

In [26]:
def convert(obj):
    li=[]
    for i in obj:
        li.append(i['name'])
        return li

In [27]:
convert('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

TypeError: string indices must be integers, not 'str'

In [None]:
# you can ee in that loop function list has not passed because of string in the genres columns 

In [None]:
# here we will convert string int list using this module beelow 

In [28]:
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [None]:
# now we will use ast.literal_eval  this function in  loop 

In [30]:
def convert(obj):
    li=[]
    for i in ast.literal_eval(obj):
        li.append(i['name'])
        return li

In [None]:
# now we will use apply function for conversion string to list 

In [31]:
movies['genres']=movies['genres'].apply(convert)

In [32]:
movies.head(2)

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,[Action],"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",Avatar,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,[Adventure],"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# now we will do these things with keywords wala colums 

In [33]:
movies['keywords']=movies['keywords'].apply(convert)

In [34]:
movies.head(2)

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,[Action],[culture clash],"In the 22nd century, a paraplegic Marine is di...",Avatar,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,[Adventure],[ocean],"Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# now we will work on  caste columns we are required only three actor name for recommendation  will run loop with counter 

In [35]:
def convert3(obj):
    li=[]
    counter=0
    for i in ast.literal_eval(obj):
        if counter != 3:
            li.append(i["name"])
            counter+=1
        else:
            break
    return li       
            

    

In [None]:
# now will use apply function  and only 3 star caste will shown  

In [36]:
movies["cast"]=movies["cast"].apply(convert3)

In [37]:
movies.head()

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,[Action],[culture clash],"In the 22nd century, a paraplegic Marine is di...",Avatar,19995,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,[Adventure],[ocean],"Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,285,"[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,[Action],[spy],A cryptic message from Bond’s past sends him o...,Spectre,206647,"[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,[Action],[dc comics],Following the death of District Attorney Harve...,The Dark Knight Rises,49026,"[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,[Action],[based on novel],"John Carter is a war-weary, former military ca...",John Carter,49529,"[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [None]:
# now we will work on crew column  
# we are required only director name from dictionary 
#wher "job":director we will use loop function and break it after getting 
#use ast.literal_eval(obj):
#ten apply fuction  

In [38]:
def fetch_director(obj):
    li=[]
    for i in ast.literal_eval(obj):
        if i["job"]=="Director":
            li.append(i["name"])
            break
    return li        

In [39]:
movies["crew"]=movies["crew"].apply(fetch_director)

In [40]:
movies

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,[Action],[culture clash],"In the 22nd century, a paraplegic Marine is di...",Avatar,19995,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,[Adventure],[ocean],"Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,285,"[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,[Action],[spy],A cryptic message from Bond’s past sends him o...,Spectre,206647,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,[Action],[dc comics],Following the death of District Attorney Harve...,The Dark Knight Rises,49026,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,[Action],[based on novel],"John Carter is a war-weary, former military ca...",John Carter,49529,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]
...,...,...,...,...,...,...,...
4804,[Action],[united states–mexico barrier],El Mariachi just wants to play his guitar and ...,El Mariachi,9367,"[Carlos Gallardo, Jaime de Hoyos, Peter Marqua...",[Robert Rodriguez]
4805,[Comedy],,A newlywed couple's honeymoon is upended by th...,Newlyweds,72766,"[Edward Burns, Kerry Bishé, Marsha Dietlein]",[Edward Burns]
4806,[Comedy],[date],"""Signed, Sealed, Delivered"" introduces a dedic...","Signed, Sealed, Delivered",231617,"[Eric Mabius, Kristin Booth, Crystal Lowe]",[Scott Smith]
4807,,,When ambitious New York attorney Sam is sent t...,Shanghai Calling,126186,"[Daniel Henney, Eliza Coupe, Bill Paxton]",[Daniel Hsia]


In [None]:
# now we have overview columns which is in list so willc onvert it into list 

In [41]:
movies["overview"][0]   # now we will aplly lamba funtion to convert into list 

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [42]:
movies["overview"]=movies["overview"].apply(lambda x:x.split())

In [43]:
movies.head(2)

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,[Action],[culture clash],"[In, the, 22nd, century,, a, paraplegic, Marin...",Avatar,19995,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,[Adventure],[ocean],"[Captain, Barbossa,, long, believed, to, be, d...",Pirates of the Caribbean: At World's End,285,"[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


In [None]:
# here we remove spaces in rowe names for better under understanding of machine  

In [44]:
#  remove spaces from elements in the "genres" column
movies["genres"] = movies["genres"].apply(lambda x: [i.replace(" ", "") for i in x] if x is not None else x)
movies["keywords"] = movies["keywords"].apply(lambda x: [i.replace(" ", "") for i in x] if x is not None else x)
movies["cast"] = movies["cast"].apply(lambda x: [i.replace(" ", "") for i in x] if x is not None else x)
movies["crew"] = movies["crew"].apply(lambda x: [i.replace(" ", "") for i in x] if x is not None else x)

In [45]:
movies.head()

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew
0,[Action],[cultureclash],"[In, the, 22nd, century,, a, paraplegic, Marin...",Avatar,19995,"[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,[Adventure],[ocean],"[Captain, Barbossa,, long, believed, to, be, d...",Pirates of the Caribbean: At World's End,285,"[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,[Action],[spy],"[A, cryptic, message, from, Bond’s, past, send...",Spectre,206647,"[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,[Action],[dccomics],"[Following, the, death, of, District, Attorney...",The Dark Knight Rises,49026,"[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,[Action],[basedonnovel],"[John, Carter, is, a, war-weary,, former, mili...",John Carter,49529,"[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [None]:
# now after filtered we will concatinate = overview	,cast ,crew,genres,keywords into anew colum that is tags.

In [46]:
movies ["tags"]=movies["overview"] + movies["genres"] + movies["keywords"] + movies["cast"] + movies["crew"]

In [47]:
movies.head()

Unnamed: 0,genres,keywords,overview,title,movie_id,cast,crew,tags
0,[Action],[cultureclash],"[In, the, 22nd, century,, a, paraplegic, Marin...",Avatar,19995,"[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,[Adventure],[ocean],"[Captain, Barbossa,, long, believed, to, be, d...",Pirates of the Caribbean: At World's End,285,"[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,[Action],[spy],"[A, cryptic, message, from, Bond’s, past, send...",Spectre,206647,"[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,[Action],[dccomics],"[Following, the, death, of, District, Attorney...",The Dark Knight Rises,49026,"[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,[Action],[basedonnovel],"[John, Carter, is, a, war-weary,, former, mili...",John Carter,49529,"[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [None]:
# now we will create a new dataframe columns name= movie_ID, tile, tags 

In [49]:
new_data=movies[["movie_id","title","tags"]]
new_data

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...
4804,9367,El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,
4806,231617,"Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,


In [None]:
# now in tags we will convert list in rows  into string we will join through lamba 

In [50]:
new_data["tags"]=new_data["tags"].apply(lambda x:" ".join(x))


TypeError: can only join an iterable

In [51]:
# Assuming you want to join elements of lists in the "tags" column into a single string
new_data["tags"] = new_data["tags"].apply(lambda x: " ".join(x) if isinstance(x, list) else x)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["tags"] = new_data["tags"].apply(lambda x: " ".join(x) if isinstance(x, list) else x)


In [52]:
new_data.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [None]:
# now you can see tags column = where yoy can see concatinate string 

In [53]:
new_data["tags"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action cultureclash SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [None]:
# now we will convert this string into lower case small letter for better understanding  

In [54]:
# Assuming you want to convert string values in the "tags" column to lowercase  using google for this 
new_data['tags'] = new_data['tags'].apply(lambda x: x.lower() if isinstance(x, str) else x)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data['tags'] = new_data['tags'].apply(lambda x: x.lower() if isinstance(x, str) else x)


In [55]:
new_data.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [56]:
import nltk


In [None]:
#we will use stem function to make repeated woed in in one manner porterstammer

In [57]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [None]:
# we will use a def function to stem every words 

In [58]:
def stem(text):
    st=[]
    for i in text.split():     #breaking the words 
        st.append(ps.stem(i))  # will stem the words 
    return " ".join(st)   

In [59]:

def stem(text):
    # Check if the text is not NaN (a float)
    if isinstance(text, str):
        st = []
        for i in text.split():  # breaking the words
            st.append(ps.stem(i))  # will stem the words
        return " ".join(st)
    else:
        # Handle non-string (float or NaN) values
        return text

# Apply the stem function to your DataFrame column
new_data['text'] = new_data['text'].apply(stem)


KeyError: 'text'

In [65]:
new_data["tags"].apply(stem)

0       in the 22nd century, a parapleg marin is dispa...
1       captain barbossa, long believ to be dead, ha c...
2       a cryptic messag from bond’ past send him on a...
3       follow the death of district attorney harvey d...
4       john carter is a war-weary, former militari ca...
                              ...                        
4804    el mariachi just want to play hi guitar and ca...
4805                                                  NaN
4806    "signed, sealed, delivered" introduc a dedic q...
4807                                                  NaN
4808    ever sinc the second grade when he first saw h...
Name: tags, Length: 4806, dtype: object

In [66]:
new_data["tags"]=new_data["tags"].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["tags"]=new_data["tags"].apply(stem)


In [None]:
# after complete this process will continue start with CountVectorizer

In [None]:
# now we will use text vectorisation using sklearn countvetoristaion 
# and will use hyperparameter max_features and stop word 

In [None]:
# now we will convert vector into array because of requirement 

In [67]:
# Fill NaN values in the "tags" column with an empty string
new_data["tags"].fillna('', inplace=True)

# Now, you can apply the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words="english")
vectors = cv.fit_transform(new_data["tags"]).toarray()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["tags"].fillna('', inplace=True)


In [None]:
# you can see its shape  #we HAVE 4806 MOVIE AND WE HAVE TRANSFORM 5000

In [68]:
cv.fit_transform(new_data["tags"]).toarray().shape

(4806, 5000)

In [None]:
# NOW WE WILL CALL IT AS VECTORS 

In [69]:
vectors=cv.fit_transform(new_data["tags"]).toarray()

In [70]:
vectors     # now every movies are in the form of vectors 

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [None]:
# now lets chek first movie avatar 

In [71]:
vectors[0]   # in a singlw movie 5000 word in not possble that is why it is showing 0.in array

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [None]:
# cv we have used avove it has a function name called cv.get_featurs_names()

In [72]:
cv.get_feature_names()   # because of his error i got help from google use the code in below

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [73]:
# Assuming you have already fit_transformed your data with the CountVectorizer (cv)
# Use the get_feature_names_out() method to get feature names
feature_names = cv.get_feature_names_out()

# Now, you can access the feature names
print(feature_names)


['000' '007' '10' ... 'zone' 'zoo' 'zooeydeschanel']


In [None]:
# now focus on nltk above at the bottom of new_data.head()
#we will use stem function to make repeated woed in in one manner

In [None]:
# here now we will calculate the cosin distance to know the angle of the vactor 
# we will use cosin similarity which tells 0 to 1  
# 1 means more similarity and 0 means no similarity  

In [74]:
from sklearn.metrics.pairwise import cosine_similarity


In [75]:
similarity=cosine_similarity(vectors)

In [76]:
similarity

array([[1.        , 0.        , 0.04622502, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.03370999, 0.        ,
        0.03580574],
       [0.04622502, 0.        , 1.        , ..., 0.02956562, 0.        ,
        0.        ],
       ...,
       [0.        , 0.03370999, 0.02956562, ..., 1.        , 0.        ,
        0.04828045],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.03580574, 0.        , ..., 0.04828045, 0.        ,
        1.        ]])

In [78]:
similarity.shape

(4806, 4806)

In [79]:
similarity[0] #similarity of first movie with first movie  u can see 1 there

array([1.        , 0.        , 0.04622502, ..., 0.        , 0.        ,
       0.        ])

In [80]:
similarity[1]  #simlirity od second movie with second movie is 1 means good similarity

array([0.        , 1.        , 0.        , ..., 0.03370999, 0.        ,
       0.03580574])

In [None]:
# now we will create a function if some one is giving me a movie it will suggest them five mob]vie according to simiarity

In [None]:
def recommend(movie):
    

In [None]:
# will fetch the inde here

In [81]:
new_data["title"]=="Avatar"

0        True
1       False
2       False
3       False
4       False
        ...  
4804    False
4805    False
4806    False
4807    False
4808    False
Name: title, Length: 4806, dtype: bool

In [82]:
new_data[new_data["title"]=="Avatar"]

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."


In [83]:
new_data[new_data["title"]=="Avatar"].index[0] # index position of avatar is 0 .

0

In [84]:
new_data[new_data["title"]=="Batman Begins"].index[0] #index position of batman begins is 119

119

In [None]:
# we need to to sorting but there is a prblm 

In [85]:
sorted(similarity[0],reverse=True) # we have those movies which are more similar 
#sorting karne se simislrity me index position sb idh udhr ho jaega 
# to hame iska index [osition pkdh k rkhna pdega during sorting ]
#for that we will call enumerate function  

[1.0000000000000002,
 0.24595492912420727,
 0.23333333333333336,
 0.19611613513818404,
 0.18490006540840975,
 0.18257418583505536,
 0.1767766952966369,
 0.16412198797244368,
 0.16169041669088866,
 0.158113883008419,
 0.15499685165842572,
 0.15430334996209194,
 0.15342910298305393,
 0.15075567228888181,
 0.14744195615489716,
 0.14506471329641488,
 0.14433756729740646,
 0.1439780709302267,
 0.14142135623730953,
 0.14142135623730953,
 0.14142135623730953,
 0.1405456737852613,
 0.13867504905630731,
 0.13608276348795437,
 0.13608276348795437,
 0.1336306209562122,
 0.1336306209562122,
 0.13130643285972257,
 0.1307440900921227,
 0.1307440900921227,
 0.12909944487358055,
 0.12909944487358055,
 0.12909944487358055,
 0.1270001270001905,
 0.1270001270001905,
 0.12598815766974242,
 0.12598815766974242,
 0.125,
 0.125,
 0.125,
 0.125,
 0.12274328238644315,
 0.12171612389003691,
 0.12171612389003691,
 0.12171612389003691,
 0.12171612389003691,
 0.12171612389003691,
 0.12126781251816648,
 0.121267812

In [90]:
def recommend(movie):
    index = new_data[new_data['title'] == movie].index[0]
    distances = sorted(enumerate(similarity[index]),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(new_data.iloc[i[0]].title)
        

In [91]:
recommend('Gandhi')

Gandhi, My Father
Guiana 1838
Ramanujan
Barry Lyndon
Partition


In [93]:
recommend("Avatar")

Tears of the Sun
Apollo 18
Bucky Larson: Born to Be a Star
Risen
Attack the Block


In [94]:
recommend("Batman Begins")

Synecdoche, New York
Copying Beethoven
The Animal
10th & Wolf
The Shipping News


# Conclusion:

The TMDB Movie Dataset provides a comprehensive and valuable resource for exploring
the vast and dynamic world of cinema. With "tmdb_5000_movies.csv" and "tmdb_5000_credits.csv" 
at your disposal, you can delve into a wide range of movie-related analyses and investigations.

Throughout this dataset, we've encountered a treasure trove of information 
about thousands of movies, spanning different genres, release years, and production budgets.
We've gained insights into financial performance, audience reception, and critical acclaim 
for various films, allowing us to draw connections between budget, revenue, and user ratings.

Moreover, "tmdb_5000_credits.csv" has allowed us to explore the individuals 
who contribute to the creation of these cinematic masterpieces. We've uncovered details about actors
directors, producers, and other crew members, shedding light on their roles and collaborations
within the industry.

As we conclude our exploration of the TMDB Movie Dataset, 
we're reminded of the limitless possibilities it offers. From predicting a film's 
success based on its attributes to analyzing the evolution of movie genres over time, 
this dataset empowers researchers, data scientists, and film enthusiasts alike to embark on exciting journeys of discovery.

Whether you're passionate about movies or simply eager to harness the power of data for 
insightful analysis, the TMDB Movie Dataset stands as a valuable companion on your
quest for knowledge in the realm of cinema.

So, let the credits roll and the analysis begin, as we continue to uncover the stories,
trends, and magic that make the world of movies so captivating and endlessly fascinating.





