In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

In [2]:
credits=pd.read_csv('../dataset/tmdb_5000_credits.csv')
movies=pd.read_csv('../dataset/tmdb_5000_movies.csv')

# Data Cleaning: 
1. Credits Dataset:
   - The **Cast and Crew feature** is a list that has a dictionary with key-value pair of the numerous characters and crew members respectively in the movie. Since it consists of a lot of characters we can only use the top 5 characters and only the director's name from the crew member since they are the one popularly recognized by audience. (Tideness)
3. Movies Dataset:
   - **Home Page**: 64% of the values are missing (Completeness)
   - **Tagline**: 17% of the values are missing (Completeness)
   - The **Genres, Keywords, Production_companines, Production_countries** is also inside a list with key-value pairs (Tideness)

In [3]:
print(f'The shape of the credits dataset is {credits.shape}')
print(f'The shape of the movies dataset is {movies.shape}')

The shape of the credits dataset is (4803, 4)
The shape of the movies dataset is (4803, 20)


In [4]:
credits.sample(5)

Unnamed: 0,movie_id,title,cast,crew
4783,226458,Backmask,"[{""cast_id"": 3, ""character"": ""Father Conway"", ...","[{""credit_id"": ""554c8351c3a3685e500041f5"", ""de..."
119,272,Batman Begins,"[{""cast_id"": 13, ""character"": ""Bruce Wayne / B...","[{""credit_id"": ""52fe4230c3a36847f800ac6d"", ""de..."
1787,5902,A Bridge Too Far,"[{""cast_id"": 4, ""character"": ""Lt. Gen. Fredric...","[{""credit_id"": ""52fe442bc3a36847f8086685"", ""de..."
866,70074,Bullet to the Head,"[{""cast_id"": 4, ""character"": ""James Bonomo"", ""...","[{""credit_id"": ""52fe47e9c3a368484e0e0019"", ""de..."
2831,13812,Quarantine,"[{""cast_id"": 1, ""character"": ""Angela Vidal"", ""...","[{""credit_id"": ""5834decbc3a36829d900d7a5"", ""de..."


In [5]:
movies.sample(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
2405,17000000,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 53, ""nam...",,3597,"[{""id"": 1328, ""name"": ""secret""}, {""id"": 1936, ...",en,I Know What You Did Last Summer,As they celebrate their high school graduation...,27.654472,"[{""name"": ""Columbia Pictures Corporation"", ""id...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1997-10-17,125586134,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"If you're going to bury the truth, make sure i...",I Know What You Did Last Summer,5.6,687
2781,14350531,"[{""id"": 28, ""name"": ""Action""}, {""id"": 35, ""nam...",http://attacktheblock.com/,59678,"[{""id"": 542, ""name"": ""street gang""}, {""id"": 24...",en,Attack the Block,A teen gang in a grim South London housing est...,30.952377,"[{""name"": ""UK Film Council"", ""id"": 2452}, {""na...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2011-05-12,3964682,88.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Inner City vs. Outer Space,Attack the Block,6.3,733
760,60000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 80, ""nam...",,9932,"[{""id"": 378, ""name"": ""prison""}, {""id"": 1321, ""...",en,Analyze That,The mafia's Paul Vitti is back in prison and w...,21.08952,"[{""name"": ""Village Roadshow Pictures"", ""id"": 7...","[{""iso_3166_1"": ""AU"", ""name"": ""Australia""}, {""...",2002-12-06,55003135,96.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Back in therapy,Analyze That,5.7,380


In [6]:
credits.isnull().mean()*100

movie_id    0.0
title       0.0
cast        0.0
crew        0.0
dtype: float64

In [7]:
movies.isnull().mean()*100

budget                   0.000000
genres                   0.000000
homepage                64.355611
id                       0.000000
keywords                 0.000000
original_language        0.000000
original_title           0.000000
overview                 0.062461
popularity               0.000000
production_companies     0.000000
production_countries     0.000000
release_date             0.020820
revenue                  0.000000
runtime                  0.041641
spoken_languages         0.000000
status                   0.000000
tagline                 17.572351
title                    0.000000
vote_average             0.000000
vote_count               0.000000
dtype: float64

In [8]:
credits.duplicated().sum()

0

In [9]:
movies.duplicated().sum()

0

In [10]:
movies=movies.merge(credits,on="title")
movies.shape

(4809, 23)

In [11]:
pd.set_option('display.max_columns', None)
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


Since I am focusing on making a content-based reccommendation system not all 23 columns is going to be useful. Thus, my goal is to just extract keywords that can help me reccommend a particular user. The columns that I will be using are: **Genres, id, keywords, overview, title, cast, crew**

In [12]:
movies=movies[['genres','id','keywords','overview','cast','crew','title']]

In [13]:
movies.sample(2)

Unnamed: 0,genres,id,keywords,overview,cast,crew,title
3478,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 37, ""nam...",80304,"[{""id"": 534, ""name"": ""mexico""}, {""id"": 10291, ...",Scheming of a way to save their father's ranch...,"[{""cast_id"": 1001, ""character"": ""Armando Alvar...","[{""credit_id"": ""52fe47b59251416c910732ff"", ""de...",Casa De Mi Padre
3423,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 18, ""name...",650,"[{""id"": 542, ""name"": ""street gang""}, {""id"": 57...",Boyz n the Hood is the popular and successful ...,"[{""cast_id"": 14, ""character"": ""Jason 'Furious'...","[{""credit_id"": ""52fe4264c3a36847f801ae93"", ""de...",Boyz n the Hood


In [14]:
movies.isnull().mean()*100

genres      0.000000
id          0.000000
keywords    0.000000
overview    0.062383
cast        0.000000
crew        0.000000
title       0.000000
dtype: float64

In [15]:
movies.dropna(inplace=True)

# Data Preprocessing:

In [16]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [17]:
import ast
def extractName(object):
    list=[]
    for keys in ast.literal_eval(object):
        if "name" in keys:
            list.append(keys["name"])
    return list   

In [18]:
movies.genres=movies.genres.apply(extractName)

In [19]:
movies.iloc[0].genres

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [20]:
movies.iloc[0].keywords

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [21]:
movies.keywords=movies.keywords.apply(extractName)

In [22]:
def actor_name(object):
    list=[]
    count=0
    for keys in ast.literal_eval(object):
        if count<5:
            if "name" in keys:
                list.append(keys["name"])
                count+=1
    return list

In [23]:
movies.cast=movies.cast.apply(actor_name)

In [24]:
movies.iloc[0].cast

['Sam Worthington',
 'Zoe Saldana',
 'Sigourney Weaver',
 'Stephen Lang',
 'Michelle Rodriguez']

In [25]:
def director(object):
    list=[]
    for keys in ast.literal_eval(object):
        if keys['job']=='Director':
            list.append(keys['name'])
    return list

In [26]:
movies.crew=movies.crew.apply(director)

In [27]:
movies.iloc[0].crew

['James Cameron']

In [28]:
movies.head()

Unnamed: 0,genres,id,keywords,overview,cast,crew,title
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron],Avatar
1,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski],Pirates of the Caribbean: At World's End
2,"[Action, Adventure, Crime]",206647,"[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes],Spectre
3,"[Action, Crime, Drama, Thriller]",49026,"[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,"[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan],The Dark Knight Rises
4,"[Action, Adventure, Science Fiction]",49529,"[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton],John Carter


In [29]:
movies.overview=movies.overview.apply(lambda x:x.split())

In [30]:
def remove_space(object):
    list=[]
    for items in object:
        list.append(items.replace(" ",""))
    return list

In [31]:
movies.cast=movies.cast.apply(remove_space)
movies.crew=movies.crew.apply(remove_space)
movies.genres=movies.genres.apply(remove_space)
movies.keywords=movies.keywords.apply(remove_space)

In [32]:
movies.head(2)

Unnamed: 0,genres,id,keywords,overview,cast,crew,title
0,"[Action, Adventure, Fantasy, ScienceFiction]",19995,"[cultureclash, future, spacewar, spacecolony, ...","[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],Avatar
1,"[Adventure, Fantasy, Action]",285,"[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],Pirates of the Caribbean: At World's End


# EDA:

In [33]:
from wordcloud import WordCloud
wc=WordCloud(width=500, height=500, min_font_size=10, background_color='black')

In [36]:
print(movies['genres'].dtype)

object


In [38]:
genres_string = movies.genres.apply(lambda x: " ".join(x))

In [39]:
ham_words=df1[df1.Type==0]['Transformed_Text'].str.cat(sep=" ")

0       Action Adventure Fantasy ScienceFiction
1                      Adventure Fantasy Action
2                        Action Adventure Crime
3                   Action Crime Drama Thriller
4               Action Adventure ScienceFiction
                         ...                   
4804                      Action Crime Thriller
4805                             Comedy Romance
4806               Comedy Drama Romance TVMovie
4807                                           
4808                                Documentary
Name: genres, Length: 4806, dtype: object