# Pré-processamento

Importações

In [1]:
import numpy as np
import pandas as pd
import ast

Carregando dados

In [2]:
credits_path, movies_path = 'files/credits.csv', 'files/movies.csv'
df_credits = pd.read_csv(credits_path)
df_movies = pd.read_csv(movies_path)

Configurando o pandas para mostrar todas as linhas e colunas do DataFrame (útil para DF grandes).

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Colunas antes do merge

In [4]:
print("collumns num: ", len(df_movies.columns), "\nCollumns: ", str(df_movies.columns)[6:-20])

collumns num:  20 
Collumns:  ['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
 


Junção dos dois DataFrames utilizando o 'title' como chave

In [5]:
df_movies = df_movies.merge(df_credits, on='title')

Colunas após o merge

In [6]:
print("collumns num: ", len(df_movies.columns), "\nCollumns: ", str(df_movies.columns)[6:-20])

collumns num:  23 
Collumns:  ['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
 


Selecionando somente as colunas necessárioas para o processamento

In [7]:
columns = ['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']
df_movies = df_movies[columns]

In [8]:
print("collumns num: ", len(df_movies.columns), "\nCollumns: ", str(df_movies.columns)[6:-20])

collumns num:  7 
Collumns:  ['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'cre


Remoção linhas caso tenha valores nulos

In [9]:
if any (df_movies.isnull().sum()):
    df_movies.dropna(axis=0, how='any', inplace=True)
    print('All null values dropped')

else:
    print('Have no Null Values')

All null values dropped


Remoção linhas caso tenha linhas repetidas

In [10]:
if df_movies.duplicated().sum():
    df_movies.dropduplicates(inplace=True)
    print('All null duplicated values dropped')
else:
    print('Have no duplicated values.')

Have no duplicated values.


Funções para formatação dos dados

In [11]:
def convert(obj):
    """
    Retorna lista contendo os nomes em um dicionário
    """
    return [i['name'] for i in ast.literal_eval(obj)]
def convert3(obj):
    """
    Retorna lista contendo os 3 primeiros nomes em um dicionário
    """
    return [j['name'] for i, j in enumerate(ast.literal_eval(obj)[:3])]
def fetch_director(obj):
    """
    Retorna lista contendo somente o nome caso o cargo seja diretor
    """
    return [i['name'] for i in ast.literal_eval(obj) if i['job'] == 'Director']

Colunas antes da aplicação das funções

In [12]:
df_movies[['genres', 'keywords', 'cast', 'crew', 'overview']].head()

Unnamed: 0,genres,keywords,cast,crew,overview
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","In the 22nd century, a paraplegic Marine is di..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","Captain Barbossa, long believed to be dead, ha..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",A cryptic message from Bond’s past sends him o...
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",Following the death of District Attorney Harve...
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...","John Carter is a war-weary, former military ca..."


Aplicação das funções para formatação dos dados

In [13]:
# coluna recebe apenas o nome do gênero
df_movies['genres'] = df_movies['genres'].apply(convert)
# coluna recebe apenas a(s) palavras-chave
df_movies['keywords'] = df_movies['keywords'].apply(convert)
# coluna recebe apenas os 3 primeiros nomes do elenco
df_movies['cast'] = df_movies['cast'].apply(convert3)
# coluna recebe apenas os nomes dos diretores
df_movies['crew'] = df_movies['crew'].apply(fetch_director)
# separa as palavras da descrição do filme (facilita tarefas de NLP)
df_movies['overview'] = df_movies['overview'].apply(lambda x: x.split())

Colunas depois da aplicação das funções

In [14]:
df_movies[['genres', 'keywords', 'cast', 'crew', 'overview']].head()

Unnamed: 0,genres,keywords,cast,crew,overview
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan],"[Following, the, death, of, District, Attorney..."
4,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton],"[John, Carter, is, a, war-weary,, former, mili..."


Retira espaços de nomes com mais de uma palavra em colunas específicas

In [15]:
df_movies['genres'] = df_movies['genres'].apply(lambda x: [i.replace(" ", "") for i in x])
df_movies['crew'] = df_movies['crew'].apply(lambda x: [i.replace(" ", "") for i in x])
df_movies['cast'] = df_movies['cast'].apply(lambda x: [i.replace(" ", "") for i in x])
df_movies['keywords'] = df_movies['keywords'].apply(lambda x: [i.replace(" ", "") for i in x])

Cria uma coluna contendo as informações das outras colunas

In [16]:
df_movies['tags'] = df_movies['overview'] + df_movies['genres'] + df_movies['keywords'] + df_movies['cast'] + df_movies['crew']

Exibição das colunas após remoção de espaços e nova coluna adicionada

In [17]:
df_movies[['genres', 'keywords', 'cast', 'crew', 'overview', 'tags']].head()

Unnamed: 0,genres,keywords,cast,crew,overview,tags
0,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin...","[In, the, 22nd, century,, a, paraplegic, Marin..."
1,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d...","[Captain, Barbossa,, long, believed, to, be, d..."
2,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send...","[A, cryptic, message, from, Bond’s, past, send..."
3,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney...","[Following, the, death, of, District, Attorney..."
4,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili...","[John, Carter, is, a, war-weary,, former, mili..."


Criação do DataFrame final utilizando somente as colunas necessárias para o sistema

In [18]:
cols = ['movie_id', 'title', 'tags']
new_df = df_movies[cols]

Formatação do conteúdo da coluna tags

In [19]:
# retira os colchetes
new_df['tags'] = new_df['tags'].apply(lambda x : ' '.join(x))
# converte todas as letras para minúsculas
new_df['tags'] = new_df['tags'].apply(lambda x : x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x : ' '.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x : x.lower())


Exibição do DF resultante

In [20]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


# NLP

Importações

In [21]:
# tokenização e construção de dicionário
from sklearn.feature_extraction.text import CountVectorizer
import nltk
# Stemmin (basicamente remove os sufixos das palavras)
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity

Instância da classe de vetorização de texto CountVectorizer

In [22]:
cv = CountVectorizer(max_features=5000, stop_words='english')

Cria o vocabulário e a matriz de contagem de palavras a partir da coluna 'tags'

In [23]:
vectors = cv.fit_transform(new_df['tags']).toarray()

Função para aplicar o stemmin em um texto

In [24]:
ps = PorterStemmer()

def stem(text):
    return " ".join([ps.stem(i) for i in text.split()])

Coluna antes do stemmin

In [25]:
new_df['tags'].head()

0    in the 22nd century, a paraplegic marine is di...
1    captain barbossa, long believed to be dead, ha...
2    a cryptic message from bond’s past sends him o...
3    following the death of district attorney harve...
4    john carter is a war-weary, former military ca...
Name: tags, dtype: object

In [26]:
# stemmin em todas as palavras existentes na coluna 'tags'
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


Coluna após o stemmin

In [27]:
new_df['tags'].head()

0    in the 22nd century, a parapleg marin is dispa...
1    captain barbossa, long believ to be dead, ha c...
2    a cryptic messag from bond’ past send him on a...
3    follow the death of district attorney harvey d...
4    john carter is a war-weary, former militari ca...
Name: tags, dtype: object

In [28]:
# Cria uma matriz contendo a similaridade para cada par de vetor existente
similarity = cosine_similarity(vectors)

# Recommender

Função para encontrar, ordenar e apresentar os filmes recomendados com base na similaridade entre as tags

In [29]:
def recommend(movie):
    movie_index = new_df[new_df['title']==movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:6]

    temp = f"{'='*3} Top 5 recommendations for '{movie}' {'='*3}"
    print(temp)
    for i, j in enumerate(movies_list):
        print(f"{i+1}. ", new_df.iloc[j[0]].title)
    print("=" * len(temp), "\n")

## Test

testa o sistema com os 5 primeiros filmes do DataFrame

In [30]:
movies = new_df['title'][:5]

In [31]:
[recommend(movie) for movie in movies]

=== Top 5 recommendations for 'Avatar' ===
1.  Titan A.E.
2.  Independence Day
3.  Small Soldiers
4.  Aliens vs Predator: Requiem
5.  Krull

=== Top 5 recommendations for 'Pirates of the Caribbean: At World's End' ===
1.  Pirates of the Caribbean: Dead Man's Chest
2.  Pirates of the Caribbean: The Curse of the Black Pearl
3.  Pirates of the Caribbean: On Stranger Tides
4.  20,000 Leagues Under the Sea
5.  Puss in Boots

=== Top 5 recommendations for 'Spectre' ===
1.  Quantum of Solace
2.  Never Say Never Again
3.  Skyfall
4.  From Russia with Love
5.  Thunderball

=== Top 5 recommendations for 'The Dark Knight Rises' ===
1.  The Dark Knight
2.  Batman Begins
3.  Batman
4.  Batman Returns
5.  Batman

=== Top 5 recommendations for 'John Carter' ===
1.  Star Trek: Insurrection
2.  Mission to Mars
3.  Captain America: The First Avenger
4.  Escape from Planet Earth
5.  Ghosts of Mars



[None, None, None, None, None]