# Análisis de Datos de Películas

En este notebook, realizaremos un análisis de un conjunto de datos de películas. Los datos incluyen información sobre géneros, palabras clave, reparto, equipo, fecha de lanzamiento, popularidad, ingresos, promedio de votos y conteo de votos.

## Importación de Librerías

Primero, importamos las librerías necesarias para el análisis.


In [131]:
import pandas as pd
import numpy as np
import ast
import os
pd.set_option('display.max_columns', None)


## Carga de Datos

Cargamos los conjuntos de datos `tmdb_5000_movies.csv` y `tmdb_5000_credits.csv`.


In [132]:
movies = pd.read_csv('./dataSets/tmdb_5000_movies.csv')
credits = pd.read_csv('./dataSets/tmdb_5000_credits.csv' , encoding='utf-8')

In [133]:
print(movies.shape)
movies.head(1)

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [134]:
print(credits.shape)
credits.head(1)

(4803, 4)


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Fusión de Datos

Fusionamos los dos conjuntos de datos en uno solo utilizando la columna `title` y eliminamos las filas con valores nulos.


In [135]:
movies = movies.merge(credits,on='title')
movies.dropna(inplace=True)

In [136]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Selección de Columnas

Seleccionamos las columnas relevantes para nuestro análisis.


In [137]:

movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew', 
                 'release_date', 'popularity', 'revenue', 'vote_average', 'vote_count']]

movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",2009-12-10,150.437577,2787965087,7.2,11800


## Conversión de Datos

Definimos una función para convertir las columnas `genres` y `keywords` de cadenas de texto a listas de nombres.


In [138]:
import ast

def convert(text):
    try:
        return [i['name'] for i in ast.literal_eval(text)]
    except (ValueError, SyntaxError, TypeError):
        return []

movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)

movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",2009-12-10,150.437577,2787965087,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",2007-05-19,139.082615,961000000,6.9,4500


### Evaluación de Literales en Python

El siguiente código utiliza el módulo `ast` de Python para evaluar una cadena de texto que representa una lista de diccionarios. La función `ast.literal_eval` se utiliza para convertir esta cadena en una estructura de datos de Python (en este caso, una lista de diccionarios).


In [139]:
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [140]:
import ast

def convert3(text):
    L = []
    try:
        parsed = ast.literal_eval(text)
        counter = 0
        for i in parsed:
            if counter < 3:
                L.append(i['name'])
                counter += 1
    except (ValueError, SyntaxError):
        pass
    return L


movies['cast'] = movies['cast'].apply(convert3)


movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",2009-12-10,150.437577,2787965087,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",2007-05-19,139.082615,961000000,6.9,4500


In [141]:
import ast

def fetch_director(text):
    L = []
    try:
        parsed = ast.literal_eval(text)
        for i in parsed:
            if i['job'] == 'Director':
                L.append(i['name'])
    except (ValueError, SyntaxError):
        pass
    return L

# Aplica la función fetch_director a la columna 'crew'
movies['crew'] = movies['crew'].apply(fetch_director)

# Muestra una muestra del DataFrame para verificar
movies.sample(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count
566,920,Cars,"Lightning McQueen, a hotshot rookie race car d...","[Animation, Adventure, Comedy, Family]","[car race, car journey, village and town, auto...","[Owen Wilson, Paul Newman, Bonnie Hunt]","[John Lasseter, Joe Ranft]",2006-06-08,82.643036,461983149,6.6,3877
1368,350,The Devil Wears Prada,The Devil Wears Prada is about a young journal...,"[Comedy, Drama, Romance]","[paris, journalist, journalism, world of fasio...","[Meryl Streep, Anne Hathaway, Emily Blunt]",[David Frankel],2006-06-30,83.893257,326551094,7.0,3088


In [142]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.sample(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count
1452,13056,Punisher: War Zone,"[Waging, his, one-man, war, on, the, world, of...","[Action, Crime]","[broken neck, fbi agent, wall safe, trashed ho...","[Ray Stevenson, Dominic West, Julie Benz]",[Lexi Alexander],2008-12-05,17.112498,10089373,5.6,294
1171,1213,The Talented Mr. Ripley,"[Tom, Ripley, is, a, calculating, young, man, ...","[Thriller, Crime, Drama]","[venice, italy, gay, new york, lovesickness, d...","[Matt Damon, Gwyneth Paltrow, Jude Law]",[Anthony Minghella],1999-12-25,31.385257,128798265,7.0,767


In [147]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ", ""))
    return L1

# Aplica la función collapse a las columnas correspondientes
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

# Muestra las primeras filas para verificar
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],2009-12-10,150.437577,2787965087,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],2007-05-19,139.082615,961000000,6.9,4500


In [149]:
# Crea la columna tags concatenando las columnas especificadas
movies['tags'] = (
    movies['overview'].astype(str) + ' ' +
    movies['genres'].apply(' '.join) + ' ' +
    movies['keywords'].apply(' '.join) + ' ' +
    movies['cast'].apply(' '.join) + ' ' +
    movies['crew'].apply(' '.join) + ' ' +
    movies['popularity'].astype(str) + ' ' +
    movies['revenue'].astype(str) + ' ' +
    movies['vote_average'].astype(str) + ' ' +
    movies['vote_count'].astype(str)
)

# Muestra las primeras filas para verificar
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,release_date,popularity,revenue,vote_average,vote_count,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],2009-12-10,150.437577,2787965087,7.2,11800,"['In', 'the', '22nd', 'century,', 'a', 'parapl..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],2007-05-19,139.082615,961000000,6.9,4500,"['Captain', 'Barbossa,', 'long', 'believed', '..."


In [150]:
new = movies.drop(columns=['overview', 'genres', 'keywords', 'cast', 'crew', 'popularity', 'revenue', 'vote_average', 'vote_count'])
new.head(2)

Unnamed: 0,movie_id,title,release_date,tags
0,19995,Avatar,2009-12-10,"['In', 'the', '22nd', 'century,', 'a', 'parapl..."
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"['Captain', 'Barbossa,', 'long', 'believed', '..."


In [152]:
new['tags'] = new['tags'].apply(lambda x: " ".join(x) if isinstance(x, list) else x)

new.head(2)

Unnamed: 0,movie_id,title,release_date,tags
0,19995,Avatar,2009-12-10,"[ ' I n ' , ' t h e ' , ' 2 2 n d ' , ' ..."
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"[ ' C a p t a i n ' , ' B a r b o s s a , ' ..."
