<a href="https://colab.research.google.com/github/AndresMontesDeOca/TextMining/blob/main/Text_Mining_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>

#### Universidad Austral<br>
#### Maestría en Minería de Datos y Gestión del Conocimiento<br>
#### Text Mining<br>
#### CGC: Clasificador de Géneros Cinematográficos<br>


#### Integrantes:<br>
Alejandra Reyes<br>
Andres Montes de Oca<br>
Rafael Gimenez<br>
Soledad Ríos<br>
Tomas Sauro

</center>

## Introducción
Título: Clasificador de Géneros Cinematográficos (CGC)

Nota: utilizar mas de un modelo (x modelos) y que el numero sea impar para usar criterio democratico

Problemática: En un servicio que releva los contenidos disponibles en plataformas streaming, con frecuencuencia se detectan contenidos sin información de género pero se cuenta con la sinopsis del contenido.

Nombre del servicio y empresa con la problematica: [Content Pulse | BB Media](https://bb.vision/content-pulse-en/)

Objetivo: Entrenar un modelo que dada una sinopsis sobre una película retorne los géneros del contenido.

Hoja de Ruta:

1. Acceso a los datos
2. Tabulación de los datos
3. Limpieza de datos
4. Exploración de modelos
5. Pruebas
6. Analísis de resultados
7. Conclusiones

Posibles contratiempos:

1. Datos faltantes, incompletos, no estandarizados.
2. Fallar en una buena separacion de los subdataset para testing y validation (ejemplo, quedarnos con sinopsis para testing muy ricas en descripción y para validation no).

## Librerias

In [1]:
# Verificación e instalación
import importlib
import subprocess

def instalar_librerias(packages):
    [importlib.import_module(package) if package in locals() else subprocess.call(['pip', 'install', package]) for package in packages]

# Lista de librerías a verificar e instalar
listado_librerias = ['requests', 'json', 'datetime', 'gdown', 'numpy', 'pandas', 'seaborn', 'matplotlib', 'tabulate', 'scikit-learn']

# Verificar e instalar las librerías
instalar_librerias(listado_librerias)

In [2]:
# Importación
import requests
import json
import datetime
import gdown

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate

from sklearn.model_selection import train_test_split

# Ignorar Warnings
import warnings
warnings.filterwarnings("ignore")

## 0. Acceso a los datos (Ejemplo con algunas variables usando la API de TMDB)


Se utiliza una base de datos colaborativas con acceso abierto llamada TMDB obtenido los datos mediante su [API](https://developer.themoviedb.org/docs)

In [3]:
# Credencial
api_key = 'bf0a945ba271caf72a6a1b1f53a1084d'

# URL base
base_url = 'https://api.themoviedb.org/3/'

# Endpoint para obtener la lista de películas y series de tv
movie_endpoint = 'discover/movie'
tv_endpoint = 'discover/tv'

# Parámetros
params = {
    'api_key': api_key,
    'sort_by': 'popularity.desc' # Se usa este orden para verificar si los datos obtenidos son correctos (los contenidos mas populares son mas conocidos siendo mas facil validar)
}

# Función para obtener nombres de actores
def get_cast_names(content_type, content_id):
    credits_endpoint = f'{content_type}/{content_id}/credits'
    credits_params = {
        'api_key': api_key
   }
    credits_response = requests.get(base_url + credits_endpoint, params=credits_params)
    credits_data = credits_response.json()
    cast_names = [actor['name'] for actor in credits_data['cast'][:3]] # Dado que los actores estan ordenados por importancia, nos quedamos con los 3 primeros (protagonistas)
    return ', '.join(cast_names)

# Función para obtener nombres de compañías de producción
def get_production_companies(content_type, content_id):
    details_endpoint = f'{content_type}/{content_id}'
    details_params = {
        'api_key': api_key
    }
    details_response = requests.get(base_url + details_endpoint, params=details_params)
    details_data = details_response.json()
    production_companies = ', '.join([company['name'] for company in details_data.get('production_companies', [])])
    return production_companies

# Realizar solicitudes GET a la API y obtener datos para películas
movie_response = requests.get(base_url + movie_endpoint, params=params)
movie_data = movie_response.json()

# Realizar solicitudes GET a la API y obtener datos para programas de televisión
tv_response = requests.get(base_url + tv_endpoint, params=params)
tv_data = tv_response.json()

# Procesar los resultados de películas y programas de televisión
results_list = []

for content_type, content_data in [('movie', movie_data), ('tv', tv_data)]:
    for result in content_data['results']:
        content_id = result['id']
        title = result['title'] if content_type == 'movie' else result['name']
        release_date = result['release_date'] if content_type == 'movie' else result['first_air_date']
        overview = result['overview']
        poster_path = result['poster_path']
        popularity = result['popularity']
        vote_count = result['vote_count']
        vote_average = result['vote_average']

        # Obtener nombres de actores utilizando la función
        cast_names = get_cast_names(content_type, content_id)

        # Obtener nombres de compañías de producción utilizando la nueva función
        production_companies = get_production_companies(content_type, content_id)

        results_list.append({
            'type': content_type.capitalize(),
            'id': content_id,
            'title': title,
            'release': release_date,
            'synopsis': overview,
            'cast': cast_names,
            'productions': production_companies,
            'popularity': popularity,
            'votes': vote_count,
            'score': vote_average,
            'imagen': f'https://image.tmdb.org/t/p/w500{poster_path}'
        })

# DataFrame
tmdb_base_20 = pd.DataFrame(results_list)
tmdb_base_20.head()

Unnamed: 0,type,id,title,release,synopsis,cast,productions,popularity,votes,score,imagen
0,Movie,1008042,Talk to Me,2023-07-26,When a group of friends discover how to conjur...,"Sophie Wilde, Alexandra Jensen, Joe Bird","Causeway Films, Metrol Technology, Screen Aust...",3538.457,613,7.3,https://image.tmdb.org/t/p/w500/kdPMUMJzyYAc4r...
1,Movie,385687,Fast X,2023-05-17,Over many missions and against impossible odds...,"Vin Diesel, Michelle Rodriguez, Tyrese Gibson","Universal Pictures, Original Film, One Race, P...",2774.925,3736,7.3,https://image.tmdb.org/t/p/w500/fiVW06jE7z9YnO...
2,Movie,346698,Barbie,2023-07-19,Barbie and Ken are having the time of their li...,"Margot Robbie, Ryan Gosling, America Ferrera","LuckyChap Entertainment, Heyday Films, NB/GG P...",2820.205,4668,7.3,https://image.tmdb.org/t/p/w500/iuFNMS8U5cb6xf...
3,Movie,615656,Meg 2: The Trench,2023-08-02,An exploratory dive into the deepest depths of...,"Jason Statham, Wu Jing, Shuya Sophia Cai","Apelles Entertainment, Warner Bros. Pictures, ...",2429.447,1790,7.0,https://image.tmdb.org/t/p/w500/4m1Au3YkjqsxF8...
4,Movie,968051,The Nun II,2023-09-06,"In 1956 France, a priest is violently murdered...","Taissa Farmiga, Jonas Bloquet, Storm Reid","New Line Cinema, Atomic Monster, The Safran Co...",1849.236,211,6.6,https://image.tmdb.org/t/p/w500/5gzzkR7y3hnY8A...


## 1. Acceso a los datos (Archivo preexistente con datos públicos previamente descargados desde TMDB)

In [4]:
# URL del archivo Excel
url = "https://onedrive.live.com/download?resid=B5CCCD69939F6AA3%21985&authkey=!AMd0773xIAnUL_I&em=x&app=Excel"

# Descarga del archivo en el entorno de colab
output = "/content/file.xlsx"  # Ruta donde se guardará el archivo en Colab
gdown.download(url, output, quiet=False)

Downloading...
From: https://onedrive.live.com/download?resid=B5CCCD69939F6AA3%21985&authkey=!AMd0773xIAnUL_I&em=x&app=Excel
To: /content/file.xlsx
100%|██████████| 184M/184M [00:02<00:00, 64.7MB/s]


'/content/file.xlsx'

In [5]:
# DataFrame
contents = pd.read_excel(output)
contents.head()

Unnamed: 0,id,type,title,year,duration,synopsis,genres,cast,directors,url
0,511620,Movie,Between the Shades,2017.0,82.0,Fifty conversations exploring the many differe...,,,Jill Salvino,https://www.themoviedb.org/movie/511620/
1,363857,Movie,Tentang Bulan,2006.0,,"Friendship between five companions sekampong ,...","Comedy,Drama,Family","Erin Malek,Aedy Ashraf,Nik Adruce,Fatin Afeefa...",Ahmad Idham,https://www.themoviedb.org/movie/363857/
2,459281,Movie,Tamayo Portraits,2016.0,28.0,Omnibus film about a museum,,,,https://www.themoviedb.org/movie/459281/
3,461235,Movie,The Liberator,2017.0,89.0,Martial arts fuelled adventure as Ben Silver a...,Action,"Ben Lettieri,Keith Chanter,Daniel Jordan,Jessi...",Ben Lettieri,https://www.themoviedb.org/movie/461235/
4,410323,Movie,Neruppu Da,2017.0,128.0,Guru and his friends' ambition is to become fi...,"Romance,Action","Vikram Prabhu,Nikki Galrani,Nagineedu Vellanki...",B. Ashok Kumar,https://www.themoviedb.org/movie/410323/


## 2. Exploración Inicial



In [6]:
print('contents dataframe:', contents.shape)

contents dataframe: (698681, 10)


In [7]:
print(contents.dtypes)

id             int64
type          object
title         object
year         float64
duration     float64
synopsis      object
genres        object
cast          object
directors     object
url           object
dtype: object


In [8]:
# Verificar si el campo id contiene valores únicos
id_is_unique = contents['id'].nunique() == len(contents['id'])
if id_is_unique:
    print("El campo 'id' tiene valores únicos.")
else:
    print("El campo 'id' tiene valores repetidos.")

El campo 'id' tiene valores repetidos.


In [9]:
# Obtener un id repetido para verificar los registros que lo contienen

# Construir tabla de frecuencia por 'id'
id_frequency_table = contents['id'].value_counts().reset_index()
id_frequency_table.columns = ['id', 'Frequency']

# Filtrar la tabla de frecuencia para encontrar el primer ID con una frecuencia mayor a uno
firts_id_duplicated = id_frequency_table[id_frequency_table['Frequency'] > 1]['id'].iloc[0]

# Mostrar los registros que contienen el primer ID repetido
contents[contents['id'] == firts_id_duplicated]

Unnamed: 0,id,type,title,year,duration,synopsis,genres,cast,directors,url
326600,120393,Series,Nice to Meet You! IZ*ONE’s First Steps in Japan,,,IZ*ONE's first japanese talk show,,,,https://www.themoviedb.org/tv/120393/
386614,120393,Movie,To Get to Heaven First You Have to Die,2006.0,95.0,Twenty-year-old Kamal has been married for a f...,Drama,"Khurshed Golibekov,Dinara Drukarova,Maruf Pulo...",Jamshed Usmonov,https://www.themoviedb.org/movie/120393/


In [10]:
# Verificar si el campo id contiene valores únicos para cada tipo de contenido
type_contents = ['Series', 'Movie']

for type_contents in type_contents:
    # Filtrar el DataFrame por el campo 'type' para el tipo de contenido actual
    contents_by_type = contents[contents['type'] == type_contents]

    # Verificar si el campo 'id' en el DataFrame filtrado contiene valores únicos
    id_is_unique = contents_by_type['id'].nunique() == len(contents_by_type['id'])

    if id_is_unique:
        print(f"El campo 'id' tiene valores únicos para el tipo de contenido '{type_contents}'.")
    else:
        print(f"El campo 'id' tiene valores repetidos para el tipo de contenido '{type_contents}'.")

El campo 'id' tiene valores únicos para el tipo de contenido 'Series'.
El campo 'id' tiene valores únicos para el tipo de contenido 'Movie'.


####**Variables**

- **id**: Identificador único (por tipo de contenido) para cada registro
- **type**: Tipo de contenido (Movies/Series)
- **title**: Título del contenido
- **year**: Año de lanzamiento del contenido
- **duration**: Duración del contenido expresada en minutos
- **synopsis**: Reseña del contenido
- **genres**: Géneros de los contenidos (separados con "," cuando hay más de uno
- **cast**: Listado de actores (separados con ",")
- **directors**: director/directores del contenido (separados con ",")
- **url**: Direccion web del contenido en tmdb

## 3. Exploración y Transformación

In [11]:
# Verificación
contents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698681 entries, 0 to 698680
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   id         698681 non-null  int64  
 1   type       698681 non-null  object 
 2   title      698681 non-null  object 
 3   year       626862 non-null  float64
 4   duration   587950 non-null  float64
 5   synopsis   698680 non-null  object 
 6   genres     442585 non-null  object 
 7   cast       459860 non-null  object 
 8   directors  522131 non-null  object 
 9   url        698681 non-null  object 
dtypes: float64(2), int64(1), object(7)
memory usage: 53.3+ MB


In [12]:
# Especificar variables categóricas
var_categoricas = [col for col in contents.columns if col in ['type', 'genres']]

# Convertir a categóricas las varibales correspondientes
contents[var_categoricas] = contents[var_categoricas].astype('category')

# Verificación
contents.dtypes

id              int64
type         category
title          object
year          float64
duration      float64
synopsis       object
genres       category
cast           object
directors      object
url            object
dtype: object

In [13]:
contents.isna().sum()

id                0
type              0
title             0
year          71819
duration     110731
synopsis          1
genres       256096
cast         238821
directors    176550
url               0
dtype: int64

In [14]:
# Eliminar los registros con valor nulo en el campo synopsis
contents.dropna(subset=['synopsis'], inplace=True)
contents.isna().sum()

id                0
type              0
title             0
year          71819
duration     110730
synopsis          0
genres       256095
cast         238820
directors    176549
url               0
dtype: int64

In [15]:
# Calcular el porcentaje de valores nulos para cada columna
porcentaje_nulos = (contents.isnull().sum() / len(contents)) * 100
print("Porcentaje de valores nulos por columna:")
print(porcentaje_nulos)

Porcentaje de valores nulos por columna:
id            0.000000
type          0.000000
title         0.000000
year         10.279241
duration     15.848457
synopsis      0.000000
genres       36.654119
cast         34.181600
directors    25.268936
url           0.000000
dtype: float64


In [16]:
contents['genres'].value_counts()

Documentary                                   90619
Drama                                         71198
Comedy                                        37317
Animation                                     22470
Music                                         14436
                                              ...  
Documentary,Comedy,History,Fantasy                1
Documentary,Comedy,Fantasy,Adventure,Music        1
Documentary,Comedy,Family,News                    1
Documentary,Comedy,Drama,Science Fiction          1
Western,War,Thriller                              1
Name: genres, Length: 9798, dtype: int64

In [17]:
# Crear una columna con el género principal
contents['principal_genre'] = contents['genres'].str.split(',').str[0]
contents.head()

Unnamed: 0,id,type,title,year,duration,synopsis,genres,cast,directors,url,principal_genre
0,511620,Movie,Between the Shades,2017.0,82.0,Fifty conversations exploring the many differe...,,,Jill Salvino,https://www.themoviedb.org/movie/511620/,
1,363857,Movie,Tentang Bulan,2006.0,,"Friendship between five companions sekampong ,...","Comedy,Drama,Family","Erin Malek,Aedy Ashraf,Nik Adruce,Fatin Afeefa...",Ahmad Idham,https://www.themoviedb.org/movie/363857/,Comedy
2,459281,Movie,Tamayo Portraits,2016.0,28.0,Omnibus film about a museum,,,,https://www.themoviedb.org/movie/459281/,
3,461235,Movie,The Liberator,2017.0,89.0,Martial arts fuelled adventure as Ben Silver a...,Action,"Ben Lettieri,Keith Chanter,Daniel Jordan,Jessi...",Ben Lettieri,https://www.themoviedb.org/movie/461235/,Action
4,410323,Movie,Neruppu Da,2017.0,128.0,Guru and his friends' ambition is to become fi...,"Romance,Action","Vikram Prabhu,Nikki Galrani,Nagineedu Vellanki...",B. Ashok Kumar,https://www.themoviedb.org/movie/410323/,Romance


In [18]:
contents['principal_genre'].value_counts()

Drama                 107831
Documentary           104241
Comedy                 62512
Animation              34829
Horror                 20994
Music                  19310
Action                 16386
Romance                11772
Crime                   9657
Thriller                9471
Family                  5901
Science Fiction         5462
Adventure               4580
Fantasy                 4487
Western                 4445
Mystery                 4232
Reality                 4221
TV Movie                2986
History                 2480
War                     1922
Action & Adventure      1348
News                     827
Talk                     798
Sci-Fi & Fantasy         723
Soap                     575
Kids                     451
War & Politics           144
Name: principal_genre, dtype: int64

In [19]:
# Crear la columna 'synopsis_length' que contiene la longitud de cada sinopsis
contents['synopsis_length'] = contents['synopsis'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
contents['synopsis_length'].describe()

count    698680.000000
mean        281.319452
std         220.742708
min           1.000000
25%         119.000000
50%         219.000000
75%         390.000000
max        9732.000000
Name: synopsis_length, dtype: float64

In [20]:
# Crear la columna 'synopsis_length_interval' que contiene los intervalos de la longitud de cada sinopsis

# Definir los límites de los intervalos
bins = [1, 100, 200, 300, float('inf')]

# Definir las etiquetas para cada intervalo
labels = ['Muy corta', 'Corta', 'Moderada', 'Larga']

# Crear el campo 'synopsis_length_interval' que asigna un intervalo a cada longitud de sinopsis
contents['synopsis_length_interval'] = pd.cut(contents['synopsis_length'], bins=bins, labels=labels, include_lowest=True)
contents.head()


Unnamed: 0,id,type,title,year,duration,synopsis,genres,cast,directors,url,principal_genre,synopsis_length,synopsis_length_interval
0,511620,Movie,Between the Shades,2017.0,82.0,Fifty conversations exploring the many differe...,,,Jill Salvino,https://www.themoviedb.org/movie/511620/,,172,Corta
1,363857,Movie,Tentang Bulan,2006.0,,"Friendship between five companions sekampong ,...","Comedy,Drama,Family","Erin Malek,Aedy Ashraf,Nik Adruce,Fatin Afeefa...",Ahmad Idham,https://www.themoviedb.org/movie/363857/,Comedy,143,Corta
2,459281,Movie,Tamayo Portraits,2016.0,28.0,Omnibus film about a museum,,,,https://www.themoviedb.org/movie/459281/,,27,Muy corta
3,461235,Movie,The Liberator,2017.0,89.0,Martial arts fuelled adventure as Ben Silver a...,Action,"Ben Lettieri,Keith Chanter,Daniel Jordan,Jessi...",Ben Lettieri,https://www.themoviedb.org/movie/461235/,Action,215,Moderada
4,410323,Movie,Neruppu Da,2017.0,128.0,Guru and his friends' ambition is to become fi...,"Romance,Action","Vikram Prabhu,Nikki Galrani,Nagineedu Vellanki...",B. Ashok Kumar,https://www.themoviedb.org/movie/410323/,Romance,127,Corta


## 4. Separación de dataframes

In [21]:
# Excluir columnas con las que no se trabajara
contents= contents.drop(['id', 'title', 'genres', 'url'], axis=1)

In [22]:
# Registros sin genres (el problema a resolver)
contents_target = contents[contents['principal_genre'].isnull()]

# Registros con genres para modelar
contents = contents.drop(contents_target.index)

In [23]:
# Tabla de contingencia (%) entre principal_genre synopsis_length_interval en contents
pd.crosstab(contents['principal_genre'], contents['synopsis_length_interval'])

synopsis_length_interval,Muy corta,Corta,Moderada,Larga
principal_genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,2481,4842,3026,6037
Action & Adventure,131,283,253,681
Adventure,821,1387,873,1499
Animation,9314,10204,6069,9242
Comedy,12119,18854,11751,19788
Crime,1781,3040,1880,2956
Documentary,15359,24033,19128,45721
Drama,16405,32215,22020,37191
Family,879,1639,1114,2269
Fantasy,891,1314,843,1439


In [24]:
# Dividir el DataFrame en 70% train y 30% test, estratificado por "principal_genre" y "synopsis_length_interval"
contents_train, contents_test = train_test_split(contents, test_size=0.3, stratify=contents[['principal_genre', 'synopsis_length_interval']], random_state=2023)

In [25]:
# Tabla de contingencia (%) entre principal_genre synopsis_length_interval en contents_train
pd.crosstab(contents_train['principal_genre'], contents_train['synopsis_length_interval'], normalize='index') * 100

synopsis_length_interval,Muy corta,Corta,Moderada,Larga
principal_genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,15.143854,29.546643,18.465562,36.843941
Action & Adventure,9.745763,20.974576,18.75,50.529661
Adventure,17.935122,30.286962,19.058016,32.7199
Animation,26.743232,29.298605,17.424118,26.534044
Comedy,19.385726,30.160653,18.798419,31.655202
Crime,18.446746,31.47929,19.467456,30.606509
Documentary,14.733654,23.054996,18.350258,43.861092
Drama,15.214223,29.874672,20.420763,34.490342
Family,14.891041,27.772397,18.886199,38.450363
Fantasy,19.866285,29.290035,18.783827,32.059854


In [26]:
# Tabla de contingencia (%) entre principal_genre synopsis_length_interval en contents_test
pd.crosstab(contents_test['principal_genre'], contents_test['synopsis_length_interval'], normalize='index') * 100

synopsis_length_interval,Muy corta,Corta,Moderada,Larga
principal_genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,15.134255,29.55655,18.470301,36.838893
Action & Adventure,9.653465,21.039604,18.811881,50.49505
Adventure,17.90393,30.276565,19.068413,32.751092
Animation,26.739401,29.294669,17.427505,26.538425
Comedy,19.388898,30.160508,18.796992,31.653602
Crime,18.432862,31.480842,19.468416,30.617881
Documentary,14.735226,23.055769,18.348683,43.860322
Drama,15.212217,29.877276,20.421033,34.489474
Family,14.906832,27.780915,18.859401,38.452851
Fantasy,19.836553,29.271917,18.796434,32.095097


## 5. Baseline Model

## 6. Andy's Model

In [62]:
# Sampleamos el dataset
data = contents[['synopsis', 'principal_genre']].sample(10000, random_state=45)
data.head()

Unnamed: 0,synopsis,principal_genre
130614,"Slammin’ Sammy Menacker is killed in the ring,...",Action
268585,"While staying in the forest, a group of colleg...",Horror
225763,Stand up comedian Javier Guzman talks you thro...,Comedy
438292,Anna is working at a parisian advertising agen...,Comedy
358613,It started as fantasy...and ended as intimate ...,Drama


In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Preprocesamiento de texto
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['synopsis'])
y = data['principal_genre']

# División de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Modelos
model1 = RandomForestClassifier()
model2 = GradientBoostingClassifier()
model3 = SVC()

# Entrenamiento del modelo de clasificación multietiqueta (Regresión Logística Multietiqueta)
# classifier = MultiOutputClassifier(LogisticRegression())
classifier = MultiOutputClassifier()
classifier.fit(X_train, y_train.to_frame())

# Predicción
y_pred = classifier.predict(X_test)

# Evaluación del modelo
accuracy = accuracy_score(y_test, y_pred)
print(f'Precisión del modelo: {accuracy:.2f}')

Precisión del modelo: 0.47
