# Inteligencia Artificial para las Ciencias e Ingenierías
## Proyecto Sistema de Recomendación de Películas
### Miembros del grupo
* Fonsy Johan Mercado Agudelo, CC 1020472932, Ingeniería eléctrica
* Orlando José Salazar Polo, CC 1152714311, Ingeniería eléctrica
* Angie Dayana Rincón Mandón, CC 1091681348, Ingeniería eléctrica

### Exploración de datos


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import linear_model
from wordcloud import WordCloud

### The movies dataset
Para el desarrollo de este proyecto se seleccionó el dataset [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) proporcionado en kaggle, estos archivos contienen metadatos de 45.000 películas enumeradas en el conjunto de datos completo de MovieLens. Este conjunto de datos también tiene archivos que contienen 26 millones de calificaciones de 270.000 usuarios para las 45.000 películas. Las calificaciones están en una escala de 1 a 5 y se han obtenido del sitio web oficial de GroupLens.

El dataset tiene un total de 900 MB y contiene los siguientes archivos .csv

- **movies_metadata.csv**: Archivo principal de metadatos de películas, contiene información de 45000 películas incluyendo carteles, fondos, presupuesto, ingresos, fechas de lanzamiento, idiomas, países de producción y empresas.

- **keywords.csv**: Contiene las palabras clave de la trama de la película en forma de un objeto JSON en cadena.

- **credits.csv**: Contiene información sobre el reparto y equipo técnico de las películas en forma de objeto JSON en cadena.

- **links.csv**: Contiene los ID de TMDB e IMDB de las películas que aparecen en el conjunto de datos de MovieLens.

- **links_small.csv**: Contiene los ID de TMDB e IMDB de un subconjunto de 9.000 películas del conjunto de datos completo.

- **ratings_small.csv** El subconjunto de 100.000 calificaciones de 700 usuarios en 9.000 películas

In [3]:
credits = pd.read_csv('the-movies-dataset/credits.csv')
keywords = pd.read_csv('the-movies-dataset/keywords.csv')
movies = pd.read_csv('the-movies-dataset/movies_metadata.csv', low_memory=False).\
                        drop(['belongs_to_collection', 'homepage', 'imdb_id', 'poster_path', 'status', 'title', 'video'], axis=1)

movies['id'] = movies['id'].apply(pd.to_numeric, errors='coerce')
movies.dropna(inplace=True)
movies['id'] = movies['id'].astype('int64')

df = movies.merge(keywords, on='id'). merge(credits, on='id')

df['original_language'] = df['original_language'].fillna('')
df['runtime'] = df['runtime'].fillna(0)
df['tagline'] = df['tagline'].fillna('')

df.dropna(inplace=True)

In [8]:
df.sample(2)

Unnamed: 0,adult,budget,genres,id,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,keywords,cast,crew
15112,False,1000,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",286805,en,Dark Dungeons,College freshmen Debbie and Marcie are excited...,0.055526,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2014-08-15,0.0,40.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Role Playing Games - Not on YOUR life!,1.0,1.0,"[{'id': 9755, 'name': 'parody'}, {'id': 10183,...","[{'cast_id': 0, 'character': 'Debbie', 'credit...","[{'credit_id': '585599f09251416fa1043320', 'de..."
14359,False,1300000,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",53459,en,F,A group of teachers must defend themselves fro...,2.299942,"[{'name': 'Gatlin Pictures', 'id': 4800}, {'na...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",2010-09-07,0.0,79.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Welcome to the school of hard knocks,5.8,32.0,"[{'id': 6270, 'name': 'high school'}, {'id': 1...","[{'cast_id': 1, 'character': 'Robert Anderson'...","[{'credit_id': '5324302e92514113b9000475', 'de..."


In [9]:
def get_text(text, obj='name'):
    text = eval(text)
    
    if len(text) == 1:
        for i in text:
            return i[obj]
    else:
        s = []
        for i in text:
            s.append(i[obj])
        return ', '.join(s)
    
df['genres'] = df['genres'].apply(get_text)
df['production_companies'] = df['production_companies'].apply(get_text)
df['production_countries'] = df['production_countries'].apply(get_text)
df['crew'] = df['crew'].apply(get_text)
df['spoken_languages'] = df['spoken_languages'].apply(get_text)
df['keywords'] = df['keywords'].apply(get_text)

df['characters'] = df['cast'].apply(get_text, obj='character')
df['actors'] = df['cast'].apply(get_text)

df['release_date'] = pd.to_datetime(df['release_date'])
df['budget'] = df['budget'].astype('float64')
df['popularity'] = df['popularity'].astype('float64')

df.drop('cast', axis=1, inplace=True)
df = df[~df['original_title'].duplicated()]
df = df.reset_index(drop=True)

In [13]:
df.head()

Unnamed: 0,adult,budget,genres,id,original_language,original_title,overview,popularity,production_companies,production_countries,...,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,keywords,crew,characters,actors
0,False,65000000.0,"Adventure, Fantasy, Family",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"TriStar Pictures, Teitler Film, Interscope Com...",United States of America,...,262797249.0,104.0,"English, Français",Roll the dice and unleash the excitement!,6.9,2413.0,"board game, disappearance, based on children's...","Larry J. Franco, Jonathan Hensleigh, James Hor...","Alan Parrish, Samuel Alan Parrish / Van Pelt, ...","Robin Williams, Jonathan Hyde, Kirsten Dunst, ..."
1,False,0.0,"Romance, Comedy",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"Warner Bros., Lancaster Gate",United States of America,...,0.0,101.0,English,Still Yelling. Still Fighting. Still Ready for...,6.5,92.0,"fishing, best friend, duringcreditsstinger, ol...","Howard Deutch, Mark Steven Johnson, Mark Steve...","Max Goldman, John Gustafson, Ariel Gustafson, ...","Walter Matthau, Jack Lemmon, Ann-Margret, Soph..."
2,False,16000000.0,"Comedy, Drama, Romance",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,Twentieth Century Fox Film Corporation,United States of America,...,81452156.0,127.0,English,Friends are the people who let you be yourself...,6.1,34.0,"based on novel, interracial relationship, sing...","Forest Whitaker, Ronald Bass, Ronald Bass, Ezr...","Savannah 'Vannah' Jackson, Bernadine 'Bernie' ...","Whitney Houston, Angela Bassett, Loretta Devin..."
3,False,0.0,Comedy,11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,"Sandollar Productions, Touchstone Pictures",United States of America,...,76578911.0,106.0,English,Just When His World Is Back To Normal... He's ...,5.7,173.0,"baby, midlife crisis, confidence, aging, daugh...","Alan Silvestri, Elliot Davis, Nancy Meyers, Na...","George Banks, Nina Banks, Franck Eggelhoffer, ...","Steve Martin, Diane Keaton, Martin Short, Kimb..."
4,False,60000000.0,"Action, Crime, Drama, Thriller",949,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",17.924927,"Regency Enterprises, Forward Pass, Warner Bros.",United States of America,...,187436818.0,170.0,"English, Español",A Los Angeles Crime Saga,7.7,1886.0,"robbery, detective, bank, obsession, chase, sh...","Michael Mann, Michael Mann, Art Linson, Michae...","Lt. Vincent Hanna, Neil McCauley, Chris Shiher...","Al Pacino, Robert De Niro, Val Kilmer, Jon Voi..."
