# Laboratorio 2
## Universidad del Valle de Guatemala <br> Facultas de Ingeniería
#### Departamento de Ciencias de la Computación <br> Deep Learning y Sistemas Inteligentes - Sección 20
#### Grupo 12  
Cristian Laynez, Jeyner Arango

### Objetivo de la Red

### Implementación de Redes

In [63]:
# Paquetes a utilizar
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import layers, regularizers

# Cargar la base de datos
dataset = pd.read_csv('movie_statistic_dataset.csv')

# eliminar las filas con valores faltantes en la columna de destino
dataset = dataset.dropna(subset=['Worldwide gross $'])

# Indexar 'movie_title'
dataset.set_index('movie_title', inplace=True)

#runtime_minutes
dataset['runtime_minutes'] = dataset['runtime_minutes'].astype(int)

#director_name
#Es mejor convertir los directores faltantes a NaN (No es un número)
dataset['director_name'].replace('-', np.nan, inplace=True)

#production_date: AAAA-MM-DD
#Como el campo representa una fecha, es mejor transformarlo en características separadas de
#año, mes y día. De esta manera, la red neuronal puede capturar mejor los patrones temporales.
dataset['production_year'] = pd.to_datetime(dataset['production_date']).dt.year
dataset['production_month'] = pd.to_datetime(dataset['production_date']).dt.month
dataset['production_day'] = pd.to_datetime(dataset['production_date']).dt.day

# botar la columnar original 'production_date' 
dataset.drop(columns=['production_date'], inplace=True)

#genres: varios géneros separados por comas
#Podemos usar la codificación one-hot para convertir 
#los géneros en columnas binarias separadas para cada género.
# Convertir genres a columnas binarias utilizando one-hot encoding
dataset['genres'].replace(r'\N', '', inplace=True)
genres_list = dataset['genres'].str.get_dummies(sep=',')
dataset = pd.concat([dataset, genres_list], axis=1)

# botar la columna original 'genres'
dataset.drop(columns=['genres'], inplace=True)

#director_professions: 
#Es mejor convertir los profesiones faltantes a NaN (No es un número)
dataset['director_professions'].replace('-', np.nan, inplace=True)
#Varias profesiones separadas por coma
#De manera similar a los géneros, podemos usar la codificación one-hot para 
#convertir las profesiones de director en columnas binarias separadas para cada profesión.
professions_list = dataset['director_professions'].str.get_dummies(sep=',')
dataset = pd.concat([dataset, professions_list], axis=1)

# botar la columna original 'director_professions'
dataset.drop(columns=['director_professions'], inplace=True)

#director:birthYear: valores faltantes como '-'
#Es mejor convertir los años de nacimiento faltantes a NaN (No es un número) para que
#se manejen correctamente durante el procesamiento de datos.
dataset['director_birthYear'].replace(r'\N', '-1', inplace=True)
dataset['director_birthYear'].replace('-', -1, inplace=True)
dataset['director_birthYear'] = dataset['director_birthYear'].astype(int)

# director:deathYear: valores faltantes como '-' y 'alive' si no está muerto
#Podemos convertir los valores 'vivos' a NaN y reemplazar el '-' con NaN también.
# Convertir 'alive' a NaN 
dataset['director_deathYear'].replace('alive', -1, inplace=True)
# Convertir '-' a NaN 
dataset['director_deathYear'].replace('-', -1, inplace=True)
dataset['director_deathYear'] = dataset['director_deathYear'].astype(int)

# director_name
# Por el momento boto el nombre del director 
dataset.drop(columns=['director_name'], inplace=True)

In [64]:
dataset.columns

Index(['runtime_minutes', 'director_birthYear', 'director_deathYear',
       'movie_averageRating', 'movie_numerOfVotes', 'approval_Index',
       'Production budget $', 'Domestic gross $', 'Worldwide gross $',
       'production_year', 'production_month', 'production_day', 'Action',
       'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music',
       'Musical', 'Mystery', 'News', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
       'War', 'Western', 'actor', 'actress', 'animation_department',
       'art_department', 'art_director', 'assistant_director',
       'camera_department', 'casting_department', 'casting_director',
       'cinematographer', 'composer', 'costume_designer', 'director', 'editor',
       'editorial_department', 'executive', 'location_management',
       'make_up_department', 'miscellaneous', 'music_artist',
       'music_department', 'producer', 'production_designer',
   

In [65]:
dataset.dtypes.values

array([dtype('int64'), dtype('int64'), dtype('int64'), dtype('float64'),
       dtype('float64'), dtype('float64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),


In [66]:
dataset.iloc[0,:]

runtime_minutes                 192.0
director_birthYear             1954.0
director_deathYear               -1.0
movie_averageRating               7.8
movie_numerOfVotes           277543.0
                               ...   
special_effects                   0.0
stunts                            0.0
transportation_department         0.0
visual_effects                    0.0
writer                            1.0
Name: Avatar: The Way of Water, Length: 67, dtype: float64

In [67]:
# Extraer características y meta 
y = dataset['Worldwide gross $']
X = dataset.drop(columns=['Worldwide gross $'])

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identificar columnas numericas para estandarizar (excluyendo no-numericas)
numeric_columns = X_train.select_dtypes(include=['float64', 'int64','int32']).columns

# Standardize the numeric input features
scaler = StandardScaler()
X_train[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

#### Preparar modelos

In [68]:
# 1. Red Neuronal con Activación Sigmoidal y Regularización L1:
def model1(p_X_train, p_y_train, p_X_test, p_y_test) -> None:
    model_1 = tf.keras.Sequential([
        layers.Dense(64, activation='sigmoid', input_shape=(p_X_train.shape[1],)),
        layers.Dense(32, activation='sigmoid'),
        layers.Dense(16, activation='sigmoid'),
        layers.Dense(1)
    ])

    # Agregar regularización L1 a todas las capas ocultas
    model_1.add(layers.Dense(16, activation='sigmoid', kernel_regularizer=regularizers.l1(0.01)))

    # Compilar el modelo
    model_1.compile(optimizer='adam', loss='mse', metrics=['mae'])

    # Entrenar el modelo
    model_1.fit(p_X_train, p_y_train, epochs=20, batch_size=30, validation_data=(p_X_test, p_y_test))

    loss, mae = model_1.evaluate(p_X_test, p_y_test)

    print(f'MAE: {mae:.2f}\nMSE: {loss:.2f}')

# 2. Red Neuronal con Activación ReLU y Regularización Dropout:
def model2(p_X_train, p_y_train, p_X_test, p_y_test) -> None:
    model_2 = tf.keras.Sequential([
        layers.Dense(128, activation='relu', input_shape=(p_X_train.shape[1],)),
        layers.Dropout(0.2),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(1)
    ])

    # Compilar el modelo
    model_2.compile(optimizer='adam', loss='mse', metrics=['mae'])

    # Entrenar el modelo
    model_2.fit(p_X_train, p_y_train, epochs=20, batch_size=30, validation_data=(p_X_test, p_y_test))

    loss, mae = model_2.evaluate(p_X_test, p_y_test)

    print(f'MAE: {mae:.2f}\nMSE: {loss:.2f}')

# 3. Red Neuronal con Activación Tangente Hiperbólica (Tanh) y Regularización L2:
def model3(p_X_train, p_y_train, p_X_test, p_y_test) -> None:
    model_3 = tf.keras.Sequential([
        layers.Dense(256, activation='tanh', kernel_regularizer=regularizers.l2(0.01), input_shape=(p_X_train.shape[1],)),
        layers.Dense(128, activation='tanh', kernel_regularizer=regularizers.l2(0.01)),
        layers.Dense(64, activation='tanh', kernel_regularizer=regularizers.l2(0.01)),
        layers.Dense(32, activation='tanh', kernel_regularizer=regularizers.l2(0.01)),
        layers.Dense(1)
    ])

    # Compilar el modelo
    model_3.compile(optimizer='adam', loss='mse', metrics=['mae'])

    # Entrenar el modelo
    model_3.fit(p_X_train, p_y_train, epochs=20, batch_size=30, validation_data=(p_X_test, p_y_test))

    loss, mae = model_3.evaluate(p_X_test, p_y_test)

    print(f'MAE: {mae:.2f}\nMSE: {loss:.2f}')

In [69]:
model1(X_train, y_train, X_test, y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
MAE: 127529592.00
MSE: 61200135756972032.00


In [70]:

model2(X_train, y_train, X_test, y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
MAE: 80738824.00
MSE: 25883218829901824.00


In [71]:
model3(X_train, y_train, X_test, y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
MAE: 127529512.00
MSE: 61200105692200960.00


#### Acontinuacion se eliminaran unas cuantas variables y posteriormente se correran

Como se podra apreciar se esta analizando la ganancia bruta a nivel mundial. Hay unas variables que no aportan mucho para analizar dicho objetivo.

In [72]:
# Llevar a cabo una copia de seguridad
data_frame = dataset.copy()
data_frame

Unnamed: 0_level_0,runtime_minutes,director_birthYear,director_deathYear,movie_averageRating,movie_numerOfVotes,approval_Index,Production budget $,Domestic gross $,Worldwide gross $,production_year,...,production_designer,production_manager,script_department,sound_department,soundtrack,special_effects,stunts,transportation_department,visual_effects,writer
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar: The Way of Water,192,1954,-1,7.8,277543.0,7.061101,460000000,667830256,2265935552,2022,...,0,0,0,0,0,0,0,0,0,1
Avengers: Endgame,181,-1,-1,8.4,1143642.0,8.489533,400000000,858373000,2794731755,2019,...,0,0,0,0,0,0,0,0,0,0
Pirates of the Caribbean: On Stranger Tides,137,1960,-1,6.6,533763.0,6.272064,379000000,241071802,1045713802,2011,...,0,0,0,0,0,0,0,0,0,0
Avengers: Age of Ultron,141,1964,-1,7.3,870573.0,7.214013,365000000,459005868,1395316979,2015,...,0,0,0,0,0,0,0,0,0,1
Avengers: Infinity War,149,-1,-1,8.4,1091968.0,8.460958,300000000,678815482,2048359754,2018,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Super Size Me,100,1970,-1,7.2,110078.0,6.017902,65000,11529368,22233808,2004,...,0,0,0,0,0,0,0,0,0,1
The Brothers McMullen,98,1968,-1,6.6,7986.0,4.231464,50000,10426506,10426506,1995,...,0,0,0,0,0,0,0,0,0,1
Gabriela,93,1973,-1,4.9,1593.0,2.526405,50000,2335352,2335352,2001,...,0,0,0,0,0,0,0,0,0,1
Tiny Furniture,98,1986,-1,6.2,14595.0,4.242085,50000,391674,424149,2010,...,0,0,0,0,0,0,0,0,0,1


In [73]:
# Eliminación de algunas variables irrelevantes o con información insuficiente

def delete_elements(elements : list) -> None:
    for e in elements: data_frame.pop(e)    

# Estos no se condideran relevantes para el objetivo a lograr
no_relevant = ['director_birthYear', 'director_deathYear', 'approval_Index', 'miscellaneous']
delete_elements(no_relevant)

# production_designer	production_manager	script_department	sound_department	soundtrack	special_effects	stunts	transportation_department	visual_effects	writer
# Eliminar variables que tienen poca variabilidad
# Se quitaran estos datos ya que solamente hay 0 y 1
repeated = [
    'production_designer', 'production_manager', 'script_department', 'sound_department', 'soundtrack', 
    'special_effects', 'stunts', 'transportation_department', 'visual_effects', 'writer',
    'costume_designer', 'director', 'editor', 'editorial_department', 'executive',	'location_management',	'make_up_department', 'music_artist',
    'music_department', 'producer'
]
delete_elements(repeated)

In [74]:
data_frame

Unnamed: 0_level_0,runtime_minutes,movie_averageRating,movie_numerOfVotes,Production budget $,Domestic gross $,Worldwide gross $,production_year,production_month,production_day,Action,...,actress,animation_department,art_department,art_director,assistant_director,camera_department,casting_department,casting_director,cinematographer,composer
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar: The Way of Water,192,7.8,277543.0,460000000,667830256,2265935552,2022,12,9,1,...,0,0,0,0,0,0,0,0,0,0
Avengers: Endgame,181,8.4,1143642.0,400000000,858373000,2794731755,2019,4,23,1,...,0,0,0,0,0,0,0,0,0,0
Pirates of the Caribbean: On Stranger Tides,137,6.6,533763.0,379000000,241071802,1045713802,2011,5,20,1,...,0,0,0,0,0,0,0,0,0,0
Avengers: Age of Ultron,141,7.3,870573.0,365000000,459005868,1395316979,2015,4,22,1,...,0,0,0,0,0,0,0,0,0,0
Avengers: Infinity War,149,8.4,1091968.0,300000000,678815482,2048359754,2018,4,25,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Super Size Me,100,7.2,110078.0,65000,11529368,22233808,2004,5,7,0,...,0,0,0,0,0,0,0,0,0,0
The Brothers McMullen,98,6.6,7986.0,50000,10426506,10426506,1995,8,9,0,...,0,0,0,0,0,0,0,0,0,0
Gabriela,93,4.9,1593.0,50000,2335352,2335352,2001,3,16,0,...,0,0,0,0,0,0,0,0,0,0
Tiny Furniture,98,6.2,14595.0,50000,391674,424149,2010,11,12,0,...,1,0,0,0,0,0,0,0,0,0


Se paso de tener a 67 a 43 variables

In [75]:
# Repitiendo el mismo proceso
y = data_frame['Worldwide gross $']
X = data_frame.drop(columns=['Worldwide gross $'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_columns = X_train.select_dtypes(include=['float64', 'int64','int32']).columns

scaler = StandardScaler()
X_train[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

In [76]:
model1(X_train, y_train, X_test, y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
MAE: 127529592.00
MSE: 61200135756972032.00


In [77]:
model2(X_train, y_train, X_test, y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
MAE: 88822440.00
MSE: 32053559973380096.00


In [78]:
model3(X_train, y_train, X_test, y_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
MAE: 127529512.00
MSE: 61200105692200960.00


### Composición y Resultados Obtenidos

#### Resultados de los primeros modelos utilizando todas las variables:

| Modelo Numero | mae | mse |
| --- | --- | --- |
| 1 | 127529592.00 | 61200135756972032.00 |
| 2 | 80738824.00 | 25883218829901824.00 |
| 3 | 127529512.00 | 61200105692200960.00 |

#### Resultados de los moelos con variables eliminadas:

| Modelo Numero | mae | mse |
| --- | --- | --- |
| 1 | 127529592.00 | 61200135756972032.00 |
| 2 | 88822440.00 | 32053559973380096.00 |
| 3 | 127529512.00 | 61200105692200960.00 |

### Diferencia de rendimiento  conceptuales en la composición y resultados obtenidos en cada red neural.

Como se podrá obsrevar los primeros modelas analizados donde se utilizan todas las variables tienen resultados de MAE y MSE muy altos, esto quiere indicar que hay muchas complicaciones para ajustarse a los datos y realizar predicciones precisas. 

Por lo que se puede observar en los dos escenarios tanto el modelo 1 como en el modelo 3 obtuvieron los mismos resultados. Mientras que en donde si se puede observar el cambio son el los modelos 2, en el modelo 2 donde se eliminaron las variables se muetra una disminución en el MAE y MSE ya comparando su contraparte inciial, se puede apreciar que si hay una gran diferen cia en el rendimiento de este modelo.

La selección de variables si influyeron en el rendomiento de los modelos de redes neuronales.

### Red Neuronal Optima

Se puede observar que los modelos 2 muestran una disminución significativa en MSE y MSE. Así que el modelo 2 parece ser la opción más prometedora en términos de métricas de evaluación.