## Hoja de trabajo 02

Integrantes:
- Andrea Ximena Ramírez Recinos, 21874
- Adrián Ricardo Flores Trujillo, 21500
- Emily Elvia Melissa Pérez Alarcón, 21385


In [12]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
from matplotlib import pyplot as plt

# Clustering
import sklearn.cluster as cluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
import pyclustertend

# Data preprocessing
import sklearn.preprocessing

# Metrics for clustering evaluation
from sklearn.metrics import silhouette_score

In [13]:
# Read the data from the CSV file into a DataFrame and handle encoding issues if any.
df = pd.read_csv('Data/movies.csv', encoding='unicode_escape')

# Convert string values in numeric columns to numeric data types, replacing non-convertible values with NaN.
df[['castMenAmount', 'castWomenAmount']] = df[['castMenAmount', 'castWomenAmount']].apply(pd.to_numeric, errors='coerce')

# Replace excessively high values with NaN, as they are likely erroneous.
df[['castMenAmount', 'castWomenAmount']] = np.where(df[['castMenAmount', 'castWomenAmount']] > 1000, np.nan, df[['castMenAmount', 'castWomenAmount']])

# Display the DataFrame containing movie data.
df

Unnamed: 0,id,budget,genres,homePage,productionCompany,productionCompanyCountry,productionCountry,revenue,runtime,video,...,popularity,releaseDate,voteAvg,voteCount,genresAmount,productionCoAmount,productionCountriesAmount,actorsAmount,castWomenAmount,castMenAmount
0,5,4000000,Crime|Comedy,https://www.miramax.com/movie/four-rooms/,Miramax|A Band Apart,US|US,United States of America,4257354.0,98,False,...,20.880,1995-12-09,5.7,2077,2,2,1,25,15.0,9.0
1,6,21000000,Action|Thriller|Crime,,Universal Pictures|Largo Entertainment|JVC,US|US|JP,Japan|United States of America,12136938.0,110,False,...,9.596,1993-10-15,6.5,223,3,3,2,15,3.0,9.0
2,11,11000000,Adventure|Action|Science Fiction,http://www.starwars.com/films/star-wars-episod...,Lucasfilm|20th Century Fox,US|US,United States of America,775398007.0,121,,...,100.003,1977-05-25,8.2,16598,3,2,1,105,5.0,62.0
3,12,94000000,Animation|Family,http://movies.disney.com/finding-nemo,Pixar,US,United States of America,940335536.0,100,,...,134.435,2003-05-30,7.8,15928,2,1,1,24,5.0,18.0
4,13,55000000,Comedy|Drama|Romance,,Paramount|The Steve Tisch Company,US|,United States of America,677387716.0,142,False,...,58.751,1994-07-06,8.5,22045,3,2,1,76,18.0,48.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,920081,0,Action|Horror,,,,,0.0,100,False,...,16.662,2021-11-26,6.8,108,2,1,1,10,2.0,4.0
9996,920143,0,Comedy,,Caracol Televisión|Dago García Producciones,CO|CO,Colombia,0.0,97,False,...,491.706,2021-12-25,1.5,2,1,2,1,8,1.0,1.0
9997,922017,0,Comedy,,,,Nigeria,0.0,112,False,...,565.658,2021-12-17,6.1,30,1,1,17,1,0.0,
9998,922162,0,,https://www.netflix.com/title/81425229,,,United States of America,0.0,59,False,...,9.664,2021-12-17,6.0,1,1,0,0,0,,


1. Haga el preprocesamiento del dataset, explique qué variables no aportan información a la generación de grupos y por qué. Describa con qué variables calculará los grupos.

In [14]:
# Select only numerical columns from the DataFrame as qualitative variables are not suitable for clustering.
# Drop the 'id' column as it does not provide significant information for clustering, leaving only numeric data.
df = df.select_dtypes(include=[np.number])
df = df.loc[:, df.columns != 'id'].dropna()
df

Unnamed: 0,budget,revenue,runtime,popularity,voteAvg,voteCount,genresAmount,productionCoAmount,productionCountriesAmount,actorsAmount,castWomenAmount,castMenAmount
0,4000000,4257354.0,98,20.880,5.7,2077,2,2,1,25,15.0,9.0
1,21000000,12136938.0,110,9.596,6.5,223,3,3,2,15,3.0,9.0
2,11000000,775398007.0,121,100.003,8.2,16598,3,2,1,105,5.0,62.0
3,94000000,940335536.0,100,134.435,7.8,15928,2,1,1,24,5.0,18.0
4,55000000,677387716.0,142,58.751,8.5,22045,3,2,1,76,18.0,48.0
...,...,...,...,...,...,...,...,...,...,...,...,...
9991,0,0.0,0,28.548,2.0,1,1,1,1,2,1.0,0.0
9992,0,0.0,77,153.156,7.5,22,2,3,2,9,4.0,2.0
9995,0,0.0,100,16.662,6.8,108,2,1,1,10,2.0,4.0
9996,0,0.0,97,491.706,1.5,2,1,2,1,8,1.0,1.0


In [15]:
# Since the dataset might be too large for computational resources, a sample of 1000 rows is taken for processing.
dfArray = np.array(df.sample(1000))

# Data normalization
# Scale the data to have zero mean and unit variance along each feature.
# This step is crucial for clustering algorithms as it ensures that each feature contributes equally to the distance calculations.
df_scale = sklearn.preprocessing.scale(dfArray)
df_scale

array([[-0.47939399, -0.41376027,  0.02857942, ..., -0.82729495,
        -0.84060439, -0.71671768],
       [-0.27713576, -0.398636  ,  1.15773891, ...,  0.42283594,
         0.90800534,  1.09729821],
       [-0.53122392, -0.41376027, -0.59440513, ..., -0.91063701,
        -0.84060439, -0.80741847],
       ...,
       [ 1.41239812,  1.28183543, -0.36078593, ..., -0.03554539,
        -0.04578178,  0.37169186],
       [-0.03883967, -0.41376027,  1.04092931, ..., -0.41058466,
        -0.84060439, -0.98882006],
       [-0.47908822, -0.41213124, -0.32184939, ..., -0.91063701,
        -0.68163987, -0.89811927]])

2. Analice la tendencia al agrupamiento usando el estadístico de Hopkings y la VAT (Visual Assessment of cluster Tendency). Discuta sus resultados e impresiones.

In [16]:
# Calculate the Hopkins statistic to assess the clustering tendency of the data.
hopkinsValue = pyclustertend.hopkins(df_scale, len(df_scale))

# Determine if clustering is worthwhile based on the Hopkins statistic.
if hopkinsValue < 0.5:
    print(f"The Hopkins Value is {hopkinsValue:.3f}, indicating a strong tendency for clustering.")
    print("Clustering is worthwhile.")
else:
    print(f"The Hopkins Value is {hopkinsValue:.3f}, indicating little clustering tendency.")
    print("Clustering may not be meaningful for this dataset.")

The Hopkins Value is 0.096, indicating a strong tendency for clustering.
Clustering is worthwhile.


3. Determine cuál es el número de grupos a formar más adecuado para los datos que está trabajando. Haga una gráfica de codo y explique la razón de la elección de la cantidad de clústeres con la que trabajará.

4. Utilice los algoritmos k-medias y clustering jerárquico para agrupar. Compare los resultados generados por cada uno.

5. Determine la calidad del agrupamiento hecho por cada algoritmo con el método de la silueta. Discuta los resultados.

6. Interprete los grupos basado en el conocimiento que tiene de los datos. Recuerde investigar las medidas de tendencia central de las variables continuas y las tablas de frecuencia de las variables categóricas pertenecientes a cada grupo. Identifique hallazgos interesantes debido a las agrupaciones y describa para qué le podría servir.

Enlace a la discusión de resultados: https://docs.google.com/document/d/14eCCvQ_C5yHVzzyvL_xo4VWLY_o9-9R3Mdc7et8q8JA/edit?usp=sharing