# 1. Procesamiento de Datos - MovieLens

**Objetivo**: Cargar, limpiar y generar features para el an√°lisis de MovieLens.

**Dataset**: MovieLens 100K/1M
- 100,000 ratings (1-5) de 943 usuarios sobre 1682 pel√≠culas
- Like/Dislike: umbral ‚â• 4 estrellas

**Autor**: Pf. Rensso Mora Colque

---

## Contenido
1. Importaci√≥n de librer√≠as
2. Carga de datos (100K o 1M)
3. Exploraci√≥n inicial
4. Creaci√≥n de variable objetivo (like)
5. Features de usuario (preferencias, estad√≠sticas, entrop√≠a)
6. Features de pel√≠cula (g√©neros, d√©cada)
7. Embeddings latentes (SVD)
8. Ensamblaje y exportaci√≥n de datasets

## 1. Importaci√≥n de Librer√≠as

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler, LabelEncoder
from scipy.stats import entropy
import warnings
import os
warnings.filterwarnings('ignore')

# Configuraci√≥n de visualizaci√≥n
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

print("‚úì Librer√≠as importadas correctamente")

‚úì Librer√≠as importadas correctamente


## 2. Carga de Datos

### üí° Nota: Este notebook funciona con ambos datasets

Para cambiar entre **MovieLens 100K** y **MovieLens 1M**, solo modifica la variable `DATASET`.

In [2]:
# ============================================
# CONFIGURACI√ìN: Cambiar entre 100K o 1M
# ============================================
DATASET = '100k'  # Opciones: '100k' o '1m'

if DATASET == '100k':
    # MovieLens 100K (943 usuarios, 1682 pel√≠culas, 100K ratings)
    ratings_cols = ['user_id', 'item_id', 'rating', 'timestamp']
    ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

    movie_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
    genre_cols = ['unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy',
                  'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
                  'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
    movies = pd.read_csv('ml-100k/u.item', sep='|', names=movie_cols + genre_cols,
                         encoding='latin-1', usecols=range(24))

    user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
    users = pd.read_csv('ml-100k/u.user', sep='|', names=user_cols, encoding='latin-1')

elif DATASET == '1m':
    # MovieLens 1M (6040 usuarios, 3706 pel√≠culas, 1M ratings)
    ratings = pd.read_csv('ml-1m/ratings.dat', sep='::', names=['user_id', 'item_id', 'rating', 'timestamp'],
                         engine='python', encoding='latin-1')

    movies = pd.read_csv('ml-1m/movies.dat', sep='::', names=['item_id', 'title', 'genres'],
                        engine='python', encoding='latin-1')

    # Procesar g√©neros (est√°n en formato "Action|Adventure|Sci-Fi")
    genre_cols = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy',
                  'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
                  'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

    for genre in genre_cols:
        movies[genre] = movies['genres'].str.contains(genre, case=False, na=False).astype(int)

    # Extraer a√±o del t√≠tulo (formato: "Movie Title (YYYY)")
    movies['release_date'] = movies['title'].str.extract(r'\((\d{4})\)')[0]

    users = pd.read_csv('ml-1m/users.dat', sep='::',
                       names=['user_id', 'gender', 'age', 'occupation', 'zip_code'],
                       engine='python', encoding='latin-1')

print(f"üìä Dataset seleccionado: MovieLens {DATASET.upper()}")
print(f"Ratings: {ratings.shape}")
print(f"Movies: {movies.shape}")
print(f"Users: {users.shape}")

üìä Dataset seleccionado: MovieLens 100K
Ratings: (100000, 4)
Movies: (1682, 24)
Users: (943, 5)


## 3. Exploraci√≥n Inicial

In [3]:
print("="*60)
print("INFORMACI√ìN GENERAL")
print("="*60)
print(f"\nTotal de ratings: {len(ratings):,}")
print(f"Total de usuarios: {ratings['user_id'].nunique()}")
print(f"Total de pel√≠culas: {ratings['item_id'].nunique()}")
print(f"\nRango de ratings: {ratings['rating'].min()} - {ratings['rating'].max()}")
print(f"\nDistribuci√≥n de ratings:")
print(ratings['rating'].value_counts().sort_index())

# Mostrar primeras filas
ratings.head()

INFORMACI√ìN GENERAL

Total de ratings: 100,000
Total de usuarios: 943
Total de pel√≠culas: 1682

Rango de ratings: 1 - 5

Distribuci√≥n de ratings:
rating
1     6110
2    11370
3    27145
4    34174
5    21201
Name: count, dtype: int64


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


## 4. Creaci√≥n de Variable Objetivo (Like)

In [4]:
# Crear variable objetivo: Like (rating >= 4) vs Dislike (rating < 4)
ratings['like'] = (ratings['rating'] >= 4).astype(int)

print(f"{'='*60}")
print("DISTRIBUCI√ìN LIKE/DISLIKE (umbral ‚â• 4)")
print("="*60)
print(ratings['like'].value_counts())
print(f"\nPorcentaje de likes: {ratings['like'].mean()*100:.2f}%")

DISTRIBUCI√ìN LIKE/DISLIKE (umbral ‚â• 4)
like
1    55375
0    44625
Name: count, dtype: int64

Porcentaje de likes: 55.38%


## 5. Features de Usuario

In [5]:
# Extraer a√±o y d√©cada de la pel√≠cula
movies['year'] = pd.to_datetime(movies['release_date'], errors='coerce').dt.year
movies['decade'] = (movies['year'] // 10 * 10).fillna(1990).astype(int)

# Unir datos
data = ratings.merge(movies, on='item_id').merge(users, on='user_id')

print("="*60)
print("DATOS COMBINADOS")
print("="*60)
print(f"Shape: {data.shape}")
print(f"\nColumnas: {list(data.columns)}")
print(f"\nValores nulos:")
print(data.isnull().sum()[data.isnull().sum() > 0])

DATOS COMBINADOS
Shape: (100000, 34)

Columnas: ['user_id', 'item_id', 'rating', 'timestamp', 'like', 'title', 'release_date', 'video_release_date', 'imdb_url', 'unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western', 'year', 'decade', 'age', 'gender', 'occupation', 'zip_code']

Valores nulos:
release_date               9
video_release_date    100000
imdb_url                  13
year                       9
dtype: int64


### 5.1 Preferencias por G√©nero

In [6]:
# Calcular preferencias de g√©nero para cada usuario
genre_columns = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy',
                 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
                 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
                 'Thriller', 'War', 'Western']

user_genre_prefs = []
for user_id in data['user_id'].unique():
    user_data = data[data['user_id'] == user_id]
    genre_prefs = (user_data[genre_columns].T * user_data['rating'].values).T.sum() / user_data['rating'].sum()
    user_genre_prefs.append(genre_prefs.values)

user_genre_df = pd.DataFrame(user_genre_prefs,
                              columns=[f'user_pref_{g}' for g in genre_columns],
                              index=data['user_id'].unique())
user_genre_df.index.name = 'user_id'
user_genre_df = user_genre_df.reset_index()

print("="*60)
print("PREFERENCIAS DE G√âNERO POR USUARIO")
print("="*60)
print(user_genre_df.head())
print(f"\nShape: {user_genre_df.shape}")

PREFERENCIAS DE G√âNERO POR USUARIO
   user_id  user_pref_Action  user_pref_Adventure  user_pref_Animation  \
0      196          0.035461             0.070922             0.000000   
1      186          0.340764             0.143312             0.057325   
2       22          0.580420             0.282051             0.000000   
3      244          0.169160             0.078251             0.042578   
4      166          0.422535             0.000000             0.000000   

   user_pref_Children  user_pref_Comedy  user_pref_Crime  \
0            0.056738          0.808511         0.000000   
1            0.108280          0.124204         0.117834   
2            0.023310          0.501166         0.074592   
3            0.037975          0.416571         0.035673   
4            0.098592          0.267606         0.014085   

   user_pref_Documentary  user_pref_Drama  user_pref_Fantasy  \
0               0.028369         0.319149           0.028369   
1               0.000000      

### 5.2 Preferencias por D√©cada

In [7]:
# Calcular preferencias por d√©cada para cada usuario
user_decade_prefs = data.groupby(['user_id', 'decade'])['rating'].mean().unstack(fill_value=0)
user_decade_prefs.columns = [f'user_decade_{int(d)}' for d in user_decade_prefs.columns]
user_decade_prefs = user_decade_prefs.reset_index()

print("="*60)
print("PREFERENCIAS DE D√âCADA POR USUARIO")
print("="*60)
print(user_decade_prefs.head())
print(f"\nShape: {user_decade_prefs.shape}")

PREFERENCIAS DE D√âCADA POR USUARIO
   user_id  user_decade_1920  user_decade_1930  user_decade_1940  \
0        1               0.0               3.5               4.0   
1        2               0.0               0.0               0.0   
2        3               0.0               0.0               0.0   
3        4               0.0               0.0               0.0   
4        5               0.0               4.0               3.0   

   user_decade_1950  user_decade_1960  user_decade_1970  user_decade_1980  \
0          4.000000          3.000000          3.944444          3.931818   
1          0.000000          0.000000          5.000000          0.000000   
2          0.000000          0.000000          0.000000          0.000000   
3          0.000000          0.000000          4.500000          3.000000   
4          2.833333          3.333333          2.950000          3.250000   

   user_decade_1990  
0          3.528497  
1          3.666667  
2          2.796296  
3   

### 5.3 Estad√≠sticas de Usuario

In [8]:
# Calcular features estad√≠sticas por usuario
user_stats = data.groupby('user_id').agg({
    'rating': ['mean', 'std', 'count']
}).reset_index()
user_stats.columns = ['user_id', 'user_rating_mean', 'user_rating_std', 'user_n_votes']
user_stats['user_rating_std'] = user_stats['user_rating_std'].fillna(0)

# Calcular diversidad de g√©neros (entrop√≠a)
def calculate_genre_entropy(user_id):
    user_movies = data[data['user_id'] == user_id][genre_columns]
    genre_counts = user_movies.sum()
    if genre_counts.sum() == 0:
        return 0
    genre_probs = genre_counts / genre_counts.sum()
    return entropy(genre_probs[genre_probs > 0])

user_stats['user_genre_diversity'] = user_stats['user_id'].apply(calculate_genre_entropy)

print("="*60)
print("FEATURES ESTAD√çSTICAS DE USUARIO")
print("="*60)
print(user_stats.head(10))
print(f"\nShape: {user_stats.shape}")

FEATURES ESTAD√çSTICAS DE USUARIO
   user_id  user_rating_mean  user_rating_std  user_n_votes  \
0        1          3.610294         1.263585           272   
1        2          3.709677         1.030472            62   
2        3          2.796296         1.219026            54   
3        4          4.333333         0.916831            24   
4        5          2.874286         1.362963           175   
5        6          3.635071         1.039461           211   
6        7          3.965261         1.064480           403   
7        8          3.796610         1.242629            59   
8        9          4.272727         0.935125            22   
9       10          4.206522         0.582777           184   

   user_genre_diversity  
0              2.463542  
1              2.277391  
2              2.367946  
3              2.348953  
4              2.452145  
5              2.425902  
6              2.605526  
7              2.252600  
8              2.172565  
9           

## 6. Embeddings Latentes (SVD)

In [9]:
# Crear matriz usuario-pel√≠cula
user_item_matrix = data.pivot_table(index='user_id', columns='item_id', values='rating', fill_value=0)

# Aplicar SVD para USUARIOS
n_components = 20
svd_user = TruncatedSVD(n_components=n_components, random_state=42)
user_embeddings = svd_user.fit_transform(user_item_matrix)

user_embedding_cols = [f'user_embed_{i}' for i in range(n_components)]
user_embeddings_df = pd.DataFrame(
    user_embeddings,
    columns=user_embedding_cols,
    index=user_item_matrix.index
).reset_index()

print("="*60)
print("EMBEDDINGS LATENTES DE USUARIOS (SVD)")
print("="*60)
print(f"Componentes: {n_components}")
print(f"Varianza explicada: {svd_user.explained_variance_ratio_.sum()*100:.2f}%")
print(f"\nShape de embeddings: {user_embeddings_df.shape}")

EMBEDDINGS LATENTES DE USUARIOS (SVD)
Componentes: 20
Varianza explicada: 41.21%

Shape de embeddings: (943, 21)


In [10]:
# Aplicar SVD para PEL√çCULAS
svd_item = TruncatedSVD(n_components=n_components, random_state=42)
item_embeddings = svd_item.fit_transform(user_item_matrix.T)

item_embedding_cols = [f'item_embed_{i}' for i in range(n_components)]
item_embeddings_df = pd.DataFrame(
    item_embeddings,
    columns=item_embedding_cols,
    index=user_item_matrix.columns
).reset_index()
item_embeddings_df = item_embeddings_df.rename(columns={'index': 'item_id'})

print(f"{'='*60}")
print("EMBEDDINGS LATENTES DE PEL√çCULAS (SVD)")
print("="*60)
print(f"Componentes: {n_components}")
print(f"Varianza explicada: {svd_item.explained_variance_ratio_.sum()*100:.2f}%")
print(f"Shape de embeddings: {item_embeddings_df.shape}")

EMBEDDINGS LATENTES DE PEL√çCULAS (SVD)
Componentes: 20
Varianza explicada: 46.20%
Shape de embeddings: (1682, 21)


## 7. Ensamblaje de Dataset Final

In [11]:
# Unir todas las features
data_final = data.merge(user_genre_df, on='user_id', how='left')
data_final = data_final.merge(user_decade_prefs, on='user_id', how='left')
data_final = data_final.merge(user_stats, on='user_id', how='left')
data_final = data_final.merge(user_embeddings_df, on='user_id', how='left')
data_final = data_final.merge(item_embeddings_df, on='item_id', how='left')

print("="*60)
print("DATASET FINAL ENSAMBLADO")
print("="*60)
print(f"Shape: {data_final.shape}")
print(f"\nColumnas totales: {len(data_final.columns)}")
print(f"\nPrimeras columnas: {list(data_final.columns[:10])}")

DATASET FINAL ENSAMBLADO
Shape: (100000, 104)

Columnas totales: 104

Primeras columnas: ['user_id', 'item_id', 'rating', 'timestamp', 'like', 'title', 'release_date', 'video_release_date', 'imdb_url', 'unknown']


## 8. Exportaci√≥n de Datasets

In [12]:
# Crear carpeta para datos procesados
os.makedirs('data_processed', exist_ok=True)

# 1. Dataset completo
data_final.to_csv('data_processed/data_final.csv', index=False)
print("‚úì data_final.csv exportado")

# 2. Features de usuario (para clustering)
user_features = user_genre_df.merge(user_embeddings_df, on='user_id')
user_features = user_features.merge(user_stats, on='user_id')
user_features.to_csv('data_processed/user_features.csv', index=False)
print("‚úì user_features.csv exportado")

# 3. Features de pel√≠cula
movie_features = movies[['item_id', 'title', 'decade'] + genre_columns].merge(
    item_embeddings_df, on='item_id'
)
movie_features.to_csv('data_processed/movie_features.csv', index=False)
print("‚úì movie_features.csv exportado")

print("\n" + "="*60)
print("‚úÖ PROCESAMIENTO COMPLETADO")
print("="*60)
print(f"\nArchivos exportados en: data_processed/")
print(f"  - data_final.csv ({data_final.shape[0]:,} filas, {data_final.shape[1]} columnas)")
print(f"  - user_features.csv ({user_features.shape[0]:,} usuarios)")
print(f"  - movie_features.csv ({movie_features.shape[0]:,} pel√≠culas)")

‚úì data_final.csv exportado
‚úì user_features.csv exportado
‚úì movie_features.csv exportado

‚úÖ PROCESAMIENTO COMPLETADO

Archivos exportados en: data_processed/
  - data_final.csv (100,000 filas, 104 columnas)
  - user_features.csv (943 usuarios)
  - movie_features.csv (1,682 pel√≠culas)
