# Limpieza y Transformación de Datos: Uniformidad y Consistencia

Este notebook asegura que todas las películas y series tengan comentarios válidos o instrucciones claras si no los hay, eliminando valores faltantes y homogenizando la información.


In [82]:
import pandas as pd
import numpy as np

In [83]:
#Tablon de las reseñas de las películas
df = pd.read_csv("C:\\Users\\Usuario\\Desktop\\Netflix\\Raw_data\\df_reseñas.csv",index_col=0)
df_clustered = pd.read_csv("C:\\Users\\Usuario\\Desktop\\Netflix\\Raw_data\\df_clustered.csv",index_col=0)

#### Info del Tablón con todas las reseñas

In [84]:
df

Unnamed: 0,title,title_comment,comment
0,The Healing Powers of Dude,Great Family Show with Awesome Messages!,This is a wonderful family show! It tackles gr...
1,The Healing Powers of Dude,"Ignore the ""controversies"" and just enjoy the ...",I have a huge crush on Larisa Oleynik since ba...
2,The Healing Powers of Dude,Cute and lovable,"I'm 34 and do not have anxiety disorder, nor d..."
3,The Healing Powers of Dude,Fun show but inaccurate,So I do love the show but it is not accurate f...
4,The Healing Powers of Dude,I like the show but its inaccurate,Cute show but I don't like that Dude is an emo...
...,...,...,...
584,Chicken Soup for the Soul's Being Dad,,
585,H,,
586,Manu,,
587,Mama Drama,,


In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 91508 entries, 0 to 588
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          91508 non-null  object
 1   title_comment  90918 non-null  object
 2   comment        90919 non-null  object
dtypes: object(3)
memory usage: 2.8+ MB


In [86]:
df

Unnamed: 0,title,title_comment,comment
0,The Healing Powers of Dude,Great Family Show with Awesome Messages!,This is a wonderful family show! It tackles gr...
1,The Healing Powers of Dude,"Ignore the ""controversies"" and just enjoy the ...",I have a huge crush on Larisa Oleynik since ba...
2,The Healing Powers of Dude,Cute and lovable,"I'm 34 and do not have anxiety disorder, nor d..."
3,The Healing Powers of Dude,Fun show but inaccurate,So I do love the show but it is not accurate f...
4,The Healing Powers of Dude,I like the show but its inaccurate,Cute show but I don't like that Dude is an emo...
...,...,...,...
584,Chicken Soup for the Soul's Being Dad,,
585,H,,
586,Manu,,
587,Mama Drama,,


### Borramos las filas con comentario no disponible y que el título disponga de almenos otro comentario.
Comment = *`"Comentario no disponible"`*  son comentarios que incluyen **spoilers** por lo que no se han scrapeado.

In [87]:
df[df["comment"] == "Comentario no disponible"].head()

Unnamed: 0,title,title_comment,comment
7,The Healing Powers of Dude,Cute series for the entire family,Comentario no disponible
10,The Healing Powers of Dude,Perfect family show,Comentario no disponible
29,Brave Blue World,"A good film, some celebrities",Comentario no disponible
41,Brave Blue World,Greenwashing cringe-worthy docu,Comentario no disponible
50,DNA,Big moral dilemma,Comentario no disponible


In [88]:
# Filtrar los títulos que tienen al menos un comentario válido
titles_with_valid_comments = df[df["comment"] != "Comentario no disponible"]["title"].unique()

# Eliminar las filas donde el comentario es "Comentario no disponible" si el título tiene al menos un comentario válido
df = df[~((df["title"].isin(titles_with_valid_comments)) & (df["comment"] == "Comentario no disponible"))]

In [89]:
# Comentarios que son iguales a sus títulos
df[df["comment"] ==(df["title_comment"])].info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 8416 to 4181
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          12 non-null     object
 1   title_comment  12 non-null     object
 2   comment        12 non-null     object
dtypes: object(3)
memory usage: 384.0+ bytes


In [90]:
# En etse caos contienen la frase del título en el comentario
df[df["comment"] ==(df["title_comment"])]

Unnamed: 0,title,title_comment,comment
8416,Word of Honor,This is a great movie I've ever seen Beautiful...,This is a great movie I've ever seen Beautiful...
417,Secret Superstar,"Basically qualified inspirational films, but t...","Basically qualified inspirational films, but t..."
4765,Chasing Coral,Beautiful underwater footage (10/10); Anti hum...,Beautiful underwater footage (10/10); Anti hum...
491,The Rise of Phoenixes,"Sooooo amazing, I mean I feel so satisfied whe...","Sooooo amazing, I mean I feel so satisfied whe..."
501,The Rise of Phoenixes,Chen Kun's face looks so gorgeous on the scree...,Chen Kun's face looks so gorgeous on the scree...
5324,Toilet: A Love Story,Although the performance is exaggerated and th...,Although the performance is exaggerated and th...
6187,Kevin Hart: Zero F**ks Given,Not so great like his other sketch but still e...,Not so great like his other sketch but still e...
297,Mosul,"A very beautiful movie, but it cannot reach th...","A very beautiful movie, but it cannot reach th..."
3428,Typewriter,Indian version of stranger things with black m...,Indian version of stranger things with black m...
3453,Blood Money,Indian version of stranger things with black m...,Indian version of stranger things with black m...


### Remplazamos los comentarios de los títulos que no disponen de ningúna reseña por una instrucción clara para que la LLM sepa como tratar el título
title_comment = *`"NaN"`* son títulos poco conocídos que no tienen ningúna reseña encontrada.

In [91]:
df

Unnamed: 0,title,title_comment,comment
0,The Healing Powers of Dude,Great Family Show with Awesome Messages!,This is a wonderful family show! It tackles gr...
1,The Healing Powers of Dude,"Ignore the ""controversies"" and just enjoy the ...",I have a huge crush on Larisa Oleynik since ba...
2,The Healing Powers of Dude,Cute and lovable,"I'm 34 and do not have anxiety disorder, nor d..."
3,The Healing Powers of Dude,Fun show but inaccurate,So I do love the show but it is not accurate f...
4,The Healing Powers of Dude,I like the show but its inaccurate,Cute show but I don't like that Dude is an emo...
...,...,...,...
584,Chicken Soup for the Soul's Being Dad,,
585,H,,
586,Manu,,
587,Mama Drama,,


In [92]:
df["title_comment"] = df["title_comment"].fillna("Este título no tiene reseñas")

#Instrucciones para la LLM
df["comment"] = df["comment"].fillna(
    "Este título no tiene reseñas disponibles. La siguiente información ha sido generada por IA:\n\n"
    "1. Opinión positiva: Posible razón por la que este título podría gustar a la audiencia.\n"
    "2. Opinión negativa: Posible punto débil o algo que podría no gustar de esta película.\n"
    "3. Mini resumen: Basado en el título, una breve sinopsis de 2 o 3 líneas sobre de qué podría tratarse la película.\n\n"
    "Este contenido no refleja opiniones reales y solo es una interpretación automática del título."
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["title_comment"] = df["title_comment"].fillna("Este título no tiene reseñas")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["comment"] = df["comment"].fillna(


In [93]:
df

Unnamed: 0,title,title_comment,comment
0,The Healing Powers of Dude,Great Family Show with Awesome Messages!,This is a wonderful family show! It tackles gr...
1,The Healing Powers of Dude,"Ignore the ""controversies"" and just enjoy the ...",I have a huge crush on Larisa Oleynik since ba...
2,The Healing Powers of Dude,Cute and lovable,"I'm 34 and do not have anxiety disorder, nor d..."
3,The Healing Powers of Dude,Fun show but inaccurate,So I do love the show but it is not accurate f...
4,The Healing Powers of Dude,I like the show but its inaccurate,Cute show but I don't like that Dude is an emo...
...,...,...,...
584,Chicken Soup for the Soul's Being Dad,Este título no tiene reseñas,Este título no tiene reseñas disponibles. La s...
585,H,Este título no tiene reseñas,Este título no tiene reseñas disponibles. La s...
586,Manu,Este título no tiene reseñas,Este título no tiene reseñas disponibles. La s...
587,Mama Drama,Este título no tiene reseñas,Este título no tiene reseñas disponibles. La s...


In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 76452 entries, 0 to 588
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          76452 non-null  object
 1   title_comment  76452 non-null  object
 2   comment        76452 non-null  object
dtypes: object(3)
memory usage: 2.3+ MB


#### Exportar csv el tablón con los datos limpios y transformados.

In [98]:
df.to_csv("C:\\Users\\Usuario\\Desktop\\Netflix\\Raw_data\\tablon.csv")