<a href="https://colab.research.google.com/github/fralfaro/MAT281/blob/main/docs/labs/lab_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# MAT281 - Laboratorio N°03





**Objetivo**: Aplicar técnicas avanzadas de manipulación y análisis de datos con pandas sobre un conjunto real de datos de contenido de Netflix, reforzando buenas prácticas y métodos eficientes sin recurrir a `groupby`, `merge`, `pivot`, ni `join`.



**Dataset**:

Trabajaremos con el archivo `netflix_titles.csv`, que contiene información sobre los títulos disponibles en la plataforma Netflix hasta el año 2021.

| Variable       | Clase     | Descripción                                                                 |
|----------------|-----------|------------------------------------------------------------------------------|
| show_id        | caracter  | Identificador único del título en el catálogo de Netflix.                   |
| type           | caracter  | Tipo de contenido: 'Movie' o 'TV Show'.                                     |
| title          | caracter  | Título del contenido.                                                       |
| director       | caracter  | Nombre del director (puede ser nulo).                                       |
| cast           | caracter  | Lista de actores principales (puede ser nulo).                              |
| country        | caracter  | País o países donde se produjo el contenido.                                |
| date_added     | fecha     | Fecha en la que el título fue agregado al catálogo de Netflix.              |
| release_year   | entero    | Año de lanzamiento original del título.                                     |
| rating         | caracter  | Clasificación por edad (por ejemplo: 'PG-13', 'TV-MA').                      |
| duration       | caracter  | Duración del contenido (minutos o número de temporadas para series).        |
| listed_in      | caracter  | Categorías o géneros en los que está clasificado el contenido.              |
| description    | caracter  | Breve sinopsis del contenido.                                               |




In [1]:
import pandas as pd

# Cargar datos
df = pd.read_csv('https://raw.githubusercontent.com/fralfaro/MAT281/main/docs/labs/data/netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...



### Parte 1: Limpieza y preparación

1. Revisar y describir el dataset:

   * ¿Cuántas filas y columnas tiene?
   * ¿Qué tipos de datos hay?
   * ¿Cuántos valores nulos hay por columna?

2. Transformar la columna `date_added` a tipo fecha.

3. Crear columnas auxiliares con `assign`:

   * Año (`year_added`)
   * Mes (`month_added`)



Cantidad de filas y columnas:

In [2]:
print(f'Filas: {df.shape[0]}\nColumnas: {df.shape[1]}')

Filas: 8807
Columnas: 12


Tipos de datos que hay:

In [3]:
print(df.dtypes)

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


Valores nulos por columna:

In [4]:
print(df.isnull().sum())

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


Transformar la columna date_added a tipo fecha:

In [16]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


Crear columnas auxiliares con assign:

In [18]:
df = df.assign(
    year_added = df["date_added"].dt.year.astype("Int64"),
    month_added = df["date_added"].dt.month.astype("Int64")
)
print(df[['date_added', 'year_added', 'month_added']].head())

  date_added  year_added  month_added
0 2021-09-25        2021            9
1 2021-09-24        2021            9
2 2021-09-24        2021            9
3 2021-09-24        2021            9
4 2021-09-24        2021            9


## Parte 2: Técnicas avanzadas de pandas

4. Utilizar `.loc` para seleccionar películas (`type == 'Movie'`) que fueron agregadas después del año 2018.

In [20]:
movies_after_2018 = df.loc[(df['type'] == 'Movie') & (df['year_added'] > 2018)]
display(movies_after_2018.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,9
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,2021-09-24,2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,2021,9
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",2021,9
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...,2021,9
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic",2021-09-23,2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...,2021,9


5. Utilizar `str.contains()` y `str.extract()`:

   * Filtrar títulos que contienen la palabra 'love' (sin distinguir mayúsculas/minúsculas).
   * Extraer la duración en minutos para las películas desde la columna `duration`.


In [23]:
titulo_love= df[df['title'].str.contains('love', case=False, na=False)]
display(titles_with_love.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
25,s26,TV Show,Love on the Spectrum,,Brooke Satchwell,Australia,2021-09-21,2021,TV-14,2 Seasons,"Docuseries, International TV Shows, Reality TV",Finding love can be hard for anyone. For young...,2021,9
158,s159,Movie,Love Don't Cost a Thing,Troy Byer,"Nick Cannon, Christina Milian, Kenan Thompson,...",United States,2021-09-01,2003,PG-13,101 min,"Comedies, Romantic Movies",A nerdy teen tries to make himself cool by ass...,2021,9
159,s160,Movie,Love in a Puff,Pang Ho-cheung,"Miriam Chin Wah Yeung, Shawn Yue, Singh Hartih...",Hong Kong,2021-09-01,2010,TV-MA,103 min,"Comedies, Dramas, International Movies",When the Hong Kong government enacts a ban on ...,2021,9
206,s207,Movie,"LSD: Love, Sex Aur Dhokha",Dibakar Banerjee,"Nushrat Bharucha, Anshuman Jha, Neha Chauhan, ...",India,2021-08-27,2010,TV-MA,112 min,"Dramas, Independent Movies, International Movies",This provocative drama examines how the voyeur...,2021,8
227,s228,Movie,Really Love,Angel Kristi Williams,"Kofi Siriboe, Yootha Wong-Loi-Sing, Michael Ea...",United States,2021-08-25,2020,TV-MA,95 min,"Dramas, Independent Movies, Romantic Movies",A rising Black painter tries to break into a c...,2021,8


In [30]:
#Filtra las peliculas
movies_df=df[df['type']=='Movie'].copy()

#Extrae la duracion en minutos
movies_df['duration_minutes']=movies_df['duration'].str.extract(r'(\d+)\smin').astype(float)

display(movies_df[['title','duration']].head())

Unnamed: 0,title,duration
0,Dick Johnson Is Dead,90 min
6,My Little Pony: A New Generation,91 min
7,Sankofa,125 min
9,The Starling,104 min
12,Je Suis Karl,127 min


6. Aplicar `explode()` sobre la columna `listed_in` para obtener una fila por cada género.

In [31]:
# Dividimos la columna 'listed_in' en filas distintas por genero
df_exploded=df.assign(listed_in=df['listed_in'].str.split(', ')).explode('listed_in')
display(df_exploded.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,9
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,International TV Shows,"After crossing paths at a party, a Cape Town t...",2021,9
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,TV Dramas,"After crossing paths at a party, a Cape Town t...",2021,9
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,TV Mysteries,"After crossing paths at a party, a Cape Town t...",2021,9
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,Crime TV Shows,To protect his family from a powerful drug lor...,2021,9


7. Obtener un top 10 de géneros más frecuentes utilizando `value_counts()`.

In [33]:
df_exploded['listed_in'].value_counts().head(10)

Unnamed: 0_level_0,count
listed_in,Unnamed: 1_level_1
International Movies,2752
Dramas,2427
Comedies,1674
International TV Shows,1351
Documentaries,869
Action & Adventure,859
TV Dramas,763
Independent Movies,756
Children & Family Movies,641
Romantic Movies,616


8. Aplicar `where()` y `mask()` para marcar las películas de más de 120 minutos como contenido largo en una nueva columna.

In [36]:
movies_df['is_long_movie'] = movies_df['duration_minutes'].where(
    movies_df['duration_minutes'] > 120, #duracion mayor a 120 minutos
    other='película corta' #verifica si la condicion es falsa
).mask(
    movies_df['duration_minutes'] > 120, #condicion: duracion mayor a 120 minutos
    other='película larga' #verifica si la condicion es verdadera
)

display(movies_df[['title', 'duration_minutes', 'is_long_movie']].head())

Unnamed: 0,title,duration_minutes,is_long_movie
0,Dick Johnson Is Dead,90.0,película corta
6,My Little Pony: A New Generation,91.0,película corta
7,Sankofa,125.0,película larga
9,The Starling,104.0,película corta
12,Je Suis Karl,127.0,película larga


9. Utilizar `.loc` para filtrar películas que cumplen con:

   * Más de 100 minutos de duración.
   * Rating igual a `'R'`.
   * País igual a `'United States'`.

In [37]:
filtered_movies = movies_df.loc[
    (movies_df['duration_minutes'] > 100) &
    (movies_df['rating'] == 'R') &
    (movies_df['country'] == 'United States')
]

display(filtered_movies.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,duration_minutes,is_long_movie
48,s49,Movie,Training Day,Antoine Fuqua,"Denzel Washington, Ethan Hawke, Scott Glenn, T...",United States,2021-09-16,2001,R,122 min,"Dramas, Thrillers",A rookie cop with one day to prove himself to ...,2021,9,122.0,película larga
81,s82,Movie,Kate,Cedric Nicolas-Troyan,"Mary Elizabeth Winstead, Jun Kunimura, Woody H...",United States,2021-09-10,2021,R,106 min,Action & Adventure,"Slipped a fatal poison on her final job, a rut...",2021,9,106.0,película corta
131,s132,Movie,Blade Runner: The Final Cut,Ridley Scott,"Harrison Ford, Rutger Hauer, Sean Young, Edwar...",United States,2021-09-01,1982,R,117 min,"Action & Adventure, Classic Movies, Cult Movies","In a smog-choked dystopian Los Angeles, blade ...",2021,9,117.0,película corta
139,s140,Movie,Do the Right Thing,Spike Lee,"Danny Aiello, Ossie Davis, Ruby Dee, Richard E...",United States,2021-09-01,1989,R,120 min,"Classic Movies, Comedies, Dramas","On a sweltering day in Brooklyn, simmering rac...",2021,9,120.0,película corta
144,s145,Movie,House Party,Reginald Hudlin,"Christopher Reid, Christopher Martin, Robin Ha...",United States,2021-09-01,1990,R,104 min,"Comedies, Cult Movies","Grounded by his strict father, Kid risks life ...",2021,9,104.0,película corta


10. Utilizar `.style` para formatear visualmente el top 10 de películas más largas.

In [39]:
#Ordena las peliculas en orden descendente y muestra los primeros 10
top_10_longest_movies = movies_df.sort_values(by='duration_minutes', ascending=False).head(10)

#Aplicando style para distinguir duraciones
styled_top_10 = top_10_longest_movies.style.background_gradient(subset=['duration_minutes'], cmap='Blues')
display(styled_top_10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,duration_minutes,is_long_movie
4253,s4254,Movie,Black Mirror: Bandersnatch,,"Fionn Whitehead, Will Poulter, Craig Parkinson, Alice Lowe, Asim Chaudhry",United States,2018-12-28 00:00:00,2018,TV-MA,312 min,"Dramas, International Movies, Sci-Fi & Fantasy","In 1984, a young programmer begins to question reality as he adapts a dark fantasy novel into a video game. A mind-bending tale with multiple endings.",2018,12,312.0,película larga
717,s718,Movie,Headspace: Unwind Your Mind,,"Andy Puddicombe, Evelyn Lewis Prieto, Ginger Daniels, Darren Pettie, Simon Prebble, Rhiannon Mcgavin, Kate Seftel",,2021-06-15 00:00:00,2021,TV-G,273 min,Documentaries,"Do you want to relax, meditate or sleep deeply? Personalize the experience according to your mood or mindset with this Headspace interactive special.",2021,6,273.0,película larga
2491,s2492,Movie,The School of Mischief,Houssam El-Din Mustafa,"Suhair El-Babili, Adel Emam, Saeed Saleh, Younes Shalabi, Hadi El-Gayyar, Ahmad Zaki, Hassan Moustafa",Egypt,2020-05-21 00:00:00,1973,TV-14,253 min,"Comedies, Dramas, International Movies",A high school teacher volunteers to transform five notorious misfits into model students — and has unintended results.,2020,5,253.0,película larga
2487,s2488,Movie,No Longer kids,Samir Al Asfory,"Said Saleh, Hassan Moustafa, Ahmed Zaki, Younes Shalabi, Nadia Shukri, Karima Mokhtar",Egypt,2020-05-21 00:00:00,1979,TV-14,237 min,"Comedies, Dramas, International Movies","Hoping to prevent their father from skipping town with his mistress, four rowdy siblings resort to absurd measures to stop him.",2020,5,237.0,película larga
2484,s2485,Movie,Lock Your Girls In,Fouad El-Mohandes,"Fouad El-Mohandes, Sanaa Younes, Sherihan, Ahmed Rateb, Ijlal Zaki, Zakariya Mowafi",,2020-05-21 00:00:00,1982,TV-PG,233 min,"Comedies, International Movies, Romantic Movies",A widower believes he must marry off his three problematic daughters before he can pursue his real goal of marrying his secret love.,2020,5,233.0,película larga
2488,s2489,Movie,Raya and Sakina,Hussein Kamal,"Suhair El-Babili, Shadia, Abdel Moneim Madbouly, Ahmed Bedir",,2020-05-21 00:00:00,1984,TV-14,230 min,"Comedies, Dramas, International Movies","When robberies and murders targeting women sweep early 20th-century Egypt, the hunt for suspects leads to two shadowy sisters. Based on a true story.",2020,5,230.0,película larga
166,s167,Movie,Once Upon a Time in America,Sergio Leone,"Robert De Niro, James Woods, Elizabeth McGovern, Treat Williams, Tuesday Weld, Burt Young, Joe Pesci, Danny Aiello, William Forsythe, James Hayden","Italy, United States",2021-09-01 00:00:00,1984,R,229 min,"Classic Movies, Dramas",Director Sergio Leone's sprawling crime epic follows a group of Jewish mobsters who rise in the ranks of organized crime in 1920s New York City.,2021,9,229.0,película larga
7932,s7933,Movie,Sangam,Raj Kapoor,"Raj Kapoor, Vyjayanthimala, Rajendra Kumar, Lalita Pawar, Achala Sachdev, Hari Shivdasani, Raj Mehra, Iftekhar",India,2019-12-31 00:00:00,1964,TV-14,228 min,"Classic Movies, Dramas, International Movies","Returning home from war after being assumed dead, a pilot weds the woman he has long loved, unaware that she had been planning to marry his best friend.",2019,12,228.0,película larga
1019,s1020,Movie,Lagaan,Ashutosh Gowariker,"Aamir Khan, Gracy Singh, Rachel Shelley, Paul Blackthorne, Kulbhushan Kharbanda, Raghuvir Yadav, Yashpal Sharma, Rajendranath Zutshi, Rajesh Vivek, Aditya Lakhia","India, United Kingdom",2021-04-17 00:00:00,2001,PG,224 min,"Dramas, International Movies, Music & Musicals","In 1890s India, an arrogant British commander challenges the harshly taxed residents of Champaner to a high-stakes cricket match.",2021,4,224.0,película larga
4573,s4574,Movie,Jodhaa Akbar,Ashutosh Gowariker,"Hrithik Roshan, Aishwarya Rai Bachchan, Sonu Sood, Poonam Sinha, Suhasini Mulay, Ila Arun, Raza Murad, Kulbhushan Kharbanda, Abeer Abrar",India,2018-10-01 00:00:00,2008,TV-14,214 min,"Action & Adventure, Dramas, International Movies","In 16th-century India, what begins as a strategic alliance between a Mughal emperor and a Hindu princess becomes a genuine opportunity for true love.",2018,10,214.0,película larga


### Pregunta Desafío

11. ¿Cuáles son las combinaciones más frecuentes de género y rating en el dataset?
    (Sugerencia: utilizar `value_counts` con `subset=["genre", "rating"]` después de aplicar `explode()`).

In [41]:
#Obtencion de la cantidad de combinaciones entre genero y rating
genre_rating_counts = df_exploded.value_counts(subset=['listed_in', 'rating'])
display(genre_rating_counts.head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,count
listed_in,rating,Unnamed: 2_level_1
International Movies,TV-MA,1130
International Movies,TV-14,1065
Dramas,TV-MA,830
International TV Shows,TV-MA,714
Dramas,TV-14,693
International TV Shows,TV-14,472
Comedies,TV-14,465
TV Dramas,TV-MA,434
Comedies,TV-MA,431
Dramas,R,375


### Bonus: Análisis de duplicados y limpieza

12. ¿Existen películas con el mismo nombre (`title`) pero con distinto año de lanzamiento (`release_year`)?

In [43]:
# Agrupa por titulo y cuenta el numero de años de lanzamiento unicos para cada titulo.
title_release_year_counts = df.groupby('title')['release_year'].nunique()

#Filtra los titulos que tienen mas de un año de lanzamiento unico.
titles_with_multiple_release_years = title_release_year_counts[title_release_year_counts > 1]

if not titles_with_multiple_release_years.empty:
    print("Películas con el mismo nombre pero diferente año de lanzamiento:")
    display(titles_with_multiple_release_years)
else:
    print("No hay películas con el mismo nombre pero diferente año de lanzamiento.")

No hay películas con el mismo nombre pero diferente año de lanzamiento.


13. ¿Cuántos títulos únicos hay en total en la columna `title`?

In [44]:
unique_titles_count = df['title'].nunique()
print(f"El número total de títulos únicos es: {unique_titles_count}")

El número total de títulos únicos es: 8807
