# Data Visualization
- Conceptos basicos
- Uso de plotly
- Graficos basicos
- Graficos de Correlacion
- Mapas de calor

# Principios Basicos (Teoria)
- Que pregunta quieres responder con los graficos
- Proveer un contexto
- Tener una Jerarquia de visualizacion
- Codigo de colores
- Enfocarse en areas claves
- Graficos simples hacen mas que graficos complejos
- Habilitar comparaciones para mejor visualizacion

## Lineamientos de visualizacion
- Se honesto
- Echar una mano
- Deleitar a los usuarios
- Dar claridad de enfoque
- Abrazar la escala
- Proporcionar estructura

## Codigo de color
- Usar color para crear asociaciones: profit, loss, medio ambiente, pais, etc.
- Usar distintas saturaciones para data continua
- Colores de contraste para comparaciones
- Colores para enfatizar informacion importante
- Colores que sean faciles de distinguir
- Uso de pocos colores para evitar saturaciones (max 7)
- Accesibilidad
- Recordar que la inclusion de colores acelera y mejora el contenido de cualquier visualizacion

# Graficas Basicas
- Barras
- Lineas
- Torta \ Pastel
- Scatterplot (x , y)

## Importacion de librerias y apertura de csv

In [38]:
import pandas as pd
import numpy as np

In [39]:
df = pd.read_csv("../../../Archivos-Analisis/netflix_titles2.csv")

In [40]:
df.sample(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_num,duration_unit
4649,s4650,Movie,The Resistance Banker,Joram Lürsen,"Barry Atsma, Jacob Derwig, Pierre Bokma, Jaap ...",Netherlands,"September 11, 2018",2018,TV-MA,123 min,"Dramas, International Movies","Risking his family and future, a banker in occ...",123.0,min
6484,s6485,Movie,Chupan Chupai,Mohsin Ali,"Ahsan Khan, Neelam Muneer, Faizan Khawaja, Ali...",Pakistan,"January 15, 2019",2017,TV-14,118 min,"Comedies, International Movies","The lives of five hapless, petty criminals car...",118.0,min
6962,s6963,Movie,Hero,Corey Yuen,"Takeshi Kaneshiro, Yuen Biao, Valerie Chow, Je...","Hong Kong, China","August 1, 2018",1997,TV-MA,89 min,"Action & Adventure, International Movies",A pugilist from Shantung struggles to rise to ...,89.0,min
6302,s6303,Movie,Big Bear,Joey Kern,"Joey Kern, Adam Brody, Zachary Knighton, Tyler...",United States,"February 28, 2018",2017,TV-MA,87 min,Comedies,The alcohol-fueled high jinks of a bachelor pa...,87.0,min
5117,s5118,Movie,Russell Howard: Recalibrate,Peter Orton,Russell Howard,United Kingdom,"December 19, 2017",2017,TV-MA,69 min,Stand-Up Comedy,Self-deprecating comic Russell Howard plows ah...,69.0,min
993,s994,TV Show,Shadow and Bone,,"Jessie Mei Li, Archie Renaux, Ben Barnes, Fred...",United States,"April 23, 2021",2021,TV-14,1 Season,"TV Action & Adventure, TV Dramas, TV Sci-Fi & ...",Dark forces conspire against orphan mapmaker A...,1.0,season
5482,s5483,Movie,Kabhi Haan Kabhi Naa,Kundan Shah,"Shah Rukh Khan, Suchitra Krishnamoorthi, Deepa...",India,"May 15, 2017",1994,TV-14,151 min,"Comedies, Dramas, International Movies",A dreamer falls for a girl who is in love with...,151.0,min
4149,s4150,TV Show,Paprika,,"Kaycie Chase, David Gasman, Tom Morton, Lee De...",,"January 31, 2019",2018,TV-Y,1 Season,Kids' TV,Stan and Olivia – the amazingly different Papr...,1.0,season
3165,s3166,TV Show,Astronomy Club: The Sketch Show,,"Shawtane Bowen, Jonathan Braylock, Ray Cordova...",United States,"December 6, 2019",2019,TV-MA,1 Season,TV Comedies,With unique individual perspectives that conve...,1.0,season
7283,s7284,Movie,Legend of the Naga Pearls,Yang Lei,"Darren Wang, Zhang Tianai, Sheng Guansen, Simo...",China,"March 30, 2018",2017,TV-MA,108 min,"Action & Adventure, International Movies, Sci-...",A petty thief teams up with two bickering acco...,108.0,min


## Importacion de la libreria Plotly

In [41]:
import plotly.express as px

In [42]:
df_movies_year = df.groupby('release_year').size().rename('movies').reset_index()
df_movies_year

Unnamed: 0,release_year,movies
0,1925,1
1,1942,2
2,1943,3
3,1944,3
4,1945,4
...,...,...
69,2017,1032
70,2018,1147
71,2019,1030
72,2020,953


## Grafico de barras

In [43]:
# Grafico de peliculas y series por año
fig = px.bar(df_movies_year, x='release_year', y='movies', title='Series y Peliculas por año')
fig.show()

In [44]:
# Grafico de peliculas y series por año - comparacion
# Primero generamos un DF basado en el año (release_year) y el tipo (type)
df_release_type_year = df.groupby(['release_year', 'type']).size().rename('movies').reset_index()
df_release_type_year

Unnamed: 0,release_year,type,movies
0,1925,TV Show,1
1,1942,Movie,2
2,1943,Movie,3
3,1944,Movie,3
4,1945,Movie,3
...,...,...,...
114,2019,TV Show,397
115,2020,Movie,517
116,2020,TV Show,436
117,2021,Movie,277


In [45]:
# Graficamos
fig = px.bar(df_release_type_year, x='release_year', y ='movies', title='Series y Peliculas por año', color='type')
fig.show()

In [46]:
# Forma para que no se sobrepongan las barras
fig = px.bar(df_release_type_year, x='release_year', y ='movies', title='Series y Peliculas por año', color='type', barmode='group')
fig.show()


In [47]:
# Hagamos un zoom desde 2000 hasta la fecha
index_1990 = df_release_type_year[(df_release_type_year['release_year'] < 2000)].index
df_release_type_year.drop(index_1990, inplace=True)

In [48]:
df_release_type_year.sample(3)

Unnamed: 0,release_year,type,movies
89,2007,Movie,74
78,2001,TV Show,5
97,2011,Movie,145


In [49]:
fig = px.bar(df_release_type_year, x='release_year', y ='movies', title='Series y Peliculas por año (2000 - 2021)', color='type', barmode='group', text='movies')
fig.show()

## Grafico de lineas

In [50]:
fig = px.line(df_movies_year, x='release_year', y='movies', title='Series y Peliculas por año')
fig.show()

In [51]:
fig = px.line(df_release_type_year, x='release_year', y='movies', title='Series y Peliculas por año (2000 - 2021)', color='type')
fig.show()

In [52]:
# Grafico de peliculas y series por año basado en rating
df_release_rating = df.groupby(['release_year', 'rating']).size().rename('movies').reset_index()
df_release_rating

Unnamed: 0,release_year,rating,movies
0,1925,TV-14,1
1,1942,TV-14,2
2,1943,TV-PG,3
3,1944,TV-14,2
4,1944,TV-PG,1
...,...,...,...
435,2021,TV-G,21
436,2021,TV-MA,270
437,2021,TV-PG,45
438,2021,TV-Y,26


In [53]:
fig = px.line(df_release_rating, x='release_year', y='movies', color='rating', title='Serires y Pelisculas por año y por rating')
fig.show()

In [54]:
index_1990 = df_release_rating[(df_release_rating['release_year'] < 2000)].index
df_release_rating.drop(index_1990, inplace=True)

In [55]:
# Limpieza de rating
ix66 = df_release_rating[df_release_rating['rating'] == '66 min'].index
ix74 = df_release_rating[df_release_rating['rating'] == '84 min'].index
ix84 = df_release_rating[df_release_rating['rating'] == '74 min'].index

In [56]:
# Eliminamos los datos
df_release_rating.drop(ix66, inplace=True)
df_release_rating.drop(ix74, inplace=True)
df_release_rating.drop(ix84, inplace=True)

In [57]:
df_release_rating.head(6)

Unnamed: 0,release_year,rating,movies
217,2000,G,2
218,2000,PG,4
219,2000,PG-13,10
220,2000,R,8
221,2000,TV-14,5
222,2000,TV-MA,1


In [58]:
df_release_rating.groupby('rating').count()

Unnamed: 0_level_0,release_year,movies
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
G,13,13
NC-17,3,3
NR,12,12
PG,22,22
PG-13,22,22
R,22,22
TV-14,22,22
TV-G,15,15
TV-MA,22,22
TV-PG,22,22


In [59]:
fig = px.bar(df_release_rating, x='release_year', y='movies', color='rating', title='Series y Peliculas por año y por Rating')
fig.show()

Retomando un poco lo teorio, la pregunta que se esta buscando responder con estos graficos o con este grafico es, ¿Cuantas series y peliculas tengo por rating por año?

## Grafico de Pie
- Que tipo de pelicula
- Que tipo de pelicula (por rating) se incluyeron, en total

In [60]:
# Limpieza de ratings
df.at[5541, 'duration_unit'] = 'min'
df.at[5794, 'duration_unit'] = 'min'
df.at[5813, 'duration_unit'] = 'min'

df.at[5541, 'duration_num'] = 74
df.at[5794, 'duration_num'] = 84
df.at[5813, 'duration_num'] = 66

df.at[5541, 'duration'] = '74 min'
df.at[5794, 'duration'] = '84 min'
df.at[5813, 'duration'] = '66 min'

# Se asume que el rating era 'G'
df.at[5541, 'rating'] = 'G'
df.at[5794, 'rating'] = 'G'
df.at[5813, 'rating'] = 'G'


In [61]:
# Numero de peliculas clasificadas por rating
df_release_rating_total = df.groupby(['rating']).size().rename('movies').reset_index()
df_release_rating_total

Unnamed: 0,rating,movies
0,G,44
1,NC-17,3
2,NR,80
3,PG,287
4,PG-13,490
5,R,799
6,TV-14,2160
7,TV-G,220
8,TV-MA,3207
9,TV-PG,863


In [62]:
fig = px.pie(df_release_rating_total, names='rating', values='movies', title='Series y Peliculas por rating')
fig.show()

## Grafico de Correlacion
- Relacion entre variables
- Pregunta: el largo de los titulos tiene que ver con el año de release, o no?

In [63]:
# Se agrega una nueva columna para el numero de palabras en el titulo
df['num_words_title'] = df['title'].str.split().str.len() # Separamas palabra por palabra y contamos la longitud

In [64]:
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_num,duration_unit,num_words_title
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.0,min,4
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2.0,season,3
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1.0,season,1
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",1.0,season,3
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2.0,season,2


In [65]:
# Comprobamos si la columna es del tipo numerico, si es numerico saldra con estadisticas en su columna
df.describe()

Unnamed: 0,release_year,duration_num,num_words_title
count,8807.0,8807.0,8807.0
mean,2014.180198,69.84853,3.110821
std,8.819312,50.806431,1.862154
min,1925.0,1.0,1.0
25%,2013.0,2.0,2.0
50%,2017.0,88.0,3.0
75%,2019.0,106.0,4.0
max,2021.0,312.0,17.0


In [66]:
# Generamos un DF para el analisis de correlacion
df[['release_year', 'num_words_title']].sample(5)

Unnamed: 0,release_year,num_words_title
4767,2018,2
7502,2012,3
7269,2019,2
5121,2017,2
8033,2016,3


In [67]:
# Correlacion Heatmap
# Se calcula la correlacion entre las variables numericas del DF
df.corr()

"""Nora: la variable duration_num contiene datos de series (seasons) y peliculas (min), por lo que se tiene que desagregrar por tipos (type) para el analisis"""

ValueError: could not convert string to float: 's1'

In [68]:
# Esto indica que a medida que el tiempo pasa, el largo de los titulos sigue siendo el mismo y 
# ademas la duracion se ha udo acortando
# Ahora bien, esto esta tomando en cuenta series y peliculas

In [69]:
# Heatmap de correlacion
fig = px.imshow(df.corr(), text_auto=True)
fig.show()

ValueError: could not convert string to float: 's1'

In [70]:
df_movies = df[df['type'] == 'Movie']

In [71]:
fig = px.imshow(df_movies.corr(), text_auto=True)
fig.show()

ValueError: could not convert string to float: 's1'

## Graficos Estadisticos

In [72]:
fig = px.histogram(df_movies, x='num_words_title', title='Distribucion del largo del titulo', nbins=20)
fig.show()

In [73]:
fig = px.histogram(df_movies, x='duration_num', title='Distribucion del largo de las peliculas', nbins=20)
fig.show()

In [74]:
# Acumulada
fig = px.ecdf(df, x='num_words_title')
fig.show()

In [75]:
# Distrobicion acumulada por pelicula
fig = px.ecdf(df_movies, x='duration_num')
fig.show()

## BoxPlot

In [76]:
fig = px.box(df_movies, y='num_words_title', title='BoxPlot del largo de titulos')
fig.show()

## Violin

In [77]:
fig = px.violin(df_movies, y='num_words_title', title='Violin del largo del titulo')
fig.show()

## Strip

In [78]:
# Distribuciones por categoria
fig = px.strip(df_movies, x='duration_num', y='rating')
fig.show()

In [79]:
# Distribucion por categoria
fig = px.strip(df_movies, x='duration_num', y='release_year')
fig.show()

In [80]:
# Zoom desde el año 2000
fig = px.strip(df_movies[df_movies['release_year'] > 1990], x='duration_num', y='release_year', color='release_year')
fig.show()

## Graficas Multipanel

In [85]:
# Facet charts
fig = px.strip(df_movies[df_movies['release_year'] > 2015], x='duration_num', y='num_words_title', facet_col='release_year')
fig.show()

In [87]:
fig = px.strip(df[df['release_year'] > 2015], x='duration_num', y='num_words_title', color='type', facet_col='release_year')
fig.show()

## Grafica Scatterplot

In [88]:
fig = px.scatter(df_release_rating, x='release_year', y='movies')
fig.show()

In [90]:
# Numero de palabras en el titulo por año
fig = px.scatter(df, x='release_year', y='num_words_title', facet_col='type')
fig.show()

## Mapa de calor

In [91]:
px.density_heatmap(df, x='release_year', y='type')

In [92]:
# Crearemos un nuevo DF para extraer informaicon unicamente despues del año 2000 en adelante
df_2000 = df

In [94]:
# Filtramos los indices
index_2000 = df_2000[(df_2000['release_year'] < 2000)].index
df_2000.drop(index_2000, inplace=True)

In [95]:
px.density_heatmap(df_2000, x='release_year', y='type')

In [97]:
px.density_heatmap(df_2000, x='release_year', y='type', facet_row='type')

## Grafica de contorno

In [98]:
fig = px.density_contour(df[df['release_year'] > 2000], x='release_year', y='type')
fig.show()

In [101]:
fig = px.density_contour(df_movies[df_movies['release_year'] > 2000], x='duration_num', y='release_year')
fig.show()
