# Netflix Original Films & IMDB Scores


In [7]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

In [8]:
df = pd.read_csv('db/NetflixOriginals.csv')
df

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi
...,...,...,...,...,...,...
579,Taylor Swift: Reputation Stadium Tour,Concert Film,"December 31, 2018",125,8.4,English
580,Winter on Fire: Ukraine's Fight for Freedom,Documentary,"October 9, 2015",91,8.4,English/Ukranian/Russian
581,Springsteen on Broadway,One-man show,"December 16, 2018",153,8.5,English
582,Emicida: AmarElo - It's All For Yesterday,Documentary,"December 8, 2020",89,8.6,Portuguese


El dataset cuenta con 584 peliculas y con 6 columnas

In [10]:
df.shape

(584, 6)

Es demasiado raro que encontremos datos demasiado limpios, pero nos ayudara a tener un comienzo muy bueno.

In [11]:
df.isnull().sum()

Title         0
Genre         0
Premiere      0
Runtime       0
IMDB Score    0
Language      0
dtype: int64

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       584 non-null    object 
 1   Genre       584 non-null    object 
 2   Premiere    584 non-null    object 
 3   Runtime     584 non-null    int64  
 4   IMDB Score  584 non-null    float64
 5   Language    584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


El tipo de datos "Premiere" debemos convertirlo a un datetime, porque actualemnte es un objeto.

In [30]:
df['date'] = pd.to_datetime(df['Premiere'])
df['date']

0     2019-08-05
1     2020-08-21
2     2019-12-26
3     2018-01-19
4     2020-10-30
         ...    
579   2018-12-31
580   2015-10-09
581   2018-12-16
582   2020-12-08
583   2020-10-04
Name: date, Length: 584, dtype: datetime64[ns]

Para poder tener un analisis mas sencillo vamos a separar la fecha completa en diferentes columnas.

In [31]:
df['year'] = df['date'].dt.year
df['year_month'] = df['date'].dt.strftime('%Y-%m')
df['month']= df['date'].dt.month
df['day_of_week']=df['date'].dt.dayofweek
df.head(1)

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language,date,year_month,year,month,day_of_week
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese,2019-08-05,2019-08,2019,8,0


# Análisis

## Genero
Existen 115 generos unicos en el data set.

In [53]:
df['Genre'].nunique()

115

Vamos a normalizar los datos esto es muy importante hacerlo, ya que nos ayudara a entender cual es el genero que predomina más.

In [42]:
df['Genre'].value_counts(normalize=True)

Documentary            0.272260
Drama                  0.131849
Comedy                 0.083904
Romantic comedy        0.066781
Thriller               0.056507
                         ...   
Biographical/Comedy    0.001712
Action-adventure       0.001712
Adventure-romance      0.001712
Sports film            0.001712
Action/Comedy          0.001712
Name: Genre, Length: 115, dtype: float64

In [78]:
print(df['Genre'].value_counts()[:10])
print('-'*50)
print('Top 5  de generos cuenta con el: ',(df['Genre'].value_counts()[:5].sum()*100)/df['Genre'].value_counts().sum(), '%')

Documentary        159
Drama               77
Comedy              49
Romantic comedy     39
Thriller            33
Comedy-drama        14
Crime drama         11
Biopic               9
Horror               9
Action               7
Name: Genre, dtype: int64
--------------------------------------------------
Top 10 cuenta con el:  61.13013698630137 %


- El 27.2% de las peliculas son de Documentales y despues le sigue las peliculas de Drama con 13.1%.
- La mayoria de peliculas provienen de diferentes generos y una gran cantidad de pelicuas estan en el rango del 1%.
- Algo impresionante es el top 5 de los generos simplemente abarca el 61.13% de las peliculas. Y sus generos son Documental, Drama, Comedia romantica, Thriller y comedia dramatica.

69.6917808219178

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=322df7a5-4b18-4ddb-8f91-c142e3bf7671' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>