<a href="https://colab.research.google.com/github/LRLeite/Data-Analytics/blob/main/IMDb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Avaliações de filmes e séries, segundo o IMDb**

## **Informações sobre os dados:**
https://www.imdb.com/interfaces/



---



---



In [1]:
import pandas as pd
import gzip
import numpy as np

## **Importar os dados**

In [2]:
#Salvar o link dos dados das informações básicas num objeto
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'

#Importar os dados com os dados das informações básicas
basics = pd.read_csv(url_basics, compression = 'gzip', sep = '\t', usecols = ['tconst','titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'genres'])
basics.head()

Unnamed: 0,tconst,titleType,originalTitle,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,1893,\N,1,"Comedy,Short"


In [3]:
#Salvar o link dos dados das avaliações num objeto
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

#Importar os dados com os dados das avaliações
ratings = pd.read_csv(url_ratings, compression = 'gzip', sep = '\t')
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1923
1,tt0000002,5.8,259
2,tt0000003,6.5,1737
3,tt0000004,5.6,174
4,tt0000005,6.2,2550


## **Realizar join das tabelas para visualizar as avaliações de cada título, apenas com os registros de filmes, filmes de TV e séries de TV**

In [4]:
#Verificar os tipos de títulos
basics.titleType.unique()

array(['short', 'movie', 'tvSeries', 'tvShort', 'tvMovie', 'tvEpisode',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

In [5]:
#Join
avaliation = pd.merge(basics[['tconst', 'titleType', 'originalTitle', 'startYear', 'runtimeMinutes']].loc[basics.titleType.isin(['movie', 'tvMovie', 'tvSeries'])], ratings[['tconst','averageRating', 'numVotes']], on = 'tconst', how = 'left')
avaliation.head()

Unnamed: 0,tconst,titleType,originalTitle,startYear,runtimeMinutes,averageRating,numVotes
0,tt0000009,movie,Miss Jerry,1894,45,5.3,200.0
1,tt0000502,movie,Bohemios,1905,100,4.2,14.0
2,tt0000574,movie,The Story of the Kelly Gang,1906,70,6.0,796.0
3,tt0000591,movie,L'enfant prodigue,1907,90,5.1,20.0
4,tt0000615,movie,Robbery Under Arms,1907,\N,4.3,23.0


In [6]:
avaliation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000557 entries, 0 to 1000556
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   tconst          1000557 non-null  object 
 1   titleType       1000557 non-null  object 
 2   originalTitle   1000557 non-null  object 
 3   startYear       1000557 non-null  object 
 4   runtimeMinutes  1000557 non-null  object 
 5   averageRating   415863 non-null   float64
 6   numVotes        415863 non-null   float64
dtypes: float64(2), object(5)
memory usage: 61.1+ MB




Conforme informado no site do [IMDb](https:www.imdb.com/interfaces/), "Um '\N' é usado para indicar que um determinado campo está ausente ou nulo para esse título/nome". Além disso, as colunas 'startYear', 'endYear', 'runtimeMinutes' são do tipo 'object', o que impossibilita análises númericas que serão realizadas na sequência. Dessa forma, para converter as colunas para tipo númerico na sequência, é necessário substituir os valores '\N' por nan



In [7]:
#Substituir os valores "\\N", para converter as colunas para tipo númerico na sequência
cols = ['startYear', 'endYear', 'runtimeMinutes']
avaliation[cols] = basics[cols].replace({'\\N': np.nan})

In [8]:
#Converter colunas para tipo inteiro
avaliation[['startYear', 'endYear', 'runtimeMinutes']] = avaliation[['startYear', 'endYear', 'runtimeMinutes']].astype('float') #é necessário converter primeiro para float para depois passar para int
avaliation[['startYear', 'endYear', 'runtimeMinutes', 'numVotes']] = avaliation[['startYear', 'endYear', 'runtimeMinutes', 'numVotes']].astype('Int64')

In [9]:
avaliation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000557 entries, 0 to 1000556
Data columns (total 8 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   tconst          1000557 non-null  object 
 1   titleType       1000557 non-null  object 
 2   originalTitle   1000557 non-null  object 
 3   startYear       993573 non-null   Int64  
 4   runtimeMinutes  539821 non-null   Int64  
 5   averageRating   415863 non-null   float64
 6   numVotes        415863 non-null   Int64  
 7   endYear         21164 non-null    Int64  
dtypes: Int64(4), float64(1), object(3)
memory usage: 72.5+ MB




## **Análises**



## **Quantidade de títulos por categoria**

In [10]:
basics[['originalTitle', 'titleType']].groupby('titleType').count()

Unnamed: 0_level_0,originalTitle
titleType,Unnamed: 1_level_1
movie,627705
short,901062
tvEpisode,7091349
tvMiniSeries,46065
tvMovie,138357
tvPilot,2
tvSeries,234495
tvShort,10746
tvSpecial,39207
video,267519


## **TOP 10 Filmes com as melhores avaliações**
(25000 votos é o valor mínimo [necessário](https://en.wikipedia.org/wiki/IMDb) para estar listado entre os melhores no IMDb) 

In [11]:
#Filmes
avaliation[['originalTitle', 'averageRating', 'numVotes']].loc[(avaliation.titleType == 'movie') & (avaliation.numVotes >= 25000 )].sort_values(by = 'averageRating', ascending = False).head(10)

Unnamed: 0,originalTitle,averageRating,numVotes
89162,The Shawshank Redemption,9.3,2663380
172352,Hababam Sinifi,9.2,40898
52849,The Godfather,9.2,1845722
698160,CM101MMXI Fundamentals,9.1,46399
122215,The Lord of the Rings: The Return of the King,9.0,1836283
55372,The Godfather Part II,9.0,1264251
36841,12 Angry Men,9.0,786522
525770,Kantara,9.0,68030
278166,The Dark Knight,9.0,2636338
86675,Schindler's List,9.0,1348725


In [12]:
#Filmes de TV
avaliation[['originalTitle', 'averageRating', 'numVotes']].loc[(avaliation.titleType == 'tvMovie') & (avaliation.numVotes >= 25000 )].sort_values(by = 'averageRating', ascending = False).head(10)

Unnamed: 0,originalTitle,averageRating,numVotes
44724,A Charlie Brown Christmas,8.3,37889
45804,How the Grinch Stole Christmas!,8.3,51853
426479,Temple Grandin,8.2,30196
44295,Rudolph the Red-Nosed Reindeer,8.0,31977
564368,The Normal Heart,7.9,36199
51431,Duel,7.6,71471
371402,You Don't Know Jack,7.6,29036
296864,24: Redemption,7.4,28435
517434,The Sunset Limited,7.3,30312
525483,Werewolf by Night,7.2,49158


## **TOP 10 Filmes com as piores avaliações**


In [13]:
#Filmes
avaliation[['originalTitle', 'averageRating', 'numVotes']].loc[(avaliation.titleType == 'movie') & (avaliation.numVotes >= 25000 )].sort_values(by = 'averageRating').head(10)

Unnamed: 0,originalTitle,averageRating,numVotes
910482,Cumali Ceber: Allah Seni Alsin,1.0,39060
859172,Reis,1.0,73623
934998,Sadak 2,1.1,95949
861281,Smolensk,1.2,39718
183705,Superbabies: Baby Geniuses 2,1.5,31136
503507,Elk*rtuk,1.5,38093
46065,Manos: The Hands of Fate,1.6,36513
567934,Justin Bieber: Never Say Never,1.6,76282
402338,Disaster Movie,1.9,91841
918312,Race 3,1.9,46866


In [14]:
#Filmes de TV
avaliation[['originalTitle', 'averageRating', 'numVotes']].loc[(avaliation.titleType == 'tvMovie') & (avaliation.numVotes >= 25000 )].sort_values(by = 'averageRating').head(10)

Unnamed: 0,originalTitle,averageRating,numVotes
704096,Sharknado,3.3,50169
296527,High School Musical 2,5.1,61302
340212,Camp Rock,5.1,34300
280562,High School Musical,5.5,89357
604724,The Wizard of Lies,6.8,26039
98650,Gia,6.9,46247
525483,Werewolf by Night,7.2,49158
517434,The Sunset Limited,7.3,30312
296864,24: Redemption,7.4,28435
51431,Duel,7.6,71471


## **TOP 10 Séries de TV com as melhores avaliações**
(25000 votos é o valor mínimo [necessário](https://en.wikipedia.org/wiki/IMDb) para estar listado entre os melhores no IMDb) 

In [15]:
avaliation[['originalTitle', 'averageRating', 'numVotes']].loc[(avaliation.titleType == 'tvSeries') & (avaliation.numVotes >= 25000 )].sort_values(by = 'averageRating', ascending = False).head(10)

Unnamed: 0,originalTitle,averageRating,numVotes
306841,Breaking Bad,9.5,1865404
926418,The Heroes,9.4,165942
987241,The Chosen,9.4,28536
257669,Avatar: The Last Airbender,9.3,313541
412088,Scam 1992: The Harshad Mehta Story,9.3,144059
203683,The Wire,9.3,338884
490338,Aspirants,9.2,297114
109240,The Sopranos,9.2,396831
310568,Game of Thrones,9.2,2084536
777107,The Filthy Frank Show,9.2,33076


## **TOP 10 Séries de TV com as piores avaliações**

In [16]:
avaliation[['originalTitle', 'averageRating', 'numVotes']].loc[(avaliation.titleType == 'tvSeries') & (avaliation.numVotes >= 25000 )].sort_values(by = 'averageRating').head(10)

Unnamed: 0,originalTitle,averageRating,numVotes
353060,Keeping Up with the Kardashians,2.8,30177
962149,Batwoman,3.4,44208
992812,Resident Evil,4.0,40336
493732,Raketsonyeondan,4.1,27653
362964,Tandav,4.6,60404
774835,Inhumans,4.9,27019
287206,Hannah Montana,5.2,40742
92159,7th Heaven,5.2,25802
352709,She-Hulk: Attorney at Law,5.2,159710
951579,Another Life,5.2,36076


## **Filmes com maior duração**

In [17]:
#Filmes
avaliation[['originalTitle', 'runtimeMinutes']].loc[avaliation.titleType == 'movie'].sort_values(by = 'runtimeMinutes', ascending = False).head(10)

Unnamed: 0,originalTitle,runtimeMinutes
448072,Anote Konote,8400
271898,Isto É São Paulo,5220
774732,New Year,4800
441446,Cupid and Valentine's Day,3600
306337,La cattiva stella,2925
225185,The Silent Cross,2800
446662,Louloute,2002
445189,Kto prikhodit v zimniy vecher...,2001
447836,Island 18,2001
448584,Lunch Always!,2001


In [18]:
#Filmes de TV
avaliation[['originalTitle', 'runtimeMinutes']].loc[avaliation.titleType == 'tvMovie'].sort_values(by = 'runtimeMinutes', ascending = False).head(10)

Unnamed: 0,originalTitle,runtimeMinutes
235170,An American Family Revisited: The Louds 10 Yea...,3122
780671,Franklin and Friends: Deep Sea Voyage,2150
448341,Snow 2: Brain Freeze,2000
454123,The World's Most Luxurious Prison,2000
404225,For Peter Sake,1664
94559,The White House,1620
414010,Die ganz begreifliche Angst vor Schlägen,1370
73952,Deadly Deception,1335
478212,The Horror Show,1248
106227,Lord of Misrule,1140
