<a href="https://colab.research.google.com/github/DataRecouver/Codes-DataScience-Python/blob/main/Desafio_2_IMDb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Criação do dataset IMDb

Utilizando as técnicas de raspagem de dados apresentadas na aula, crie um dataset com as informações dos filmes de longa metragem lançados no primeiro semestre de 2021. Os dados estão disponíveis no site do IMDb e podem ser obtidos a partir da busca avançada do site.

Lembrem-se de marcar "Feature film" no campo "Title Type" da página de configuração da busca.

Os dados requeridos são: 
- Título
- Ano de lançamento
- Censura
- Duração
- Categorias
- Avaliação
- Metascore
- Resumo
- Número de votos
- Arrecadação

Vocês devem entregar o código da raspagem e o dataset construído em csv. 

**A atividade deve ser feita em equipes. Basta um membro da equipe postar e informar os demais membros no notebook.**

Grupo 4

**....:::: Membros ::::....**
- Ane Caroline Teixeira
- Danilo Lima Souza
- Gabriel Borges Calheiros
- Izadora de Oliveira Machado Paim
- Laianne Protasio
- Guilherme Cruz

In [None]:
import numpy as np
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
from google.colab import files

In [None]:
imdb_page = requests.get("https://www.imdb.com/search/title/?title_type=feature&release_date=2021-01-01,2021-06-30&count=250&start=0&ref_=adv_nxt")

imdb_page.status_code

200

In [None]:
imdb_soup = BeautifulSoup(imdb_page.content, 'html.parser')

In [None]:
movies_list = imdb_soup.find_all('div', class_='lister-item mode-advanced')
movies_list

In [None]:
print(type(movies_list))
print(len(movies_list))

<class 'bs4.element.ResultSet'>
250


In [None]:
movie_name = movies_list[0].h3.a.text
movie_name

'Nobody'

In [None]:
movie_release = movies_list[0].h3.find('span', class_='lister-item-year text-muted unbold').text
movie_release

'(I) (2021)'

In [None]:
movie_censorship = movies_list[0].p.find('span', class_='certificate').text
movie_censorship

'R'

In [None]:
movie_duration = movies_list[0].p.find('span', class_='runtime').text
movie_duration

'92 min'

In [None]:
movie_category = movies_list[0].p.find('span', class_='genre').text
movie_category

'\nAction, Crime, Drama            '

In [None]:
movie_rating = movies_list[0].strong.text
rating_to_float = float(movie_rating)

rating_to_float

7.4

In [None]:
movie_metascore = movies_list[0].find('span', class_='metascore favorable').text
metascore_to_int = int(movie_metascore)

metascore_to_int

64

In [None]:
movie_summary = movies_list[0].find_all(class_='text-muted')[2].text
movie_summary

'\nA docile family man slowly reveals his true character after his house gets burgled by two petty thieves, which, coincidentally, leads him into a bloody war with a Russian crime boss.'

In [None]:
movie_votes = movies_list[0].find_all('span', attrs={'name': 'nv'})[0]['data-value']

votes_to_int = int(movie_votes)
votes_to_int

235800

In [None]:
movie_gross = movies_list[0].find_all('span', attrs={'name': 'nv'})[1]['data-value']

gross_replace = movie_gross.replace(',','')
gross_to_float = float(gross_replace)

gross_to_float

27268035.0

# Generalize the code

In [None]:
num = 0
lista = []

while(num < 4557):
  imdb_page = requests.get("https://www.imdb.com/search/title/?title_type=feature&release_date=2021-01-01,2021-06-30&count=250&start={}&ref_=adv_nxt".format(num))

  num += 250

  imdb_soup = BeautifulSoup(imdb_page.content, 'html.parser')
  movies_list = imdb_soup.find_all('div', class_='lister-item mode-advanced')

  for movie in movies_list:
    votes_and_grosses = []

    for i in movie.find_all('span', attrs={'name': 'nv'}):
      if i is None:
        votes_and_grosses.append(np.nan)
      else:
        votes_and_grosses.append(i)
    
    m = [
      movie.h3.find('a'),
      movie.h3.find('span', class_='lister-item-year'),
      movie.p.find('span', class_='certificate'),
      movie.p.find('span', class_='runtime'),
      movie.p.find('span', class_='genre'),
      movie.find('strong'),
      movie.find('span', class_='metascore'),
      movie.find_all(class_='text-muted')[2],
    ]

    m += votes_and_grosses

    lista.append([np.nan if item is None else item.text for item in m])

Unnamed: 0,titles,releases,censorships,durations,categories,ratings,metascores,summaries,votes,grosses
0,Nobody,(I) (2021),R,92 min,"\nAction, Crime, Drama",7.4,64,\nA docile family man slowly reveals his true ...,235870,$27.27M
1,The Worst Person in the World,(2021),R,128 min,"\nComedy, Drama, Romance",7.8,90,\nThe chronicles of four years in the life of ...,54574,
2,Wrath of Man,(2021),R,119 min,"\nAction, Crime, Thriller",7.1,57,"\nThe plot follows H, a cold and mysterious ch...",169132,
3,CODA,(2021),PG-13,111 min,"\nComedy, Drama, Music",8.0,74,\nAs a CODA (Child of Deaf Adults) Ruby is the...,119624,
4,Pleasure,(2021),Not Rated,109 min,\nDrama,6.3,75,\nBella Cherry arrives in Los Angeles with dre...,13393,
...,...,...,...,...,...,...,...,...,...,...
4554,Kikoriki And Friends. Vol.2,(2021),,,,,,\nAdd a Plot\n,,
4555,Fennu de huangniu,(2021),,78 min,"\nAction, Crime",,,\nBatu and his wife Tana work hard for their l...,,
4556,Luban four Heroes,(2021),,,\nAction,6.7,,\nAdd a Plot\n,40,
4557,Dimensions 2,(2021),,,\nAdventure,6.4,,\nJack gets taken by a guard and the guard try...,7,


In [None]:
movie_df = pd.DataFrame(lista, columns=['titles', 'releases', 'censorships', 'durations', 'categories', 'ratings', 'metascores', 'summaries', 'votes', 'grosses'])
movie_df

Unnamed: 0,titles,releases,censorships,durations,categories,ratings,metascores,summaries,votes,grosses
0,Nobody,(I) (2021),R,92 min,"\nAction, Crime, Drama",7.4,64,\nA docile family man slowly reveals his true ...,235870,$27.27M
1,The Worst Person in the World,(2021),R,128 min,"\nComedy, Drama, Romance",7.8,90,\nThe chronicles of four years in the life of ...,54574,
2,Wrath of Man,(2021),R,119 min,"\nAction, Crime, Thriller",7.1,57,"\nThe plot follows H, a cold and mysterious ch...",169132,
3,CODA,(2021),PG-13,111 min,"\nComedy, Drama, Music",8.0,74,\nAs a CODA (Child of Deaf Adults) Ruby is the...,119624,
4,Pleasure,(2021),Not Rated,109 min,\nDrama,6.3,75,\nBella Cherry arrives in Los Angeles with dre...,13393,
...,...,...,...,...,...,...,...,...,...,...
4554,Kikoriki And Friends. Vol.2,(2021),,,,,,\nAdd a Plot\n,,
4555,Fennu de huangniu,(2021),,78 min,"\nAction, Crime",,,\nBatu and his wife Tana work hard for their l...,,
4556,Luban four Heroes,(2021),,,\nAction,6.7,,\nAdd a Plot\n,40,
4557,Dimensions 2,(2021),,,\nAdventure,6.4,,\nJack gets taken by a guard and the guard try...,7,


In [None]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4559 entries, 0 to 4558
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   titles       4559 non-null   object
 1   releases     4559 non-null   object
 2   censorships  477 non-null    object
 3   durations    3370 non-null   object
 4   categories   4303 non-null   object
 5   ratings      2472 non-null   object
 6   metascores   240 non-null    object
 7   summaries    4559 non-null   object
 8   votes        2473 non-null   object
 9   grosses      12 non-null     object
dtypes: object(10)
memory usage: 356.3+ KB


In [None]:
movie_df.isna().sum()

titles            0
releases          0
censorships    4082
durations      1189
categories      256
ratings        2087
metascores     4319
summaries         0
votes          2086
grosses        4547
dtype: int64

In [None]:
movie_df.dtypes

titles         object
releases       object
censorships    object
durations      object
categories     object
ratings        object
metascores     object
summaries      object
votes          object
grosses        object
dtype: object

In [None]:
# Convert METASCORES to FLOAT
movie_df["metascores"] = movie_df["metascores"].apply(np.single)
movie_df.dtypes

titles          object
releases        object
censorships     object
durations       object
categories      object
ratings         object
metascores     float32
summaries       object
votes           object
grosses         object
dtype: object

In [None]:
# Convert RATINGS to FLOAT
movie_df["ratings"] = movie_df["ratings"].apply(np.single)
movie_df.dtypes

titles          object
releases        object
censorships     object
durations       object
categories      object
ratings        float32
metascores     float32
summaries       object
votes           object
grosses         object
dtype: object

In [None]:
# Replace Categories
movie_df["categories"] = movie_df["categories"].str.replace('\n', '')
movie_df.dtypes

titles          object
releases        object
censorships     object
durations       object
categories      object
ratings        float32
metascores     float32
summaries       object
votes           object
grosses         object
dtype: object

In [None]:
# Replace Summaries
movie_df["summaries"] = movie_df["summaries"].str.replace('\n', '')
movie_df.dtypes

titles          object
releases        object
censorships     object
durations       object
categories      object
ratings        float32
metascores     float32
summaries       object
votes           object
grosses         object
dtype: object

In [None]:
# Replace Votes and convert to FLOAT
movie_df["votes"] = movie_df["votes"].str.replace('$', '')
movie_df["votes"] = movie_df["votes"].str.replace('M', '')
movie_df["votes"] = movie_df["votes"].str.replace(',', '.').apply(np.single)
movie_df.dtypes

  


titles          object
releases        object
censorships     object
durations       object
categories      object
ratings        float32
metascores     float32
summaries       object
votes          float32
grosses         object
dtype: object

In [None]:
# Converted None to NaN
movie_df.grosses.fillna(value=np.nan, inplace=True)
movie_df.dtypes

titles          object
releases        object
censorships     object
durations       object
categories      object
ratings        float32
metascores     float32
summaries       object
votes          float32
grosses         object
dtype: object

In [None]:
movie_df

Unnamed: 0,titles,releases,censorships,durations,categories,ratings,metascores,summaries,votes,grosses
0,Nobody,(I) (2021),R,92 min,"Action, Crime, Drama",7.4,64.0,A docile family man slowly reveals his true ch...,235.869995,$27.27M
1,The Worst Person in the World,(2021),R,128 min,"Comedy, Drama, Romance",7.8,90.0,The chronicles of four years in the life of Ju...,54.574001,
2,Wrath of Man,(2021),R,119 min,"Action, Crime, Thriller",7.1,57.0,"The plot follows H, a cold and mysterious char...",169.132004,
3,CODA,(2021),PG-13,111 min,"Comedy, Drama, Music",8.0,74.0,As a CODA (Child of Deaf Adults) Ruby is the o...,119.624001,
4,Pleasure,(2021),Not Rated,109 min,Drama,6.3,75.0,Bella Cherry arrives in Los Angeles with dream...,13.393000,
...,...,...,...,...,...,...,...,...,...,...
4554,Kikoriki And Friends. Vol.2,(2021),,,,,,Add a Plot,,
4555,Fennu de huangniu,(2021),,78 min,"Action, Crime",,,Batu and his wife Tana work hard for their liv...,,
4556,Luban four Heroes,(2021),,,Action,6.7,,Add a Plot,40.000000,
4557,Dimensions 2,(2021),,,Adventure,6.4,,Jack gets taken by a guard and the guard try's...,7.000000,


In [None]:
movie_df.to_csv('movies_dataframe.csv', index=False)
files.download('movies_dataframe.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
movie_df.to_excel('movies_dataframe.xlsx', index=False)
files.download('movies_dataframe.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>