**Insper**  
**Análise de Textos de Fontes Desestruturadas e WEB**

# **IMDB: <br/><SUBTÍTULO>**

**Beatriz de Jesus**  
**Luciano Felix**  
**Rodrigo Villela**


## **1. Introdução**

> Lorem ipsum dolor sit amet rebum blandit sit diam molestie ea. Erat elitr dolor magna erat velit sadipscing est. Sadipscing accusam elitr vel sea dolore at sed sed voluptua vero at no ut.

Tation dolor suscipit lorem ipsum duo gubergren accusam lorem feugiat diam voluptua gubergren erat dolore diam qui. Eum voluptua consequat ipsum ipsum magna sadipscing sed sit aliquyam diam lorem accumsan consetetur justo aliquyam. Accusam molestie duo nonumy no clita eirmod elitr. Quis volutpat magna lorem vel molestie takimata vero justo eos ipsum tempor ipsum at consetetur exerci eirmod lorem aliquyam. Takimata sit ut aliquyam. Sed tempor consequat kasd molestie ea amet gubergren stet et amet amet stet invidunt euismod in dolore vero. Duis invidunt dolores vel. Vero et feugiat sea eirmod accusam. Sadipscing molestie voluptua rebum magna in dolore sit elitr ea in commodo. Lorem ad at minim et stet at velit clita ea eirmod clita et amet labore. Dolor at eos sit vero dolor sed tempor. Nonumy invidunt eirmod. Aliquam amet veniam et sadipscing takimata sit amet justo nisl at amet at aliquyam clita eirmod. In hendrerit autem blandit nonumy. Consequat labore rebum ipsum tempor eirmod placerat nonumy et te et dolore. Autem nonumy magna nonumy eirmod erat labore et ipsum et autem accusam consetetur. Duo erat minim.


In [29]:
from dataclasses import dataclass
from typing import List, Dict
from urllib.parse import urlparse
from pathlib import PurePath

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


BASE_REQUESTS_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}


## 2. Scrapping dos dados

### 2.1. Obtenção da base de índice dos filmes

In [30]:
def get_top_250_ids(headers: Dict[str, str] = BASE_REQUESTS_HEADERS) -> List[str]:
    url = "https://www.imdb.com/chart/top"
    response = requests.get(url, headers=headers)
    soup = bs(response.text)

    return [
        PurePath(urlparse(node.attrs["href"]).path).name
        for node in soup.select('table > tbody > tr > td.titleColumn > a')
    ]

top_250_ids = get_top_250_ids()
top_250_ids[:5]

['tt0111161', 'tt0068646', 'tt0468569', 'tt0071562', 'tt0050083']

### 2.2. Scrapping para cada filme individual

In [31]:
@dataclass
class Title:
    title: str
    score: float
    plot: str

def fetch_title(title_id: str, headers: Dict[str, str] = BASE_REQUESTS_HEADERS) -> Title:
    url = f"https://www.imdb.com/title/{title_id}/"
    response = requests.get(url, headers=headers)
    soup = bs(response.text)

    title_node = soup.find("h1")
    score_node = soup.select_one('[data-testid="hero-rating-bar__aggregate-rating__score"]').next_element
    plot_node = soup.select_one('[data-testid="plot-xl"]')

    return Title(
        title = title_node.text,
        score = float(score_node.text),
        plot = plot_node.text
    )


In [32]:
titles = []
for index, title_id in enumerate(top_250_ids, 1):
    print(f"\r{index} of {len(top_250_ids)}", end="")

    titles.append(fetch_title(title_id) )

print(f"\rComplete!")

df = pd.DataFrame(titles)
df.head(5)

1 of 250

KeyboardInterrupt: 

## 3. Limpeza dos dados

In [39]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from gensim.parsing.preprocessing import preprocess_string, remove_stopwords

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [49]:
df["keywords"] = df["plot"].map(preprocess_string)

df.head(5)

Unnamed: 0,title,score,plot,keywords
0,Um Sonho de Liberdade,9.3,"Over the course of several years, two convicts...","[cours, year, convict, form, friendship, seek,..."
1,O Poderoso Chefão,9.2,"Don Vito Corleone, head of a mafia family, dec...","[vito, corleon, head, mafia, famili, decid, ha..."
2,Batman: O Cavaleiro das Trevas,9.0,When the menace known as the Joker wreaks havo...,"[menac, known, joker, wreak, havoc, chao, peop..."
3,O Poderoso Chefão II,9.0,The early life and career of Vito Corleone in ...,"[earli, life, career, vito, corleon, new, york..."
4,12 Homens e uma Sentença,9.0,The jury in a New York City murder trial is fr...,"[juri, new, york, citi, murder, trial, frustra..."
5,A Lista de Schindler,9.0,"In German-occupied Poland during World War II,...","[german, occupi, poland, world, war, industria..."
6,O Senhor dos Anéis: O Retorno do Rei,9.0,Gandalf and Aragorn lead the World of Men agai...,"[gandalf, aragorn, lead, world, men, sauron, a..."
7,Pulp Fiction: Tempo de Violência,8.9,"The lives of two mob hitmen, a boxer, a gangst...","[live, mob, hitmen, boxer, gangster, wife, pai..."
8,O Senhor dos Anéis: A Sociedade do Anel,8.8,A meek Hobbit from the Shire and eight compani...,"[meek, hobbit, shire, companion, set, journei,..."
9,Três Homens em Conflito,8.8,A bounty hunting scam joins two men in an unea...,"[bounti, hunt, scam, join, men, uneasi, allian..."


### 2.3. Salvando a base localmente

In [52]:
df.to_csv("../data/imdb_top_250.csv")