 # Парсинг данных и анализ. Кассовые сборы фильмов и бюджеты
 
Поработаем с данными о бюджетах фильмов и финансовых показателях с помощью сайта [The Numbers](https://www.the-numbers.com/movie/budgets/all).

In [1]:
!pip install fake_useragent

Collecting fake_useragent
  Obtaining dependency information for fake_useragent from https://files.pythonhosted.org/packages/33/c9/ff44922639b8827dbc86d463d870dabfc19d1567d8a6427dcb2289d83fd8/fake_useragent-1.4.0-py3-none-any.whl.metadata
  Downloading fake_useragent-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading fake_useragent-1.4.0-py3-none-any.whl (15 kB)
Installing collected packages: fake_useragent
Successfully installed fake_useragent-1.4.0


In [2]:
# импортируйте нужные библиотеки
import requests
from bs4 import BeautifulSoup
import pandas as pd
from fake_useragent import UserAgent
from tqdm import tqdm
import matplotlib
%matplotlib inline 

In [3]:
import warnings
warnings.filterwarnings("ignore")

### Парсинг данных

In [9]:
req = requests.get('https://www.the-numbers.com/movie/budgets/all')
print(req)

<Response [403]>


Мы впервые сталкиваемся с тем, что сервер не отдает нам данные по запросу. Но и из этой ситуации есть выход! Библиотека `fake_useragent`.

In [10]:
ua = UserAgent()
headers = {'User-Agent': ua.chrome}

In [11]:
req = requests.get('https://www.the-numbers.com/movie/budgets/all', headers=headers)
print(req)

<Response [200]>


In [12]:
soup = BeautifulSoup(req.text, 'html')

In [13]:
df = pd.read_html(str(soup.find('table')))[0]

In [14]:
help(pd.read_html)

Help on function read_html in module pandas.io.html:

read_html(io: 'FilePath | ReadBuffer[str]', *, match: 'str | Pattern' = '.+', flavor: 'str | None' = None, header: 'int | Sequence[int] | None' = None, index_col: 'int | Sequence[int] | None' = None, skiprows: 'int | Sequence[int] | slice | None' = None, attrs: 'dict[str, str] | None' = None, parse_dates: 'bool' = False, thousands: 'str | None' = ',', encoding: 'str | None' = None, decimal: 'str' = '.', converters: 'dict | None' = None, na_values: 'Iterable[object] | None' = None, keep_default_na: 'bool' = True, displayed_only: 'bool' = True, extract_links: "Literal[None, 'header', 'footer', 'body', 'all']" = None, dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>) -> 'list[DataFrame]'
    Read HTML tables into a ``list`` of ``DataFrame`` objects.
    
    Parameters
    ----------
    io : str, path object, or file-like object
        String, path object (implementing ``os.PathLike[str]``), or file-like
        object im

In [20]:
df

Unnamed: 0_level_0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 9, 2022",Avatar: The Way of Water,"$460,000,000","$684,075,767","$2,319,591,720"
2,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,788,912,285"
3,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
4,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
5,"May 17, 2023",Fast X,"$340,000,000","$146,126,015","$714,581,860"
...,...,...,...,...,...
6466,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
6467,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
6468,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
6469,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


Спарсим все данные по ссылке выше.

In [16]:
main = pd.DataFrame()

for i in tqdm(range(1, 65)):
    req = requests.get(f'https://www.the-numbers.com/movie/budgets/all/{i}01', headers=headers)
    soup = BeautifulSoup(req.text, 'html')
    table = soup.find('table')
    df1 = pd.read_html(str(table))[0]
    main = pd.concat([main, df1])

100%|██████████████████████████████████████████████████████████████████████████████████| 64/64 [02:04<00:00,  1.94s/it]


In [17]:
df = pd.concat([df, main])

In [18]:
df = df.set_index('Unnamed: 0')

In [19]:
df.shape

(6470, 5)

In [22]:
df.sample(5)

Unnamed: 0_level_0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6289,"Jan 18, 1967",Per un pugno di dollari,"$200,000","$3,500,000","$3,528,283"
878,"Sep 20, 2002",Ballistic: Ecks vs. Sever,"$70,000,000","$14,294,842","$14,294,842"
4685,"May 8, 2009",Julia,"$6,000,000","$65,108","$1,365,108"
5572,"Nov 20, 2009",The Missing Person,"$2,000,000","$17,896","$17,896"
4611,"Jul 25, 1980",Caddyshack,"$6,000,000","$39,846,344","$39,849,764"


### EDA

* Сколько строк и столбцов содержит набор данных?
* Присутствуют ли значения NaN?
* Есть ли дублирующиеся строки?
* Какие типы данных столбцов?

In [24]:
df.shape

(6470, 5)

In [25]:
# ваш код
df.isna().values.any()

False

In [26]:
df.duplicated().values.any()

False

In [None]:
# Если у нас False и False значит у нас нет ни пустых значений ни дублей


In [27]:
?df.dropna

In [28]:
?df.drop_duplicates

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6470 entries, 1 to 6470
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Release Date       6470 non-null   object
 1   Movie              6470 non-null   object
 2   Production Budget  6470 non-null   object
 3   Domestic Gross     6470 non-null   object
 4   Worldwide Gross    6470 non-null   object
dtypes: object(5)
memory usage: 303.3+ KB


### Перевод данных

Переводим столбцы в нужные типы данных

In [30]:
my_cols = list(df.columns)
my_cols

['Release Date',
 'Movie',
 'Production Budget',
 'Domestic Gross',
 'Worldwide Gross']

In [32]:
cols = ['Production Budget', 'Domestic Gross', 'Worldwide Gross']

for col in cols:
    df[col] = df[col].str.replace('$', '')
    df[col] = df[col].str.replace(',', '')
    df[col] = pd.to_numeric(df[col])

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6470 entries, 1 to 6470
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Release Date       6470 non-null   object
 1   Movie              6470 non-null   object
 2   Production Budget  6470 non-null   int64 
 3   Domestic Gross     6470 non-null   int64 
 4   Worldwide Gross    6470 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 303.3+ KB


### Описательная статистика

* Какой средний бюджет фильмов в наборе данных?
* Какой средний мировой доход фильмов?
* Какие минимальные показатели мирового и локального дохода фильмов?
* Какие самый высокий бюджет и самый высокий мировой доход среди фильмов?
* Какой доход принесли фильмы с самым низким и самым высоким бюджетом?

In [37]:
# ваш код
pd.options.display.float_format = '{:,.2f}'.format

df.describe()

Unnamed: 0,Production Budget,Domestic Gross,Worldwide Gross
count,6470.0,6470.0,6470.0
mean,32574090.87,41942440.98,93286774.93
std,43949080.38,71834702.38,185769917.22
min,86.0,0.0,0.0
25%,5000000.0,1082370.5,3730082.75
50%,17000000.0,16108537.5,27148444.5
75%,40000000.0,51622412.0,96935655.0
max,460000000.0,936662225.0,2923706026.0


In [38]:
df[df['Production Budget'] == 86.0]

Unnamed: 0_level_0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6470,"Mar 2, 2021",Neeras,86,0,0


In [44]:
df.sort_values(by=['Production Budget'], ascending = True).iloc[0:3]

Unnamed: 0_level_0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6470,"Mar 2, 2021",Neeras,86,0,0
6469,"Aug 5, 2005",My Date With Drew,1100,181041,181041
6468,"Sep 29, 2015",A Plague So Pleasant,1400,0,0


In [53]:
# df.sort_values(by=['Production Budget'], ascending = False).iloc[0:10]
# df.sort_values(by=['Domestic Gross'], ascending = False).iloc[0:10]
df.sort_values(by=['Worldwide Gross'], ascending = False).iloc[0:10]

Unnamed: 0_level_0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
32,"Dec 17, 2009",Avatar,237000000,785221649,2923706026
2,"Apr 23, 2019",Avengers: Endgame,400000000,858373000,2788912285
1,"Dec 9, 2022",Avatar: The Way of Water,460000000,684075767,2319591720
57,"Dec 18, 1997",Titanic,200000000,674460013,2223048786
6,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817
7,"Apr 25, 2018",Avengers: Infinity War,300000000,678815482,2048359754
55,"Dec 14, 2021",Spider-Man: No Way Home,200000000,814115070,1907836254
45,"Jun 9, 2015",Jurassic World,215000000,652306625,1669963641
18,"Jul 11, 2019",The Lion King,260000000,543638043,1646106779
37,"Apr 25, 2012",The Avengers,225000000,623357910,1515100211


In [49]:
df[(df['Worldwide Gross'] == 0) & ((df['Release Date'] != 'Unknown') & (~df['Release Date'].str.contains('2024')))]

Unnamed: 0_level_0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
256,"Dec 13, 2019",6 Underground,150000000,0,0
374,"Nov 9, 2018",Outlaw King,120000000,0,0
375,"Dec 2, 2022",Emancipation,120000000,0,0
387,"Mar 6, 2019",Triple Frontier,115000000,0,0
500,"Jun 12, 2020",Artemis Fowl,100000000,0,0
...,...,...,...,...,...
6445,"Nov 25, 2011",The Ridges,17300,0,0
6459,"May 19, 2015",Family Motocross,10000,0,0
6465,"Mar 1, 2022",Red 11,7000,0,0
6468,"Sep 29, 2015",A Plague So Pleasant,1400,0,0


### Фильмы, потерявшие деньги

* Какой процент фильмов, в которых затраты на производство превысили мировой доход?

In [57]:
df[df['Worldwide Gross'] < df['Production Budget']].shape(0)/df.shape[0]

TypeError: 'tuple' object is not callable