# Web scraping

Web scraping to technika pozyskiwania danych z internetu poprzez automatyczne pobieranie treści ze stron internetowych. Jest to proces, w którym specjalne programy lub skrypty przeglądają różne strony internetowe (kod strony) i wyciągają z nich konkretne informacje, takie jak tekst, obrazy, dane tabelaryczne czy inne elementy.

Do czego możemy używać web scraping:
- Analizy rynku: Firmy mogą zbierać dane dotyczące cen produktów konkurencji, opinii klientów czy trendów rynkowych.
- Badania naukowe: Naukowcy mogą wykorzystywać web scraping do zbierania danych naukowych z różnych źródeł online.
- Monitorowania mediów społecznościowych: Marketerzy mogą śledzić aktywność na platformach społecznościowych, analizować trendy i reakcje użytkowników.
- Tworzenia spersonalizowanych witryn internetowych: Możliwe jest automatyczne tworzenie stron internetowych na podstawie danych zebranych z innych źródeł.
- Generowania raportów i analiz: Firmy i analitycy danych mogą wykorzystywać web scraping do automatycznego tworzenia raportów i analiz na podstawie danych internetowych.

Jakie są najczęstsze źródła danych dla osób pracujących w Data science?
- bazy danych,
- pliki CSV, JSON, HDF5, PKL
- web scraping
- API

Żeby lepiej zrozumieć na jakiej zasadzie działa web scraping musimy najpierw dowiedzieć się, jakie są podstawowe zasady działania stron internetowych.

Za każdym razem kiedy otwieramy stronę internetową nasza przeglądarka robi **request** do serwera który zwraca pewną odpowiedź **response**. Jeżeli wszystko poszło dobrze, to zawartością tej odpowiedzi jest HTML interesującej nas strony. Może on zawierać odwołania do innych rzeczy na stronie (obrazki, dźwięki, pliki JavaScript) o które przeglądarka zrobi osobne requesty. 

Kiedy korzystamy z przeglądarki internetowej ciężko doprowadzić do nadużyć stron internetowych. Jednak wywołując kod w Pythonie możemy tworzyć nawet tysiące requestów na sekundę co może doprowadzić albo do zablokowania nas, albo nawet do zatkania serwera.

Biblioteka z której będziemy korzystać do robienia requestów nazywa się po prostu `requests`
[Requests PyPi](https://pypi.org/project/requests/)

In [2]:
import requests

### GET request

Requesty typu `GET` służą do odczytywania danych.

### Response

Wynikiem działania requestu jest obiekt typu `Response`.

In [6]:
response = requests.get('https://api.github.com/')

In [7]:
response.status_code

200

Jednym z ważnych atrybutów `Response` jest `status_code` który mówi jaki jest wynik zapytania. Odpowiedź `200 OK` oznacza, że zapytanie było udane i dostaliśmy wynik.

Znaczenia innych kodów można sprawadzić tutaj:
[Inne kody response](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

Co w przypadku gdy wykonamy request do błędnego serwera?

In [8]:
response = requests.get('https://api.github.com/invalid')
response.status_code

404

Aby sprawdzić zawartość odpowiedzi musimy skorzystać z atrybutu `content`

In [32]:
response = requests.get('https://api.github.com')
response.content

b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sea

Oprócz `content` możemy również sprawdzić nagłówek odpowiedzi używając `headers`

In [10]:
response.headers

{'Server': 'GitHub.com', 'Date': 'Wed, 13 Sep 2023 16:56:28 GMT', 'Content-Type': 'application/json; charset=utf-8', 'X-GitHub-Media-Type': 'github.v3; format=json', 'x-github-api-version-selected': '2022-11-28', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Vary': 'Accept-Encoding, Accept, X-Requested-With', 'Content-Encoding': 'gzip', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '48

W nagłówku znajdziemy takie informacje jak np. `Content-Type` który mówi jakiego typu jest odpowiedź:

In [11]:
response.headers['Content-Type']

'application/json; charset=utf-8'

### Tworzenie zapytań do API

Zapytania `GET` możemy odpowiednio parametryzować z użyciem query string.

Zapytanie poniżej odpytuje API Githubowe o repozytoria napisane w języku Python.

[Dokumentacja Github API](https://docs.github.com/en/rest)

In [12]:
response = requests.get(
    'https://api.github.com/search/repositories',
    params={'q': 'requests+language:python'},
)

json_response = response.json()
repository = json_response['items'][0]

repository

{'id': 33210074,
 'node_id': 'MDEwOlJlcG9zaXRvcnkzMzIxMDA3NA==',
 'name': 'secrules-language-evaluation',
 'full_name': 'SpiderLabs/secrules-language-evaluation',
 'private': False,
 'owner': {'login': 'SpiderLabs',
  'id': 508521,
  'node_id': 'MDEyOk9yZ2FuaXphdGlvbjUwODUyMQ==',
  'avatar_url': 'https://avatars.githubusercontent.com/u/508521?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/SpiderLabs',
  'html_url': 'https://github.com/SpiderLabs',
  'followers_url': 'https://api.github.com/users/SpiderLabs/followers',
  'following_url': 'https://api.github.com/users/SpiderLabs/following{/other_user}',
  'gists_url': 'https://api.github.com/users/SpiderLabs/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/SpiderLabs/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/SpiderLabs/subscriptions',
  'organizations_url': 'https://api.github.com/users/SpiderLabs/orgs',
  'repos_url': 'https://api.github.com/users/SpiderLabs/repos',


### Zapisywanie HTML do pliku:

Aby nie nadużywać zasobów strony i nie zostać zablokowanym dobrym pomysłem jest zapisywanie sobie odpowiedzi HTMLowych do pliku i pracowanie na nich offline:

In [20]:
res = requests.get('https://api.github.com', timeout=1)

In [17]:
def save_html(html_content, file_path):
    with open(file_path, 'wb') as f:
        f.write(html_content)


save_html(res.content, 'my_file')

In [18]:
def open_html(file_path):
    with open(file_path, 'r') as f:
        return f.read()


html = open_html('my_file')
html

'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sear

## BeautifulSoup

Używając `requests` możemy otrzymać HTML danej strony internetowej. Jednak zwykle otrzymane dane będą bardzo skomplikowane i rozbudowane i odczytywanie danych z nich ręcznie może być bardzo trudne. Aby ułatwić sobie tą pracę możemy skorzystać z pakietu `beautifulsoup4`, który pozwala nam parsować pliki HTML i XML.
[Beautifulsoup PyPi](https://pypi.org/project/beautifulsoup4/)

In [31]:
import requests

response = requests.get('https://dataquestio.github.io/web-scraping-pages/simple.html')

In [32]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

In [33]:
response.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [30]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>



Aby nie musieć manualnie szukać rzeczy na stronie `bs4` udostępnia metody pozwalające na szukanie obiektów po tagu HTML, klasie css i id css:

In [37]:
soup.find_all('p')  # znajdź wszystkie elementy z tagiem <p>

[<p>Here is some simple content for this page.</p>]

In [40]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

In [41]:
response = requests.get('https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>



Znajdowanie obiektów z tagiem `<p>` i id `first`:

In [44]:
soup.find_all(name='p', id="first")[0].get_text()

'\n                First paragraph.\n            '

Znajdowanie obiektów z tagiem `<p>` i klasą `outer-text`:

In [50]:
soup.find_all(name='p', class_='outer-text')[1].get_text()

'\n\n                Second outer paragraph.\n            \n'

```css
p {
    color: red
}

.first-item {
    color: blue
}

#first {
    color: green
}
```

In [72]:
soup.select("p.first-item")[1].get_text()

'\n\n                First outer paragraph.\n            \n'

## Ćwiczenie 1

### Scraping strony internetowej z pogodą

Spróbujemy zescrapować pogodę na następne 7 dni w San Francisco.
Skorzystamy ze strony: [Forecast weather gov](https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168)

Analizując jej źródła zobaczymy, że interesuje nas część otagowana id `seven-day-forecast`. Poszczególne prognozy są oznaczone klasą `tombstone-container`.

In [76]:
import requests
from bs4 import BeautifulSoup

page = requests.get('https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168')
soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast-container")
forecast_items = seven_day.find_all(class_="tombstone-container")
# forecast_items = seven_day.select(".tombstone-container")
# result = soup.select('#seven-day-forecast-container.tombstone-container')
print(forecast_items[1].prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Increasing clouds, with a low around 57. West southwest wind 10 to 15 mph, with gusts as high as 18 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Tonight: Increasing clouds, with a low around 57. West southwest wind 10 to 15 mph, with gusts as high as 18 mph. "/>
 </p>
 <p class="short-desc">
  Increasing
  <br/>
  Clouds
 </p>
 <p class="temp temp-low">
  Low: 57 °F
 </p>
</div>



In [86]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday']

Korzystając z różnych klas możemy wyciągnąć poszczególne informacje:

In [95]:
short_desc = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
desc = [img["title"] for img in seven_day.select(".tombstone-container img")]
print(short_desc)
print(temps)
print(desc)

['Sunny', 'IncreasingClouds', 'Sunny', 'IncreasingClouds', 'Mostly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny']
['High: 72 °F', 'Low: 57 °F', 'High: 70 °F', 'Low: 59 °F', 'High: 70 °F', 'Low: 59 °F', 'High: 69 °F', 'Low: 59 °F', 'High: 69 °F']
['Today: Sunny, with a high near 72. Light south southwest wind becoming west southwest 11 to 16 mph in the afternoon. Winds could gust as high as 21 mph. ', 'Tonight: Increasing clouds, with a low around 57. West southwest wind 10 to 15 mph, with gusts as high as 18 mph. ', 'Thursday: Sunny, with a high near 70. West southwest wind 9 to 16 mph, with gusts as high as 20 mph. ', 'Thursday Night: Increasing clouds, with a low around 59. West southwest wind 9 to 16 mph, with gusts as high as 21 mph. ', 'Friday: Mostly sunny, with a high near 70. Southwest wind 6 to 10 mph. ', 'Friday Night: Mostly cloudy, with a low around 59.', 'Saturday: Mostly sunny, with a high near 69.', 'Saturday Night: Partly cloudy, with a low ar

Tak zdobyte wyniki możemy wrzucić do pandas i stworzyć z nich DataFrame:

In [96]:
import pandas as pd

weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_desc,
    "temps": temps,
    "desc": desc
})

weather

Unnamed: 0,period,short_desc,temps,desc
0,Today,Sunny,High: 72 °F,"Today: Sunny, with a high near 72. Light south..."
1,Tonight,IncreasingClouds,Low: 57 °F,"Tonight: Increasing clouds, with a low around ..."
2,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. West sou..."
3,ThursdayNight,IncreasingClouds,Low: 59 °F,"Thursday Night: Increasing clouds, with a low ..."
4,Friday,Mostly Sunny,High: 70 °F,"Friday: Mostly sunny, with a high near 70. Sou..."
5,FridayNight,Mostly Cloudy,Low: 59 °F,"Friday Night: Mostly cloudy, with a low around..."
6,Saturday,Mostly Sunny,High: 69 °F,"Saturday: Mostly sunny, with a high near 69."
7,SaturdayNight,Partly Cloudy,Low: 59 °F,"Saturday Night: Partly cloudy, with a low arou..."
8,Sunday,Mostly Sunny,High: 69 °F,"Sunday: Mostly sunny, with a high near 69."


Spróbujemy dostać się do temperatury - skorzystamy tutaj z wyrażeń regularnych:

Więcej o wyrażeniach regularnych: [Regex](https://regexone.com/)

In [104]:
temp_nums = weather.loc[:, "temps"].str.extract("(\d+)", expand=False)
# \d oznacza, że ta część pasuje do cyfry 0-9, + że powinna być przynajmniej jedna taka cyfra lub więcej
weather["temp_num"] = temp_nums.astype('int')
weather

Unnamed: 0,period,short_desc,temps,desc,temp_num
0,Today,Sunny,High: 72 °F,"Today: Sunny, with a high near 72. Light south...",72
1,Tonight,IncreasingClouds,Low: 57 °F,"Tonight: Increasing clouds, with a low around ...",57
2,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. West sou...",70
3,ThursdayNight,IncreasingClouds,Low: 59 °F,"Thursday Night: Increasing clouds, with a low ...",59
4,Friday,Mostly Sunny,High: 70 °F,"Friday: Mostly sunny, with a high near 70. Sou...",70
5,FridayNight,Mostly Cloudy,Low: 59 °F,"Friday Night: Mostly cloudy, with a low around...",59
6,Saturday,Mostly Sunny,High: 69 °F,"Saturday: Mostly sunny, with a high near 69.",69
7,SaturdayNight,Partly Cloudy,Low: 59 °F,"Saturday Night: Partly cloudy, with a low arou...",59
8,Sunday,Mostly Sunny,High: 69 °F,"Sunday: Mostly sunny, with a high near 69.",69


Aby znaleźć prognozy, tylko dla nocy możemy skorzystać z kolumny `temps` i wyciągnąć wiersze, w których znajduje się słowo "Low".

In [106]:
is_night = weather.loc[:, "temps"].str.contains("Low")
weather["is_night"] = is_night
weather

Unnamed: 0,period,short_desc,temps,desc,temp_num,is_night
0,Today,Sunny,High: 72 °F,"Today: Sunny, with a high near 72. Light south...",72,False
1,Tonight,IncreasingClouds,Low: 57 °F,"Tonight: Increasing clouds, with a low around ...",57,True
2,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. West sou...",70,False
3,ThursdayNight,IncreasingClouds,Low: 59 °F,"Thursday Night: Increasing clouds, with a low ...",59,True
4,Friday,Mostly Sunny,High: 70 °F,"Friday: Mostly sunny, with a high near 70. Sou...",70,False
5,FridayNight,Mostly Cloudy,Low: 59 °F,"Friday Night: Mostly cloudy, with a low around...",59,True
6,Saturday,Mostly Sunny,High: 69 °F,"Saturday: Mostly sunny, with a high near 69.",69,False
7,SaturdayNight,Partly Cloudy,Low: 59 °F,"Saturday Night: Partly cloudy, with a low arou...",59,True
8,Sunday,Mostly Sunny,High: 69 °F,"Sunday: Mostly sunny, with a high near 69.",69,False


In [107]:
weather[is_night]

Unnamed: 0,period,short_desc,temps,desc,temp_num,is_night
1,Tonight,IncreasingClouds,Low: 57 °F,"Tonight: Increasing clouds, with a low around ...",57,True
3,ThursdayNight,IncreasingClouds,Low: 59 °F,"Thursday Night: Increasing clouds, with a low ...",59,True
5,FridayNight,Mostly Cloudy,Low: 59 °F,"Friday Night: Mostly cloudy, with a low around...",59,True
7,SaturdayNight,Partly Cloudy,Low: 59 °F,"Saturday Night: Partly cloudy, with a low arou...",59,True


In [67]:
weather[~is_night]

Unnamed: 0,period,short_desc,temps,desc,temp_num,is_night
0,LaborDay,BecomingSunny,High: 70 °F,"Labor Day: Partly sunny, then gradually becomi...",70,False
2,Tuesday,Sunny,High: 70 °F,"Tuesday: Sunny, with a high near 70. Southwest...",70,False
4,Wednesday,Mostly Sunny,High: 71 °F,"Wednesday: Mostly sunny, with a high near 71. ...",71,False
6,Thursday,Mostly Sunny,High: 71 °F,"Thursday: Mostly sunny, with a high near 71.",71,False
8,Friday,Mostly Sunny,High: 71 °F,"Friday: Mostly sunny, with a high near 71.",71,False


## Ćwiczenie 2

Inny przykład web scraping to zeskrapowanie strony prowadzącej ranking obiektowności mediów w Stanach Zjednoczonych. Strona posiadające te dane to [AllSides Media Bias Ratings](https://www.allsides.com/media-bias)

Spróbuj zescrapować tabelkę z tej strony, Możesz posiłkować się poniższym kodem:

In [49]:
import requests

url = 'https://www.allsides.com/media-bias/media-bias-ratings'
r = requests.get(url)

print(r.content[:100])

b'<!DOCTYPE html>\n<html  lang="en" dir="ltr" prefix="og: http://ogp.me/ns# content: http://purl.org/rs'


In [73]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html.parser')

In [74]:
rows = soup.select('tbody tr')

In [75]:
row = rows[0]

name = row.select_one('.views-field').text.strip()

print(name)

ABC News (Online)


In [76]:
allsides_page = row.select_one('.views-field a')['href']
allsides_page = 'https://www.allsides.com' + allsides_page

print(allsides_page)

https://www.allsides.com/news-source/abc-news-media-bias


In [78]:
agree = row.select_one('.agree').text
agree = int(agree)

disagree = row.select_one('.disagree').text
disagree = int(disagree)

agree_ratio = agree / disagree

print(f"Agree: {agree}, Disagree: {disagree}, Ratio {agree_ratio:.2f}")

Agree: 35455, Disagree: 17959, Ratio 1.97


In [80]:
def get_agreeance_text(ratio):
    if ratio > 3:
        return "absolutely agrees"
    elif 2 < ratio <= 3:
        return "strongly agrees"
    elif 1.5 < ratio <= 2:
        return "agrees"
    elif 1 < ratio <= 1.5:
        return "somewhat agrees"
    elif ratio == 1:
        return "neutral"
    elif 0.67 < ratio < 1:
        return "somewhat disagrees"
    elif 0.5 < ratio <= 0.67:
        return "disagrees"
    elif 0.33 < ratio <= 0.5:
        return "strongly disagrees"
    elif ratio <= 0.33:
        return "absolutely disagrees"
    else:
        return None


print(get_agreeance_text(2.5))

strongly agrees


In [81]:
data = []

for row in rows:
    d = dict()

    d['name'] = row.select_one('.source-title').text.strip()
    d['allsides_page'] = 'https://www.allsides.com' + row.select_one('.source-title a')['href']
    d['bias'] = row.select_one('.views-field-field-bias-image a')['href'].split('/')[-1]
    d['agree'] = int(row.select_one('.agree').text)
    d['disagree'] = int(row.select_one('.disagree').text)
    d['agree_ratio'] = d['agree'] / d['disagree']
    d['agreeance_text'] = get_agreeance_text(d['agree_ratio'])

    data.append(d)

In [84]:
pages = [
    'https://www.allsides.com/media-bias/media-bias-ratings',
]

In [85]:
from time import sleep

data = []

for page in pages:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    rows = soup.select('tbody tr')

    for row in rows:
        d = dict()

        d['name'] = row.select_one('.source-title').text.strip()
        d['allsides_page'] = 'https://www.allsides.com' + row.select_one('.source-title a')['href']
        d['bias'] = row.select_one('.views-field-field-bias-image a')['href'].split('/')[-1]
        d['agree'] = int(row.select_one('.agree').text)
        d['disagree'] = int(row.select_one('.disagree').text)
        d['agree_ratio'] = d['agree'] / d['disagree']
        d['agreeance_text'] = get_agreeance_text(d['agree_ratio'])

        data.append(d)

    sleep(10)

In [86]:
data

[{'name': 'ABC News (Online)',
  'allsides_page': 'https://www.allsides.com/news-source/abc-news-media-bias',
  'bias': 'left-center',
  'agree': 35455,
  'disagree': 17959,
  'agree_ratio': 1.9742190545130576,
  'agreeance_text': 'agrees'},
 {'name': 'AlterNet',
  'allsides_page': 'https://www.allsides.com/news-source/alternet-media-bias',
  'bias': 'left',
  'agree': 13706,
  'disagree': 2968,
  'agree_ratio': 4.617924528301887,
  'agreeance_text': 'absolutely agrees'},
 {'name': 'Associated Press',
  'allsides_page': 'https://www.allsides.com/news-source/associated-press-media-bias',
  'bias': 'center',
  'agree': 26761,
  'disagree': 20602,
  'agree_ratio': 1.2989515581011553,
  'agreeance_text': 'somewhat agrees'},
 {'name': 'Axios',
  'allsides_page': 'https://www.allsides.com/news-source/axios',
  'bias': 'center',
  'agree': 6092,
  'disagree': 6462,
  'agree_ratio': 0.9427421850820179,
  'agreeance_text': 'somewhat disagrees'},
 {'name': 'BBC News',
  'allsides_page': 'https:/

In [88]:
df = pd.DataFrame(data)

In [89]:
df['total_votes'] = df['agree'] + df['disagree']
df.sort_values('total_votes', ascending=False, inplace=True)

df.head(10)

Unnamed: 0,name,allsides_page,bias,agree,disagree,agree_ratio,agreeance_text,total_votes
44,TheBlaze.com,https://www.allsides.com/news-source/theblaze-...,right,99685,80239,1.242351,somewhat agrees,179924
10,CNN (Online News),https://www.allsides.com/news-source/cnn-media...,left,50943,47804,1.065664,somewhat agrees,98747
16,Fox News (Online News),https://www.allsides.com/news-source/fox-news-...,right,41844,47667,0.87784,somewhat disagrees,89511
24,New York Times (News),https://www.allsides.com/news-source/new-york-...,left-center,28942,38784,0.746236,somewhat disagrees,67726
27,NPR (Online News),https://www.allsides.com/news-source/npr-media...,center,31634,30014,1.053975,somewhat agrees,61648
18,HuffPost,https://www.allsides.com/news-source/huffpost-...,left,35462,22240,1.594514,agrees,57702
4,BBC News,https://www.allsides.com/news-source/bbc-news-...,center,29300,25126,1.166123,somewhat agrees,54426
29,Politico,https://www.allsides.com/news-source/politico-...,left-center,23363,30184,0.774019,somewhat disagrees,53547
0,ABC News (Online),https://www.allsides.com/news-source/abc-news-...,left-center,35455,17959,1.974219,agrees,53414
6,Breitbart News,https://www.allsides.com/news-source/breitbart,right,39306,11241,3.496664,absolutely agrees,50547


## Ćwiczenie 3

Zescrapuj dane ze strony [IMDb](https://www.imdb.com/search/title/?title_type=feature&sort=num_votes,desc) z listą najlepszych filmów posortowanych po liczbie głosów. Zrób dataframe ze 100 najlepszymi filmami, ich ratingami, opisem, czasem trwania, reżyserami, głosami oraz aktorami. Zbadaj:
- ile filmów ocenionych jest na ponad 8 gwiazdek,
- który reżyser ma najwięcej filmów w TOP 100,
- jakich jest TOP 5 aktorów, którzy wystąpili najwięcej razy,
- ile filmów ma powyżej 1_500_000 głosów
- jaki jest średni czas trwania filmów z TOP 100 oraz który trwa najdłużej, a który najkrócej.