# Web scraping

## Task 1. 

### part 1

We will parse the page with the latest news on [habr.com/ru/all/](https://habr.com/ru/all/).

You only need to collect articles that contain at least one of the required keywords. These words are defined at the beginning of the code in a variable, for example:

`KEYWORDS = ['python', 'парсинг']`

The search is carried out on all available preview information (this is information available directly from the current page).
 
As a result, it should be a dataframe: `<date> - <title> - <link>`

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
res = requests.get('http://habr.com/ru/all/').text

In [3]:
KEYWORDS = ['python', 'парсинг']

In [4]:
soup = BeautifulSoup(res)

In [5]:
articles = soup.find_all('article', class_='tm-articles-list__item')
articles

[<article class="tm-articles-list__item" data-navigatable="" id="656925" tabindex="0"><div class="tm-article-snippet"><div class="tm-article-snippet__meta-container"><div class="tm-article-snippet__meta"><span class="tm-user-info tm-article-snippet__author"><a class="tm-user-info__userpic" href="/ru/users/nawa_itools/" title="nawa_itools"><div class="tm-entity-image"><svg class="tm-svg-img tm-image-placeholder tm-image-placeholder_blue" height="24" width="24"><!-- --> <use xlink:href="/img/megazord-v25.4b679db1.svg#placeholder-user"></use></svg></div></a> <span class="tm-user-info__user"><a class="tm-user-info__username" href="/ru/users/nawa_itools/">
       nawa_itools
     </a> </span></span> <span class="tm-article-snippet__datetime-published"><time datetime="2022-03-22T14:17:57.000Z" title="2022-03-22, 17:17">сегодня в 17:17</time></span></div> <!-- --></div> <h2 class="tm-article-snippet__title tm-article-snippet__title_h2"><a class="tm-article-snippet__title-link" data-article-li

In [6]:
articles_data = pd.DataFrame(columns=['date', 'header', 'link'])
for article in articles:
    article_id = article.attrs.get('id')
    if not article_id:
        continue
    if any([word in str(article).lower() for word in KEYWORDS]):
        date = pd.to_datetime(article.time.get('title'))
        header = article.h2.string
        link = 'https://habr.com' + article.h2.a.get('href')
        row = {'date': date, 'header': header, 'link': link}
        articles_data = pd.concat([articles_data, pd.DataFrame([row])]) 
articles_data

Unnamed: 0,date,header,link
0,2022-03-22 17:17:00,Метаморфозы Go: сможет ли язык одолеть Python ...,https://habr.com/ru/company/instinctools/blog/...


### part 2

Improve the script so that it analyzes not only the preview information of the article, but the entire text of the article as a whole.

To do this, you need to get pages of articles and search for the text inside this page.

Form the final dataframe with the following columns: `<date> - <header> - <link> - <article_text>`

In [7]:
articles = soup.find_all('article', class_='tm-articles-list__item')

In [8]:
articles_improved_data = pd.DataFrame(columns=['date', 'header', 'link', 'article_text'])

In [9]:
for article in articles:
    link = 'https://habr.com' + article.h2.a.get('href')
    res_article = requests.get(link).text
    soup_article = BeautifulSoup(res_article)
    article_text = soup_article.find('div', class_='tm-article-body').text.strip().replace('\n', ' ')
    if any([word in str(article_text).lower() for word in KEYWORDS]):
        date = pd.to_datetime(article.time.get('title'))
        header = article.h2.string
        row = {'date': date, 'header': header, 'link': link, 'article_text':article_text}
        articles_improved_data = pd.concat([articles_improved_data, pd.DataFrame([row])])
articles_improved_data

Unnamed: 0,date,header,link,article_text
0,2022-03-22 17:17:00,Метаморфозы Go: сможет ли язык одолеть Python ...,https://habr.com/ru/company/instinctools/blog/...,Мы продолжаем публиковать выдержки из дискусси...


## Task 2.

Write a script that will check a list of e-mail addresses for leaks using the  [Avast Hack Ckeck](https://www.avast.com/hackcheck/).
We set the list of emails to a variable at the beginning of the code:
`EMAIL = [xxx@x.ru, yyy@y.com]`

As a result, a dataframe should be formed with the following columns: `<mail> - <leak date> - <leak source> - <leak description>`

In [10]:
import json

In [11]:
EMAIL = ["xxx@x.ru", "yyy@y.com"]

In [12]:
url_post = 'https://identityprotection.avast.com/v1/web/query/site-breaches/unauthorized-data'
headers = {
    "Accept": "application/json, text/plain, */*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8,ru;q=0.7",
    "Connection": "keep-alive",
    "Content-Length": "36",
    "Content-Type": "application/json;charset=UTF-8",
    "DNT": "1",
    "Host": "identityprotection.avast.com",
    "Origin": "https://www.avast.com",
    "Referer": "https://www.avast.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Vaar-Header-App-Build-Version": "1.0.0",
    "Vaar-Header-App-Product": "hackcheck-web-avast",
    "Vaar-Header-App-Product-Name": "hackcheck-web-avast",
    "Vaar-Version": "0"
}
dict_ = {"emailAddresses": EMAIL}
payload = json.dumps(dict_)
req = requests.post(url_post, data=payload, headers=headers)

In [13]:
result = req.json()
result

{'breaches': {'17110': {'breachId': 17110,
   'site': 'azcentral.com',
   'recordsCount': 705538,
   'description': 'At an unconfirmed date, online Arizona newspaper "AZ Central" was allegedly breached. The stolen data contains passwords and email addresses. This breach is being publicly shared on the internet.',
   'publishDate': '2020-01-03T00:00:00Z',
   'statistics': {'usernames': 0, 'passwords': 702971, 'emails': 705538}},
  '37177': {'breachId': 37177,
   'site': 'forums.vkmonline.com',
   'recordsCount': 825654,
   'description': 'At an unconfirmed date, the Russian-language music forum VKM was allegedly breached. The stolen data contains passwords, IPs, email addresses, usernames and additional personal information. This breach is being privately shared on the internet.',
   'publishDate': '2021-02-11T00:00:00Z',
   'statistics': {'usernames': 825566, 'passwords': 825654, 'emails': 824936}},
  '41': {'breachId': 41,
   'site': 'dropbox.com',
   'recordsCount': 68591031,
   'des

In [14]:
leak = pd.DataFrame(columns=['email', 'date', 'source', 'description'])
for i in EMAIL:
    email = i
    for j in result['summary'][i]['breaches']:
        description = result['breaches'][str(j)]['description']
        source = result['breaches'][str(j)]['site']
        date = pd.to_datetime(result['breaches'][str(j)]['publishDate'], format='%Y.%m.%d').date()
        row = {'email': email, 'description': description, 'source': source, 'date':date}
        leak = pd.concat([leak, pd.DataFrame([row])]) 

In [15]:
leak

Unnamed: 0,email,date,source,description
0,xxx@x.ru,2016-10-21,adobe.com,"In October of 2013, criminals penetrated Adobe..."
0,xxx@x.ru,2016-10-29,vk.com,Popular Russian social networking platform VKo...
0,xxx@x.ru,2016-10-23,imesh.com,"In June 2016, a cache of over 51 million user ..."
0,xxx@x.ru,2017-01-31,cdprojektred.com,"In March 2016, CDProjektRed.com.com's forum da..."
0,xxx@x.ru,2017-02-14,cfire.mail.ru,"In July and August of 2016, two criminals carr..."
0,xxx@x.ru,2017-02-14,parapa.mail.ru,"In July and August 2016, two criminals execute..."
0,yyy@y.com,2016-10-21,linkedin.com,"In 2012, online professional networking platfo..."
0,yyy@y.com,2016-10-21,adobe.com,"In October of 2013, criminals penetrated Adobe..."
0,yyy@y.com,2016-10-23,imesh.com,"In June 2016, a cache of over 51 million user ..."
0,yyy@y.com,2016-10-24,dropbox.com,Cloud storage company Dropbox suffered a major...
