# Homework 2

### Description:
- parse https://rusdisinfo.voxukraine.org/
- write function to parse news pages
- wrap it in the flask app, so when you pass the URL, you receive list of news back
- wrap it in the docker container

### Step 1. Imports libraries

In [39]:
import time
import requests
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup, SoupStrainer

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

### Step 2. Website analysis and variables declaration

For this assignment we are going to parse data from **'https://rusdisinfo.voxukraine.org'**. [2]

![](images/main-page.png)

One thing to note here that we are going to subgroup after pressing one of these hashtags. So we are going to parse all of them by receiving their suffixes to form request url. Link to every group looks like `<url>/narratives/<hashtag>`.

Also, we should take into account that this website is dynamic.

**Dynamic websites** produce some results based on some action of a user. For example, when a webpage is completely loaded only
on scroll down or move the mouse over the screen there must be some dynamic programming behind this. [1]

In [40]:
url = 'https://rusdisinfo.voxukraine.org'
narratives = '/narratives'
href = 'href'

sub_headers = []
h2s = []

fake_news = set()
true_news = set()

Here we just check if we are going to add a valid subhead's link to our list.

In [41]:
def is_valid_subheader(link):
    return link.has_attr(href) \
           and link[href].startswith(narratives) \
           and len(link[href]) != len(narratives)

Now it is time to collect all subheaders. We use `BeautifulSoup` for now, because we need just parse html and there is no dynamic content. [5]

In [42]:
response = requests.get(url)

if response.status_code == 200:
    # parse the HTML content of the page and get <a>...</a>
    soup = BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a'))

    for link in soup:
        if is_valid_subheader(link):
            sub_headers.append((link['href'], link.text[2:]))
else:
    raise Exception('Failed to fetch the page')

In [43]:
result = '\n'.join([item[0] for item in sub_headers])
print(f'Here is sub-headers we are going to parse: \n{result}')

Here is sub-headers we are going to parse: 
/narratives/Maidan%20in%202014%20was%20a%20coup%20d'%C3%A9tat
/narratives/Nazis%20and%20ultra-nationalists%20in%20Ukraine
/narratives/Nord%20Stream-2
/narratives/Russia%20needs%20to%20protect%20itself
/narratives/Ukraine%20is%20an%20illiberal%20state
/narratives/Ukraine%20is%20a%20failed%20state
/narratives/The%20West%20controls%20Ukraine%20and%20uses%20it%20to%20its%20advantage
/narratives/The%20West%20is%20not%20interested%20in%20dealing%20with%20Ukraine,%20moreover,%20solving%20its%20problems.
/narratives/Russians%20are%20discriminated%20in%20Ukraine
/narratives/Russia%20is%20not%20an%20agressor%20towards%20Ukraine
/narratives/Ukraine%20is%20conducting%20an%20aggressive%20policy
/narratives/Schismatic%20Ukrainian%20Church
/narratives/Russia%20is%20not%20involved%20in%20MH17%20crash;%20Ukraine%20shot%20down%20the%20plane


In [44]:
def get_driver():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument("--disable-setuid-sandbox")
    return webdriver.Chrome(options=chrome_options)

### Step 3. Collect fake and true news



If we check narrative structure, we will see that it looks like:
```text
<h3>Fake news title</h3>
<h4>True information</h4>
```
So we are going to store titles to our fake news dataset and text to our true one.
As parser let's take `selenium`, because it helps us to parse dynamic pages. [4, 5]

In [45]:
driver = get_driver()

for sub_header, label in sub_headers:
    to_parse = url + sub_header # '<url>/narratives/<article's encoded title>
    driver.get(to_parse)
    titles = driver.find_elements(By.XPATH, '//*[@class="Narrative_fakeLink___YbTe"]')
    print(f'Number of titles: {len(titles)}')

    for element in titles:
        # Open dynamic link
        element.click()

        # Remove comment if we need to have time delay to do click
        # time.sleep(1)

        fake_news.add((element.text, label))

        texts = element.find_elements(By.XPATH, '//*[@class="Narrative_Debunking__gRBl1"]')
        for text in texts:
            true_news.add((text.text, label))

Number of titles: 6
Number of titles: 8
Number of titles: 5
Number of titles: 8
Number of titles: 3
Number of titles: 3
Number of titles: 11
Number of titles: 4
Number of titles: 4
Number of titles: 4
Number of titles: 9
Number of titles: 3
Number of titles: 1


### Step 4. Import collected data to our dataset

In [48]:
fake_news_result, fake_news_labels = zip(*fake_news)
fakes = pd.DataFrame({
    'text': fake_news_result,
    'label': fake_news_labels,
    'is_fake': 1
})

true_news_result, true_news_labels = zip(*true_news)
trues = pd.DataFrame({
    'text': true_news_result,
    'label': true_news_labels,
    'is_fake': 0
})

df = pd.concat([fakes, trues], ignore_index=True)
df.to_csv('data/data.csv')

### Step 5. Close and quit the driver to free up system resources

In [47]:
driver.close()
driver.quit()

## References

[1] - https://medium.com/swlh/scraping-a-dynamic-web-page-its-selenium-da161999c975
[2] - https://rusdisinfo.voxukraine.org/
[3] - https://pandas.pydata.org/
[4] - https://blog.jovian.com/web-scraping-using-selenium-2a3ffa1f03f4
[5] - https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25