# WEB  SCRAPING

In the USA media of different views and political opinions are widely represented and they can report news from so various perspectives that any reader is able to choose favorite media according to his/her own views. So, I decided to benefit from this variety and gather headlines from all possible political segments:

1. CNN - left;

2. CBS - lean left;

3. New York Times - lean left;

4. Real Clear Politics - center;

5. The Epoch Times - lean right;

6. National Review - right.

<img src="https://www.allsides.com/sites/default/files/AllSidesMediaBiasChart-Version4.1.jpg" width="500"> 

It was supposed that this way of gathering database would provide balanced data and all views would be equally represented. But, performing web scraping I faced following obstacles:

- some media mention Russia in regular reports about coronavirus situation in the world, so, there are more articles mentioning Russia than in media without coronavirus reports;

- it looks like different media use different search methods on their websites that results in articles with various levels of relevance to the topic;

- certainly, size of media plays significant role, for instance, CNN publishes in general more articles than Real Clear Politics.

Nevertheless, I used Selenium library to perform web scraping and gathered in total 11 941 headlines of articles published from 01.01.2020 to 20.04.2021 (the date when I started this project).

In [10]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

import dateutil.parser
import time
import csv
from datetime import datetime
import io

## CNN

In [6]:
driver = webdriver.Chrome(r'C:\Users\Samsung\Downloads\chromedriver_win32_90\chromedriver.exe')
driver.get('https://edition.cnn.com/')

In [8]:
search_box = driver.find_element_by_css_selector('input[id="footer-search-bar"]')

ActionChains(driver).move_to_element(search_box).click()\
	.send_keys('russia').key_down(Keys.ENTER).perform()

In [9]:
count = 0
for x in range(10):
   
    if count > 0:
        loadMoreButton = driver.find_element_by_css_selector('div.pagination-arrow.pagination-arrow-right.cnnSearchPageLink.text-active')
        time.sleep(5)
        loadMoreButton.click()
        time.sleep(8)

        
    top_titles = driver.find_elements_by_css_selector('div[class="cnn-search__result cnn-search__result--article"]')
    
    with open('Russia_headlines_CNN_test.txt', 'a', encoding='utf-8') as rus_headlines:
        for title in top_titles:
            headline = title.find_element_by_css_selector('h3[class="cnn-search__result-headline"]').text
            date = driver.find_element_by_css_selector('div[class = "cnn-search__result-publish-date"]').text
            
            if '2021' in date or '2020' in date:
                rus_headlines.write(headline)
                rus_headlines.write('\n')
                print(headline)
    
    count = count + 1        
    print('...')


Biden weathers his first foreign crisis after months focusing on domestic troubles
Taiwan blames China for slowing down its access to Covid-19 vaccines. The reality is more complicated
The week in 13 headlines
How a once-bipartisan commission to investigate the Capitol Riot fell apart
Elon Musk says Tesla is considering a plant in Russia
As the US and Russia spar over the Arctic, Putin creates new facts on the ground
Trump administration secretly obtained CNN reporter's phone and email records
Past, present and future: The evolution of China's incredible high-speed rail network
US destroyer backs up Biden's tough words in South China Sea
5 things to know for May 20: Capitol riot, Covid-19, Gaza, policing, South China Sea
...
Blinken and Lavrov hold first high-level meeting of Biden's presidency as US-Russia tensions simmer
Colonial Pipeline CEO admits to authorizing $4.4 million ransomware payment
Lawmakers unveil legislation to give 'Havana syndrome' victims better medical care
Barack

## New York Times

In [11]:
driver = webdriver.Chrome(r'C:\Users\Samsung\Downloads\chromedriver_win32_90\chromedriver.exe')
driver.get('https://www.nytimes.com/search?dropmab=false&endDate=20210420&query=russia&sort=newest&startDate=20200101')

In [13]:
from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_css_selector('select[class = "css-v7it2b"]'))

select.select_by_value("newest")

In [14]:
#count_clicks = 0
LoadMore_button = driver.find_element_by_css_selector('button[data-testid = "search-show-more-button"]')
   
for x in range(10):    
    webdriver.ActionChains(driver).move_to_element(LoadMore_button ).click(LoadMore_button ).perform()
    time.sleep(8)
    

In [15]:
top_titles = driver.find_elements_by_css_selector('li[class="css-1l4w6pd"]')

In [16]:
with open('Russia_headlines_NT_test.txt', 'w', encoding='utf-8') as rus_headlines:
    for title in top_titles:
        headline = title.find_element_by_css_selector('h4[class="css-2fgx4k"]').text
                           
        rus_headlines.write(headline)
        rus_headlines.write('\n')
        print(headline)


General Warns of Challenges to Tracking Terrorist Threats in Afghanistan After U.S. Exits
George Floyd, Johnson & Johnson, Philip Roth: Your Tuesday Evening Briefing
Coal Is Set to Roar Back, and So Are Its Climate Risks
Are American Values Ruining European Football?
How the Artists Behind ‘Shtisel’ Brought Akiva’s Journey to Life
Your Wednesday Briefing
‘We Know How to Defend Our Interests’: Putin’s Emerging Hard Line
Weekly News Quiz for Students: Vaccine Pause, Police Shooting, Softball First
A Global Tipping Point for Reining In Tech Has Arrived
Lesson of the Day: ‘How Working From Home Changed Wardrobes Around the World’
‘A Threat From the Russian State’: Ukrainians Alarmed as Troops Mass on Their Doorstep
Your Tuesday Briefing
Abandoning Afghanistan Is a Historic Mistake
Richard Rush, Who Directed ‘The Stunt Man,’ Dies at 91
Chauvin, Vaccines, Indianapolis: Your Monday Evening Briefing
Your Tuesday Briefing
The Science of Climate Change Explained: Facts, Evidence and Proof
Alekse

## CBS News

In [33]:
driver = webdriver.Chrome(r'C:\Users\Samsung\Downloads\chromedriver_win32_90\chromedriver.exe')
driver.get('https://www.cbsnews.com/#search-form')

In [37]:
search_icon =  driver.find_element_by_css_selector('#site-header\:48 > div > nav > ul > li.site-nav__item.site-nav__item--level-1.site-nav__item--search > a > svg > use')
time.sleep(3)
webdriver.ActionChains(driver).move_to_element(search_icon ).click(search_icon ).perform()


In [38]:
search_box = driver.find_element_by_css_selector('input[class="search-field"]')

ActionChains(driver).move_to_element(search_box).click()\
	.send_keys('russia').key_down(Keys.ENTER).perform()

In [44]:
top_titles = driver.find_elements_by_css_selector('article[class="item item--type-article"]')


with open('Russia_headlines_CBS_test.txt', 'w', encoding='utf-8') as rus_headlines:
    for title in top_titles:
        headline = title.find_element_by_css_selector('div[class="item__title-wrapper"]').text
                           
        rus_headlines.write(headline)
        rus_headlines.write('\n')
        print(headline)
    

"Piles of problems," but Russia calls meeting with Blinken "positive"
MAY 20, 2021
Is "a stable and more predictable relationship with Russia" possible?
MAY 19, 2021
U.S. behind Russia, China in sending vaccines to nations in need
MAY 19, 2021
Russia clears actress for flight to space station
MAY 14, 2021
Russia's space agency chief declares Venus a "Russian planet"
MAY 6, 2021
Russia restricts airspace near Ukraine amid wargames in the Black Sea
APR 22, 2021
Russia says huge military exercises near Ukraine to wind down
APR 22, 2021
U.S. ambassador to Russia returning to Washington for "consultations"
APR 20, 2021
Russia shuts down Alexey Navalny's anti-corruption foundation
APR 27, 2021
Macron says international community must draw "clear red lines" with Russia
APR 18, 2021
Russia warns U.S. to stay away from Ukraine for its "own good"
APR 14, 2021
Biden announces sweeping new Russia sanctions
APR 16, 2021
Russia announces expulsion of 10 U.S. diplomats and ban some U.S. officials
APR

## Real Clear Politics


In [45]:
driver = webdriver.Chrome(r'C:\Users\Samsung\Downloads\chromedriver_win32_90\chromedriver.exe')
driver.get('https://www.realclearpolitics.com/search/?q=russia#')

In [46]:
count = 0
for x in range(10):
   
    if count > 0:
        loadMoreButton = driver.find_element_by_css_selector('#pagination-controls > li.next')
        loadMoreButton.click()
        time.sleep(8)

        
  
    with open('Russia_headlines_RCP_test.txt', 'a', encoding='utf-8') as rus_headlines:
        for x in range(1,26):
            xpath = '//*[@id="results"]/div[' + str(x) + ']/p[1]/strong/a' # simplified
            xpath_date = '//*[@id="results"]/div[' + str(x) + ']/p[1]/span' # simplified
            
            headline = driver.find_element_by_xpath(xpath).text
            date = driver.find_element_by_xpath(xpath_date).text
            
            if '2021' in date or '2020' in date:
                rus_headlines.write(headline)
                rus_headlines.write('\n')
                print(headline)
    
    count = count + 1        
    print('...')


Tucker Carlson: Jan 6th Commission Is The New "Russia Russia Russia"
Peddlers of Russia Collusion Won't Take Truth for an Answer
Bret Baier: Is The Biden Administration Acknowledging They're Soft On Russia?
The Russia Collusion Smear Returns
Biden DOJ Hires Full-On Russia Collusion Hoaxer
China and Russia's Dangerous Convergence
Glenn Greenwald: People Are Afraid To Speak Out Against New Cold War With Russia
For What Should We Fight Russia or China?
Biden Administration Can't Escape the Russia Problem
Jake Sullivan On New Russia Sanctions: Relations Not As Bad As During "Evil Empire" Summits
A List of Official Russia Claims That Proved To Be Bogus
Blinken: U.S. Has "Real Concerns About Russia's Actions" Near The Ukraine Border
China and Russia Are Winning the New Space Race
China & Russia Are Winning the New Space Race
China and Russia Are Winning the New Space Race
Blowups With China & Russia Mark Biden's First 60 Days
Peddlers of Russiagate Won't Take Truth for an Answer
Kilimnik Spe

## The Epoch Times

In [48]:
driver = webdriver.Chrome(r'C:\Users\Samsung\Downloads\chromedriver_win32_90\chromedriver.exe')
driver.get('https://www.theepochtimes.com/search/?q=russia')

In [52]:
wait = WebDriverWait(driver, 10)

count = 0
for x in range(2): 
   
    if count > 0:
        loadMoreButton = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="main"]/div/div[3]/a[5]')))
        loadMoreButton.click()


    with open('Russia_headlines_ET_test.txt', 'a', encoding='utf-8') as rus_headlines:
        for x in range(1,31):
            xpath = '//*[@id="eetsearch-result"]/li[' + str(x) + ']/div[2]/div[1]/a' # simplified
            xpath_date = '//*[@id="eetsearch-result"]/li[' + str(x) + ']/div[2]/div[3]/span[1]' # simplified
            
            headline = driver.find_element_by_xpath(xpath).text
            date = driver.find_element_by_xpath(xpath_date).text
            
            if '2021' in date or '2020' in date:
                rus_headlines.write(headline)
                rus_headlines.write('\n')
                print(headline)
    
    count = count + 1        
    print('...')


Recurrents: The Story of Faust Through History
Why Did I Renounce the Chinese Communist Party Membership?
Afghanistan Peace Process Will Depend Upon Whether Regional Powers Cooperate or Compete: Experts
Schiff: If Senate Blocks Jan. 6 Commission Bill, Dems Will Get Answers ‘One Way or the Other’
The New York Times Exposes and Cuts Ties With Fusion GPS
Sinead O’Connor, Felt-Truths, and Media Propaganda
What Role Does China Play in the Israeli-Palestinian Conflict?
‘Positive’ Decryption Tool Given to Irish Health Service After Ransom Attack
Deep Dive (May 20): Colonial Pipeline CEO Explains Why He Paid the Ransom: ‘For the Country’
Lithuanian Parliament Latest to Call China’s Treatment of Uyghurs ‘Genocide’
Lawmakers Demand Reinstatement of Space Force Officer Who Was Removed for Denouncing Critical Race Theory
Australia Joins Space Race With New Defence Division
Biden Admin Waives Sanctions on Russia’s Nord Stream 2 Company and CEO
NTD Evening News Full Broadcast (May 19)
Colonial Pipel

## National Review

In [58]:
driver = webdriver.Chrome(r'C:\Users\Samsung\Downloads\chromedriver_win32_90\chromedriver.exe')
driver.get('https://www.nationalreview.com/?s=russia&sp%5Bforce%5D=1&search-date=custom&search-date-from=01%2F01%2F2020&search-date-to=04%2F20%2F2021&orderby=date&order=DESC')

In [64]:

count = 0
for x in range(2):
   
    if count > 0:
        loadMoreButton = driver.find_element_by_css_selector('#wp_page_numbers > ul > li:nth-child(9) > a')
        loadMoreButton.click()
        time.sleep(8)


    with open('Russia_headlines_NR_test.txt', 'a', encoding='utf-8') as rus_headlines:
        for x in range(1,11):
            xpath = '//*[@id="main"]/div/div/div/article[' + str(x) + ']/div/h4/a' # simplified
                       
            headline = driver.find_element_by_xpath(xpath).text
            
            rus_headlines.write(headline)
            rus_headlines.write('\n')
            print(headline)
    
    count = count + 1        
    print('...')


The Media Eat Crow on COVID Lab-Leak Theory
For Your Listening Pleasure
A Dissent on Pipeline Politics
What Does Vladimir Putin Have on Joe Biden?
Biden Balks on Russia
The Conventional Wisdom on UFOs Is Shifting
Joe Biden Wimps Out on Russia’s Nord Stream 2 Pipeline
American Muscle Cars in Russia
Associated Press, Hamas Propagandists
Why Is Hungary Abandoning Hong Kong?
...
The Center Is Borrowing from the Edges
Hockey Hero
Tom Stoppard, a Conservative Genius for Our Time
Will America Rule the Waves?: Inside the New Issue of NR
House GOP Votes Stefanik into Cheney’s Former Leadership Role
The Colonial Pipeline Hack: A New Era of Cyberwarfare
Democrats Finally Find Something They Don’t Want to Blame on Putin
Colonial Pipeline Paid Hackers $5 Million Ransom: Report
Biden’s Throwback Presidency: A Return to Dukakis
How Edward Said Reoriented the West
...


**Technical details:**

- all websites are very different, therefore, I wrote 'personalized' code for every media website;

- there were 2 options of scrolling articles till the 1st of January 2020: 'Show more' that uploads and shows more articles after each click, like on the New York Times website, and 'Next page' that uploads limited amout of articles each time, for instance, Real Clear Politics website shows 25 articles per page;

- there were 2 options of setting date limits: I could set specific date limits like on the National Review website, or I had to manually set amount of clicking 'Show more' or 'Next page' and control when it scrolled to the 1st of January 2020, like on the Epoch Times website;

- in the presented Notebook I shorten number of iterations on each website for checking the correctness of code, therefore, you can see above not hundreds of headlines, but dozens.