## Data Collection for Bing
### Scraper that fetches the top 10 articles from Bing News for the given keywords.

#### This notebook uses newspaper3k and beautifulsoup to extract the following features:
* Title of the article
* URL of the article
* Published Date
* Publisher
* Description shown in the Result Page
* Entire Article as a HTML

By default there are **30 keywords** for which the data is collected. These must be located in the same directory as this notebook, and be named as 'Capstone Keywords.csv'. The name can be change in the **Code Cell 3**.

In [1]:
# Install the two libraries using pip. The same can be done using conda prompt.
# Uncomment the following lines to install the packages.

# !pip install beautifulsoup4
# !pip install newspaper3k

In [2]:
# Import the necessary libraries for the notebook.
import requests
from bs4 import BeautifulSoup
import csv
from newspaper import Article
from datetime import datetime, timedelta
from tqdm import tqdm

In [3]:
# Initialize the list of keywords to analyze
keyword_list = []

# CSV file to access the keywords to analyze
csv_for_keywords = 'Capstone Keywords.csv'

# Access the keywords from the CSV file and create a list with it.
with open(csv_for_keywords, newline='') as keywords_initial:
    reader = csv.reader(keywords_initial)
    for row in reader:
        keyword_list.append(row[0])

In [4]:
# Clean up the data for inserting into the URL.
keyword_for_query = []
for keyword in keyword_list:
    add_plus = keyword.replace(" ", "+")
    final_query = add_plus.replace("-", "+")
    keyword_for_query.append(final_query)

In [5]:
# Create a new CSV file to store the results
file_write = open('raw_bing.csv', 'w', newline='', encoding='UTF8')
writer = csv.writer(file_write)

### Main Section
The following code cell extracts the top 10 results from Bing News, and inserts them to a CSV File.
#### After running the cell, there should be a CSV file named "raw_bing.csv" in the working directory.


In [6]:
# Header for the CSV File.
header_csv = ['keyword', 'title', 'url', 'published_date', 'publisher', 'description', 'entire_doc']

# Write the header into the newly created CSV file.
writer.writerow(header_csv)

# Main part of the scraper
try:
    # Fetch data for each keyword.
    for keyword in tqdm(range(len(keyword_for_query)), desc = "Fetching Data from Bing"):
        # Make the request to the given url.
        page = requests.get('https://www.bing.com/news/search?q='+ keyword_for_query[keyword] +'&cc=US')
        
        # Create a BS4 object and find all the required values.
        soup = BeautifulSoup(page.content, 'html.parser')
        result_title = soup.find_all('div', class_='t_t')
        result_desc = soup.find_all('div', class_='snippet')
        result_link = soup.find_all('a', class_='title')
        result_source = soup.find_all('div', class_='source')
        
        # Change the value to tweak how many results is to be extracted.
        # By default, the value is 10.
        for i in range(10):
            # Insert the keyword into the list.
            temp_list_results = []
            temp_list_results.append(keyword_list[keyword])
            if temp_list_results[0] not in keyword_list:
                continue
            
            # Get the title and link of the article.
            temp_list_results.append(result_title[i].get_text())
            temp_list_results.append(result_link[i]['href'])
            if temp_list_results[2][:4] != "http":
                continue
            # Conver the date into a relevant format.
            date_ = result_source[i].findChildren("span")[2].get_text()
            minute = 0
            hour = 0
            day = 0
            if date_[-1] == "m":
                minute = int(date_[:-1])
            elif date_[-1] == "h":
                hour = int(date_[:-1])
            elif date_[-1] == "d":
                day = int(date_[:-1])
            else:
                pass
            date_published = datetime.today() - timedelta(days = day, hours=hour, minutes=minute)
            date_published = date_published.strftime('%a, %d %b %Y %X')
            temp_list_results.append(date_published)

            # Get the description and the entire text.
            temp_list_results.append(result_source[i].findChild().get_text())
            temp_list_results.append(result_desc[i].get_text())
            try:
                # Access the text from the URL.
                article = Article(temp_list_results[2])
                article.download()
                article.parse()
                temp_list_results.append(article.text)
            except:
                temp_list_results.append('')
            # Insert the list we created into the CSV file.
            writer.writerow(temp_list_results)
    # Close the file once complete.
    file_write.close()
# Close the file prematurely, if there is a fatal error.
except:
    file_write.close()

Fetching Data from Bing:  93%|█████████████████████████████████████████████████████▏   | 28/30 [04:33<00:19,  9.76s/it]
