## Data Collection for Google
### Scraper that fetches the top 10 articles from Google News for the given keywords

#### This notebook uses GNews and Newspaper3k to extract the following features:
* Title of the article
* URL of the article
* Published Date
* Publisher
* Description shown in the Result Page
* Entire Article as a HTML

By default there are **30 keywords** for which the data is collected. These must be located in the same directory as this notebook, and be named as 'Capstone Keywords.csv'. The name can be change in the **Code Cell 3**.

In [1]:
#Install the required packages.
# Uncomment the lines to install the packages.

#To get the news from Google
# !pip install gnews

#To scrape the links we obtain from gnews
# !pip3 install newspaper3k

In [2]:
# Import the required libraries
import csv
from gnews import GNews
# import pandas as pd
from newspaper import Article
from tqdm import tqdm

In [3]:
# Initialize the list of keywords to analyze
keyword_list = []

# CSV file to access the keywords to analyze
csv_for_keywords = 'Capstone Keywords.csv'

# Access the keywords from the CSV file and create a list with it.
with open(csv_for_keywords, newline='') as keywords_initial:
    reader = csv.reader(keywords_initial)
    for row in reader:
        keyword_list.append(row[0])

In [4]:
# Initialize the parameters for the Google News handler.
chosen_country = 'United States'
chosen_language = 'english'
results_to_return = 10
chosen_period = '3d'

In [5]:
# Change the paramters of the GNews object.
news_handler = GNews()
news_handler.country = chosen_country
news_handler.language = chosen_language
news_handler.max_results = results_to_return
news_handler.period = chosen_period

In [6]:
# Create a new CSV file to store the results
file_write = open('raw_google.csv', 'w', newline='', encoding='UTF8')
writer = csv.writer(file_write)

### Main Section
The following code cell extracts the top 10 results from Google News, and inserts them to a CSV File.
#### After running the cell, there should be a CSV file named "raw_google.csv" in the working directory.


In [7]:
# Header for the CSV file
header_csv = ['keyword', 'title', 'url', 'published_date', 'publisher', 'description', 'entire_doc']
writer.writerow(header_csv)
temp_count = 0
# Go through each of the keywords
try:
    for keyword in tqdm(keyword_list, desc = "Fetching Data from Google"):
        # GNews object
        news_return = news_handler.get_news(keyword)

        # For each of the news, collect the required data and store in a list.
        for news in news_return:
            temp_list_results = []
            temp_list_results.append(keyword)
            temp_list_results.append(news['title'])
            temp_list_results.append(news['url'])
            temp_list_results.append(news['published date'])
            temp_list_results.append(news['publisher']['title'])
            temp_list_results.append(news['description'])
            try:
                # Access the text from the URL.
                article = Article(news['url'])
                article.download()
                article.parse()
                temp_list_results.append(article.text)
            except:
                temp_list_results.append('')
            # Insert the list we created into the CSV file.
            if temp_list_results[0] in keyword_list:
                if temp_list_results[2][:4] == 'http':
                    writer.writerow(temp_list_results)
        temp_count += 1
        # print ("{}% complete".format(round((temp_count/len(keyword_list)*100), 2)))
    file_write.close()
# Close the CSV file
except:
    file_write.close()

Fetching Data from Google: 100%|███████████████████████████████████████████████████████| 30/30 [03:52<00:00,  7.76s/it]
