<a href="https://colab.research.google.com/github/PurpleDin0/news-scraping-exercise/blob/master/Execution_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MST 698S - Data Science Tools And Techniques 
# Bad Ozone Grasshoppers - News Scrapper Exercise

**Summary:** This notebook installs the required python libraries and operating system (OS) programs to execute a python based news scrapper targeted at the Reuters news website.  Additionally, this notebook walks the user through the process of saving the scrapped data to their Google drive and opening the data using pandas.  

**Usage Details:** This Notebook is designed to be run in the [Google Colab environment](https://colab.research.google.com/). However, it should work in ***most*** Linux based Jupyter Notebooks or Jupyter Lab environments.  The main purpose of the notebook is to install relevant python libraries, execute the web scrapping code, and save the output to a cloud repository.   
***CAUTION:*** If executing this notebook on a Windows based system the user will need to install Git and update the default file paths to match windows formatting.

## Initialize the Environment 
1. Clone the GitHub repo [located here](https://github.com/PurpleDin0/news-scraping-exercise).  
2. Install all required dependencies.  This is best done by storing all dependencies to a `requirements.txt` in the GitHub repo file and running a `pip install` using that file (see below for example code).
```
!pip install -r requirements.txt
```
3. Install the webdrivers for selenium.  This is needed as Google Colab notebook instances do not start with any web browsers installed.  Selenium uses a webdriver from Chrome, Firefox, or internet explorer to drive many of its functions.   Luckily we can install programs using shell "!" commands or magics "%" (see below for example code or [read info here](https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com)).
```python
!apt-get update 
!apt install chromium-chromedriver
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
```



In [1]:
### 1. CLone the github Repo ###
# Navigate the working directory in colab to "/content" 
%cd /content/
# clone the relevant github repo
!git clone https://github.com/PurpleDin0/news-scraping-exercise.git
# Navigate to the newly created repo folder
%cd /content/news-scraping-exercise

### 2. Installed the required dependancies ###
# install all required python libraries
!pip install -r requirements.txt
# Once installed you may need to restart the runtime (Colab will tell you if a restart is required)

/content
Cloning into 'news-scraping-exercise'...
remote: Enumerating objects: 82, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 82 (delta 43), reused 22 (delta 8), pack-reused 0[K
Unpacking objects: 100% (82/82), done.
/content/news-scraping-exercise
Collecting selenium==3.141.0
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 2.7MB/s 
[?25hCollecting beautifulsoup4==4.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/1a/b7/34eec2fe5a49718944e215fde81288eec1fa04638aa3fb57c1c6cd0f98c3/beautifulsoup4-4.8.0-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 7.9MB/s 
Collecting lxml==4.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc0236

In [2]:
### 2. Install the webdrivers for selenium ###
# Install the Chromium webdriver so Selenium can work
!apt-get update # updates the ubuntu apt program to correctly run apt install
!apt install chromium-chromedriver
#!cp /usr/lib/chromium-browser/chromedriver /usr/bin #If running on a local machine you might need this line
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
#from selenium import webdriver #might not need this line


0% [Working]            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [1 InRelease 2,589 B/88.70% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Waiting for headers] [Wait                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Waiting for headers] [Wait                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [3 InRelease 3,626 B/3,626 0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Waiting for headers] [Wait                                                                            

## Run the reuters news scrapper

In [8]:
#Import the news_reuters.py function and execte it using the Chrome browser agent we just installed
%cd /content/news-scraping-exercise/
import news_reuters
updated_news_object = news_reuters.main(browser_agent="Chrome")

/content/news-scraping-exercise
|        Bad Ozone Grasshoppers        |

Executing using Chrome webdriver
Chrome Driver not found...installing locally...
56 articles loaded from news_dump_object.json
Getting Reuters articles...
19 new articles scraped
Saving news object...


## Export the reuters news dump
Mount a your google drive and save the news_dump_object.json.  
This allows you to access the scrapped data at a later time.  
Additionally, you can load this saved file prior to executing `news_reuters.py` so you add new articles to the JSON file.



In [4]:
#mount your google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
#export news_dump_object.json file to your google drive *CHANGE PATH to where you want to save the file
!cp /content/news-scraping-exercise/news_dump_object.json '/content/drive/My Drive/Colab Notebooks/Coursework/698S/news-scraping-exercise'

## View the reuters news dump

In [6]:
%cd /content/news-scraping-exercise/
import os
import pandas as pd
json_path = os.path.join(os.getcwd(), 'news_dump_object.json')
df = pd.read_json(json_path)
df

/content/news-scraping-exercise


Unnamed: 0,date,time,source,Title,Text,url
0,2020-05-19,6:18 PM,www.reuters.com,"Dutch schools, cafes and museums to reopen in ...",AMSTERDAM (Reuters) - The Netherlands will pre...,https://www.reuters.com/article/us-health-coro...
1,2020-05-19,6:16 PM,www.reuters.com,Egypt says proposed standby funding from IMF a...,CAIRO (Reuters) - Egypt’s talks with the IMF a...,https://www.reuters.com/article/us-egypt-imf/e...
2,2020-05-19,4:21 PM,www.reuters.com,Santander agrees to $550 million U.S. settleme...,WASHINGTON (Reuters) - Santander Consumer USA ...,https://www.reuters.com/article/us-usa-autos-l...
3,2020-05-19,8:33 AM,www.reuters.com,Pier 1 seeks to wind down operations as pandem...,(Reuters) - Home decor and furniture retailer ...,https://www.reuters.com/article/us-pier-1-impo...
4,2020-05-19,6:14 PM,www.reuters.com,Abu Dhabi's Etihad makes first known flight to...,DUBAI/TEL AVIV (Reuters) - An Etihad Airways p...,https://www.reuters.com/article/us-israel-emir...
5,2020-05-19,6:11 PM,www.reuters.com,Investors seen queuing up for new U.S. 20-year...,NEW YORK (Reuters) - Investors are likely to s...,https://www.reuters.com/article/us-usa-bonds-2...
6,2020-05-19,6:06 PM,www.reuters.com,Fed's Rosengren says U.S. unemployment rate co...,(Reuters) - Businesses will face weak demand a...,https://www.reuters.com/article/us-usa-fed-ros...
7,2020-05-19,6:06 PM,www.reuters.com,"USDA sets coronavirus aid payments for corn, s...",CHICAGO (Reuters) - U.S. farmers that grow cro...,https://www.reuters.com/article/us-health-coro...
8,2020-05-19,6:01 PM,www.reuters.com,Goldman Sachs cuts Brazil 2020 GDP forecast to...,BRASILIA (Reuters) - Economists at Goldman Sac...,https://www.reuters.com/article/us-latam-econo...
9,2020-05-19,1:37 PM,www.reuters.com,WHO chief says he will keep leading virus resp...,GENEVA (Reuters) - The World Health Organizati...,https://www.reuters.com/article/us-health-coro...
