<a href="https://colab.research.google.com/github/PurpleDin0/news-scraping-exercise/blob/master/MST698S_news_scraping_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MST 698S - Data Science Tools And Techniques 
# Bad Ozone Grasshoppers - News Scrapper Exercise

**Summary:** This notebook installs the required python libraries and operating system (OS) programs to execute a python based news scrapper targeted at the Reuters news website.  Additionally, this notebook walks the user through the process of saving the scrapped data to their Google drive and opening the data using pandas.  

**Usage Details:** This Notebook is desigened to be run in the [Google Colab environment](https://colab.research.google.com/). However, it should work in ***most*** linux based Jupyter Notebooks or Jupyter Lab environments.  The main purpose of the notebook is to install relevant python libraries, execute the web scrapping code, and save the output to a cloud repository.   
***CAUTION:*** If executing this notebook on a Windows based system the user will need to install Git and update the default filepaths to match windows formating.

## Initialize the Environment 
1. Clone the github repo [located here](https://github.com/PurpleDin0/news-scraping-exercise).  
2. Install all required dependancies.  This is best done by storing all dependices to a `requirements.txt` in the github repo file and running a `pip install` using that file (see below for example code).
```
!pip install -r requirements.txt
```
3. Install the webdrivers for selenium.  This is needed as Google Colab notebook instances do not start with any web browsers installed.  Selenium uses a webdriver from Chrome, Firefox, or internet explorer to drive many of its functions.   Luckily we can install programs using shell "!" commands or magics "%" (see below for example code or [read info here](https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com)).
```python
!apt-get update 
!apt install chromium-chromedriver
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
```



In [1]:
### 1. CLone the github Repo ###
# Navigate the working directory in colab to "/content" 
%cd /content/
# clone the relevant github repo
!git clone https://github.com/PurpleDin0/news-scraping-exercise.git
# Navigate to the newly created repo folder
%cd /content/news-scraping-exercise

### 2. Installed the required dependancies ###
# install all required python libraries
!pip install -r requirements.txt
# Once installed you may need to restart the runtime (Colab will tell you if a restart is required)

/content
Cloning into 'news-scraping-exercise'...
remote: Enumerating objects: 51, done.[K
remote: Counting objects: 100% (51/51), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 51 (delta 24), reused 13 (delta 4), pack-reused 0[K
Unpacking objects: 100% (51/51), done.
/content/news-scraping-exercise
Collecting selenium==3.141.0
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 2.8MB/s 
[?25hCollecting beautifulsoup4==4.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/1a/b7/34eec2fe5a49718944e215fde81288eec1fa04638aa3fb57c1c6cd0f98c3/beautifulsoup4-4.8.0-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 9.1MB/s 
Collecting lxml==4.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc0236

In [2]:
### 2. Install the webdrivers for selenium ###
# Install the Chromium webdriver so Selenium can work
!apt-get update # updates the ubuntu apt program to correctly run apt install
!apt install chromium-chromedriver
#!cp /usr/lib/chromium-browser/chromedriver /usr/bin #If running on a local machine you might need this line
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
#from selenium import webdriver #might not need this line


0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ Packages [92.1 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:7 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:9 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:10 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:13 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease [15.4 kB]
Get:14 http://security.ubuntu.com/ubuntu bioni

## Run the reuters news scrapper

In [3]:
#Import the news_reuters.py function and execte it using the Chrome browser agent we just installed
%cd /content/news-scraping-exercise/
import news_reuters
news_reuters.main(browser_agent="Chrome")

/content/news-scraping-exercise
Executing using Chrome webdriver
Getting Reuters articles...
Unable to decode...skipping article...
Saving news object...




## Export the reuters news dump
Mount a your google drive and save the news_dump_object.json.  
This allows you to access the scrapped data at a later time.  
Additionally, you could load this saved file prior to executing `news_reuters.py` so you only add new articles to the JSON file.



In [4]:
#mount your google drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#export all files to your google drive *CHANGE PATH to where you want to save the files
!cp /content/news-scraping-exercise/news_dump_object.json '/content/drive/My Drive/Colab Notebooks/Coursework/698S/news-scraping-exercise'

## View the reuters news dump

In [6]:
%cd /content/news-scraping-exercise/
import os
import pandas as pd
json_path = os.path.join(os.getcwd(), 'news_dump_object.json')
df = pd.read_json(json_path)
df

/content/news-scraping-exercise


Unnamed: 0,date,time,source,Title,Text,url
0,2020-05-18,8:37 PM,www.reuters.com,U.S. Senator Rubio chosen as acting Intelligen...,WASHINGTON (Reuters) - U.S. Senator Marco Rubi...,https://www.reuters.com/article/us-usa-intelli...
1,2020-05-18,8:27 PM,www.reuters.com,Supplier restarts flow of critical truck parts...,"TOLEDO, Ohio (Reuters) - Dana Inc (DAN.N), a k...",https://www.reuters.com/article/us-health-coro...
2,2020-05-18,11:13 AM,www.reuters.com,S&P 500 closes at 10-week high on vaccine hope...,NEW YORK (Reuters) - U.S. stocks jumped on Mon...,https://www.reuters.com/article/us-usa-stocks/...
3,2020-05-18,8:47 PM,www.reuters.com,Baidu forecasts current-quarter revenue above ...,(Reuters) - Chinese search engine giant Baidu ...,https://www.reuters.com/article/us-baidu-resul...
4,2020-05-18,9:00 PM,www.reuters.com,"Fannie Mae, Freddie Mac to hire financial advi...",(Reuters) - Mortgage companies Fannie Mae and ...,https://www.reuters.com/article/us-usa-housing...
...,...,...,...,...,...,...
95,2020-05-18,5:34 AM,www.reuters.com,"Ryanair shares surge on reduced cash burn, opt...",DUBLIN (Reuters) - Ryanair shares surged 15% o...,https://www.reuters.com/article/us-ryanair-res...
96,2020-05-18,4:45 PM,www.reuters.com,"PM hopes Slovaks can ""live freely again"" soon ...",BRATISLAVA (Reuters) - Slovakia’s prime minist...,https://www.reuters.com/article/us-health-coro...
97,2020-05-18,4:20 PM,www.reuters.com,Bank of England not ruling out negative rates ...,LONDON (Reuters) - Bank of England official Si...,https://www.reuters.com/article/us-health-coro...
98,2020-05-18,7:34 AM,www.reuters.com,European shares surge as recovery hopes boost ...,(Reuters) - European shares enjoyed their best...,https://www.reuters.com/article/us-europe-stoc...
