# Ethical Data Collection for Financial News 

<br>
<figure>
  <img src="./img/icon-web-scraping.png" />
      <figcaption> <center>From: <a href='https://arbisoft.com/services/data-services/web-scraping-services/'> arbisoft </a>  </center></figcaption>

</figure>


## Introduction: 

According to ** <a href = "https://www.kaggle.com/surveys/2017"> Kaggle survey of 2017 </a> ** , Data availabilty and quality remains among the top barriers for professionals in the field.
<br>
<figure>
  <img src="./img/Screenshot0.png" />
  <br>    
  <figcaption> <center>Kaggle survey results(2017)</center></figcaption>
</figure>

<br><br>
This notebook deals with **the availabilty problem**, and walks through the steps of scrapping web pages and collecting the data in a responsible and efficient manner.

## Table of Contents:

1. [Ethical Scrapping:](#1)

1. [Efficent Scrapping:](#2)

1. [Pre-Code Analysis:](#3)
    1. [Examining the Source](#3.1)
    1. [Examining the HTML](#3.2)
1. [Code:](#4)
    1. [Envirenment and Setup](#4.1)
    1. [Imports](#4.2)
    1. [Making a request to a single page](#4.3)
    1. [Code Structure](#4.4)
    1. [Getting the details of a single Article](#4.5)
    1. [Getting the details of a single Page: (list of Articles)](#4.6)
    1. [Saving to CSV](#4.7)
    1. [Looping over the Pages of the Category: (the General function)](#4.8)

1. [Checking the resulting dataset](#5)
1. [Future Improvements](#6)
1. [Up next: Starting our NLP pipline for this dataset](#7)
1. [ Ressources](#8)


## <a id="1"> Ethical Scrapping guidlines:</a> 
scrapping the web is a fairly easy anf very powerful method, therefore it must be used responsibly, you can find more in the ressource section.

- **Transparency:** identify your self, and purpose or privde a contact, if the page owner wanted to contact you; this can be easly done in the UserAgent definition.

- **Ownership:** the content you are collecting is Not your own, always cite the source or the orignal author.

- **Overuse:** we must request data at a reasonable rate, in order to avoid stressing or crashing the server; you can set a sleep time between requests.

## <a id="2">Effecient Scrapping guidlines:</a> 
- **Identifiy what your looking for:** web pages can sometimes be overcharged with informations, or presented in less optimal layout, so always locate precisly the path for accesing the data you are collecting  

- **DRY (Dont repeat yourself):** Scrapping is a repetitive process, so try the style of the code must be adapted, assign each task to a function call for example.

- **Update the code of the scrapper regularly**, as websites chage their HTML layouts quite often.

- **make as many checks** (if statements, assertions, exeptions) to keep track of your scrapper failures, when you are scrapping a large amount of pages. 

## <a id="3">Pre-Code Analysis:</a> 
### <a id="3.1">Examining the source:</a>
In this tutorial we will be collecting financial news from the **<a href="https://www.investing.com/news">investing blog</a>**
> **OBJECTIF:** 
As a simple Demo, we want to collect the following in formations: 
    the title of the articl
    the source and time of the article
    the first paragraph
In addition we want to do this for all the pages in the category.

<br>
<figure>
<img src = "./img/Screenshot1.png"/> 
<br> <img src = "./img/Screenshot2.png"/> 
<br>
<figcaption> <center> Starting page (Simple view)</center></figcaption> 
</figure>

### <a id="3.2">Examining the HTML:</a>
<br>
<figure>

<img src = "./img/screen4.png"> </img><br>
<img src = "./img/screen6.png"> </img>
<br>
<figcaption> <center> Starting page (HTML Inspection)</center></figcaption> 
</figure>

**SO,** After inspecting the HTML source, our acess path is the following:

    	|<div class="largeTitle">                       | The articles List
        |---<article class="articleItem" >              |--- Single article
        |--- ---<a href="..." class="title">            |--- --- The Article Title
        |--- ---<span class="articleDetails">  <======> |--- --- Source and Time of the Article  
        |--- ---<p>                            <======> |--- --- The First paragraph of the article
        |---<article class="articleItem" >              |--- Next Single article
        |   .......                                     |    ........
        |   .......                                     |    ........
    	|<a href="..." class="pagination">              | Page numbers



## <a id="4">  The Code, Finally! : </a>
### <a id="4.1">Environment and tools: </a>
- My default environment setup is **Python 3.5** kernel in a **Jupyter Notebook**
- We will be using: <a href="http://docs.python-requests.org/en/latest/">Requests</a> and <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> Beautifullsoup </a> as our scrapping tools:
   - **Requests:** " allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3."
   - **Beautifullsoup:** "is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

### <a id="4.2">Imports</a>

In [3]:
import time
import csv
import os.path
import numpy as np 
import pandas as pd 
import requests 
from bs4 import BeautifulSoup 

### <a id="4.3"> Making a request to a single page:</a>
Our first function would make the HTTP requests to a given URL adress, and return the page response or an error depending on the status of the response
<br>

In [4]:
def request_with_check(url):
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36 , For: a Tutorial kernel By: elamraoui sohayb'}
    page_response = requests.get(url, headers=headers, timeout=60)
    if page_response.status_code>299:
        raise AssertionError("page content not found, status: %s"%page_response.status_code)
    
    return page_response    

#### Testing:

In [5]:
page_test = request_with_check('https://www.investing.com/news/commodities-news')
page_test.text



 ### <a id="4.4">Code Structure: </a>
 Now that we can make requests to the sites page, we should extract our data form the page code we are getting:
  - **FOR each** Page **IN** the News Category, **DO:**
    - **FOR each** Article **IN** the page, **DO:**
       - **Get_Details:** title, link, date, source, first paragraphe
     - **Write the Details Of a page to CSV File**               

### <a id="4.5"> Getting the details of a single Article: </a>
>Our 1st block of operations, would be to take a single Article item from the list and extract its relevent informations as a return values:

In [6]:
def get_details(single_article):
    
    # A title is in <a></a> with the 'class' attribute set to: title
    title = single_article.find('a',{'class':'title'})

    # A safeguard against some empty articles in the deeper pages of the site
    if title == None:
        #print('Empty Article')
        return None
    
    # the link to an article is the Href attribute
    link = title['href']
    
    # A safeguarde against embedded Advertisment articles
    if (('/news/'and category_name) not in link):
        #print('Ad Article found')
        return None       
        
    title = title.text
    
    # The first Paragraph is in <p></p>
    first_p = single_article.find('p').text
    
    # the Source is in <span></span>, with Class == articleDetails
    source_tag = single_article.find_all('span',{'class':'articleDetails'})
    source = str(source_tag[0].span.text)
    
    #date is also in <span></span> withe the Class == date
    date = single_article.find('span',{'class':'date'}).text
    
    return title, link, first_p, source, date  

### <a id="4.6"> Getting the details of a single Page: (list of Articles) </a>
>In the 2nd block, we inspect the Page Url, finds the List of Articles to iterate over, and calls the first fuction (above) at each iteration, appending the calling results into a list of dictionnaries :

In [7]:
def single_page(Url_page,page_id = 1):

    news_list = []

    #Making the Http request
    page = request_with_check(Url_page)
    
    #Calling the Html.parser to start extracting our data
    html_soup = BeautifulSoup(page.text, 'html.parser')
    
    # The Articles Class
    articles = html_soup.find('div',{'class':'largeTitle'})
    
    # The single Articles List
    articleItems = articles.find_all('article' ,{'class':'articleItem'})

    # Looping, for each single Article
    for article in articleItems:
        if get_details(article) == None:
            continue
        
        title, link, first_p, source_tag, date = get_details(article)
        news_list.append({'id_page':page_id,
                          'title':title,   
                          'date':date,
                          'link': link,
                          'source':source_tag,
                          'first_p':first_p})

    return news_list

### <a id="4.7"> Saving to CSV:</a>

> The 3rd Bolck is saving the resulting news dictionnary in a CSV file, we are checking if the file exists (and we would append to it), or not (and we would creat it as new file) (returned by the second function

In [8]:
def dict_to_csv (filename,news_dict):
    
    #Setting the Dataframe headers
    fields = news_dict[0]
    fields = list(fields.keys())
    
    #Checking if the file already exists, if Exists we woulb pe appending, if Not we creat it
    has_header = False
    if os.path.isfile(filename):
        with open(filename, 'r') as csvfile:
            sniffer = csv.Sniffer()
            has_header = sniffer.has_header(csvfile.read(2048))
    
    with open(filename, 'a',errors = 'ignore', encoding= 'utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fields)
        if(has_header == False):
            writer.writeheader()  
        for row in range(len(news_dict)):
            item = news_dict[row]
            writer.writerow(item)

### <a id="4.8"> Looping over the Pages of the Category: (the General function) </a>

> Finally the genral function, where we iterate over the number of pages, and apply the blocks defined above for each page of news.
<br><br>
> **Note:** at each iteration, i.e: parsing a new page, we are forcing a 10 Seconds Pause of the scrapper, so as not to overuse our acces to the site

In [9]:
def parsing_category_pages(category_name,base_url,number_pages):
    start_time = time.time()
    
    #getting the start page
    page = request_with_check(base_url)

    #Calling the Html Parser
    html_soup = BeautifulSoup(page.text, 'html.parser')
    
    #Finding the Laste page
    last_page = int(html_soup.findAll(class_='pagination')[-1].text)

    if number_pages > last_page:
        number_pages = last_page

    #Looping over the specified nupber of Pages:
    for p in range(1,number_pages,1):
        category_page = base_url+'/'+str(p)
        print('Parsing: ',category_page)
        page_news = single_page(category_page,p)
        
        #Saving to a CSV
        dict_to_csv(category_name+'.csv',page_news)
        
        #Time sleep
        time.sleep(10)
    
    print("--- %s seconds ---" % (time.time() - start_time))
    return True

#### Testing:

In [11]:
URL = 'https://www.investing.com/news/'
category_name = 'commodities-news'
base_url = URL+category_name
parsing_category_pages ('commodities-news',base_url,number_pages=5)

Parsing:  https://www.investing.com/news/commodities-news/1
Parsing:  https://www.investing.com/news/commodities-news/2
Parsing:  https://www.investing.com/news/commodities-news/3
Parsing:  https://www.investing.com/news/commodities-news/4
--- 57.46652936935425 seconds ---


True

## <a id="5"> Checking the resulting dataset: </a>

In [12]:
data = pd.read_csv('commodities-news.csv')
data.head(100)

Unnamed: 0,id_page,title,date,link,source,first_p
0,1,Democrats Regaining House Seen Raising Odds of...,- 1 hour ago,/news/commodities-news/democrats-regaining-hou...,By Bloomberg,(Bloomberg) -- If the Democrats take over the ...
1,1,Gold Prices Advance as Dollar Softens,- 5 hours ago,/news/commodities-news/gold-prices-advance-as-...,By Investing.com,Investing.com - Gold prices advanced on Thursd...
2,1,"Oil slips on signs of rising supplies, economi...",- 5 hours ago,/news/commodities-news/oil-prices-fall-on-sign...,By Reuters,BEIJING (Reuters) - Oil prices fell on Thursd...
3,1,Oil Prices Slip Amidst U.S. Stock Build,- 6 hours ago,/news/commodities-news/oil-prices-slip-amidst-...,By Investing.com,Investing.com - Oil prices edged down on Thurs...
4,1,Trump says oil supply elsewhere sufficient to ...,- 12 hours ago,/news/commodities-news/trump-says-oil-supply-e...,By Reuters,WASHINGTON (Reuters) - U.S. President Donald ...
5,1,"Oil Down 11% In October, Biggest Loss in Over ...",- 15 hours ago,/news/commodities-news/oil-selloff-pauses-on-w...,By Investing.com,Investing.com - Oil prices in October posted t...
6,1,"Gold Limps to a 2% October Gain, Still Largest...",- 15 hours ago,/news/commodities-news/gold-end-october-up-2-b...,By Investing.com,Investing.com - More profit-taking on gold’s r...
7,1,U.S. Passes Russia -- Briefly -- to Become Top...,- 17 hours ago,/news/commodities-news/us-passes-russia--brief...,By Bloomberg,(Bloomberg) -- The U.S. surpassed Russia in Au...
8,1,OPEC oil output rises to highest since 2016 de...,- 20 hours ago,/news/commodities-news/opec-oil-output-rises-t...,By Reuters,By Alex Lawler LONDON (Reuters) - OPEC has bo...
9,1,U.S. Crude Oil Inventories Rose by 3.22M Barre...,- 20 hours ago,/news/commodities-news/us-crude-oil-inventorie...,By Investing.com,Investing.com - U.S. crude oil inventories ros...


## <a id="6"> Future Improvements: </a>
- Adding a 'start_page' and 'stop_page' parameters,
- Adding a Resume_Block for Scrapping part by part,
- Automating the scrapping of All categories of news
- Automating the scrapping to access the full text of the Articles 

## <a id="7">Up next: Starting our NLP pipline for this dataset</a>

###                             (Coming Soon .............. )

## <a id="8"> Ressources: </a>
- **More on Ethical scrapping:**
   - Legality and Ethics of Web Scraping, By *Vlad Krotov* and *Leiser Silva*  <a href ="https://www.researchgate.net/publication/324907302_Legality_and_Ethics_of_Web_Scraping">[Researchgate]</a>
   - legality, ethics,web scraping, By *Sudarshan Shidore* <a href = "https://www.linkedin.com/pulse/legality-ethics-web-scraping-sudarshan-shidore/"> [LinkedIn]</a>
   - ethics in web scrapping, By *James Densmore* <a href="https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01"> [TDS]</a>
- **More on Effecient Production Programming:**
   - How to write a production-level code in Data Science?, By *Venkatesh Pappakrishnan* <a href='https://towardsdatascience.com/how-to-write-a-production-level-code-in-data-science-5d87bd75ced'>[TDS]</a>
   - Writing clean, testable, high quality code in Python, By *Noah Gift* <a href='https://www.ibm.com/developerworks/aix/library/au-cleancode/index.html'>[IBM]</a>
        

>### <p>&copy;</p>

**By:** <a href='https://www.linkedin.com/in/sohayb-elamraoui/'>Elamraoui Sohayb</a>, **Supervision of:** <a href='https://www.linkedin.com/in/ACoAAARR-RkBQxLhbsUsrqHkxCRa8KwwtZnP0mA/'>Sadiq Abdelalim,Phd</a>. Master 'Big Data & Cloud Computing', Ibn Tofail University