# Websites

* Forbes.com
* CNBC.com
* Cointelegraph.com
* Bitcoin.com
* Coindesk.com

There are several sources for news relating to cryptocurrencies. However when extracting meaningful data, it must come from a meaningful reputable source. As such, the 5 websites listed above were selected to extract quality data. Cointelegraph, Bitcoin, and Coindesk were the highested rated cryptocurrency specific news sources. Conversely, Forbes and CNBC are in their own right reputable news sources, not specifically related to the cryptocurrency world. They are incorporated in order to give a well rounded data source.

### Turned off SSL verification so Python will throw a bunch of warngings - Turn off those warnings

In [1]:
import warnings
warnings.filterwarnings("ignore")


### Create a class to help scrape websites

In [2]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
import re
from datetime import datetime


class parse_web_page:
    #constructor
    def __init__(self,url,website):
        try:
            #Make a request to the url and read in the webpage, set SSL verificaiton to False (not transmitting sensitive info)
            self.html = requests.get(url, verify = False).text
            #Make soup!
            self.soup = BeautifulSoup(self.html, 'html.parser')
            #Extract all text from the website
            self.text = self.soup.get_text()
            #Set flags depending on which website is being processed
            if website == 'forbes':
                self.flag = 1
            if website == 'cnbc':
                self.flag = 2
            if website == 'cointelegraph':
                self.flag = 3
            if website == 'bitcoin':
                self.flag = 4
            if website == 'coindesk':
                self.flag = 5
        except:
            print('Please enter a valid URL')
            
    def page_numbers(self):
        #ONLY FOR FORBES SITES - Forbes websites has multiple pages for their articles
        #Extract how many pages there are for an article
        try:
            if self.flag is 1:
                #Example of Match Pattern: 'Page 1 / 3'
                #Search for pattern and return last value as total pages for the article
                total_pages = re.findall('Page\s[0-9]\s/\s([0-9])',self.text)[0]
                return int(total_pages)
            else:
                return None
        except:
            return None
        
    def get_headline(self):
        #Extract the headline of the article
        #Headlines for Coindesk articles are under the <title> tag
        if self.flag is 5:
            headline = self.soup.title.text
        #Headlines for all other websites are under the <h1> tag
        else:
            headline = self.soup.h1.text
        return headline
    
    def date_and_time(self):
        #Forbes date and time
        if self.flag == 1:
            date = re.search('[A-Za-z]{3}\s[0-9]+,\s[0-9]{4}\s@\s[0-9]+:[0-9]{2}\s[A-Z]{2}',self.text)
            date_time_object = date.group().split('@')
            date_time = datetime.strptime(''.join(date_time_object).replace(',',''), '%b %d %Y  %I:%M %p')
            return date_time
        #CNBC date and time
        if self.flag == 2:
            date_time_object = self.soup.time.text.split()
            order = [6,5,-1,1,2]
            date_time_list = [date_time_object[i] for i in order]
            date_time = datetime.strptime(' '.join(date_time_list),'%b %d %Y %I:%M %p')
            return date_time
        #Cointelegraph date and time
        if self.flag == 3:
            tag = self.soup.find('div', class_ = 'date')
            date_time_object = tag['datetime']
            date_time = datetime.strptime(date_time_object, '%Y-%m-%d %H:%M:%S')
            return date_time
        #Bitcoin and Coindesk date and time
        if self.flag == 4 or 5:
            tag = self.soup.time
            date_time_object = tag['datetime'].replace('T',' ').split('+')[0]
            date_time = datetime.strptime(date_time_object,'%Y-%m-%d %H:%M:%S')
            return date_time

    def raw_text(self):
        #Extract all <p> tags from articles - actual text of each website
        p_tags = self.soup.find_all('p')
        #create a list of all found <p> tags
        text = [item.text for item in p_tags]
        #Join all <p> tags to reconstruct the paragraphs of the articles
        return(' '.join(text))

### Create an a function that will read in a URL, retrieve that URL, scrape the HTML and write the appropriate data to a CSV file

In [25]:
import csv
from tqdm import tqdm

def scrape_articles(urls, coin, website = None, path = 'C://Users//simskel//Desktop//scraped_data.csv'):
    
    
    #Closure function ONLY FOR FORBES SITES - Used to extract all other pages in the forbes articles and concat them all
    def get_other_pages(url,num_of_pages,page_1,website):
        lst = [page_1]
        list_of_pages = list(range(num_of_pages+1))[2:]
        url_of_pages = [''.join([url,str(page)]) for page in list_of_pages]
        for url in url_of_pages:
            web_page = parse_web_page(url,website)
            text = web_page.raw_text()
            lst.append(text)
        return ''.join(lst)

    
    with open(path, 'w', newline='', encoding = 'utf-8') as file:
        #Create a csv writer object
        csv_writer = csv.writer(file)
        #Write the headers to the csv file
        csv_writer.writerow(['Date','Coin','website','Headline','Text','Link'])
        
        for url in tqdm(urls):
            try:
                if website is None:
                    #If not given explicitly, extract the website from the url address
                    #Used for all except Bitcoin and Cointelegraph
                    website = url.split('.')[1]
                #Create web page object from class above
                page = parse_web_page(url,website)
                #Get paragraphs of the article
                text = page.raw_text()
                #Get date and time of the article
                date_time = page.date_and_time()
                #Get the headline of the article
                headline = page.get_headline()
                #Get total number of pages in the article - FORBES ONLY
                num_of_pages = page.page_numbers()
                
                #Get other Forbes pages and concat them
                if num_of_pages is not None:
                    text = get_other_pages(url,num_of_pages,text,website)
                #Write values to csv file
                csv_writer.writerow([date_time, coin, website, headline, text, url])
            except:
                print('Could Not Process: \n{}'.format(url))
        return print('Done Writing File')

From each website listed above, a list of URLs to various articles of interest from each site was composed. These URLS will now be fed through "scrape_articles" function in order to be scraped and the resulting data written to CSV files

### Create a list of the Forbes.com links

In [1]:
with open('Ripple_Forbes.txt') as fh:
    file = fh.readlines()
    urls_Forbes = [link.strip() for link in file]

### Create a list of the CNBC.com links

In [5]:
with open('Ripple_CNBC.txt') as fh:
    file = fh.readlines()
    urls_CNBC = [link.strip() for link in file]

### Create a list of the Coindesk.com links

In [6]:
with open('Ripple_coindesk.txt') as fh:
    file = fh.readlines()
    urls_Coindesk = [link.strip() for link in file]

### Create a list of the Cointelegraph.com links

In [7]:
with open('Ripple_cointelegraph.txt') as fh:
    file = fh.readlines()
    urls_Cointelegraph = [link.strip() for link in file]

### Create a list of the Bitcoin.com links

In [8]:
with open('Ripple_bitcoin.txt') as fh:
    file = fh.readlines()
    urls_Bitcoin = [link.strip() for link in file]

# Create CSV of all Forbes.com articles

In [26]:
scrape_articles(urls_Forbes, 'Ripple', path = 'C:\\Users\\simskel\\Desktop\\Springboard-Data-Science-Immersive\\Capstone 1 Project\\Data\\Ripple_Forbes.csv')

100%|██████████████████████████████████████████| 44/44 [00:51<00:00,  1.16s/it]


Done Writing File


# Create CSV of all CNBC.com articles

In [12]:
scrape_articles(urls_CNBC, 'Ripple', path = 'C:\\Users\\simskel\\Desktop\\Springboard-Data-Science-Immersive\\Capstone 1 Project\\Data\\Ripple_CNBC.csv')

 35%|██████████████▉                            | 9/26 [00:10<00:20,  1.21s/it]

Could Not Process: 
https://www.cnbc.com/2017/07/21/ripples-xrp-digital-currency-rose-3977-percent-in-the-first-half-of-2017.html


 62%|█████████████████████████▊                | 16/26 [00:16<00:10,  1.06s/it]

Could Not Process: 
https://www.cnbc.com/2017/09/11/ripple-ceo-brad-garlinghouse-on-bitcoin-and-xrp.html


100%|██████████████████████████████████████████| 26/26 [00:25<00:00,  1.02it/s]


Done Writing File


# Create CSV of all Coindesk.com articles

In [13]:
scrape_articles(urls_Coindesk, 'Ripple', path = 'C:\\Users\\simskel\\Desktop\\Springboard-Data-Science-Immersive\\Capstone 1 Project\\Data\\Ripple_coindesk.csv')

100%|██████████████████████████████████████████| 47/47 [00:27<00:00,  1.73it/s]


Done Writing File


# Create CSV of all Cointelegraph.com articles

In [14]:
scrape_articles(urls_Cointelegraph, 'Ripple', website = 'cointelegraph', path = 'C:\\Users\\simskel\\Desktop\\Springboard-Data-Science-Immersive\\Capstone 1 Project\\Data\\Ripple_cointelegraph.csv')

100%|██████████████████████████████████████████| 88/88 [02:11<00:00,  1.50s/it]


Done Writing File


# Create CSV of all Bitcoin.com articles

In [15]:
scrape_articles(urls_Bitcoin, 'Ripple', website = 'bitcoin', path = 'C:\\Users\\simskel\\Desktop\\Springboard-Data-Science-Immersive\\Capstone 1 Project\\Data\\Ripple_bitcoin.csv')

100%|██████████████████████████████████████████| 16/16 [00:15<00:00,  1.03it/s]


Done Writing File
