## Web Scrapping News Articles
ref: https://towardsdatascience.com/the-easy-way-to-web-scrape-articles-online-d28947fc5979

In [None]:
#pip install newspaper3k

In [2]:
import newspaper
from newspaper import Article

In [5]:
import newspaper
from newspaper import Article
#The Basics of downloading the article to memory
article = Article("https://www.theglobeandmail.com/world/us-politics/article-us-election-live-updates-biden-takes-narrow-lead-in-georgia-trump/")
article.download()
article.parse()
article.nlp()
# To print out the full text
print(article.text)
# To print out a summary of the text
# This works, because newspaper3k has built in NLP tools
print(article.summary)

The latest:

Updated vote tally margins as counting continues in these swing states: Arizona: Joe Biden leads by 39,769 votes | Nevada: Biden leads by 20,137 votes | Pennsylvania: Biden leads by 14,716 votes | Georgia: Biden leads by 4,235 votes | North Carolina: Donald Trump leads by 76,737 votes

Joe Biden leads by 39,769 votes | Biden leads by 20,137 votes | Biden leads by 14,716 votes | Biden leads by 4,235 votes | Donald Trump leads by 76,737 votes Joe Biden is expected to give a prime-time speech to Americans tonight

Philadelphia police probe alleged plot to attack vote counting venue

Live U.S. election results map: Watch Donald Trump and Joe Biden’s presidential battle, state by state

5:33 p.m. ET

Biden, Harris to speak to Americans in prime-time TV address

Democratic presidential nominee Joe Biden is expected to deliver remarks Friday along with running mate Kamala Harris.

Biden has scheduled a prime-time address on the presidential contest as votes continue to be counted

In [6]:
# To print out the list of authors
print(article.authors)
# To print out the list of keywords
print(article.keywords)
#Other functions to gather the other useful bits of meta data in an article
article.title # Gives the title
article.publish_date #Gives the date the article was published
article.top_image # Gives the link to the main image of the article
article.images # Provides a set of image links

['Follow Us On Twitter']
['speech', 'holds', 'biden', 'votes', 'pennsylvania', 'donald', 'president', 'updates', 'live', 'states', 'narrow', 'lead', 'trump', 'tonight', 'key', 'state', 'ballots', 'leads', 'election']


{'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7',
 'https://www.theglobeandmail.com/resizer/e6j__inVZZfxIS1U60UqkdthR2g=/1200x0/filters:quality(80)/cloudfront-us-east-1.images.arcpublishing.com/tgam/2ABR5T6L2JKJTEH7LGQVVBNHUY.JPG'}

## Advanced: Downloading Multiple Articles from a Single Site
Now, lets try downloading multiple articles from the same site and load of all them into a pandas dataframe.

In [12]:
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd
# Let's say we wanted to download articles from theglobeandmail (which is a web site that discusses vid
scinews = newspaper.build("https://www.sciencenews.org/", memoize_articles = False)
# I set memoize_articles to False, because I don't want it to cache and save the articles to memo
# Fresh run, everytime we run execute this script essentially
final_df = pd.DataFrame()
for each_article in scinews.articles:
    each_article.download()
    each_article.parse()
    each_article.nlp()
    
    
    
#Creating a temp df and saving the final output to a .csv file    
    temp_df = pd.DataFrame(columns = ['Title', 'Authors', 'Text',
'Summary', 'published_date', 'Source'])
temp_df['Authors'] = each_article.authors
temp_df['Title'] = each_article.title
temp_df['Text'] = each_article.text
temp_df['Summary'] = each_article.summary
temp_df['published_date'] = each_article.publish_date
temp_df['Source'] = each_article.source_url
news_df = news_df.append(temp_df, ignore_index = True)
# From here you can export this Pandas DataFrame to a csv file
news_df.to_csv('my_scraped_articles.csv')

In [13]:
#Check the newly created news df
news_df.head()

Unnamed: 0,Title,Authors,Text,Summary,published_date,Source
0,What I actually learned about my family after ...,Tina Hesman Saey,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
1,What I actually learned about my family after ...,Laura Sanders,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
2,What I actually learned about my family after ...,Christie Wilcox,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
3,What I actually learned about my family after ...,Carolyn Gramling,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
4,What I actually learned about my family after ...,Bruce Bower,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections


In [15]:
#congrats, you just scrapped 10 articles
news_df.shape

(10, 6)

## Multi-threading Web Scraping
If you just ran the previous block of code, you would've noticed that it was time consuming. Multi-threading technologies help us to hasten the process.


In [16]:
import newspaper
from newspaper import Article
from newspaper import Source
from newspaper import news_pool
import pandas as pd

In [17]:
# The various News Sources we will like to web scrape from
sci_news = newspaper.build('https://www.sciencenews.org/', memoize_articles=False)
bbc = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
# Place the sources in a list
papers = [sci_news, bbc]

In [18]:
# Essentially you will be downloading 4 articles parallely per source.
# Since we have two sources, that means 8 articles are downloaded at any one time.

# Greatly speeding up the processes.
# Once downloaded it will be stored in memory to be used in the for loop below
# to extract the bits of data we want.
news_pool.set(papers, threads_per_source=4)
news_pool.join()

In [24]:
# Create our final dataframe
final_df = pd.DataFrame()
# Create a download limit per sources
# NOTE: You may not want to use a limit
limit = 100
# temporary lists to store each element we want to extract
for source in papers:
    list_title = []
    list_text = []
    list_source =[]
    count = 0
    
    for article_extract in source.articles:
        article_extract.parse()
         # Lets have a limit, so it doesnt take too long when you're
    # running the code. NOTE: You may not want to use a limit
# Appending the elements we want to extract
        if count > limit:
            break 
            list_title.append(article_extract.title)
            list_text.append(article_extract.text)
            list_source.append(article_extract.source_url)
            count +=1 # Update count
            temp_df = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
# Append to the final DataFrame
        final_df = final_df.append(temp_df, ignore_index = True)
# From here you can export this to csv file
        final_df.to_csv('my_scraped_articles.csv')

In [25]:
final_df.head()

Unnamed: 0,Title,Authors,Text,Summary,published_date,Source
0,What I actually learned about my family after ...,Tina Hesman Saey,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
1,What I actually learned about my family after ...,Laura Sanders,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
2,What I actually learned about my family after ...,Christie Wilcox,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
3,What I actually learned about my family after ...,Carolyn Gramling,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections
4,What I actually learned about my family after ...,Bruce Bower,Commercials abound for DNA testing services th...,"I sent my DNA to Living DNA, Family Tree DNA, ...",2018-06-13 18:41:40-04:00,https://www.sciencenews.org/collections


In [26]:
#we managed to scrape 3430 articles from 2 different news sites
final_df.shape

(3430, 6)