# Scrape and Summarize News Articles using Python

toc:true- branch: master- badges: true- comments: true

author: Ijeoma Odoko

categories: [python, nlp, scrape_website, jupyter]


## About 

This project will scrape and summarize news articles using nltk and newspaper3K python libraries. 

We will try to download a summary of all the articles based on the news headlines for a particular ticker symbol in this case 'NDX' (the NASDAQ-100 index) from the [Nasdaq website.](https://www.nasdaq.com/)

To go to the url for a specific ticker simply replace the **ticker value** with your ticker of choice. 

http://www.nasdaq.com/symbol/**ndx**/news-headlines


## Install Required Libraries

In [1]:
# install required libraries 

!pip install nltk
!pip install newspaper3k

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 6.2MB/s 
[?25hCollecting tldextract>=2.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/12/cf/d0ff82625e53bd245d6173ce6333d190abbfcd94e4c30e54b4e16b474216/tldextract-2.2.3-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 4.6MB/s 
Collecting feedfinder2>=0.0.4
  Downloading https://files.pythonhosted.org/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz
Collecting tinysegmenter==0.3
  Downloading https://files.pythonhosted.org/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz
Collecting cssselect>=0.9.2
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96

In [2]:
import nltk
import newspaper
import datetime
import pandas as pd


## Load the newspaper articles

In [3]:
  ndx_news = newspaper.build('https://www.nasdaq.com/market-activity/index/ndx/news-headlines')

  for article in ndx_news.articles:
    print(article.url)

https://www.nasdaq.com/articles/nasdaqs-johan-toll-wins-fintech-person-year-2018-05-16
https://www.nasdaq.com/articles/adena-friedman%3A-esg-ai-cryptocurrency-in-focus-at-davos-2019-01-21
https://www.nasdaq.com/article/invesco-qqq-celebrates-20-years-of-curating-innovation-cm1112647?utm_source=Nasdaq_LINKEDIN_COMPANY_2185074219&utm_medium=QQQ%20anniversary%20__%20%C2%A0
https://www.nasdaq.com/solutions/midpoint-extended-life-order-m-elo
https://www.nasdaq.com/articles/weekly-preview%3A-stocks-to-watch-amzn-intc-msft-nflx-tsla-2020-10-18
https://www.nasdaq.com/articles/3-high-yield-stocks-at-rock-bottom-prices-2020-10-16
https://www.nasdaq.com/articles/making-a-portfolio-election-proof-2020-10-12
https://www.nasdaq.com/articles/getting-in-the-game-with-a-high-flying-video-game-etf-2020-10-16
https://www.nasdaq.com/articles/canaccord-predicts-over-100-rally-for-these-3-strong-buy-stocks-2020-10-16
https://www.nasdaq.com/videos/tradetalks%3A-what-trends-will-dominate-the-next-decade
https

In [4]:
# extract source categories
for category in ndx_news.category_urls():
  print(category)

https://www.nasdaq.com/mediakit
https://www.nasdaq.com/market-activity/index/ndx/news-headlines
https://www.nasdaq.com/solutions
https://www.nasdaq.com
https://www.nasdaq.com/videos
http://ir.nasdaq.com
https://www.nasdaq.com/marketsite
https://www.nasdaq.com/public-policy
https://www.nasdaq.com/ESG-Guide
https://www.nasdaq.com/GlobalIndexes
https://www.nasdaq.com/trust-center
https://portfolio.nasdaq.com
https://www.nasdaq.com/symbol
https://www.nasdaq.com/TotalMarkets


In [5]:
ndx_article = ndx_news.articles[0]
ndx_article.download()
ndx_article.parse()
nltk.download('punkt')
ndx_article.nlp()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
ndx_article.summary

'Johan Toll, head of Blockchain Product Management within Nasdaq’s Market Technology business, has been awarded the FinTech Person of the Year from the Financial Technology Forum (FTF) News Technology Innovation Awards for his expertise and developments on blockchain integration with the Nasdaq Financial Framework, as well as other areas of Nasdaq.\nThe framework consists of a single operational core that ties together the Nasdaq’s technology portfolio that has proven business functionality across the trade lifecycle.\nSince joining Nasdaq in 2007, Johan has focused on many areas of the trade lifecycle.\nWe sat down with Johan to learn more about the blockchain initiatives at Nasdaq and where he sees the technology transforming in the future.\nTo learn more about Nasdaq Market Technology initiatives with blockchain, please read: Blockchain Forges AheadTo learn more about the Nasdaq Financial Framework, please visit that section of our website.'

In [7]:
ndx_article.keywords

['wins',
 'business',
 'framework',
 'blockchain',
 'nasdaqs',
 'trade',
 'world',
 'johan',
 'markets',
 'technology',
 'nasdaq',
 'financial',
 'person',
 'toll',
 'fintech']

In [8]:
ndx_article.publish_date

datetime.datetime(2018, 5, 16, 0, 0)

In [9]:
ndx_article.authors

[]

In [10]:
ndx_article.title

"Nasdaq's Johan Toll Wins FinTech Person of the Year"

## Create Dataframe for the articles 



In [11]:
url = []
keywords = []
title = []
published =[]
summary = []
authors=[]
videos = []


for article in ndx_news.articles:
    url.append(article.url)
    article.download()
    article.parse()
    article.nlp()
    keywords.append(article.keywords)
    title.append(article.title)
    published.append(article.publish_date)
    summary.append(article.summary)
    authors.append(article.authors)
    videos.append(article.movies)
   

 

In [12]:
type(url)

list

In [13]:
len(url)

62

In [14]:
# convert the lists to dataframe columns
data = [url, title, published, authors, summary, keywords, videos] # create list of lists

df = pd.DataFrame(data).transpose()  #transpose to get 7 columns instead of 62

df.columns=['url', 'title', 'published', 'authors', 'summary', 'keywords', 'videos']  # name the columns

df

Unnamed: 0,url,title,published,authors,summary,keywords,videos
0,https://www.nasdaq.com/articles/nasdaqs-johan-...,Nasdaq's Johan Toll Wins FinTech Person of the...,2018-05-16,[],"Johan Toll, head of Blockchain Product Managem...","[wins, business, framework, blockchain, nasdaq...",[]
1,https://www.nasdaq.com/articles/adena-friedman...,"Adena Friedman: ESG, AI, Cryptocurrency In Foc...",2019-01-21,[],The world economy is growing more slowly than ...,"[focus, economic, cryptocurrency, companies, d...",[]
2,https://www.nasdaq.com/article/invesco-qqq-cel...,Invesco QQQ Celebrates 20 Years of Curating In...,NaT,[],Invesco recognizes leading ETF’s long-standing...,"[funds, invesco, celebrates, rank, growth, 31,...",[]
3,https://www.nasdaq.com/solutions/midpoint-exte...,Midpoint Extended Life Order (M-ELO),NaT,[],See M-ELO in action and learn moreNasdaq's EVP...,"[services, life, nasdaqs, morenasdaqs, extende...",[]
4,https://www.nasdaq.com/articles/weekly-preview...,"Weekly Preview: Stocks To Watch (AMZN, INTC, M...",2020-10-18,[],We will also get results from Microsoft (MSFT)...,"[tesla, street, preview, watch, revenue, billi...",[]
...,...,...,...,...,...,...,...
57,https://www.nasdaq.com/articles/what-taiwan-se...,What Taiwan Semiconductor Earnings Indicate Ab...,2020-10-16,[Sejuti Banerjea],Taiwan Semiconductor Manufacturing Company TSM...,"[growth, tech, players, zacks, taiwan, expecte...",[]
58,https://www.nasdaq.com/articles/enbridge-enb-o...,Enbridge (ENB) Outpaces Stock Market Gains: Wh...,2020-10-16,[Zacks Equity Research],"In the latest trading session, Enbridge (ENB) ...","[gains, rank, enbridge, zacks, changes, stocks...",[]
59,https://www.nasdaq.com/articles/dominion-energ...,Dominion Energy (D) Outpaces Stock Market Gain...,2020-10-16,[Zacks Equity Research],Dominion Energy (D) closed at $81.41 in the la...,"[gains, rank, latest, zacks, d, estimate, chan...",[]
60,https://www.nasdaq.com/articles/united-parcel-...,United Parcel Service (UPS) Outpaces Stock Mar...,2020-10-16,[Zacks Equity Research],"In the latest trading session, United Parcel S...","[gains, rank, zacks, united, parcel, estimate,...",[]


## Inspect the data 

In [15]:
print(df.loc[0,'summary'])

Johan Toll, head of Blockchain Product Management within Nasdaq’s Market Technology business, has been awarded the FinTech Person of the Year from the Financial Technology Forum (FTF) News Technology Innovation Awards for his expertise and developments on blockchain integration with the Nasdaq Financial Framework, as well as other areas of Nasdaq.
The framework consists of a single operational core that ties together the Nasdaq’s technology portfolio that has proven business functionality across the trade lifecycle.
Since joining Nasdaq in 2007, Johan has focused on many areas of the trade lifecycle.
We sat down with Johan to learn more about the blockchain initiatives at Nasdaq and where he sees the technology transforming in the future.
To learn more about Nasdaq Market Technology initiatives with blockchain, please read: Blockchain Forges AheadTo learn more about the Nasdaq Financial Framework, please visit that section of our website.


In [16]:
for i in df.columns: 
  print(df.loc[3, i])

https://www.nasdaq.com/solutions/midpoint-extended-life-order-m-elo
Midpoint Extended Life Order (M-ELO)
NaT
[]
See M-ELO in action and learn moreNasdaq's EVP of North American Market Services, Tal Cohen, is looking to engage the industry on key public policy, market structure initiatives and Nasdaq's commitment to providing clients and the investing public with an efficient and effective marketplace.
['services', 'life', 'nasdaqs', 'morenasdaqs', 'extended', 'midpoint', 'structure', 'order', 'tal', 'melo', 'providing', 'market', 'policy', 'north', 'public']
[]


In [17]:
# check the dataframe info

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   url        62 non-null     object        
 1   title      62 non-null     object        
 2   published  28 non-null     datetime64[ns]
 3   authors    62 non-null     object        
 4   summary    62 non-null     object        
 5   keywords   62 non-null     object        
 6   videos     62 non-null     object        
dtypes: datetime64[ns](1), object(6)
memory usage: 3.5+ KB


In [18]:
df['published'].unique()

array(['2018-05-16T00:00:00.000000000', '2019-01-21T00:00:00.000000000',
                                 'NaT', '2020-10-18T00:00:00.000000000',
       '2020-10-16T00:00:00.000000000', '2020-10-12T00:00:00.000000000',
       '2020-10-09T00:00:00.000000000', '2020-10-13T00:00:00.000000000',
       '2020-10-14T00:00:00.000000000', '2020-10-17T00:00:00.000000000'],
      dtype='datetime64[ns]')

In [19]:
# check url for rows with no dates downloaded into dataframe
df_null = df[df['published'].isnull()]

df_null['url']

2     https://www.nasdaq.com/article/invesco-qqq-cel...
3     https://www.nasdaq.com/solutions/midpoint-exte...
9     https://www.nasdaq.com/videos/tradetalks%3A-wh...
10    https://www.nasdaq.com/videos/can-stock-pickin...
11    https://www.nasdaq.com/videos/tradetalks%3A-th...
12    https://www.nasdaq.com/videos/tradetalks%3A-ho...
13    https://www.nasdaq.com/videos/tradetalks%3A-tr...
14    https://www.nasdaq.com/videos/tradetalks%3A-ex...
15    https://www.nasdaq.com/videos/tradetalks%3A-wh...
16    https://www.nasdaq.com/videos/tradetalks%3A-op...
17    https://www.nasdaq.com/videos/tradetalks%3A-th...
18    https://www.nasdaq.com/videos/investing-strate...
19    https://www.nasdaq.com/videos/investing-strate...
20    https://www.nasdaq.com/videos/investing-strate...
21    https://www.nasdaq.com/videos/investing-strate...
22    https://www.nasdaq.com/videos/investing-strate...
23    https://www.nasdaq.com/videos/investing-strate...
24    https://www.nasdaq.com/videos/investing-st

## Download Dataframe to a CSV file

In [22]:
# get df with dates and sorted with most recent at the top
df_dates = df.dropna(subset=['published']).sort_values(by=['published'], inplace=False, ascending=False)
df_dates

Unnamed: 0,url,title,published,authors,summary,keywords,videos
4,https://www.nasdaq.com/articles/weekly-preview...,"Weekly Preview: Stocks To Watch (AMZN, INTC, M...",2020-10-18,[],We will also get results from Microsoft (MSFT)...,"[tesla, street, preview, watch, revenue, billi...",[]
55,https://www.nasdaq.com/articles/gol-american-a...,"GOL, American Airlines' Codeshare Pact to Boos...",2020-10-17,[Zacks Equity Research],"In a customer-friendly move, American Airlines...","[rank, connectivity, group, deal, zacks, codes...",[]
61,https://www.nasdaq.com/articles/anaplan-plan-o...,Anaplan (PLAN) Outpaces Stock Market Gains: Wh...,2020-10-16,[Zacks Equity Research],"In the latest trading session, Anaplan (PLAN) ...","[gains, rank, plan, latest, zacks, estimate, s...",[]
57,https://www.nasdaq.com/articles/what-taiwan-se...,What Taiwan Semiconductor Earnings Indicate Ab...,2020-10-16,[Sejuti Banerjea],Taiwan Semiconductor Manufacturing Company TSM...,"[growth, tech, players, zacks, taiwan, expecte...",[]
52,https://www.nasdaq.com/articles/3-biotech-stoc...,3 Biotech Stocks to Buy Right Now,2020-10-16,[Chris Tyler],"Still, if there’s one sector with stocks to bu...","[right, buy, shares, stocks, nasdaq, stock, bi...",[]
53,https://www.nasdaq.com/articles/buy-the-next-d...,Buy the Next Dip in Red-Hot Nvidia Stock,2020-10-16,[Nicolas Chahine],"InvestorPlace - Stock Market News, Stock Advic...","[buy, nvda, future, dip, redhot, shares, amd, ...",[]
45,https://www.nasdaq.com/articles/stay-at-home-s...,'Stay at Home' Stocks Can Still Thrive Even Wh...,2020-10-16,[],"In most cases, there will be a correction to t...","[likely, growth, demand, continue, disease, re...",[]
54,https://www.nasdaq.com/articles/get-ready-to-b...,Get Ready To Bull Trade Novavax Stock For A Se...,2020-10-16,[Chris Tyler],"InvestorPlace - Stock Market News, Stock Advic...","[novavax, ready, vaccine, trade, second, stock...",[]
56,https://www.nasdaq.com/articles/this-decades-s...,This Decade’s Stock Leaders,2020-10-16,[Jeff Remsburg],"InvestorPlace - Stock Market News, Stock Advic...","[investor, small, decades, leaders, stocks, su...",[]
58,https://www.nasdaq.com/articles/enbridge-enb-o...,Enbridge (ENB) Outpaces Stock Market Gains: Wh...,2020-10-16,[Zacks Equity Research],"In the latest trading session, Enbridge (ENB) ...","[gains, rank, enbridge, zacks, changes, stocks...",[]


In [24]:
# convert to a csv file
from google.colab import files
df_dates.to_csv('NDX_news_headlines_accessed_2020-10-18.csv') 
files.download('NDX_news_headlines_accessed_2020-10-18.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

References: 

1. [NDX News Headlines](https://www.nasdaq.com/market-activity/index/ndx/news-headlines) accessed 18-October-2020.
2. [Quickstart Guide - Newspaper](https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#extracting-articles) accessed 18-October-2020.
3. [Scraping articles about stocks](http://theautomatic.net/2017/08/24/scraping-articles-about-stocks/) accessed 18-October-2020.
4. [Scrape and Summarize News Articles by Computer Science](https://www.youtube.com/watch?v=YzMA2O_v5co&t=451s) accessed 18-October-2020.
