# "Scrape and Summarize News Articles using Python"

> "Original post created 18-October-2020. Updated 23-October-2020."

- toc: false
- branch: master
- badges: true
- comments: true
- author: Ijeoma Odoko
- categories: [python, nlp, scrape_website]

## About 

This project will scrape and summarize news articles using nltk and newspaper3K python libraries. 

We will try to download a summary of all the articles based on the news headlines for a particular ticker symbol in this case 'NDX' (the NASDAQ-100 index) from the [Nasdaq website.](https://www.nasdaq.com/)

To go to the url for a specific ticker simply replace the **ticker value** with your ticker of choice. 

http://www.nasdaq.com/symbol/**ndx**/news-headlines


## Install Required Libraries

In [None]:
# install required libraries 

!pip install nltk
!pip install newspaper3k

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 6.5MB/s 
[?25hCollecting tinysegmenter==0.3
  Downloading https://files.pythonhosted.org/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz
Collecting tldextract>=2.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/5d/c3/f4e90ae5b7dd02257c3b9dfb6747aba0b8a9c788f5401700fda055e814fc/tldextract-3.0.1-py2.py3-none-any.whl (86kB)
[K     |████████████████████████████████| 92kB 5.2MB/s 
[?25hCollecting feedfinder2>=0.0.4
  Downloading https://files.pythonhosted.org/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz
Collecting jieba3k>=0.35.1
[?25l  Downloading https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a06

In [None]:
import nltk
import newspaper
import datetime
import pandas as pd


## Load the newspaper articles

In [None]:
  ndx_news = newspaper.build('https://www.nasdaq.com/market-activity/index/ndx/news-headlines')

  for article in ndx_news.articles:
    print(article.url)

http://ir.nasdaq.com/Income-Statement-Trend-Summary-and-GAAP-to-Non-GAAP-Reconciliation
http://ir.nasdaq.com/events/event-details/barclays-global-financial-services-conference-2020
http://ir.nasdaq.com/events/event-details/ubs-virtual-financial-services-conference-2020
https://www.nasdaq.com/videos/tradetalks%3A-what-trends-will-dominate-the-next-decade
https://www.nasdaq.com/videos/can-stock-picking-ever-be-fun-again
https://www.nasdaq.com/videos/tradetalks%3A-how-will-2020-impact-investment-portfolios
https://www.nasdaq.com/videos/tradetalks%3A-the-toys-for-tots-campaign-for-holiday-2020
https://www.nasdaq.com/videos/tradetalks%3A-the-launch-of-jaaa-and-democratizing-access-to-clos
https://www.nasdaq.com/videos/tradetalks%3A-what-paypals-cryptocurrency-means-for-growth-in-the-space
https://www.nasdaq.com/videos/tradetalks%3A-how-do-franchise-businesses-properly-open-while-protecting-staff-and-customers
https://www.nasdaq.com/videos/tradetalks%3A-volatility-around-elections-is-normal


In [None]:
# extract source categories
for category in ndx_news.category_urls():
  print(category)

https://www.nasdaq.com/market-activity/index/ndx/news-headlines
https://www.nasdaq.com/mediakit
https://www.nasdaq.com/GlobalIndexes
https://www.nasdaq.com/TotalMarkets
https://www.nasdaq.com/marketsite
http://ir.nasdaq.com
https://www.nasdaq.com/public-policy
https://www.nasdaq.com/videos
https://www.nasdaq.com
https://www.nasdaq.com/symbol
https://www.nasdaq.com/solutions
https://portfolio.nasdaq.com
https://www.nasdaq.com/trust-center
https://www.nasdaq.com/ESG-Guide


In [None]:
ndx_article = ndx_news.articles[0]
ndx_article.download()
ndx_article.parse()
nltk.download('punkt')
ndx_article.nlp()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
ndx_article.summary

''

In [None]:
ndx_article.keywords

['nongaap',
 'trend',
 'income',
 'gaap',
 'summary',
 'reconciliation',
 'statement']

In [None]:
ndx_article.publish_date

In [None]:
ndx_article.authors

[]

In [None]:
ndx_article.title

'Income Statement Trend Summary and GAAP to Non-GAAP Reconciliation'

## Create Dataframe for the articles 



In [None]:
url = []
keywords = []
title = []
published =[]
summary = []
authors=[]
videos = []


for article in ndx_news.articles:
    url.append(article.url)
    article.download()
    article.parse()
    article.nlp()
    keywords.append(article.keywords)
    title.append(article.title)
    published.append(article.publish_date)
    summary.append(article.summary)
    authors.append(article.authors)
    videos.append(article.movies)
   

 

In [None]:
type(url)

list

In [None]:
len(url)

64

In [None]:
# convert the lists to dataframe columns
data = [url, title, published, authors, summary, keywords, videos] # create list of lists

df = pd.DataFrame(data).transpose()  #transpose to get 7 columns instead of 62

df.columns=['url', 'title', 'published', 'authors', 'summary', 'keywords', 'videos']  # name the columns

df

Unnamed: 0,url,title,published,authors,summary,keywords,videos
0,http://ir.nasdaq.com/Income-Statement-Trend-Su...,Income Statement Trend Summary and GAAP to Non...,NaT,[],,"[nongaap, trend, income, gaap, summary, reconc...",[]
1,http://ir.nasdaq.com/events/event-details/barc...,Barclays Global Financial Services Conference ...,NaT,[],,"[financial, services, barclays, 2020, global, ...",[]
2,http://ir.nasdaq.com/events/event-details/ubs-...,UBS Virtual Financial Services Conference 2020,NaT,[],,"[financial, services, ubs, virtual, 2020, conf...",[]
3,https://www.nasdaq.com/videos/tradetalks%3A-wh...,#TradeTalks: What Trends Will Dominate the Nex...,NaT,[],Caleb Silver from Investopedia returns to #Tra...,"[investopedia, silver, decade, tradetalks, ret...",[]
4,https://www.nasdaq.com/videos/can-stock-pickin...,Can Stock Picking Ever be Fun Again?,NaT,[],"Michael Holland, Holland & Co. founder, talks ...","[picking, surveillance, strategy, stock, bloom...",[]
...,...,...,...,...,...,...,...
59,https://www.nasdaq.com/articles/enbridge-enb-s...,Enbridge (ENB) Stock Sinks As Market Gains: Wh...,2020-10-23,[Zacks Equity Research],"In the latest trading session, Enbridge (ENB) ...","[sinks, enbridge, ratio, enb, stock, estimate,...",[]
60,https://www.nasdaq.com/articles/nasdaqs-johan-...,Nasdaq's Johan Toll Wins FinTech Person of the...,2018-05-16,[],"Johan Toll, head of Blockchain Product Managem...","[markets, johan, financial, nasdaq, toll, trad...",[]
61,https://www.nasdaq.com/articles/adena-friedman...,"Adena Friedman: ESG, AI, Cryptocurrency In Foc...",2019-01-21,[],The world economy is growing more slowly than ...,"[focus, markets, public, economic, nasdaq, com...",[]
62,https://www.nasdaq.com/article/invesco-qqq-cel...,Invesco QQQ Celebrates 20 Years of Curating In...,NaT,[],Invesco recognizes leading ETF’s long-standing...,"[funds, dec, growth, 20, invesco, 2018, rank, ...",[]


## Inspect the data 

In [None]:
print(df.loc[0,'summary'])




In [None]:
for i in df.columns: 
  print(df.loc[3, i])

https://www.nasdaq.com/videos/tradetalks%3A-what-trends-will-dominate-the-next-decade
#TradeTalks: What Trends Will Dominate the Next Decade?
NaT
[]
Caleb Silver from Investopedia returns to #TradeTalks!
He and Jill Malandrino go over what trends will dominate in the next decade.
['investopedia', 'silver', 'decade', 'tradetalks', 'returns', 'dominate', 'trends', 'malandrino', 'jill', 'caleb']
[]


In [None]:
# check the dataframe info

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   url        64 non-null     object        
 1   title      64 non-null     object        
 2   published  32 non-null     datetime64[ns]
 3   authors    64 non-null     object        
 4   summary    64 non-null     object        
 5   keywords   64 non-null     object        
 6   videos     64 non-null     object        
dtypes: datetime64[ns](1), object(6)
memory usage: 3.6+ KB


In [None]:
df['published'].unique()

array([                          'NaT', '2020-10-23T00:00:00.000000000',
       '2020-10-19T00:00:00.000000000', '2020-10-09T00:00:00.000000000',
       '2020-10-16T00:00:00.000000000', '2020-10-22T00:00:00.000000000',
       '2020-10-21T00:00:00.000000000', '2020-10-20T00:00:00.000000000',
       '2018-05-16T00:00:00.000000000', '2019-01-21T00:00:00.000000000'],
      dtype='datetime64[ns]')

In [None]:
# check url for rows with no dates downloaded into dataframe
df_null = df[df['published'].isnull()]

df_null['url']

0     http://ir.nasdaq.com/Income-Statement-Trend-Su...
1     http://ir.nasdaq.com/events/event-details/barc...
2     http://ir.nasdaq.com/events/event-details/ubs-...
3     https://www.nasdaq.com/videos/tradetalks%3A-wh...
4     https://www.nasdaq.com/videos/can-stock-pickin...
5     https://www.nasdaq.com/videos/tradetalks%3A-ho...
6     https://www.nasdaq.com/videos/tradetalks%3A-th...
7     https://www.nasdaq.com/videos/tradetalks%3A-th...
8     https://www.nasdaq.com/videos/tradetalks%3A-wh...
9     https://www.nasdaq.com/videos/tradetalks%3A-ho...
10    https://www.nasdaq.com/videos/tradetalks%3A-vo...
11    https://www.nasdaq.com/videos/tradetalks%3A-fi...
12    https://www.nasdaq.com/videos/investing-strate...
13    https://www.nasdaq.com/videos/investing-strate...
14    https://www.nasdaq.com/videos/investing-strate...
15    https://www.nasdaq.com/videos/investing-strate...
16    https://www.nasdaq.com/videos/investing-strate...
17    https://www.nasdaq.com/videos/investing-st

## Download Dataframe to a CSV file

In [None]:
# get df with dates and sorted with most recent at the top
df_dates = df.dropna(subset=['published']).sort_values(by=['published'], inplace=False, ascending=False) #drop rows that did not load the published dates, and sort by date published
df_dates

Unnamed: 0,url,title,published,authors,summary,keywords,videos
29,https://www.nasdaq.com/articles/iphone-12-coul...,iPhone 12 Could Be a Tailwind for Apple’s Acce...,2020-10-23,[Marty Shtrubel],Apple’s (AAPL) accessories business might make...,"[aapl, tailwind, 12, opinions, feature, access...",[]
56,https://www.nasdaq.com/articles/dominion-energ...,Dominion Energy (D) Stock Sinks As Market Gain...,2020-10-23,[Zacks Equity Research],Dominion Energy (D) closed at $81.14 in the la...,"[sinks, energy, d, stock, ratio, gains, estima...",[]
53,https://www.nasdaq.com/articles/target-tgt-out...,Target (TGT) Outpaces Stock Market Gains: What...,2020-10-23,[Zacks Equity Research],"On that day, TGT is projected to report earnin...","[tgt, ratio, stock, estimate, gains, stocks, s...",[]
52,https://www.nasdaq.com/articles/bp-bp-stock-si...,BP (BP) Stock Sinks As Market Gains: What You ...,2020-10-23,[Zacks Equity Research],"In the latest trading session, BP (BP) closed ...","[sinks, reflect, revisions, stock, estimate, g...",[]
47,https://www.nasdaq.com/articles/wall-st-week-a...,Wall St Week Ahead-More U.S. companies offer e...,2020-10-23,[David Randall],"By David RandallNEW YORK, Oct 23 (Reuters) - W...","[sp, investors, far, week, offer, pandemic, re...",[]
30,https://www.nasdaq.com/articles/covid-19-is-dr...,COVID-19 Is Driving a Surge in E-Commerce Ad S...,2020-10-23,[Evan Niu],That's also causing a related surge in ad spen...,"[covid19, surge, fool, ad, revenue, ecommerce,...",[]
42,https://www.nasdaq.com/articles/is-the-market-...,Is the Market Reaction to Intel's (INTC) Earni...,2020-10-23,[],It is not unusual for a stock to fall followin...,"[stock, good, growth, revenue, intc, billion, ...",[]
55,https://www.nasdaq.com/articles/hanesbrands-hb...,HanesBrands (HBI) Stock Sinks As Market Gains:...,2020-10-23,[Zacks Equity Research],"In the latest trading session, HanesBrands (HB...","[sinks, trading, hbi, stock, stocks, gains, ra...",[]
54,https://www.nasdaq.com/articles/united-parcel-...,United Parcel Service (UPS) Stock Sinks As Mar...,2020-10-23,[Zacks Equity Research],United Parcel Service (UPS) closed at $171.90 ...,"[sinks, service, transportation, ratio, gained...",[]
57,https://www.nasdaq.com/articles/noble-midstrea...,Noble Midstream Partners (NBLX) Stock Sinks As...,2020-10-23,[Zacks Equity Research],Noble Midstream Partners (NBLX) closed at $9.0...,"[sinks, trading, revisions, stock, estimate, g...",[]


In [None]:
# convert to a csv file
from google.colab import files
df_dates.to_csv('NDX_news_headlines_accessed_2020-10-23.csv', index=False) 
files.download('NDX_news_headlines_accessed_2020-10-23.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

References: 

1. [NDX News Headlines](https://www.nasdaq.com/market-activity/index/ndx/news-headlines) accessed 23-October-2020.
2. [Quickstart Guide - Newspaper](https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#extracting-articles) accessed 18-October-2020.
3. [Scraping articles about stocks](http://theautomatic.net/2017/08/24/scraping-articles-about-stocks/) accessed 18-October-2020.
4. [Scrape and Summarize News Articles by Computer Science](https://www.youtube.com/watch?v=YzMA2O_v5co&t=451s) accessed 18-October-2020.
