# Keyword Extraction using News Articles

#newsapi #KeywordExtraction #ElectricVehicles #CO2Emissions

The goal of this document is to extract keywords 
that are associated with the environmental downsides 
of an electric vehicle that are covered in mainstream
media sources. The API used here is for newsapi.org.

This jupyter notebook includes:
- Data gathering from newsapi.org
- Data cleaning and feature extraction on the newsapi.org dataset

# First, use API to gather news data

In [1]:
from newsapi import NewsApiClient
import datetime as dt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

my_api_key = "fea5c56a315742a8bcc16a4c19a1b31e"

In [2]:
newsapi = NewsApiClient(api_key = my_api_key)
data = newsapi.get_everything(q = "(electric vehicles OR EV OR EVs) AND (carbon footprint OR CO2 Emissions)", language = 'en', page_size = 100)

In [3]:
data.keys()

dict_keys(['status', 'totalResults', 'articles'])

## The structure of the pulled dataset is as follows:

The _data_ object is a dictionary with keys {'status', 'totalResults', 'articles'}, as seen below:

In [4]:
print(type(data), data.keys())

<class 'dict'> dict_keys(['status', 'totalResults', 'articles'])


For example, it can be checked using the 'status' key that my search went as wished, and the number of relevant search results is given by:

In [5]:
print(data['status'] == 'ok', data['totalResults'])

True 1874


But since my account isn't a developer account, this dataset is going to contain just 100 news articles:

In [6]:
print(len(data['articles']))

100


As for all these datasets, it will be helpful to gather the text data into a data frame 
to prep for some data cleaning and later on, clustering. As can be seen below, the articles
are stored in a list of dictionaries:

In [7]:
print(type(data['articles']), type(data['articles'][0]), len(data['articles']))

<class 'list'> <class 'dict'> 100


Below are some good information about the storage structure for each article:

In [8]:
print(data['articles'][0])
print(data['articles'][0].keys())

{'source': {'id': 'techcrunch', 'name': 'TechCrunch'}, 'author': 'Walter Thompson', 'title': 'EV charging solutions will become an asset, not a liability, to the grid', 'description': 'Although wireless charging is still relatively new to the market, the benefits are beginning to become glaringly self-evident.', 'url': 'http://techcrunch.com/2021/08/31/ev-charging-solutions-will-become-an-asset-not-a-liability-to-the-grid/', 'urlToImage': 'https://techcrunch.com/wp-content/uploads/2021/08/GettyImages-1219328171.jpg?w=600', 'publishedAt': '2021-08-31T14:30:08Z', 'content': 'President Joe Biden’s plan for electric vehicles (EVs) to comprise roughly half of U.S. sales by 2030 is a clear indication that the U.S. is making strides in decarbonizing its transportation systems… [+8883 chars]'}
dict_keys(['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content'])


It can be seen above that each article has attributes:
{'source', 'author', 'title', 'description', 'url', 
'urlToImage', 'publishedAt', 'content'}

- These variables will be relevant and kept: source, title, publishedAt, content
- _source_ could reveal media bias
- _title_ is important content
- _publishedAt_ is important because EV production technologies are evolving fast, an article from too long ago is not applicable for today
- _content_ must be kept for analysis

# Below is Data Cleaning and Feature Extraction

1. The articles will be put into a dataframe
2. The contents will be converted to bags-of-words (BOW) for further text analysis

The _pandas_ library has an amazing function already implemented to convert a dictionary into a data frame. The fucntion is called:

> pandas.DataFrame.from_dict()

The data can be a list of dictionaries, just like our case. See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html#pandas.DataFrame.from_dict). 

In [9]:
raw_dataset = data['articles']
DF = pd.DataFrame.from_dict(raw_dataset)
DF.to_csv('rawNewsData.csv', encoding='utf-8')

In [10]:
print(DF.info())
print(DF.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
source         100 non-null object
author         72 non-null object
title          100 non-null object
description    100 non-null object
url            100 non-null object
urlToImage     100 non-null object
publishedAt    100 non-null object
content        100 non-null object
dtypes: object(8)
memory usage: 6.4+ KB
None
Index(['source', 'author', 'title', 'description', 'url', 'urlToImage',
       'publishedAt', 'content'],
      dtype='object')


First tasks on this data frame:
1. _author, description, url, urlToImage_ are all irrelevant features, so they will be deleted. The relevant DF method is:
    > pandas.DataFrame.drop()
2. For _source_, only 'name' will be kept
3. Each feature has datatype 'object' which needs correction.
    - source -- string
    - title -- string
    - description -- string
    - publishedAt -- int, only year
    - content -- string

In [11]:
### Task 1
# documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
drop_labels = ['author', 'url', 'urlToImage']
DF = DF.drop(labels = drop_labels, axis = 1)

In [12]:
## Task 2
for i in range(0,100):
    DF['source'][i] = DF['source'][i]['name']
print(DF['source'])

0       TechCrunch
1       TechCrunch
2        Wiley.com
3          Reuters
4          Reuters
          ...     
95    The Guardian
96       Zacks.com
97         Reuters
98         Reuters
99         Reuters
Name: source, Length: 100, dtype: object


In [13]:
## Task 3
for label in DF.columns:
    print(DF[label][0])
    print(type(DF[label][0]))
    
# so only publishedAt needs correction, this will be made into int year
for i in range(0,100):
    DF['publishedAt'][i] = int(DF['publishedAt'][i][0:4]) # Python strings can be sliced like arrays

TechCrunch
<class 'str'>
EV charging solutions will become an asset, not a liability, to the grid
<class 'str'>
Although wireless charging is still relatively new to the market, the benefits are beginning to become glaringly self-evident.
<class 'str'>
2021-08-31T14:30:08Z
<class 'str'>
President Joe Biden’s plan for electric vehicles (EVs) to comprise roughly half of U.S. sales by 2030 is a clear indication that the U.S. is making strides in decarbonizing its transportation systems… [+8883 chars]
<class 'str'>


In [14]:
# it turns out that all these articles were published in 2021, so this feature is no longer useful. 
DF = DF.drop(labels = ['publishedAt'], axis = 1)

# here is what the dataset looks like now:
DF

Unnamed: 0,source,title,description,content
0,TechCrunch,"EV charging solutions will become an asset, no...",Although wireless charging is still relatively...,President Joe Biden’s plan for electric vehicl...
1,TechCrunch,Hyundai Motor Group unveils its hydrogen strat...,Hyundai Motor Group is backing hydrogen as a t...,Hyundai Motor Group is backing hydrogen as a t...
2,Wiley.com,Electric Vehicles produce substantial toxicity...,Electric vehicles (EVs) coupled with low-carbo...,Introduction\r\nOur global society is dependen...
3,Reuters,Zurich Insurance sets climate steps to curb C0...,"Zurich Insurance Group <a href=""https://www.re...",CEO Mario Greco of Swiss Zurich Insurance addr...
4,Reuters,UPDATE 1-Zurich Insurance sets climate steps t...,Zurich Insurance Group unveiled new climate me...,By Reuters Staff\r\nFILE PHOTO: CEO Mario Grec...
...,...,...,...,...
95,The Guardian,Is deep-sea mining a cure for the climate cris...,Trillions of metallic nodules on the sea floor...,In a display cabinet in the recently opened Ou...
96,Zacks.com,"Auto Stock Roundup: AAP & XPEV's Q2 Results, F...",While Advance Auto Parts (AAP) and XPeng (XPEV...,"August\r\n30, 2021\r\n5 min read\r\nThis story..."
97,Reuters,Illinois Senate passes bill to save nuclear pl...,The Illinois Senate passed a bill early on Wed...,The company and law firm names shown above are...
98,Reuters,Illinois Senate passes bill to save nuclear pl...,The Illinois Senate passed a bill early on Wed...,(Reuters) - The Illinois Senate passed a bill ...


# Text Cleaning

The only acceptable column is _source_ at the moment. For the features _title, description, content_, it is necessary to do the following data cleaning:

1. convert all text to lower case for consistency
2. remove parenthesized contents, e.g. (EVs), (Reuters), (Adds Exelon comment)
    - [Stackoverflow regex solution](https://stackoverflow.com/questions/640001/how-can-i-remove-text-within-parentheses-with-a-regex)
3. remove all link references
4. remove all punctuation
5. remove stop words

In [15]:
import re
import nltk # nlp module
from nltk.corpus import stopwords # for stopwords later
import string # to call string methods

In [16]:
punc = string.punctuation # string of punctuation characters

from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english")) # for removing stop words 

for j in ['title', 'description', 'content']: # iterated over these labels
    for i in range(0,100): # iterated over all 100 collected articles
        # 1. convert all text to lower case for consistency
        DF[j][i] = DF[j][i].lower()
        
        # 2. remove parenthesized contents
        # parenthesized contents are represented in re package
        # as \([^)]*\), see reference
        re.sub(r'\([^)]*\)', '', DF[j][i])
               
        # 3. remove all link references
        re.sub("https?:\/\/.*[\r\n]*", "", DF[j][i])
        
        # 4. replace all punctuation with an empty space
        for char in DF[j][i]:
            if char in punc:
                DF[j][i] = DF[j][i].replace(char, " ")
        
        # 5. remove stop words
        # reference for the line of code below: (I did take the time to understand this one-liner)
        # https://towardsdatascience.com/a-guide-to-cleaning-text-in-python-943356ac86ca
        DF[j][i] = " ".join([word for word in DF[j][i].split() if word not in stop_words])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ercongluo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
DF # see what the dataframe looks like now. looking good!

Unnamed: 0,source,title,description,content
0,TechCrunch,ev charging solutions become asset liability grid,although wireless charging still relatively ne...,president joe biden’s plan electric vehicles e...
1,TechCrunch,hyundai motor group unveils hydrogen strategy ...,hyundai motor group backing hydrogen top energ...,hyundai motor group backing hydrogen top energ...
2,Wiley.com,electric vehicles produce substantial toxicity...,electric vehicles evs coupled low carbon elect...,introduction global society dependent road tra...
3,Reuters,zurich insurance sets climate steps curb c02 e...,zurich insurance group href https www reuters ...,ceo mario greco swiss zurich insurance address...
4,Reuters,update 1 zurich insurance sets climate steps c...,zurich insurance group unveiled new climate me...,reuters staff file photo ceo mario greco swiss...
...,...,...,...,...
95,The Guardian,deep sea mining cure climate crisis curse,trillions metallic nodules sea floor could hel...,display cabinet recently opened broken planet ...
96,Zacks.com,auto stock roundup aap xpev q2 results f outpu...,advance auto parts aap xpeng xpev deliver impr...,august 30 2021 5 min read story originally app...
97,Reuters,illinois senate passes bill save nuclear plant...,illinois senate passed bill early wednesday ai...,company law firm names shown generated automat...
98,Reuters,illinois senate passes bill save nuclear plant...,illinois senate passed bill early wednesday ai...,reuters illinois senate passed bill early wedn...


## The dataframe now is already pretty good data, so it will be saved as is for later use as a csv file. 

In [18]:
DF.to_csv('NewsData_cleanedDF.csv', encoding='utf-8')

# Convert the Article Contents into bag-of-word (BOW) Representations using CountVectorizer()

Key function: 

> sklearn.feature_extraction.text.CountVectorizer()

Documentation: [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). 

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = DF['content'].tolist() # convert the contents into a list of strings, each string is an article

# it is discovered that there are too many numeric characters in the dataset and they are of no use for now
# so the following chunk of code takes them out using regex
for article in corpus:
    article = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", article)

    
stop_words = stop_words.union({'107', # tried regex to eliminate numeric characters but failed 
                             '10950', # so here's a brute-force solution
                             '12',
                             '1305',
                             '13052',
                             '1363',
                             '13th',
                             '1424',
                             '1472',
                             '1480',
                             '14932',
                             '1526',
                             '15961',
                             '16017',
                             '1655',
                             '1671',
                             '16th',
                             '17',
                             '1745',
                             '179',
                             '1816',
                             '1837',
                             '1846',
                             '1885',
                             '1888',
                             '1957',
                             '196',
                             '1970',
                             '1973',
                             '1988',
                             '20',
                             '2015',
                             '2016',
                             '2019',
                             '2023',
                             '2050',
                             '2078',
                             '2105',
                             '21202',
                             '2179',
                             '2201',
                             '2220',
                             '2294',
                             '2303',
                             '2310',
                             '2383',
                             '24',
                             '2416',
                             '2465',
                             '25',
                             '2500',
                             '2507',
                             '2633',
                             '2736',
                             '2787',
                             '28',
                             '2840',
                             '2844',
                             '2873',
                             '2894',
                             '29',
                             '300',
                             '3069',
                             '3108',
                             '3130',
                             '3164',
                             '3179',
                             '3277',
                             '3278',
                             '333',
                             '3500',
                             '3567',
                             '3653',
                             '3701',
                             '3821',
                             '3879',
                             '4047',
                             '4084',
                             '4247',
                             '43',
                             '4423',
                             '4534',
                             '4568',
                             '45710',
                             '4589',
                             '4762',
                             '4837',
                             '4940',
                             '4954',
                             '50',
                             '500',
                             '5075',
                             '5210',
                             '5301',
                             '5574',
                             '5633',
                             '5637',
                             '5682',
                             '5722',
                             '5889',
                             '5932',
                             '5974',
                             '5996',
                             '6052',
                             '6128',
                             '66',
                             '6694',
                             '6784',
                             '6827',
                             '70',
                             '72',
                             '7402',
                             '7459',
                             '7517',
                             '8288',
                             '83',
                             '8309',
                             '8399',
                             '8466',
                             '852',
                             '882',
                             '8883',
                             '8901'})
MyCV = CountVectorizer(input = corpus, encoding='utf-8', decode_error = 'ignore',
                      stop_words = stop_words
                       #, max_features = 200
                      )

In [20]:
X = MyCV.fit_transform(corpus)

In [21]:
colNames = MyCV.get_feature_names()

In [22]:
BOW_DF = pd.DataFrame(X.toarray(), columns = colNames)

# Now Write the result to .csv File

In [23]:
BOW_DF.to_csv("newsData_BOW.csv")