<a href="https://colab.research.google.com/github/David8523/Udemy_projects/blob/main/BlogMe_Sentiment_and_Keyword_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BlogMe: Sentiment and Keyword Analysis

### 1. Project introduction and background
BlogMe, a famous blogging business has a dataset of news articles that they need
further analysis on.

Firstly, they’d like keywords to be extracted from headlines of the article. Secondly,
they would need to determine the sentiment of the news articles. The data is in an
excel sheet and they would like to see a dashboard outlying sentiment, top articles etc.




## 2. Data Exploration and Manipulation
### 2.1 Importing the libraries, visualizing the data and understanding the variables

In [None]:
#importing libraries
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
# read xlsx file 
data = pd.read_excel("articles.xlsx")

In [None]:
# Summary of the data
data.describe()

Unnamed: 0,article_id,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
count,10437.0,10435.0,10319.0,10319.0,10319.0,10319.0
mean,5218.0,0.122089,381.39529,124.032949,196.236263,0.011629
std,3013.046714,0.327404,4433.344792,965.351188,1020.680229,0.268276
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,2609.0,0.0,0.0,0.0,1.0,0.0
50%,5218.0,0.0,1.0,0.0,8.0,0.0
75%,7827.0,0.0,43.0,12.0,47.5,0.0
max,10436.0,1.0,354132.0,48490.0,39422.0,15.0


In [None]:
data.head(5)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0
2,2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,
3,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0
4,4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10437 entries, 0 to 10436
Data columns (total 15 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   article_id                       10437 non-null  int64  
 1   source_id                        10437 non-null  object 
 2   source_name                      10437 non-null  object 
 3   author                           9417 non-null   object 
 4   title                            10435 non-null  object 
 5   description                      10413 non-null  object 
 6   url                              10436 non-null  object 
 7   url_to_image                     9781 non-null   object 
 8   published_at                     10436 non-null  object 
 9   content                          9145 non-null   object 
 10  top_article                      10435 non-null  float64
 11  engagement_reaction_count        10319 non-null  float64
 12  engagement_comment

In [None]:
# Drop unneeded column
data = data.drop('engagement_comment_plugin_count', axis=1)

In [None]:
# Counting the number of articles per source
data.groupby(['source_name'])['article_id'].count()

source_name
460.0                         1
ABC News                   1139
Al Jazeera English          499
BBC News                   1242
Business Insider           1048
CBS News                    952
CNN                        1132
ESPN                         82
Newsweek                    539
Reuters                    1252
The Irish Times            1232
The New York Times          986
The Wall Street Journal     333
Name: article_id, dtype: int64

In [None]:
# Calculating the sum of engagements per source
data.groupby(['source_name'])['engagement_reaction_count'].sum()

source_name
460.0                            0.0
ABC News                    343779.0
Al Jazeera English          140410.0
BBC News                    545396.0
Business Insider            216545.0
CBS News                    459741.0
CNN                        1218206.0
ESPN                             0.0
Newsweek                     93167.0
Reuters                      16963.0
The Irish Times              26838.0
The New York Times          790449.0
The Wall Street Journal      84124.0
Name: engagement_reaction_count, dtype: float64

###2.2 Sentiment Analysis



In [None]:
# Create a keyword flag
def keywordflag(keyword):
    length = len(data)
    keyword_flag = []
    for n in range(0, length):
        heading = data['title'][n]
        try:
            if keyword in heading:
                flag = 1
            else:
                flag = 0
        except:
            flag = 0
        keyword_flag.append(flag)
    return keyword_flag

keywordflag = keywordflag('murder')   

data['keywordflag'] = pd.Series(keywordflag)
    

In [None]:
# SentimentIntensityAnalyzer
sent_int = SentimentIntensityAnalyzer()

text = data['title'][15]
sent = sent_int.polarity_scores(text)

neg = sent['neg']
neu = sent['neu']
pos = sent['pos']

In [None]:
# using a for loop to extract sentiment per title

title_neg_sent = []
title_neu_sent = []
title_pos_sent = []

for x in range(0, len(data)):
    try:
        text = data['title'][n]
        sent_int = SentimentIntensityAnalyzer()
        sent = sent_int.polarity_scores(text)
        title_neg_sent.append(sent['neg'])
        title_neu_sent.append(sent['neu'])
        title_pos_sent.append(sent['pos'])
    except:
        title_neg_sent.append(0)
        title_neu_sent.append(0)
        title_pos_sent.append(0)

In [None]:
title_neg_sent = pd.Series(title_neg_sent)
title_neu_sent = pd.Series(title_neu_sent)
title_pos_sent = pd.Series(title_pos_sent)
data['title_neg_sentiment'] = title_neg_sent
data['title_neu_sentiment'] = title_neu_sent
data['title_pos_sentiment'] = title_pos_sent

In [None]:
data.head(5)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,keywordflag,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0,0.54,0.088,0.372
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0,0.54,0.088,0.372
2,2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,0,0.54,0.088,0.372
3,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0,0.54,0.088,0.372
4,4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0,0.54,0.088,0.372
