In [1]:
import pandas as pd

In [2]:
#importing the dataframe
news_df = pd.read_csv("../data/News.csv")

In [3]:
news_df.head()

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
0,99248.0,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,4/2/2002 0:00,0.0,-0.0533,-1,-1,-1
1,10423.0,A Look at the Health of the Chinese Economy,"Tim Haywood, investment director business-unit...",Bloomberg,economy,9/20/2008 0:00,0.208333,-0.156386,-1,-1,-1
2,18828.0,Nouriel Roubini: Global Economy Not Back to 2008,"Nouriel Roubini, NYU professor and chairman at...",Bloomberg,economy,1/28/2012 0:00,-0.42521,0.139754,-1,-1,-1
3,27788.0,Finland GDP Expands In Q4,Finland's economy expanded marginally in the t...,RTT News,economy,3/1/2015 0:06,0.0,0.026064,-1,-1,-1
4,27789.0,"Tourism, govt spending buoys Thai economy in J...",Tourism and public spending continued to boost...,The Nation - Thailand&#39;s English news,economy,3/1/2015 0:11,0.0,0.141084,-1,-1,-1


In [4]:
news_df.shape

(93239, 11)

### About the data

The given dataset contains a large number of News Article Headlines mapped together with its
Sentiment Score and their respective social feedback on multiple platforms. The collected data accounts 
about 93239 news items on four different topics: Economy, Microsoft, Obama and Palestine. (UCI 
Machine Learning Repository, n.d.)

The attributes present in the dataset are:
- **IDLink (numeric):** Unique identifier of news items
- **Title (string):** Title of the news item according to the official media sources
- **Headline (string):** Headline of the news item according to the official media sources
- **Source (string):** Original news outlet that published the news item
- **Topic (string):** Query topic used to obtain the items in the official media sources
- **PublishDate (timestamp):** Date and time of the news items' publication
- **SentimentTitle (numeric):** Sentiment score of the text in the news items' title
- **SentimentHeadline (numeric):** Sentiment score of the text in the news items' headline
- **Facebook (numeric):** Final value of the news items' popularity according to the social media 
source Facebook
- **GooglePlus (numeric):** Final value of the news items' popularity according to the social media 
source Google+
- **LinkedIn (numeric):** Final value of the news items' popularity according to the social media 
source LinkedIn

For this project the Title and SentimentTitle attributes will only be used and news related to Microsoft will be removed as it is more tech centric and it is quite irrelevant in the context of Nepal. 

In [5]:
# Data with positive sentiment
news_df[news_df['SentimentTitle'] > 0.05].shape

(25755, 11)

In [6]:
# Data with negative sentiment
news_df[news_df['SentimentTitle'] < 0.05].shape

(67482, 11)

It seems like there is almost thrice more negative news(while considering neural news as negative) than postive news.

### Data Preprocessing

In [7]:
#Dropping news related to microsoft
news_df = news_df[news_df['Topic'] != "microsoft"]

In [8]:
#Removing the irreleant columns
news_df = news_df[['Title', 'SentimentTitle']]

In [9]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71381 entries, 0 to 93237
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Title           71381 non-null  object 
 1   SentimentTitle  71381 non-null  float64
dtypes: float64(1), object(1)
memory usage: 1.6+ MB


In [10]:
# In general sentiment score above 0.05 are considered positive 
# And since we are only interested in filtering good news or positive news
# We will label score above 0.05 as postive and any score below it as negative
def is_positive(sentiment_score):
    if sentiment_score > 0.05:
        return 1
    else:
        return 0

In [11]:
news_df['Is_SentimentTitle_Positive'] = news_df['SentimentTitle'].apply(is_positive)

In [12]:
# Removing SentimentTitle column
news_df = news_df[['Title','Is_SentimentTitle_Positive']]

In [13]:
news_df.head()

Unnamed: 0,Title,Is_SentimentTitle_Positive
0,Obama Lays Wreath at Arlington National Cemetery,0
1,A Look at the Health of the Chinese Economy,1
2,Nouriel Roubini: Global Economy Not Back to 2008,0
3,Finland GDP Expands In Q4,0
4,"Tourism, govt spending buoys Thai economy in J...",0


### Text Preprocessing