# Sentiment Analysis of Financial News Using NLTK

We have to predict the sentiment of fiancial news using nltk

#About Dataset

This dataset contains 3 csv file

cnbc headline   (3080, 3)

gaurdian headline   (17800, 2)

reuters headline   (32770, 3)


# Columns Provided in the Dataset

cnbc headline
1. time
2. headlines
3. Description

gaurdian headline
1. time
2. headline

reuters headline
1. time
2. headline
3. description


# What is NLTK ?

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP).

It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.


https://medium.com/@ODSC/intro-to-language-processing-with-the-nltk-59aa26b9d056



# What is sentiment analysis ?

Sentiment analysis is the process of detecting positive or negative sentiment in text. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.



https://monkeylearn.com/sentiment-analysis/

In [None]:
# Import all the required libraries 
import pandas as pd
import numpy as np
import re
import string
import nltk

#import stopwords and text processing libraries
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('omw-1.4')

In [None]:
#import machine learning libraries
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.preprocessing import LabelEncoder
import sklearn.metrics as metrics

# Basic EDA on cnbc_headlines dataset

In [None]:
# Read csv file of cnbc headlines using pandas
cnbc = pd.read_csv('../input/financial-news-headlines/cnbc_headlines.csv')

In [None]:
cnbc

In [None]:
# check the shape of cnbc headline dataset
cnbc.shape

In [None]:
# Check all the columns in the cnbc headline dataset
cnbc.columns

In [None]:
# Check which columns are having categorical, numerical or boolean values
cnbc.info()

In [None]:
# Check for missing values in all the columnns of cnbc headline dataset
cnbc.isnull().sum()

There is 280 missing values in headlines, description and time

In [None]:
 # drop nan values in cnbc headline dataset
cnbc.dropna(axis = 0, how = 'any', inplace = True)

In [None]:
cnbc

In [None]:
cnbc.isnull().sum()

In [None]:
# drop the duplicate rows in the dataset keep the first one
cnbc.drop_duplicates(subset = ['Headlines', 'Description'], inplace = True, keep = 'first')
cnbc.reset_index(drop = True, inplace = True)

In [None]:
# check the shape of cnbc headline dataset
cnbc.shape

In [None]:
cnbc

# Basic EDA on Gaurdian headlines dataset

In [None]:
# Read csv file of gaurdian headlines using pandas
gaurdian = pd.read_csv('../input/financial-news-headlines/guardian_headlines.csv')

In [None]:
gaurdian

In [None]:
#check the shape of gaurdian headline dataset
gaurdian.shape

In [None]:
#check columns of gaurdian headline
gaurdian.columns

In [None]:
# Check which columns are having categorical, numerical or boolean values
gaurdian.info()

In [None]:
# check null values in gaurdian headlines dataset
gaurdian.isnull().sum()

In [None]:
# drop duplicate rows in headlines and keep the first one
gaurdian.drop_duplicates(subset = ['Headlines'], keep = 'first', inplace = True)
gaurdian.reset_index(drop = True, inplace = True)

In [None]:
gaurdian

# Basic EDA on reuters headlines

In [None]:
# Read csv file of reuters headlines using using pandas
reuters = pd.read_csv('../input/financial-news-headlines/reuters_headlines.csv')

In [None]:
reuters

In [None]:
#check the shape of reuters headlines dataset
reuters.shape

In [None]:
#check the columns of reuters headline dataset
reuters.columns

In [None]:
# Check which columns are having categorical, numerical or boolean values
reuters.info()

In [None]:
# Check for missing values in all the columnns of reuters headlines dataset
reuters.isnull().sum()

In [None]:
#drop the duplicate rows in reuters headlines dataset and keep the first one
reuters.drop_duplicates(subset = ['Headlines', 'Description'], keep = 'first', inplace = True)
reuters.reset_index(drop = True, inplace = True)
reuters

#Making some functions that we will need  ahead

Preprocessing 

1. **Lowercase** - It is necessary to convert the text to lower case as it is case sensitive.

2. **remove punctuations** -  The punctuations present in the text do not add value to the data. The punctuation, when attached to any word, will create a problem in differentiating with other words. so we have to get rid of them.

3. **remove stopwords** -  Stopwords include: I, he, she, and, but, was were, being, have, etc, which do not add meaning to the data. So these words must be removed which helps to reduce the features from our data. These are removed after tokenizing the text.

4. **stemming** -  A technique that takes the word to its root form. It just removes suffixes from the words. The stemmed word might not be part of the dictionary, i.e it will not necessarily give meaning.

5. **lemmatizing** -  Takes the word to its root form called Lemma. It helps to bring words to their dictionary form. It is applied to nouns by default. It is more accurate as it uses more informed analysis to create groups of words with similar meanings based on the context, so it is complex and takes more time. This is used where we need to retain the contextual information.


https://youtu.be/lMQzEk5vht4

https://www.pluralsight.com/guides/importance-of-text-pre-processing

In [None]:
# create a function for preprocessing 
def preprocessing_text(text):
  #convert all to lowercase
    text = text.lower()
#     print("Lower case Text = " + text)
  #remove puntuations
    text = text.translate(text.maketrans('', '', string.punctuation))
#     print("Punctuations Removed Text = " + text)
  #remove stopword
    stop_word = set(stopwords.words('english'))
    text_tokens = word_tokenize(text)
#     print("Text Tokens = "+ str(text_tokens))
    filtered_words = [word for word in text_tokens if word not in stop_word]
#     print("Filtered words = ",filtered_words)
  #stemming
    ps = PorterStemmer()
#     print("Porter stemmer = ",ps)
    Stemmed_words = [ps.stem(w) for w in filtered_words] 
#     print("Stemmed words = ",Stemmed_words)

  #lemmitizing
    lemmatizer = WordNetLemmatizer()
#     print("Lemmatization = ",lemmatizer)
    lemma_words = [lemmatizer.lemmatize(w, pos = 'a') for w in Stemmed_words]
#     print("Lemma words = ",lemma_words)
    return " ".join(lemma_words)


  

In [None]:
preprocessing_text('TikTok, considers,London and other locations')

SENTIMENT ANAYSIS

https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

In [None]:
# import sentiment intensity analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# create sentiment intensity analyzer object
analyzer = SentimentIntensityAnalyzer()


In [None]:
#fuction to  decide sentiment as positive, negative and neutral
def get_analysis(score):
    if score < 0.0:
        return 'Negative'
    elif score == 0.0:
        return 'Neutral'
    else:
        return 'Positive'

# Now working with description on datasets

In [None]:
# concatenate cnbc headlines dataset and reuters headline dataset
new_data = pd.concat([cnbc, reuters], axis = 0)

In [None]:
#check the shape of this new dataset
new_data.shape

In [None]:
#make a copy of new dataset 
new_data_cp = new_data.copy()

In [None]:
# apply preprocessing to the description of new dataset
new_data['Description'] = new_data['Description'].apply(preprocessing_text)


In [None]:
# analyze polarity score of values in description and  add new column of it in dataset
ds_score = []
for i in new_data['Description'].values:
    ds_score.append(analyzer.polarity_scores(i)['compound'])
new_data['ds_score'] = ds_score
new_data

In [None]:
# apply the function  which decides sentiment to  polarity score column
new_data['ds_score'] = new_data['ds_score'].apply(get_analysis)
new_data

In [None]:
# plot a count plot on description score column
sns.countplot(x = 'ds_score', data = new_data)

In the description 

there are approx

14000 positive statment

12000 negative statment

8000 neutral statment

In [None]:
# pie chart on description score column
fig = px.pie(new_data, names = 'ds_score', title = 'Pie chart of different Sentiments')
fig.show()

In the dataset 

description contains

42.6% positive statments

34.5% negtive statements

22.9% neutral statments

In [None]:
new_data['Description']

In [None]:
le = LabelEncoder()
new_data['ds_score'] = le.fit_transform(new_data['ds_score'])

# Modelling on description 

In [None]:
# split the dataset  into test and train 
# 90% train , 10% test and random state 212
X_train_ds, X_test_ds, y_train_ds, y_test_ds = train_test_split(new_data['Description'], new_data['ds_score'], random_state = 212, train_size = 0.90, test_size = 0.10)

LINEAR SUPPORT VECTOR MACHINE


In [None]:
%%time
# pipeline creation
# 1. tfidVectorization
# 2. linearSVC model
pipe = Pipeline([('tfidf', TfidfVectorizer()), ('linearsvc', LinearSVC())])


# Fit the pipeline to the data
linear_svc_model_ds = pipe.fit(X_train_ds, y_train_ds)
# predict on test dataset
pred = linear_svc_model_ds.predict(X_test_ds)
print("MODEL - LINEAR SVC")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))

LOGISTIC REGRESSION


In [None]:
%%time
# pipeline creation 
# 1. CountVectorization
# 2. TfidTransformer
# 3. Logistic Regression
pipe = Pipeline([('countvectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LogisticRegression())])

# fit the pipeline to the train data
log_model_ds = pipe.fit(X_train_ds, y_train_ds)

# predict on test dataset
pred = log_model_ds.predict(X_test_ds) 
print("MODEL - Logistic Regression")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))

MULTINOMIAL NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. MultinomialNB
pipe = Pipeline([('countvectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', MultinomialNB())])

# fit the pipeline to the train data
multinomialnb_model_ds = pipe.fit(X_train_ds, y_train_ds)

# predict on test dataset
pred = multinomialnb_model_ds.predict(X_test_ds) 
print("MODEL - Multinomial Naive Bayes")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))

BERNOULLI NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. BernoulliNB

pipe = Pipeline([('countvectorize', CountVectorizer()), ('tfidf', TfidfTransformer()), ('bernoullinb', BernoulliNB())])
# Fit the pipeline to the data
bernoullinb_model_ds = pipe.fit(X_train_ds, y_train_ds)
# predict on test dataset
pred = bernoullinb_model_ds.predict(X_test_ds)

print("MODEL - Bernoulli Naive Bayes")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))



GRADIENT BOOSTING CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. GradientBoostingClassifier
pipe = Pipeline([('countvectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('gbclassifier', GradientBoostingClassifier())])
# Fit the pipeline to the data
gb_model_ds = pipe.fit(X_train_ds, y_train_ds)
# predict on test dataset
pred = gb_model_ds.predict(X_test_ds)
print("MODEL - Gradient Boosting Classifier")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))


XGBOOST CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. XGBClassifier
pipe = Pipeline([('countvectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('xgb', XGBClassifier(loss = 'deviance', learning_rate = 0.01, n_estimators = 10, max_depth = 5, random_state = 2020))])

# Fit the pipeline to the data
xgb_model_ds = pipe.fit(X_train_ds, y_train_ds)

# predict on test data
pred = xgb_model_ds.predict(X_test_ds)
print("MODEL - XGBoost Classifer")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))

DECISION TREE CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. Decision tree classifier
pipe = Pipeline([('countvectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()),('dtree', DecisionTreeClassifier())])

# Fit the pipeline to the data
dtree_model_ds = pipe.fit(X_train_ds, y_train_ds)

# predict on test data
pred = dtree_model_ds.predict(X_test_ds)
print("MODEL - Decision Tree Classifier")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))


K- NEAREST NEIGHBOUR CLASSIFIER MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. KNN classifier
pipe = Pipeline([('countvectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()),('knn', KNeighborsClassifier())])

# Fit the pipeline to the data
knn_model_ds = pipe.fit(X_train_ds, y_train_ds)

# predict on test data
pred = knn_model_ds.predict(X_test_ds)
print("MODEL - KNN")
# print accuracy score
print("accuracy score: {}%".format(round(accuracy_score(y_test_ds, pred)*100,2)))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_ds, pred)))
#print classification report
print("Classification Report: {}%".format(classification_report(y_test_ds, pred)))


In [None]:
# helper function for comparing models matric
def compare_models(models, names, X_train_ds, X_test_ds, y_train_ds, y_test_ds):
    # the libraries we need
    # already imported in the first cell
    # create a dataframe with column matric and metric name as value
    for (model, name) in zip(models, names):
        print(name)
        # then predict on the test set
        y_pred = model.predict(X_test_ds)
        res = classification_report(y_test_ds, y_pred)
        print("Classification Report\n", res)
        print("-----------------------------------------------------------------------------------------------------------------")
    

In [None]:
# list of model objects
models = [linear_svc_model_ds, log_model_ds, multinomialnb_model_ds, bernoullinb_model_ds, gb_model_ds, xgb_model_ds, dtree_model_ds, knn_model_ds]
# list of model names
names = ['linearSVC', 'Logistic', 'MultinomialNB', 'BernoulliNB', 'gradientBoost', 'XGB', 'decisionTree', 'KNN']
# print the comparison of models
compare_models(models, names, X_train_ds, X_test_ds, y_train_ds, y_test_ds)

# working with test dataset

In [None]:
# Perforn the prediction on the test dataset
y_pred = linear_svc_model_ds.predict(X_test_ds)
y_pred

In [None]:
y_pred = le.inverse_transform(y_pred)

In [None]:
# creating a dataframe of predicted results 
predictions = pd.DataFrame(y_pred)

In [None]:
predictions

In [None]:
predictions.head()

# Now working with headlines + description

In [None]:
new_data

In [None]:
# merge headlines and description of new dataset and name it info
new_data['info'] = new_data['Headlines'] + new_data['Description']

In [None]:
new_data['info']

In [None]:
# only keep info and time column . drop all remaining columns
new_data.drop(['Description', 'Headlines', 'ds_score'], axis = 1, inplace = True)

In [None]:
new_data

In [None]:
# apply preprocessing on info column
new_data['info'] = new_data['info'].apply(preprocessing_text)

In [None]:
# analyze polarity score of values in info and  add new column of it in dataset
info_score = []
for i in new_data['info'].values:
    info_score.append(analyzer.polarity_scores(i)['compound'])
new_data['info_score'] = info_score
new_data

In [None]:
# apply the function  which decides sentiment to  polarity score column
new_data['info_score'] = new_data['info_score'].apply(get_analysis)

In [None]:
new_data

In [None]:
# perform count plot on info_score column
sns.countplot(x = 'info_score', data = new_data)

In the info

there are approx

15500 positive statment

13000 negative statment

6500 neutral statment

In [None]:
# perform pie chart on info_score column
fig = px.pie(new_data, names = 'info_score', title = 'Pie chart of different Sentiments')
fig.show()

In the dataset

info contains

44.5% positive statments

37.2% negtive statements

18.3% neutral statments

# modeling on headlines + description

In [None]:
new_data['info_score'] = le.fit_transform(new_data['info_score'])

In [None]:
new_data

In [None]:
# split the dataset  into test and train 
# 90% train , 10% test and random state 212
X_train_info, X_test_info, y_train_info, y_test_info = train_test_split(new_data['info'], new_data['info_score'], test_size = 0.10, random_state = 212)

LINEAR SUPPORT VECTOR MACHINE


In [None]:

%%time
# pipeline creation
# 1. tfidVectorization
# 2. linearSVC model
pipe = Pipeline([('tfidf', TfidfVectorizer()), ('linearsvc', LinearSVC())])

# Fit the pipline to the data
linearsvc_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test dataset
pred = linearsvc_model_info.predict(X_test_info)

print("MODEL - LinearSVC")
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))

LOGISTIC REGRESSION


In [None]:
%%time
# pipeline creation 
# 1. CountVectorization
# 2. TfidTransformer
# 3. Logistic Regression
pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('logistic', LogisticRegression())])


# Fit the pipeline to the data
lr_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = lr_model_info.predict(X_test_info)

print("MODEL - Logistic Regression")
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))
 

MULTINOMIAL NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. MultinomialNB

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('mnb', MultinomialNB())])


# Fit the pipeline to the data
mnb_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = mnb_model_info.predict(X_test_info)

print("MODEL - Multinomial Naive Bayes")
  
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))
 

BERNOULLI NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. BernoulliNB
pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('bernoulliNB', BernoulliNB())])


# Fit the pipeline to the data
bnb_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = bnb_model_info.predict(X_test_info)

print("MODEL - BernoulliNB")
 
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))
 

GRADIENT BOOSTING CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. GradientBoostingClassifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('gb', GradientBoostingClassifier())])


# Fit the pipeline to the data
gb_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = gb_model_info.predict(X_test_info)

print("MODEL - Gradient Boosting Classifier")
 
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))

XGBOOST CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. XGBClassifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('xgb', XGBClassifier())])


# Fit the pipeline to the data
xgb_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = xgb_model_info.predict(X_test_info)

print("MODEL - XGBClassifier")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))


DECISION TREE CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. Decision tree classifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('DecisionTree', DecisionTreeClassifier())])


# Fit the pipeline to the data
dtree_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = dtree_model_info.predict(X_test_info)

print("MODEL - Decision Tree")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))


K- NEAREST NEIGHBOUR CLASSIFIER MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. KNN classifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('knn', KNeighborsClassifier())])


# Fit the pipeline to the data
knn_model_info = pipe.fit(X_train_info, y_train_info)
#predict on test data
pred = knn_model_info.predict(X_test_info)

print("MODEL - K Nearest Neighbors")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_info, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_info, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_info, pred)))


In [None]:
# helper function for comparing models matric
def compare_models(models, names, X_train_info, X_test_info, y_train_info, y_test_info):
    # the libraries we need

    # create a dataframe with column matric and metric name as value
    for (model, name) in zip(models, names):
        print(name)
        # then predict on the test set
        y_pred = model.predict(X_test_info)
        res = classification_report(y_test_info, y_pred)
        print("Classification Report\n", res)
        print("-----------------------------------------------------------------------------------------------------------------")
    
   
    

In [None]:
# list of model objects
models = [linearsvc_model_info, lr_model_info, mnb_model_info, bnb_model_info, gb_model_info, xgb_model_info, dtree_model_info, knn_model_info]
# list of model names
names = ['linearSVC', 'Logistic', 'MultinomialNB', 'BernoulliNB', 'gradientBoost', 'XGB', 'decisionTree', 'KNN']
# print the comparison of models
compare_models(models, names, X_train_info, X_test_info, y_train_info, y_test_info)


# working with test data

In [None]:
# Perforn the prediction on the test dataset
y_pred = xgb_model_info.predict(X_test_info)
y_pred

In [None]:
y_pred = le.inverse_transform(y_pred)
# creating a dataframe of predicted results 
predictions = pd.DataFrame(y_pred)

In [None]:
predictions

# now working on headlines

In [None]:
# from the dataset you have copied before delete the column of description
new_data_cp.drop(['Description'], axis = 1, inplace = True)

In [None]:
# remane the date column in gaurdian headlines dataset  to time
gaurdian.rename(columns = {'date':'time'}, inplace = True)

In [None]:
new_data_cp

In [None]:
gaurdian

In [None]:
# cancatenate the gaurdian headlines dataset and  copy of datasetto get all headlines together
headlines = pd.concat([new_data_cp, gaurdian], axis = 0)

In [None]:
headlines

In [None]:
# check the shape of all headlines dataset
headlines.shape

In [None]:
#apply preprocessin to the headlines column in the new dataset
headlines['Headlines'] = headlines['Headlines'].apply(preprocessing_text)

In [None]:
# analyze polarity score of values in headlines and  add new column of it in dataset
headlines_score = []
for i in headlines['Headlines'].values:
    headlines_score.append(analyzer.polarity_scores(i)['compound'])
headlines['headlines_score'] = headlines_score
headlines


In [None]:
# apply the function  which decides sentiment to  polarity score column
headlines['headlines_score'] = headlines['headlines_score'].apply(get_analysis)

In [None]:
#perform countplot on headline score column
sns.countplot(x = 'headlines_score', data = headlines)


In the headlines

there are approx

14000 positive statment

16000 negative statment

24000 neutral statment

In [None]:
#perform pie digram on headline score column
fig = px.pie(headlines, names = 'headlines_score', title = 'Pie chart of different Sentiments')
fig.show()


In the dataset

headlines contains

24.8% positive statments

30.3% negtive statements

44.9% neutral statments

# Modeling on headlines

In [None]:
headlines['headlines_score'] = le.fit_transform(headlines['headlines_score'])

In [None]:
headlines

In [None]:
# split the dataset  into test and train 
# 90% train , 10% test and random state 212
X_train_headlines, X_test_headlines, y_train_headlines, y_test_headlines = train_test_split(headlines['Headlines'],headlines['headlines_score'], test_size = 0.10, random_state = 212)


LINEAR SUPPORT VECTOR MACHINE

In [None]:
%%time
# pipeline creation
# 1. tfidVectorization
# 2. linearSVC model

pipe = Pipeline([('tfidf', TfidfVectorizer()), ('linearSVC', LinearSVC())])

# Fit the pipeline to the data
linearsvc_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = linearsvc_model_headlines.predict(X_test_headlines)
print("MODEL - Linear SVC")
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))


LOGISTIC REGRESSION

In [None]:
%%time
# pipeline creation 
# 1. CountVectorization
# 2. TfidTransformer
# 3. Logistic Regression
pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('logistic', LogisticRegression())])

# Fit the pipeline to the data
lr_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = lr_model_headlines.predict(X_test_headlines)
print("MODEL - Logisitic Regression")
#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))


MULTINOMIAL NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. MultinomialNB
pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('multinomialNB', MultinomialNB())])

# Fit the pipeline to the data
mnb_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = mnb_model_headlines.predict(X_test_headlines)
print("MODEL - Multinomial Naive Bayes")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))


BERNOULLI NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. BernoulliNB

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('BernoulliNaiveBayes', BernoulliNB())])

# Fit the pipeline to the data
bnb_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = bnb_model_headlines.predict(X_test_headlines)
print("MODEL - Bernoullinb")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))



GRADIENT BOOSTING CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. GradientBoostingClassifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('GradientBoost', GradientBoostingClassifier())])

# Fit the pipeline to the data
gb_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = gb_model_headlines.predict(X_test_headlines)
print("MODEL - GradientBoostingClassifier")


#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))


XGBOOST CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. XGBClassifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('XGB', XGBClassifier())])

# Fit the pipeline to the data
xgb_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = xgb_model_headlines.predict(X_test_headlines)
print("MODEL - XGBoost Classifier")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))


DECISION TREE CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. Decision tree classifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('DecisionTree', DecisionTreeClassifier())])

# Fit the pipeline to the data
dtree_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = dtree_model_headlines.predict(X_test_headlines)
print("MODEL - Decision Tree")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))


K- NEAREST NEIGHBOUR CLASSIFIER MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. KNN classifier

pipe = Pipeline([('countvect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('KNN', KNeighborsClassifier())])

# Fit the pipeline to the data
knn_model_headlines = pipe.fit(X_train_headlines, y_train_headlines)
# predict on test dataset
pred = knn_model_headlines.predict(X_test_headlines)
print("MODEL - KNN")

#print accuracy
print("Accuracy: {}%".format(accuracy_score(y_test_headlines, pred)*100, 2))
#print confusion matrix
print("Confusion Matrix: {}%".format(confusion_matrix(y_test_headlines, pred)))
# print classification report
print("Classification Report: {}%".format(classification_report(y_test_headlines, pred)))



In [None]:
# helper function for comparing models matric
def compare_models(models, names, X_train_headlines, X_test_headlines, y_train_headlines, y_test_headlines):
    # the libraries we need

    # create a dataframe with column matric and metric name as value
    for (model, name) in zip(models, names):
        print(name)
        # then predict on the test set
        y_pred = model.predict(X_test_headlines)
        res = classification_report(y_test_headlines, y_pred)
        print("Classification Report\n", res)
        print("-----------------------------------------------------------------------------------------------------------------")
    
   
    

In [None]:
# list of model objects
models = [linearsvc_model_headlines, lr_model_headlines, mnb_model_headlines, bnb_model_headlines, gb_model_headlines, xgb_model_headlines, dtree_model_headlines, knn_model_headlines]
# list of model names
names = ['linearSVC', 'Logistic', 'MultinomialNB', 'BernoulliNB', 'gradientBoost', 'XGB', 'decisionTree', 'KNN']
# print the comparison of models
compare_models(models, names, X_train_headlines, X_test_headlines, y_train_headlines, y_test_headlines)


# now working with test data

In [None]:
# Perforn the prediction on the test dataset
y_pred = linearsvc_model_headlines.predict(X_test_headlines)

In [None]:
y_pred = le.inverse_transform(y_pred)
# creating a dataframe of predicted results 
predictions = pd.DataFrame(y_pred)

In [None]:
predictions

# Prediction

you can check the result on real time news headlines

Here i have used two fiancial news headlines

and predicted its sentiment

You can try more 

In [None]:
sent1 = ['GST officers detect Rs 4,000 crore of ITC fraud in April-June']
y_predict = linearsvc_model_headlines.predict(sent1)
y_predict = le.inverse_transform(y_predict)
y_predict

In [None]:
sent2 = ["Finance Ministry releases Rs 9,871 crore to 17 states as grant"]
y_predict = linearsvc_model_headlines.predict(sent2)
y_predict = le.inverse_transform(y_predict)
y_predict

# Conclusion

We learn about NLTK, sentiment analysis in this assigment.

we conclude that using nltk it is easy to classify financial news and more we improve the traning data more we can get accurate


#Congratulation for completing the assignment.


You have learned a lot while doing this assignment.