<a href="https://colab.research.google.com/github/HafidGalih/Sentiment_Analysis/blob/comparison/Technical_Test_Data_Scientist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1) Load Requirements

Because the data is a collection of words in human language (English), there's a need to install library with Natural Language Processing capability. Here i used NLTK library for building my own pipeline to compare multiple algorithms.

In [1]:
# Install Libraries

!pip install nltk



In [2]:
# Import Libraries
import pandas as pd
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Download additional modules
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# 2) Data Preparation

## Data Exploration

In [3]:
# Load dataset

df_labeled = pd.read_csv('https://raw.githubusercontent.com/HafidGalih/Sentiment_Analysis/main/financial_news_data.csv',
                         encoding="ISO-8859-1")
df_unlabeled = pd.read_csv('https://raw.githubusercontent.com/HafidGalih/Sentiment_Analysis/main/data_for_test_the_model.csv',
                           encoding="ISO-8859-1")


In [4]:
df_labeled.head(5)

Unnamed: 0,sentiment,news_headline
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [5]:
df_labeled.groupby('sentiment').count()

Unnamed: 0_level_0,news_headline
sentiment,Unnamed: 1_level_1
negative,603
neutral,2878
positive,1362


Dataset is unbalanced and have 3 class. Need to consider balancing data and also using algorithm which supports multinomial classification.

In [6]:
df_unlabeled

Unnamed: 0,number,news_headline
0,1,The 2015 target for net sales has been set at ...
1,2,It holds 38 percent of Outokumpu 's shares and...
2,3,"As a result of these transactions , the aggreg..."


## Data Pre-Processing

Data pre-processing may increase the performance of ML model, the techniques used here is as follows:
1. Converting words to lower case.
2. Removing stopwords from each news headline, then stemming to reduce the number of similar meaning words.
3. Split data into training and testing dataset
4. Transforming word into vector space model (VSM) to extract the features.
5. Select best features to be used.

In [7]:
# 1) Convert words to lower case

def ChangeLower(msg):
    # converting messages to lowercase
    msg = msg.lower()
    return msg

df_labeled['news_headline']=df_labeled['news_headline'].apply(ChangeLower)
df_labeled['news_headline'].head(5)

0    according to gran , the company has no plans t...
1    technopolis plans to develop in stages an area...
2    the international electronic industry company ...
3    with the new production plant the company woul...
4    according to the company 's updated strategy f...
Name: news_headline, dtype: object

In [8]:
# 2) Removing irrelevant words and Stemming
stopwords = set(stopwords.words('english'))
stemmer = PorterStemmer()
df_labeled['stemmed'] = df_labeled['news_headline'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in stopwords]))
df_labeled['stemmed'].head(5)

0    accord gran compani plan move product russia a...
1    technopoli plan develop stage area less squar ...
2    intern electron industri compani elcoteq laid ...
3    new product plant compani would increas capac ...
4    accord compani updat strategi year baswar targ...
Name: stemmed, dtype: object

In [9]:
# 3) Split data for training and validation

X = df_labeled['stemmed']
Y = df_labeled['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, 
                                                    random_state=1, stratify=Y)

in splitting the dataset, need to consider using stratify to ensure similar ratio between each class labels. Useful for unbalanced dataset such as the case here. random_state is set at 1 to ensure the result is repeatable, while test_size value may be changed to check which gives better result.

In [10]:
X_train.head(5)

2832                    beef import fell slightli mn kilo
1448    upm kymmen one world lead print paper produc p...
746               govern profession approach assess offer
24      net sale increas eur eur pretax profit rose eu...
3739       rubin say expect capman announc addit transact
Name: stemmed, dtype: object

In [11]:
X_test.head(5)

4792          build home improv trade sale decreas eur mn
2937    work netapp includ strateg reposit brand categ...
2674    restructur measur affect product packag print ...
1602    tekla implement renew softwar version introduc...
4820    besid depositor prefer finland senior debt dep...
Name: stemmed, dtype: object

In [12]:
# 4) Initialize Vectorization

vectorizer = TfidfVectorizer(stop_words="english")

In [13]:
# 5) Feature Selection

features = SelectKBest(chi2, k=2000)

In [14]:
# Create pipeline for easier processing

NB_pipeline = Pipeline([('vect', vectorizer),
                     ('chi', features),
                     ('nb_clf', MultinomialNB())])

RF_pipeline = Pipeline([('vect', vectorizer),
                     ('chi',  features),
                     ('rb_clf', RandomForestClassifier())])

SVM_pipeline = Pipeline([('vect', vectorizer),
                     ('chi',  features),
                     ('svm_clf', LinearSVC())])

LogR_pipeline = Pipeline([('vect', vectorizer),
                     ('chi',  features),
                     ('logr_clf', LogisticRegression())])

In the pipeline there are 3 steps:
1. Vectorization to convert text data into vector value based on TfidfVectorizer
2. Feature selection using SelectKBest to pick only k amount of best feature based on chi2 test. The k value used here is picked randomly.
3. Initialize classifer algorithm to be trained

# 3) Classifier Modeling

## Training

Because the dataset already have label, and the task is classification, we can determine the algorithms to be used.
These are few of the suitable algorithms for Supervised text classification in the existing literature. https://www.ijikm.org/Volume13/IJIKMv13p117-135Thangaraj3803.pdf

1. Naive Bayes (NB_model)
2. Random Forest (RF_model)
3. Support Vector Machine (SVM_model)
4. Logistic Regression (LogR_model)

In [15]:
# training model
NB_model = NB_pipeline.fit(X_train, y_train)
RF_model = RF_pipeline.fit(X_train, y_train)
SVM_model = SVM_pipeline.fit(X_train, y_train)
LogR_model = LogR_pipeline.fit(X_train, y_train)

In [16]:
# Training Results NB
print(classification_report(y_train, NB_model.predict(X_train)))
print(confusion_matrix(y_train, NB_model.predict(X_train)))

              precision    recall  f1-score   support

    negative       0.98      0.20      0.33       482
     neutral       0.73      0.98      0.84      2302
    positive       0.75      0.48      0.59      1090

    accuracy                           0.74      3874
   macro avg       0.82      0.55      0.59      3874
weighted avg       0.77      0.74      0.71      3874

[[  95  257  130]
 [   1 2258   43]
 [   1  561  528]]


In [17]:
# Training Results RF
print(classification_report(y_train, RF_model.predict(X_train)))
print(confusion_matrix(y_train, RF_model.predict(X_train)))

              precision    recall  f1-score   support

    negative       0.99      1.00      0.99       482
     neutral       1.00      1.00      1.00      2302
    positive       1.00      1.00      1.00      1090

    accuracy                           1.00      3874
   macro avg       1.00      1.00      1.00      3874
weighted avg       1.00      1.00      1.00      3874

[[ 480    0    2]
 [   3 2299    0]
 [   2    3 1085]]


In [18]:
# Training Results SVM
print(classification_report(y_train, SVM_model.predict(X_train)))
print(confusion_matrix(y_train, SVM_model.predict(X_train)))

              precision    recall  f1-score   support

    negative       0.90      0.85      0.87       482
     neutral       0.93      0.97      0.95      2302
    positive       0.93      0.85      0.89      1090

    accuracy                           0.92      3874
   macro avg       0.92      0.89      0.90      3874
weighted avg       0.92      0.92      0.92      3874

[[ 409   43   30]
 [  18 2244   40]
 [  28  132  930]]


In [19]:
# Training Results LogR
print(classification_report(y_train, LogR_model.predict(X_train)))
print(confusion_matrix(y_train, LogR_model.predict(X_train)))

              precision    recall  f1-score   support

    negative       0.85      0.53      0.65       482
     neutral       0.80      0.97      0.88      2302
    positive       0.89      0.64      0.74      1090

    accuracy                           0.82      3874
   macro avg       0.85      0.71      0.76      3874
weighted avg       0.83      0.82      0.81      3874

[[ 256  179   47]
 [  17 2244   41]
 [  27  368  695]]


## Validation

Using labeled data which is not used in training as validation to evaluate model performance with out of sample data

In [20]:
# Validation Results NB

print(classification_report(y_test, NB_model.predict(X_test)))
print(confusion_matrix(y_test, NB_model.predict(X_test)))

              precision    recall  f1-score   support

    negative       0.88      0.12      0.20       121
     neutral       0.69      0.96      0.80       576
    positive       0.60      0.33      0.42       272

    accuracy                           0.68       969
   macro avg       0.72      0.47      0.48       969
weighted avg       0.69      0.68      0.62       969

[[ 14  69  38]
 [  2 553  21]
 [  0 183  89]]


In [21]:
# Validation Results RF

print(classification_report(y_test, RF_model.predict(X_test)))
print(confusion_matrix(y_test, RF_model.predict(X_test)))

              precision    recall  f1-score   support

    negative       0.74      0.37      0.49       121
     neutral       0.74      0.93      0.82       576
    positive       0.68      0.47      0.56       272

    accuracy                           0.73       969
   macro avg       0.72      0.59      0.62       969
weighted avg       0.72      0.73      0.71       969

[[ 45  52  24]
 [  5 534  37]
 [ 11 133 128]]


In [22]:
# Validation Results SVM

print(classification_report(y_test, SVM_model.predict(X_test)))
print(confusion_matrix(y_test, SVM_model.predict(X_test)))

              precision    recall  f1-score   support

    negative       0.70      0.59      0.64       121
     neutral       0.76      0.85      0.80       576
    positive       0.64      0.51      0.57       272

    accuracy                           0.72       969
   macro avg       0.70      0.65      0.67       969
weighted avg       0.72      0.72      0.72       969

[[ 71  41   9]
 [ 15 492  69]
 [ 16 117 139]]


In [23]:
# Validation Results LogR

print(classification_report(y_test, LogR_model.predict(X_test)))
print(confusion_matrix(y_test, LogR_model.predict(X_test)))

              precision    recall  f1-score   support

    negative       0.73      0.42      0.53       121
     neutral       0.73      0.92      0.81       576
    positive       0.66      0.44      0.53       272

    accuracy                           0.72       969
   macro avg       0.71      0.59      0.63       969
weighted avg       0.71      0.72      0.70       969

[[ 51  53  17]
 [  5 528  43]
 [ 14 139 119]]


By comparing all the training and testing results for each model, in this case RF shows the highest Accuracy almost 100% for Training dataset and 73% for validation dataset by using the model default parameter.

In [24]:
# Save the best model into a file for later use

with open('RandomForest.pickle', 'wb') as f:
    pickle.dump(RF_model, f)

# 4) Testing Model

In this part, the trained model will be tested with data which isn't labeled. Then the model will predicts which class label the text belongs.

In [25]:
print(NB_model.predict(df_unlabeled['news_headline']))

['neutral' 'neutral' 'neutral']


In [26]:
print(RF_model.predict(df_unlabeled['news_headline']))

['neutral' 'neutral' 'neutral']


In [27]:
print(SVM_model.predict(df_unlabeled['news_headline']))

['neutral' 'neutral' 'neutral']


In [28]:
print(LogR_model.predict(df_unlabeled['news_headline']))

['neutral' 'neutral' 'neutral']


All models show similar prediction for the unlabeled news headline. Thus the sentiment label of the news_headline has a high probability of it being 'neutral' as predicted.