<a href="https://colab.research.google.com/github/HafidGalih/Sentiment_Analysis/blob/main/Technical_Test_Data_Scientist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1) Load Requirements

Because the data is a collection of words in human language (English), there's a need to install library with Natural Language Processing capability. Here i used TextBlob library for processing textual data. Its user-friendly interface provides access to basic NLP tasks such as sentiment analysis, word extraction, parsing, and many more. It's suitable for easy NLP tasks and also for a beginner.

In [1]:
# Install Modules

!pip install -U textblob
!python -m textblob.download_corpora
!git clone https://github.com/HafidGalih/Sentiment_Analysis.git

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[K     |████████████████████████████████| 636 kB 8.3 MB/s 
Installing collected packages: textblob
  Attempting uninstall: textblob
    Found existing installation: textblob 0.15.3
    Uninstalling textblob-0.15.3:
      Successfully uninstalled textblob-0.15.3
Successfully installed textblob-0.17.1
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading packag

In [2]:
# Import Libraries

from textblob import TextBlob
import pandas as pd

# 2) Data Preparation

## Data Exploration

In [3]:
# Load dataset

df_labeled = pd.read_csv('https://raw.githubusercontent.com/HafidGalih/Sentiment_Analysis/main/financial_news_data.csv',
                         encoding="ISO-8859-1")
df_unlabeled = pd.read_csv('https://raw.githubusercontent.com/HafidGalih/Sentiment_Analysis/main/data_for_test_the_model.csv',
                           encoding="ISO-8859-1")


In [4]:
df_labeled.head(5)

Unnamed: 0,sentiment,news_headline
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [5]:
df_labeled.groupby('sentiment').count()

Unnamed: 0_level_0,news_headline
sentiment,Unnamed: 1_level_1
negative,603
neutral,2878
positive,1362


Dataset is unbalanced and have 3 class. Need to consider balancing data and also using algorithm which supports multinomial classification.

In [6]:
df_unlabeled

Unnamed: 0,number,news_headline
0,1,The 2015 target for net sales has been set at ...
1,2,It holds 38 percent of Outokumpu 's shares and...
2,3,"As a result of these transactions , the aggreg..."


## Data Pre-Processing

Currently not much pre-processing step is done, this is because TextBlob accepts input as list of (Text, label) which is already matching with the provided dataset

In [7]:
#split data for training and validation
from sklearn.model_selection import train_test_split

df_train, df_validation = train_test_split(df_labeled, test_size=0.2,
                                     random_state=1)

#convert into list to support Textblob classifiers module
dataset_train = list(zip(df_train['news_headline'], df_train['sentiment']))
dataset_validation = list(zip(df_validation['news_headline'], df_validation['sentiment']))

In [8]:
dataset_train

[('The handset also features a Media Bar for quick access to favorite media and applications , including music , photos , YouTube or Ovi Share .',
  'neutral'),
 ('wins 98 % acceptance 23 December 2009 - Finnish industrial machinery company Metso Oyj ( HEL : MEO1V ) said today it will complete its takeover offer for textile company Tamfelt Oyj Abp ( HEL : TAFKS ) , after acquiring 98 % of its shares and votes .',
  'neutral'),
 ("With CapMan as a partner , we will be able to further develop our business and continue to focus on providing quality restaurant services for our customers , '' says Christopher Wynne , CEO of Papa John 's Russia .",
  'positive'),
 ('Fortum had intended to spend as much as ( EURO ) 2.7 bn to become the sole owner of TGK-10 .',
  'neutral'),
 ('561,470 new shares under 2003 option rights plan Packaging company Huhtamaki Oyj reported on Monday that a total of 561,470 new shares of the company have been issued based on share subscriptions under its 2003 option r

In [9]:
dataset_validation

[("Last year 's third quarter result had been burdened by costs stemming from restructuring in the US .",
  'negative'),
 ('The company had hoped the new plant would be on stream by the end of 2008 .',
  'neutral'),
 ('The alliance aims to tap pocketable mobile computers , netbooks , tablets , mediaphones , connected TVs and in-vehicle infotainment systems .',
  'neutral'),
 ('Pretax profit decreased to EUR 33.8 mn from EUR 40.8 mn in the fourth quarter of 2005 .',
  'negative'),
 ('Revenues at the same time grew 14 percent to 43 million euros .',
  'positive'),
 ('Turun kaupunkin , Finland based company has awarded contract to Lemminkainen Talotekniikka Oy for electrical installation work .',
  'positive'),
 ('Finnish silicon wafer technology company Okmetic Oyj ( OMX Helsinki : OKM1V ) reported on Thursday ( 7 August ) an operating profit of EUR5 .3 m for the period January-June 2008 , up from EUR3 .3 m in the corresponding period in 2007 .',
  'positive'),
 ('Production capacity wil

# 3) Classifier Modeling

## Training

Most suitable algorithms for text classification are Naive Bayes classifiers which shown appropriate for classifying articles based on content, and sentiment/emotion analysis in the existing literature. [https://www.ijikm.org/Volume13/IJIKMv13p117-135Thangaraj3803.pdf]

In [10]:
from textblob.classifiers import NaiveBayesClassifier

NB_model = NaiveBayesClassifier(dataset_train)

In [11]:
# Training Results
print(NB_model.accuracy(dataset_train))

0.9003613835828601


## Validation

Using labeled data which is not used in training as validation_dataset to compare model predictions with actual label

In [12]:
# Validation Results

print(NB_model.accuracy(dataset_validation))

0.7048503611971104


# 4) Testing Model

In this part, the trained model will be tested with data which isn't labeled. Then the model will predicts which class label the text belongs.

In [13]:
for i in df_unlabeled.index :
  print(NB_model.classify(df_unlabeled['news_headline']))

neutral
neutral
neutral
