## Sentiment Analysis

### Tools used:
Numpy, Pandas, Pytorch, NLTK, Scapy

In [1]:
import numpy as np
import pandas as pd

### Loading Training Data and its Analysis

In [2]:
train_data = pd.read_csv('sentiment analysis_train.csv', encoding='Windows-1252')

In [3]:
train_data.Sentiment.value_counts()

neutral     2626
positive    1538
negative     701
Name: Sentiment, dtype: int64

In [4]:
train_data.isnull().values.any()

False

In [5]:
train_data.head()

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


In [6]:
train_data.Sentence[0], train_data.Sentence[1]

("The GeoSolutions technology will leverage Benefon 's GPS solutions by providing Location Based Search Technology , a Communities Platform , location relevant multimedia content and a new and powerful commercial model .",
 '$ESI on lows, down $1.50 to $2.50 BK a real possibility')

We need to remove punctuation marks like apostrophe, comma, fullstop, dollar, decimals. These are unnecessary elements of the sentence which we want to analyse for sentiments

## Idea
### Preprocessing Data
Tokenization, Removing Stopwords, Lemmatization using Scapy/NLTK
### Training 
Can we use LSTM? <br>
Or shall we be happy with Neural Networks?

In [7]:
type(train_data.Sentence[0])

str

### Tokenisation and Removing Stopwords

In [8]:
import nltk, re

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english') #They are all in lowercase

def tokenise(text):
    text = re.sub(r'[^A-Za-z]', ' ', text)
    words = [word.lower() for word in text.split() if (word.lower() not in stop_words) and (len(word) >= 2)]
    return ' '.join(words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sabya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
tokenise(train_data.Sentence[0])

'geosolutions technology leverage benefon gps solutions providing location based search technology communities platform location relevant multimedia content new powerful commercial model'

In [10]:
print(train_data.Sentence[56])
print(tokenise(train_data.Sentence[56]))

Thus , SysOpen Digia has , in accordance with Chapter 14 Section 21 of the Finnish Companies Act 29.9.1978 - 734 , obtained title to all the shares of Sentera that are to be redeemed .
thus sysopen digia accordance chapter section finnish companies act obtained title shares sentera redeemed


In [11]:
print(train_data.Sentence[786])
print(tokenise(train_data.Sentence[786]))

We have made long-term investments in developing the system 's implementation model .
made long term investments developing system implementation model


### Lemmatization

In [12]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sabya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sabya\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [14]:
# !pip install spacy

In [15]:
# !python -m spacy download en_core_web_sm

In [16]:
import spacy

# Initialize spacy ‘en’ model, keeping only component needed for lemmatization and creating an engine:
nlp = spacy.load('en_core_web_sm', disable=['parser','ner'])

In [17]:
def lemmatize(text):
    words = [word.lemma_ for word in nlp(text)]
    return ' '.join(words)

In [18]:
print(train_data.Sentence[89])
print('--------------')
print(tokenise(train_data.Sentence[89]))
print('--------------')
print(lemmatize(tokenise(train_data.Sentence[89])))

In September 2010 , the Finnish group agreed to buy Danish company Rose Poultry A-S for up to EUR23 .9 m in a combination of cash and stock .
--------------
september finnish group agreed buy danish company rose poultry eur combination cash stock
--------------
september finnish group agree buy danish company rise poultry eur combination cash stock


### Preprocessing

In [19]:
inputs = train_data.Sentence
inputs.shape

(4865,)

In [20]:
def preprocess(text):
    return lemmatize(tokenise(text))

In [21]:
inputs = inputs.apply(preprocess)

In [22]:
inputs.shape, inputs[0]

((4865,),
 'geosolution technology leverage benefon gps solution provide location base search technology community platform location relevant multimedia content new powerful commercial model')

In [23]:
outputs = train_data.Sentiment

In [24]:
outputs.shape, outputs.head()

((4865,),
 0    positive
 1    negative
 2    positive
 3     neutral
 4     neutral
 Name: Sentiment, dtype: object)

In [25]:
def Categorise(text):
    if text == 'positive':
        return 1
    elif text == 'negative':
        return 0
    elif text == 'neutral':
        return 0.5

In [26]:
outputs = outputs.apply(Categorise)

In [27]:
outputs.shape, outputs.head()

((4865,),
 0    1.0
 1    0.0
 2    1.0
 3    0.5
 4    0.5
 Name: Sentiment, dtype: float64)

### Feature Extraction

<ul>
    <li>Bag of Words</li>
    <li>TF-IDF</li>
    <li>Word Embeddings</li>
</ul>    

### Bag of Words (Using CountVectorizer)

In [28]:
#CountVectorizer works on the concept of Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer

CountVectorizer()

In [29]:
vectorizer.fit(inputs)

CountVectorizer()

In [30]:
inputs_sparsed = vectorizer.transform(inputs).toarray()

In [31]:
inputs_sparsed.shape

(4865, 7873)

### Training

### Idea

We first split the data in a stratified manner into training and validation data.<br>
If neural networks do not produce good enough accuracy or f1_score, we change the model to LSTM? <br>
But wait! Our inputs are of variable size. How do we deal with this? Also the inputs are words and not numbers. How to deal with this?<br>
We build a neural network model for predicting:
<ul>
    <li>a value between 0 and 1, this may tell us how much positive and negative are the reviews</li>
    <li>output layer consists of three units, 0, 1 and 2, each for positive, negative and neutral</li>
<ul> 

In [32]:
import torch
import torch.nn as nn
import torch.functional as F

In [33]:
#Splitting into training and validation datasets
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(inputs_sparsed, outputs, stratify = outputs, train_size=0.8)

In [34]:
print(y_train.value_counts())
print(y_valid.value_counts())

0.5    2101
1.0    1230
0.0     561
Name: Sentiment, dtype: int64
0.5    525
1.0    308
0.0    140
Name: Sentiment, dtype: int64


In [35]:
X_train = torch.from_numpy(X_train)
X_train = X_train.to(dtype=torch.float32)

y_train = torch.from_numpy(y_train.values)
y_train = y_train.reshape(y_train.shape[0],1)
y_train = y_train.to(dtype=torch.float32)

X_valid = torch.from_numpy(X_valid)
X_valid = X_valid.to(dtype=torch.float32)

y_valid = torch.from_numpy(y_valid.values)
y_valid = y_valid.reshape(y_valid.shape[0],1)
y_valid = y_valid.to(dtype=torch.float32)
X_train.dtype, y_train.dtype, X_valid.dtype, y_valid.dtype

(torch.float32, torch.float32, torch.float32, torch.float32)

In [36]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

(torch.Size([3892, 7873]),
 torch.Size([3892, 1]),
 torch.Size([973, 7873]),
 torch.Size([973, 1]))

### Neural Networks as a Regressor

In [37]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.fc1 = nn.Linear(7873,1000)
        self.bn1 = nn.BatchNorm1d(1000)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(1000,500)
        self.bn2 = nn.BatchNorm1d(500)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(500,250)
        self.bn3 = nn.BatchNorm1d(250)
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(250,100)
        self.bn4 = nn.BatchNorm1d(100)
        self.relu4 = nn.ReLU()
        self.fc5 = nn.Linear(100,1)  
        self.sigmoid = nn.Sigmoid()
        
    def forward(self,x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        x = self.bn3(x)
        x = self.relu3(x)
        x = self.fc4(x)
        x = self.bn4(x)
        x = self.relu4(x)
        x = self.fc5(x)
        x = self.sigmoid(x)
        return x        

In [38]:
import math

def train(model, optimizer, criterion, X_train, y_train, batch_size, display_step=None, printing = False):
    
    i = 0
    running_loss = 0
    num_batches = math.ceil(len(X_train)/batch_size)
    input_batches = [X_train[i*batch_size: (i+1)*batch_size] for i in range(num_batches)]
    label_batches = [y_train[i*batch_size: (i+1)*batch_size] for i in range(num_batches)]
    for i in range(num_batches):
        optimizer.zero_grad()
        outputs = model(input_batches[i])
        loss = criterion(outputs, label_batches[i])
        loss.backward()
        optimizer.step()
        i+=len(input_batches[i])
        running_loss += loss.item()
        if(display_step):
            if(i%display_step == 0):
                print(f'After {i} mini-batches:')
                print(f'Training Loss: {running_loss/display_step}')
                running_loss = 0
                print('-------------------------')

In [39]:
import torch.optim as optim

model = Net()
optimizer = optim.SGD(model.parameters(), lr = 0.001, momentum = 0.8)
criterion = nn.MSELoss()

### Neural Networks as a Regression problem using Bag of Words Feature Extraction method

In [40]:
train(model, optimizer, criterion, X_train, y_train, 10, 20)

After 20 mini-batches:
Training Loss: 0.060621725581586364
-------------------------
After 40 mini-batches:
Training Loss: 0.12229592334479093
-------------------------
After 60 mini-batches:
Training Loss: 0.1189728682860732
-------------------------
After 80 mini-batches:
Training Loss: 0.11414233837276697
-------------------------
After 100 mini-batches:
Training Loss: 0.11845557112246752
-------------------------
After 120 mini-batches:
Training Loss: 0.09787253756076097
-------------------------
After 140 mini-batches:
Training Loss: 0.10866744369268418
-------------------------
After 160 mini-batches:
Training Loss: 0.11141766365617514
-------------------------
After 180 mini-batches:
Training Loss: 0.09674658626317978
-------------------------
After 200 mini-batches:
Training Loss: 0.09608124028891325
-------------------------
After 220 mini-batches:
Training Loss: 0.1088142329826951
-------------------------
After 240 mini-batches:
Training Loss: 0.10208424441516399
-----------

In [41]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score

def val_predict(model, X_valid, y_valid):
    val_preds = model(X_valid)
    
    # Modify values < 0.4 to 0
    val_preds[val_preds < 0.4] = 0

    # Modify values between 0.4 and 0.6 to 0.5
    val_preds[(val_preds >= 0.4) & (val_preds < 0.6)] = 0.5

    # Modify values > 0.6 to 1
    val_preds[val_preds >= 0.6] = 1
    
    y_valid_np = y_valid.detach().numpy()
    val_preds_np = val_preds.detach().numpy()

    # Convert the class labels to integer values
    le = LabelEncoder()
    y_valid_int = le.fit_transform(y_valid)
    val_preds_int = le.transform(val_preds_np)
    
    print('Classes:',le.classes_)
    print('F1 score:',f1_score(y_valid_int, val_preds_int, average=None))
    print('F1 score(micro)):',f1_score(y_valid_int, val_preds_int, average='micro'))
    print('F1 score(macro):',f1_score(y_valid_int, val_preds_int, average='macro'))
    print('F1 score(Weighted):',f1_score(y_valid_int, val_preds_int, average='weighted'))

In [42]:
#Saving this model
torch.save(model, 'bow_fnn_1.pt')

Observed performance is not good even on increasing the complexity of neural network. Moreover, training loss shows weird trends, it doesn't reduce even after training on multiple epochs.<br>

We shall henceforth try something else
<ul>
    <li> We were trying to represent the sentiment of the statement as a number between 0 to 1. 0 being very negative and 1 being very positive. We classify numbers below 0.4 to be negative, above 0.6 to be positive and those in between to be neutral.</li>
    <li> But clearly our above method is not yielding very good results. We now solve the problem as a ternary classifier.</li>
    <li> We can also try using other methods of feature extraction like TF-IDF, Hashing or Word Embeddings</li>
    <li> Even if all of these fails, we would have nothing other than LSTM to try </li>
</ul>
    

### Neural Networks as a Classifier

In [43]:
class Net2(nn.Module):
    def __init__(self):
        super(Net2, self).__init__()
        
        self.fc1 = nn.Linear(7873,1000)
        self.bn1 = nn.BatchNorm1d(1000)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(1000,500)
        self.bn2 = nn.BatchNorm1d(500)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(500,250)
        self.bn3 = nn.BatchNorm1d(250)
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(250,100)
        self.bn4 = nn.BatchNorm1d(100)
        self.relu4 = nn.ReLU()
        self.fc5 = nn.Linear(100,3)
        
    def forward(self,x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        x = self.bn3(x)
        x = self.relu3(x)
        x = self.fc4(x)
        x = self.bn4(x)
        x = self.relu4(x)
        x = self.fc5(x)
        return x        

In [44]:
model = Net2()
optimizer = optim.SGD(model.parameters(), lr = 0.001, momentum = 0.8)
criterion = nn.CrossEntropyLoss()

In [45]:
y_train

tensor([[1.0000],
        [0.5000],
        [1.0000],
        ...,
        [0.5000],
        [1.0000],
        [0.5000]])

In [47]:
y_train.unique(return_counts = True)
le = LabelEncoder()
y_train_transformed = torch.from_numpy(le.fit_transform(y_train))

  y = column_or_1d(y, warn=True)


### Neural Networks as a Classifier using Bag of Words Feature Extraction method

In [48]:
train(model, optimizer, criterion, X_train, y_train_transformed, 10, 20)

After 20 mini-batches:
Training Loss: 0.6086402773857117
-------------------------
After 40 mini-batches:
Training Loss: 1.017254838347435
-------------------------
After 60 mini-batches:
Training Loss: 0.9572258085012436
-------------------------
After 80 mini-batches:
Training Loss: 0.949922752380371
-------------------------
After 100 mini-batches:
Training Loss: 0.94137644469738
-------------------------
After 120 mini-batches:
Training Loss: 0.8775322496891022
-------------------------
After 140 mini-batches:
Training Loss: 0.9181284129619598
-------------------------
After 160 mini-batches:
Training Loss: 0.8976418614387512
-------------------------
After 180 mini-batches:
Training Loss: 0.8163404822349548
-------------------------
After 200 mini-batches:
Training Loss: 0.8204005360603333
-------------------------
After 220 mini-batches:
Training Loss: 0.8780878186225891
-------------------------
After 240 mini-batches:
Training Loss: 0.8438405826687813
-------------------------


In [49]:
for i in range(10):
    print(f'Epoch {i+1}')
    train(model, optimizer, criterion, X_train, y_train_transformed, 10, 380)

Epoch 1
After 380 mini-batches:
Training Loss: 0.4130635777978521
-------------------------
Epoch 2
After 380 mini-batches:
Training Loss: 0.20388914019261536
-------------------------
Epoch 3
After 380 mini-batches:
Training Loss: 0.1106453762512262
-------------------------
Epoch 4
After 380 mini-batches:
Training Loss: 0.08415311756543815
-------------------------
Epoch 5
After 380 mini-batches:
Training Loss: 0.06897903263666912
-------------------------
Epoch 6
After 380 mini-batches:
Training Loss: 0.05736556642284421
-------------------------
Epoch 7
After 380 mini-batches:
Training Loss: 0.04543262492730527
-------------------------
Epoch 8
After 380 mini-batches:
Training Loss: 0.03205742715355499
-------------------------
Epoch 9
After 380 mini-batches:
Training Loss: 0.017497727293888793
-------------------------
Epoch 10
After 380 mini-batches:
Training Loss: 0.011439415106954249
-------------------------


In [50]:
val_preds = model(X_valid)
val_preds.shape

torch.Size([973, 3])

In [51]:
val_preds[:5]

tensor([[-0.7945, -1.1400,  1.4530],
        [-1.0883, -2.1475,  2.6864],
        [-0.3323,  1.4375, -1.1679],
        [ 0.8504, -3.6752,  2.3532],
        [-2.7431,  5.8775, -2.7935]], grad_fn=<SliceBackward0>)

In [52]:
def analyse_output(preds):
    return preds.argmax(axis=1)

In [53]:
val_preds_labels = analyse_output(val_preds)
val_preds_labels.shape

torch.Size([973])

In [54]:
y_valid_transformed = torch.from_numpy(le.transform(y_valid))
y_valid_transformed.shape

  y = column_or_1d(y, warn=True)


torch.Size([973])

In [55]:
def print_f1_values(le, y_valid_int, val_preds_int):
    print('Classes:',le.classes_)
    print('F1 score:',f1_score(y_valid_int, val_preds_int, average=None))
    print('F1 score(micro)):',f1_score(y_valid_int, val_preds_int, average='micro'))
    print('F1 score(macro):',f1_score(y_valid_int, val_preds_int, average='macro'))
    print('F1 score(Weighted):',f1_score(y_valid_int, val_preds_int, average='weighted'))

In [56]:
torch.save(model, 'bow_fnn_3.pt')

In [57]:
# Even poorer of very less improvement
print_f1_values(le, y_valid_transformed, val_preds_labels)

Classes: [0.  0.5 1. ]
F1 score: [0.3099631  0.73364055 0.70847458]
F1 score(micro)): 0.6670092497430626
F1 score(macro): 0.5840260762991916
F1 score(Weighted): 0.6647135598791837


Now that neither works good, we seek a change in vectorizer method of generating feature matrix. We now use the Term frequency Inverse document frequency method of feature extraction

### Using the TF-IDF Vectorizer

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit(inputs)
inputs_sparsed = vectorizer.transform(inputs).toarray()

In [60]:
inputs_sparsed.shape

(4865, 7873)

In [61]:
X_train, X_valid, y_train, y_valid = train_test_split(inputs_sparsed, outputs, stratify=outputs, train_size=0.7)

In [62]:
X_train = torch.from_numpy(X_train)
X_train = X_train.to(dtype=torch.float32)

y_train = torch.from_numpy(y_train.values)
y_train = y_train.reshape(y_train.shape[0],1)
y_train = y_train.to(dtype=torch.float32)

X_valid = torch.from_numpy(X_valid)
X_valid = X_valid.to(dtype=torch.float32)

y_valid = torch.from_numpy(y_valid.values)
y_valid = y_valid.reshape(y_valid.shape[0],1)
y_valid = y_valid.to(dtype=torch.float32)
X_train.dtype, y_train.dtype, X_valid.dtype, y_valid.dtype

(torch.float32, torch.float32, torch.float32, torch.float32)

In [63]:
X_train.shape, X_valid.shape

(torch.Size([3405, 7873]), torch.Size([1460, 7873]))

### Neural Networks as a Regression problem using TFIDF Feature Extraction method

In [64]:
model2 = Net()
optimizer = optim.SGD(model2.parameters(), lr = 0.001, momentum = 0.8)
criterion = nn.MSELoss()

for i in range(10):
    print(f'Epoch {i+1}')
    train(model2, optimizer, criterion, X_train, y_train, 10, 170)
    print('----------------------------------')

Epoch 1
After 170 mini-batches:
Training Loss: 0.11021400428212741
-------------------------
After 340 mini-batches:
Training Loss: 0.09975243503885234
-------------------------
----------------------------------
Epoch 2
After 170 mini-batches:
Training Loss: 0.033547146607409505
-------------------------
After 340 mini-batches:
Training Loss: 0.029262912602109066
-------------------------
----------------------------------
Epoch 3
After 170 mini-batches:
Training Loss: 0.016892900469932047
-------------------------
After 340 mini-batches:
Training Loss: 0.01584038977230461
-------------------------
----------------------------------
Epoch 4
After 170 mini-batches:
Training Loss: 0.011218894563396187
-------------------------
After 340 mini-batches:
Training Loss: 0.010898101038936361
-------------------------
----------------------------------
Epoch 5
After 170 mini-batches:
Training Loss: 0.008340882783865227
-------------------------
After 340 mini-batches:
Training Loss: 0.00833732

In [65]:
val_predict(model2, X_valid, y_valid)

Classes: [0.  0.5 1. ]
F1 score: [0.25365854 0.44552846 0.5578125 ]
F1 score(micro)): 0.4678082191780822
F1 score(macro): 0.4189998306233062
F1 score(Weighted): 0.45346170578572226


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


I was happy to see training loss following a very good trend. But the f1_score comes out to be worser than earlier ones

In [66]:
val_predict(model2, X_train, y_train)

Classes: [0.  0.5 1. ]
F1 score: [0.68735084 0.64295875 0.77489967]
F1 score(micro)): 0.7042584434654919
F1 score(macro): 0.7017364183989279
F1 score(Weighted): 0.6910541809319669


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [67]:
torch.save(model2, 'tfidf_fnn_1.pt')

### Neural Networks as a Classifier using TFIDF Feature Extraction method

In [68]:
model3 = Net2()
optimizer = optim.SGD(model3.parameters(), lr = 0.001, momentum = 0.8)
criterion = nn.CrossEntropyLoss()

y_train_transformed = torch.from_numpy(le.transform(y_train))

for i in range(10):
    print(f'Epoch {i+1}')
    train(model3, optimizer, criterion, X_train, y_train_transformed, 10, 170)
    print('----------------------------------')

  y = column_or_1d(y, warn=True)


Epoch 1
After 170 mini-batches:
Training Loss: 0.9183718632249271
-------------------------
After 340 mini-batches:
Training Loss: 0.8397009803968317
-------------------------
----------------------------------
Epoch 2
After 170 mini-batches:
Training Loss: 0.39327743421582617
-------------------------
After 340 mini-batches:
Training Loss: 0.2698017226860804
-------------------------
----------------------------------
Epoch 3
After 170 mini-batches:
Training Loss: 0.12348547141779871
-------------------------
After 340 mini-batches:
Training Loss: 0.09238244681893026
-------------------------
----------------------------------
Epoch 4
After 170 mini-batches:
Training Loss: 0.046765909212477065
-------------------------
After 340 mini-batches:
Training Loss: 0.04883600134840783
-------------------------
----------------------------------
Epoch 5
After 170 mini-batches:
Training Loss: 0.026856013453182052
-------------------------
After 340 mini-batches:
Training Loss: 0.027080983268644

In [69]:
val_preds_labels3 = analyse_output(model(X_valid))
y_valid_transformed3 = torch.from_numpy(le.transform(y_valid))
print_f1_values(le, y_valid_transformed3, val_preds_labels3)

Classes: [0.  0.5 1. ]
F1 score: [0.66981132 0.89170361 0.93347874]
F1 score(micro)): 0.8726027397260274
F1 score(macro): 0.8316645552132801
F1 score(Weighted): 0.8730068476138289


### Observations

To perform good on training data, you need to prefer TFIDF over Bag of Words<br>
And for better predictions of output you need to use Neural networks as a classifier rather than a regressor

In [75]:
torch.save(model3, 'tfidf_fnn_3.pt')

In [71]:
test_data = pd.read_csv('sentiment analysis_test.csv', encoding='Windows-1252')

In [72]:
# Preprocessing the test data
test_inputs = test_data.Sentence
test_inputs = test_inputs.apply(preprocess)

In [73]:
# Feature extraction using the same vectorizer which we used to fit the training data for tfidf method
test_inputs_sparsed = vectorizer.transform(test_inputs).toarray()

In [74]:
test_inputs_sparsed.shape

(977, 7873)

In [76]:
# Converting test inputs to a form that could be used for predictions
X_test = torch.from_numpy(test_inputs_sparsed)
X_test = X_test.to(dtype=torch.float32)

In [77]:
X_test.shape

torch.Size([977, 7873])

In [78]:
# Predicting outputs for the test data
test_preds = analyse_output(model(X_test))

In [79]:
test_preds

tensor([0, 2, 2, 1, 1, 2, 2, 1, 1, 0, 1, 1, 1, 2, 0, 0, 1, 2, 1, 1, 1, 1, 1, 2,
        2, 0, 1, 1, 0, 1, 1, 0, 2, 1, 2, 1, 1, 2, 0, 2, 1, 1, 1, 1, 2, 1, 1, 2,
        1, 2, 1, 1, 2, 0, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
        2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 0, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 1, 2, 1, 2, 2, 1, 2, 0, 2, 2, 1, 1, 0,
        1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1,
        1, 0, 2, 2, 1, 2, 2, 0, 1, 1, 0, 0, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 0,
        1, 2, 2, 1, 2, 2, 0, 2, 2, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 1,
        2, 0, 2, 0, 0, 2, 1, 0, 1, 2, 0, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2,
        1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 0, 2, 1, 2, 0, 1, 1, 2, 0, 1, 1, 0,
        2, 2, 0, 0, 2, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 1, 1,
        1, 0, 1, 1, 1, 2, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1,
        0, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2,

In [84]:
new_column = pd.DataFrame(test_preds)
new_column.describe()

Unnamed: 0,0
count,977.0
mean,1.138178
std,0.654701
min,0.0
25%,1.0
50%,1.0
75%,2.0
max,2.0


In [85]:
test_data['Sentiment'] = new_column

In [86]:
test_data.head()

Unnamed: 0,Sentence,Sentiment
0,Operating loss totaled EUR 25mn compared to a ...,0
1,Renewed AB InBev Bid for SABMiller Ups Stake i...,2
2,Rautaruukki Corporation Stock exchange release...,2
3,Etteplan targets to employ at least 20 people ...,1
4,Thanks to its extensive industry and operation...,1


In [88]:
test_data = test_data.replace(to_replace = 0, value = 'negative')
test_data = test_data.replace(to_replace = 1, value = 'neutral')
test_data = test_data.replace(to_replace = 2, value = 'positive')

In [90]:
test_data.Sentiment.value_counts()

neutral     540
positive    286
negative    151
Name: Sentiment, dtype: int64

In [91]:
test_data.to_csv('Sentiment Predictions.csv')