# Sentiment Analysis Using Twitter Data

## Libraries

In [118]:
# General imports
import pickle
import numpy as np
import time
import matplotlib.pyplot as plt

# Warnings imports
import warnings
warnings.filterwarnings("ignore")

# Deep learning imports
import tensorflow as tf
from keras import models
from keras import layers
from keras import regularizers

## Helper function(s)

In [119]:
# Printed barrier function
def barrier():
    print("\n <<<","-"*50,">>> \n")

In [120]:
barrier()


 <<< -------------------------------------------------- >>> 



## Data imports

- Here, we import data from the Part 1 and Part 2-1 notebooks.

In [121]:
file = open("models/X_train.pickle","rb")
X_train = pickle.load(file)
file.close()

file = open("models/X_test.pickle","rb")
X_test = pickle.load(file)
file.close()

file = open("models/y_train.pickle","rb")
y_train = pickle.load(file)
file.close()

file = open("models/y_test.pickle","rb")
y_test = pickle.load(file)
file.close()

In [122]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1280000, 500000)
(320000, 500000)
(1280000,)
(320000,)


## Deep learning model

### Convert from sparse matrix to sparse tensor

- Neural networks work best with sparse tensors that have had their indexes re-ordered.

In [124]:
# Convert scipy sparse matrix to sparse tensor in tensorflow
def convert(X):
    coo = X.tocoo()
    indices = np.mat([coo.row, coo.col]).transpose()
    return tf.sparse.reorder(tf.SparseTensor(indices, coo.data, coo.shape))

### Training and evaluating functions

- For the sake of memory efficiency, we will train the models in batches of 100,000.

In [137]:
# Train a model using batches of X_train
def batch_func(model):
    start = 0
    end = 100000
    bs = 100000
    outputs = {"accuracy":[], "loss":[]}

    while (start < X_train.shape[0]):
        x = convert(X_train[start:end])
        y = y_train[start:end]
        o = model.train_on_batch(x, y, return_dict=True)
        outputs["loss"].append(o["loss"])
        outputs["accuracy"].append(o["accuracy"])
        start += bs
        end += bs
    
    return model, {"loss":outputs["loss"][-1], "accuracy":outputs["accuracy"][-1],}

- We will evaluate models using training and testing accuracy.

In [126]:
# Function to evaluate models
def evaluate(name, model):
    print(f"EVALUATION OF {name} MODEL.")
    start = time.time()
    model, metrics = batch_func(model)
    print(f"The training run time was {time.time() - start} seconds.\n\n")
    print("Training metrics:\n", metrics, "\n\n")
    print("Testing metrics:\n", model.evaluate(convert(X_test), y_test, return_dict=True))
    barrier()

### Baseline model
- We will use a base model with 2 densely connected layers of 64 hidden elements.
- The input_shape for the first layer is equal to the number of features from our vectorizer which is 500,000.

In [127]:
# Create base model
def base():
    base_model = models.Sequential()
    base_model.add(layers.Dense(64, activation="relu", input_shape=(X_train.shape[1],)))
    base_model.add(layers.Dense(64, activation="relu"))
    base_model.add(layers.Dense(1, activation="sigmoid"))
    base_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return base_model

In [128]:
evaluate("BASE", base_model)

EVALUATION OF BASE MODEL.
The training run time was 15.164917707443237 seconds.


Training metrics:
 {'accuracy': 0.8520749807357788, 'loss': 0.3611500859260559} 


Testing metrics:
 {'loss': 0.4081416130065918, 'accuracy': 0.8183968663215637}

 <<< -------------------------------------------------- >>> 



### Potential of overfitting
- The base model produced decent results of 85% training accuracy and 82% test accuracy
- However, there is a potential of overfitting. To explore this, we will use 3 avenues:
    - Reduce the network's size
    - Add regularization
    - Add dropout layers

#### Reducing the network's size

In [131]:
reduced_model = models.Sequential()
reduced_model.add(layers.Dense(32, activation="relu", input_shape=(X_train.shape[1],)))
reduced_model.add(layers.Dense(1, activation="sigmoid"))
display(reduced_model.summary())
reduced_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_46 (Dense)            (None, 32)                16000032  
                                                                 
 dense_47 (Dense)            (None, 1)                 33        
                                                                 
Total params: 16,000,065
Trainable params: 16,000,065
Non-trainable params: 0
_________________________________________________________________


None

In [132]:
evaluate("REDUCED", reduced_model)

EVALUATION OF REDUCED MODEL.
The training run time was 11.529033899307251 seconds.


Training metrics:
 {'accuracy': 0.7823125123977661, 'loss': 0.6719483137130737} 


Testing metrics:
 {'loss': 0.6695919036865234, 'accuracy': 0.7834843993186951}

 <<< -------------------------------------------------- >>> 



**Note:**
- Reducing the network size did not improve the training or test accuracy over the base model.
- Larger models tend to have a better "explaining" effect on input data.
- We can consider other reduction steps like maintaining the original number of layers while reducing only the number of hidden nodes per layer.

#### Adding regularization

In [133]:
reg_model = models.Sequential()
reg_model.add(layers.Dense(64, kernel_regularizer=regularizers.l2(0.001), activation="relu",
                           input_shape=(X_train.shape[1],)))
reg_model.add(layers.Dense(64, kernel_regularizer=regularizers.l2(0.001), activation="relu"))
reg_model.add(layers.Dense(1, activation="sigmoid"))
display(reg_model.summary())
reg_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_48 (Dense)            (None, 64)                32000064  
                                                                 
 dense_49 (Dense)            (None, 64)                4160      
                                                                 
 dense_50 (Dense)            (None, 1)                 65        
                                                                 
Total params: 32,004,289
Trainable params: 32,004,289
Non-trainable params: 0
_________________________________________________________________


None

In [134]:
evaluate("REGULARIZED", reg_model)

EVALUATION OF REGULARIZED MODEL.
The training run time was 15.106367588043213 seconds.


Training metrics:
 {'accuracy': 0.768875002861023, 'loss': 0.7476487159729004} 


Testing metrics:
 {'loss': 0.741715133190155, 'accuracy': 0.7692124843597412}

 <<< -------------------------------------------------- >>> 



**Note:**
- Adding regularization did not improve the training or test accuracy over the base model.
- However, this is not entirely surprising or entirely a negative thing. Regularization tends to reduce training accuracy and may be reducing test accuracy because of the coefficient used.
- Ideally, cross validation will determine the best value to use for regularization. We will keep that in mind for now.

#### Adding dropout layers

In [141]:
drop_model = models.Sequential()
drop_model.add(layers.Dense(64, activation="relu", input_shape=(X_train.shape[1],)))
drop_model.add(layers.Dropout(0.5))
drop_model.add(layers.Dense(64, activation="relu"))
drop_model.add(layers.Dense(1, activation="sigmoid"))
display(drop_model.summary())
drop_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Model: "sequential_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_60 (Dense)            (None, 64)                32000064  
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_61 (Dense)            (None, 64)                4160      
                                                                 
 dense_62 (Dense)            (None, 1)                 65        
                                                                 
Total params: 32,004,289
Trainable params: 32,004,289
Non-trainable params: 0
_________________________________________________________________


None

In [142]:
evaluate("DROPOUT_LAYER", drop_model)

EVALUATION OF DROPOUT_LAYER MODEL.
The training run time was 27.643298387527466 seconds.


Training metrics:
 {'loss': 0.6679771542549133, 'accuracy': 0.773562490940094} 


Testing metrics:
 {'loss': 0.6606889367103577, 'accuracy': 0.7842468619346619}

 <<< -------------------------------------------------- >>> 



**Note:**
- Adding drop out layers did not improve the training or test accuracy over the base model.
- Dropout is a regularization technique. It makes the model more robust by intentionally disregarding certain nodes.
- However, when the network is small relative to the data set (like in our case), it can actually worsen performance and may be generally unnecessary.

## Saving models

In [143]:
file = open("models/base_dl_model.pickle","wb")
pickle.dump(base_model, file)
file.close()

file = open("models/reduced_dl_model.pickle","wb")
pickle.dump(reduced_model, file)
file.close()

file = open("models/reg_dl_model.pickle","wb")
pickle.dump(reg_model, file)
file.close()

file = open("models/drop_dl_model.pickle","wb")
pickle.dump(drop_model, file)
file.close()

INFO:tensorflow:Assets written to: ram://80f27634-d6f5-464f-b87b-4912cd158457/assets
INFO:tensorflow:Assets written to: ram://656e0a37-37ec-4927-8e0b-dd92d35a46ba/assets
INFO:tensorflow:Assets written to: ram://bf8aba9a-1f62-49b0-9d88-fa7e1115c607/assets
INFO:tensorflow:Assets written to: ram://830ae7c5-80dd-47cc-a860-2397d3686d4d/assets


## Final conclusions and potential next steps
- The base model is the most accuracy with 85% training accuracy and 82% testing accuracy.
- This is comparable to but slightly lower than what we achieved with the Logistic Regression Model in the previous notebook (87% training accuracy and 83% test accuracy.
- We will proceed with the Logistic Regression model although there is further scope to explore more deep learning models (RSSs, CNNs, LSTM etc) as well as cross-validate the non-network models
- Part 2-1 and Part 2-2 are contained with the same project folder as this Part 1.
- See https://github.com/Daolaiya/Data-Science-Portfolio/tree/main/Project%203