# **TWITTER SENTIMENTAL CLASSIFICICATION** ☁

---


## **1. PROBLEM**

This notebook seeks to classify tweets into positive, negative, irrelevant, and neutral by using NLP algorithms.

## **2. DATA**

Our data is gotten from Kaggle. It has 4 columns: id, entity, sentiment, and tweet content sequentially.

## **3. EVALUATION**

Evaluation of the model prediction is based on the **98% accuracy** score on any prediction.

## **4. DATA DICTIONARY**

Our  training and validation data has 4 and 3 columns respectively.

**Training Data:**


**The training data contains the following columns:**

1.   **Id:** Id is the column containing the twitter ID of the user
2.   **Entity:** Is the column contianing the user
3.   **Sentiment:** This is the column classifying whether a tweet is positive, negative, irrelevant or neutral.
4. **Tweet:** This column contains the tweet of each user.

**Test Data:**

**The test data contains the following columns:**
1.   **Id:** Id is the column containing the twitter ID of the user
2.   **Entity:** Is the column contianing the user
3. **Tweet:** This column contains th tweet of each user.


## **5. MODEL BUILDING/EXPERIMENTATION**

Let us try to:
1. Set up our worktools,
2. Import our data,
3. Explore our data,
4. Build a model
5. Fit data to the model
6. Validate the model
7. Evaluate our model
7. Tune our model
6. Make predictions!

### **1. Set up our worktools**
We're gonna import:


1.   Tensorflow
2.   Matplolib
3. Scikit learn
4. Keras
5. Pandas
6. Numpy



In [2]:
# Import neccessary worktools

import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

print(tf.__version__)

2.12.0


### **2. Importing our data**

We are gonna import the training data for now.

In [3]:
#Importing our data from google drive and unzipping it
df = pd.read_csv("drive/MyDrive/Twitter data/twitter_training.csv", names=['id', 'entity', 'sentiment', 'tweet'])

### **3. Exploratory Data Analysis (EDA)**

Now, we are going to explore our data to see what we are woring with and become one with the data.

In [4]:
df.head()

Unnamed: 0,id,entity,sentiment,tweet
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         74682 non-null  int64 
 1   entity     74682 non-null  object
 2   sentiment  74682 non-null  object
 3   tweet      73996 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB


In [6]:
df.describe()

Unnamed: 0,id
count,74682.0
mean,6432.586165
std,3740.42787
min,1.0
25%,3195.0
50%,6422.0
75%,9601.0
max,13200.0


In [7]:
df.isna().sum()

id             0
entity         0
sentiment      0
tweet        686
dtype: int64

In [8]:
df_tmp = df.copy()

In [9]:
df_tmp.head()

Unnamed: 0,id,entity,sentiment,tweet
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [10]:
# Fill missing values in the tweets column

df_tmp = df_tmp.fillna(method='ffill')

In [11]:
df_tmp.head()

Unnamed: 0,id,entity,sentiment,tweet
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [12]:
df_tmp.isna().sum()

id           0
entity       0
sentiment    0
tweet        0
dtype: int64

In [13]:
df_tmp = df_tmp.sample(frac=1, random_state=42)

df_tmp.head()

Unnamed: 0,id,entity,sentiment,tweet
34877,6789,Fortnite,Irrelevant,He said told u I'm getting in that box of a br...
21704,4115,CS-GO,Positive,Yo this looks LIT! CS: GO / Overwatch combo
47008,5665,HomeDepot,Negative,@HomeDepot attention executive administrators....
7969,9369,Overwatch,Irrelevant,Guy has notified me and says that my name has ...
454,2476,Borderlands,Positive,F Loving the new DLC!!!. RhandlerR RhandlerR R...


In [14]:
# Splitting our data an getting it ready for fitting and model building

from sklearn.model_selection import train_test_split

train_tweets, val_tweets, train_labels, val_labels = train_test_split(df_tmp['tweet'].to_list(),
                                                                      df_tmp['sentiment'].to_numpy(),
                                                                      test_size=0.2,
                                                                      random_state=42)

In [15]:
# Check lengths o splitted data
len(train_tweets), len(train_labels), len(val_tweets), len(val_labels)

(59745, 59745, 14937, 14937)

In [16]:
# View 10% splitted data

train_tweets[:10]

["Currently otp with Verizon. We pay y'all too much for my shit not to be working without WiFi",
 '@Ubisoft @GhostRecon tech update I people supposed might help people with fake internet has completely locked me out of your website everytime I use to log in it says connection lost but I have a ok connection... THIS<unk> WHY IT NEEDS TO BE OFFLINE',
 'RhandlerR RhandlerR Mid-range specs, REALLY crappy camera, huge bezels, a REALLY weak camera, and no 5G support. All for a ridiculously high price. I am genuinely sad to be disappointed so much with the Surface Duo.  Another example of Microsoft failing in the phone market. pic.twitter.com/LxsnoOHZxr',
 'seem, healer skills had to always come stream in handy.',
 'said.',
 "I can't wait, man, I just hope it's a good cod",
 '@CSGO your "official servers" are a happy zone for spam.... @valvesoftware pls fix.',
 'I just bought a house in the new town with @ Lowes in 6 min. However, my 1st and 2nd shopping experience there showed me that it is 

In [17]:
val_tweets[:10]

['. seriously, your FIFA points system should be banned or. Daylight robbery.',
 'Last night, the Overwatch team took off on your @WHSEsports2 ; valiant effort, again but we came up short, losing 2 - 3 1. Great game, well played, Westfield. Up until next, our Overwatch team ( eventual 2 - 1 ) team will definitely take off on the Carmel next Thursday.',
 'This is giving me the vibes.',
 'Xbox is the Fast & Furious of entertainment',
 'Happy 44th Anniversary to Hip Car Hop! If even you love music sounds like we do head over anytime to google and take a lesson free from @FABNEWYORK on knowing how exciting to become a DJ...',
 'wow I beat Battlefield 4 Call of Duty Advanced Warfare and Ghost Recon Future Soldier @ Xbox @ PlayStation @ Aviation vision @ EA @ Ubisoft',
 'pickupblocks.com / video / ZD0jbb... Woman gets $70 million from Johnson & Johnson Suit',
 'Yet another reason to hate the Atlanta Falcons and Arthur Godfrey.',
 'What',
 "<unk> going gonna make small tweet about how Among U

In [18]:
train_labels[:10]

array(['Negative', 'Negative', 'Negative', 'Neutral', 'Neutral',
       'Positive', 'Negative', 'Positive', 'Neutral', 'Neutral'],
      dtype=object)

In [19]:
val_labels[:10]

array(['Negative', 'Irrelevant', 'Irrelevant', 'Positive', 'Positive',
       'Neutral', 'Neutral', 'Negative', 'Positive', 'Positive'],
      dtype=object)

### 4. Build our model

In [20]:
# Extract labels ("target" columns) and encode them into integers
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_labels)
val_labels_encoded = label_encoder.transform(val_labels)
#test_labels_encoded = label_encoder.transform(test_df["target"].to_numpy())

# Check what training labels look like
train_labels_encoded

array([1, 1, 1, ..., 1, 1, 1])

In [21]:
# One hot encode labels
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_labels.reshape(-1, 1))
val_labels_one_hot = one_hot_encoder.transform(val_labels.reshape(-1, 1))
#test_labels_one_hot = one_hot_encoder.transform(test_df["target"].to_numpy().reshape(-1, 1))

# Check what training labels look like
train_labels_one_hot



array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       ...,
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]])

In [22]:
# Building the baseline model with Naive Bayes
# We will build a pipeline to turn words into numbers, and model our text data

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_base = Pipeline([
    ('tfidf', TfidfVectorizer()), # Turns texts into numbers
    ('clf', MultinomialNB()) # Models the data
    ])

# Fit the pipeline to the training data
model_base.fit(train_tweets, train_labels_encoded)

In [23]:
# Check model score by creating a function
baseline_score = model_base.score(val_tweets, val_labels_encoded)
baseline_preds = model_base.predict(val_tweets)

def check_model_score_predict(X_val, y_val):

  """
  This function checks the score of a baseline model.
  Parameters:
  X_val(int): This is the validation for X.
  y_val(int): This is the validation or y.
  """
  # Scores and store the result in a variable
  baseline_score = model_base.score(val_tweets, val_labels_encoded)

  # Print the result
  print(f'Our baseline model score accuracy is: {baseline_score*100:.2f}%')

  # Makes some preiction
  baseline_preds = model_base.predict(X_val)
  return baseline_preds[:20]

In [24]:
check_model_score_predict(X_val=val_tweets,
                  y_val=val_labels_encoded)

Our baseline model score accuracy is: 72.30%


array([1, 3, 3, 2, 3, 2, 2, 1, 1, 3, 2, 2, 1, 2, 1, 1, 0, 3, 1, 1])

In [25]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [26]:
calculate_results(y_true=val_labels_encoded,
                  y_pred=baseline_preds)

{'accuracy': 72.30367543683471,
 'precision': 0.7654978292452599,
 'recall': 0.723036754368347,
 'f1': 0.7123007247914539}

### Build a deep learning model

In [27]:
import datetime

def create_tensorboard_callback(dir_name, experiment_name):
  """
  Creates a TensorBoard callback instand to store log files.

  Stores log files with the filepath:
    "dir_name/experiment_name/current_datetime/"

  Args:
    dir_name: target directory to store TensorBoard log files
    experiment_name: name of experiment directory (e.g. efficientnet_model_1)
  """
  log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=log_dir
  )
  print(f"Saving TensorBoard log files to: {log_dir}")
  return tensorboard_callback

In [28]:
# Create model using Sequential API
from tensorflow import keras
from tensorflow.keras import layers

SAVE_DIR = "model_logs"

In [29]:
tf.random.set_seed(42)
from tensorflow.keras.layers import TextVectorization

text_vectorizer = TextVectorization(max_tokens=10000,
                                    output_mode="int",
                                    output_sequence_length=15)

# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_tweets)

In [30]:
# Set random seed and create embedding layer (new embedding layer for each model)
tf.random.set_seed(42)
from tensorflow.keras import layers

token_embed = layers.Embedding(input_dim=10000,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=15,
                                     name="embedding_5")

In [31]:
# Turn our data into TensorFlow Datasets
train_dataset = tf.data.Dataset.from_tensor_slices((train_tweets, train_labels_one_hot))
valid_dataset = tf.data.Dataset.from_tensor_slices((val_tweets, val_labels_one_hot))
#test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot))

train_dataset

<_TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(4,), dtype=tf.float64, name=None))>

In [32]:
# Take the TensorSliceDataset's and turn them into prefetched batches
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
#test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None, 4), dtype=tf.float64, name=None))>

In [33]:
# Create 1-dimensional convolutional layer to model sequences
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string)
text_vectors = text_vectorizer(inputs) # vectorize text inputs
token_embeddings = token_embed(text_vectors) # create embedding
x = layers.Conv1D(200, kernel_size=5, padding="same", activation="relu")(token_embeddings)
x = layers.GlobalAveragePooling1D()(x) # condense the output of our feature vector
outputs = layers.Dense(4, activation="softmax")(x)
model_1 = tf.keras.Model(inputs, outputs)

# Compile
model_1.compile(loss="categorical_crossentropy", # if your labels are integer form (not one hot) use sparse_categorical_crossentropy
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Get a summary of our 1D convolution model
model_1.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding_5 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 conv1d (Conv1D)             (None, 15, 200)           128200    
                                                                 
 global_average_pooling1d (G  (None, 200)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 4)                 804   

In [34]:
# Fit the model
model_1_history = model_1.fit(train_dataset,
                              epochs=15,
                              validation_data=valid_dataset,
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "Conv1D")])

Saving TensorBoard log files to: model_logs/Conv1D/20230803-030705
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [35]:
# Encode another model with Universal Sentence Encoder(USE)

import tensorflow_hub as hub

sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # shape of input data
                                        dtype=tf.string, # data type of input
                                        trainable=False, # freeze all the pretrained weights
                                        name="USE")

In [36]:
# Create model using Sequential API
from tensorflow import keras
from tensorflow.keras import layers

SAVE_DIR = "model_logs"

model = tf.keras.Sequential([
    sentence_encoder_layer, # take in sentences and then encode them into an embedding
    layers.Dense(64, activation='relu'),
    layers.Dense(4, activation='sigmoid') # Output layer
], name="model_USE")


# Compile the model

model.compile(loss='categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

# Fit the model

model_history = model.fit(train_dataset,
                          epochs=20,
                          validation_data=valid_dataset,
                          callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                 "tf_hub_sentence_encoder")])

Saving TensorBoard log files to: model_logs/tf_hub_sentence_encoder/20230803-031215
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [37]:
# Encode another model with Universal Sentence Encoder(USE) which is trainable

import tensorflow_hub as hub

sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # shape of input data
                                        dtype=tf.string, # data type of input
                                        trainable=True, # freeze all the pretrained weights
                                        name="USE")

In [38]:
# Create model using Sequential API
from tensorflow import keras
from tensorflow.keras import layers

SAVE_DIR = "model_logs"

model2 = tf.keras.Sequential([
    sentence_encoder_layer, # take in sentences and then encode them into an embedding
    layers.Dense(64, activation='relu'),
    layers.Dense(4, activation='sigmoid') # Output layer
], name="model2_USE")


# Compile the model

model2.compile(loss='categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

# Fit the model

model2_history = model2.fit(train_dataset,
                          epochs=5,
                          validation_data=valid_dataset,
                          callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                 "tf_hub_sentence_2encoder")])

Saving TensorBoard log files to: model_logs/tf_hub_sentence_2encoder/20230803-032439
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [39]:
model2.save("Twitter_Sentimental_Analysis.h5")