# Twitter Sentiment Analysis using NLTK libray

In the present notebook we're going to use NLP (Natural Language Processing) techniques to analyze the sentiment of a text from Twitter, called Tweet, and build and train some of the most accurate machine learning models to the type of problem and generalizes better

First of all we're going to import the following packages to attack the problem

In [54]:
import pandas as pd, keras, pickle, warnings
from sklearn.model_selection import GridSearchCV, train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from collections import defaultdict
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from models import CleanData

warnings.filterwarnings("ignore")

## Data Cleaning & Feature Engineering

In [4]:
positive = pd.read_csv("data/tweets_pos_clean.csv")
negative = pd.read_csv("data/tweets_neg_clean.csv") # Read the data

In [5]:
positive["Target"] = [0 for i in positive["Tweets"]] # Here we're sort out the data like this because is splitted in different datasets
negative["Target"] = [1 for i in negative["Tweets"]] # but it's not part of the original file

In [6]:
print("Positives Tweets:", len(positive))
print("Negative Tweets:", len(negative))

Positives Tweets: 55056
Negative Tweets: 120948


In [7]:
df = positive.merge(negative, how="outer") # Merge the to sorted dataframes in one
df

Unnamed: 0,Tweets,Target
0,Se imaginan a los chicos agradeciendo por el p...,0
1,"Eclesiastes4:9-12 ♡ Siempre, promesa :) https...",0
2,"@pedroj_ramirez Qué saborío, PJ. ya no compart...",0
3,Buenos dias para todos. Feliz inicio de semana...,0
4,"@pepedom @bquintero Gracias! No es así, deja c...",0
...,...,...
175999,Pero... Dime que no te perderé del todo :( ❤💛💚,1
176000,Yo creo que a Colocolo le hacía falta un parti...,1
176001,@seru15 son para niño :( quisiera quedarmelos.,1
176002,Diganle al sonidero que ya le baje a su desmad...,1


Applying the cleaning functions to our raw data to get a processed data for get better accuracy to the model

In [8]:
# df["Tweets"] = df["Tweets"].apply(CleanData().remove_links) 
# df["Tweets"] = df["Tweets"].apply(CleanData().clean_emojis)
# df["Tweets"] = df["Tweets"].apply(CleanData().remove_stopwords)
# df["Tweets"] = df["Tweets"].apply(CleanData().signs_tweets)
# df["Tweets"] = df["Tweets"].apply(CleanData().remove_doubles)
# df["Tweets"] = df["Tweets"].apply(CleanData().clean_laughs)
# df["Tweets"] = df["Tweets"].apply(CleanData().remove_mentions_hashtags_retweets)

# df.dropna(axis=0).to_csv("data/data_cleaned.csv", index=False)

In [9]:
df = pd.read_csv("data/data_cleaned.csv").dropna()
print(df.size)
df.head()

351950


Unnamed: 0,Tweets,Target
0,se imaginan chicos agradeciendo premio cara or...,0
1,eclesiastes siempre promesa {link},0
2,edroj_ramirez qué saborío pj compartes gintoni...,0
3,buenos dias todos feliz inicio semana {link},0
4,epedom quintero gracias no así deja claro aqu...,0


In [10]:
df.Tweets[100300]

'ayloficial subire foto ustedes ig vengan villahermosa tomemos  ¡ya vengaan'

## Modeling

Vectorize the vocabulary of the processed data 

In [11]:
vectorizer = CountVectorizer(ngram_range=(1,2))

Building pipelines that first vectorize the data and then insert the model

In [12]:
#####################################################################################################################################

logistic_pipe = Pipeline([("vect", vectorizer), ("cls", LogisticRegression())]) # Logistic Regression

logistic_params = {"vect__max_df": (0.5, 1), "vect__min_df": (10, 20, 50), "cls__penalty": ["l1","l2"], 
"cls__C": [0.1, 1.0], "cls__solver" : ["newton-cg"]}

log_reg = GridSearchCV(logistic_pipe, logistic_params, cv=3, scoring="accuracy")

#####################################################################################################################################

svc_pipe = Pipeline([("vect", vectorizer), ("cls", LinearSVC())]) # Linear Support Vector Machine

svc_params = {"cls__C": [0.001, 0.1, 1, 10, 100], "cls__loss": ["hinge", "squared_hinge"], "cls__penalty" : ["l1", "l2"]}

svc = GridSearchCV(svc_pipe, svc_params, cv=3, scoring="accuracy")

#####################################################################################################################################

xgb = Pipeline([("vect", vectorizer), ("cls", XGBClassifier())]) # XGB Classifier

#####################################################################################################################################

In [13]:
X_train, X_test, y_train, y_test = train_test_split(df["Tweets"], df["Target"], test_size=0.20, random_state=24)

In [14]:
# log_reg.fit(X_train, y_train)

In [15]:
# with open("data/models/logistic_regression.h5", "wb") as save:
#     pickle.dump(log_reg, save)

In [16]:
with open("data/models/logistic_regression.h5", "rb") as load:
    log_reg = pickle.load(load)

In [17]:
log_reg_predictions = log_reg.predict(X_test)

log_reg_accuraccy = accuracy_score(log_reg_predictions, y_test)

print(log_reg_accuraccy)

0.8110811194771985


In [18]:
# svc.fit(X_train, y_train)

In [19]:
# with open("data/models/LinearSVC.h5", "wb") as save:
#     pickle.dump(svc, save)

In [20]:
with open("data/models/LinearSVC.h5", "rb") as load:
    svc = pickle.load(load)

In [21]:
svc_predictions = svc.predict(X_test)

svc_accuraccy = accuracy_score(svc_predictions, y_test)

print(svc_accuraccy)

0.8182412274470805


In [22]:
# xgb.fit(X_train, y_train)

In [23]:
# with open("data/models/XGBClassifier.h5", "wb") as save:
#     pickle.dump(xgb, save)

In [24]:
with open("data/models/XGBClassifier.h5", "rb") as load:
    xgb = pickle.load(load)

In [25]:
xgb_predictions = xgb.predict(X_test)

xgb_accuraccy = accuracy_score(xgb_predictions, y_test)

print(xgb_accuraccy)

0.7914192356868873


### Deep Learning model

Converting the strings into integers using Tokenizer

In [26]:
max_vocab = 20000000
tokenizer = Tokenizer(num_words=max_vocab)
tokenizer.fit_on_texts(X_train)

Checking the word index and find out the vocabulary of the dataset

In [27]:
wordidx = tokenizer.word_index

print(f"The size of dataset vocab is: {len(wordidx)}")

The size of dataset vocab is: 128936


Converting train and test sentences into sequences

In [28]:
train_seq = tokenizer.texts_to_sequences(X_train)
test_seq = tokenizer.texts_to_sequences(X_test)
print(f"Train sequence: {train_seq[0]}")
print(f"Test sequence: {test_seq[0]}")

Train sequence: [50, 67, 72, 36, 16, 1]
Test sequence: [675, 21]


Padding the sentences to get equal length sequence because it's conventional to use same size sequences

In [29]:
# Padding Train
pad_train = pad_sequences(train_seq)

print(f"The len of train sequence is: {pad_train.shape[1]}")


# Padding test
pad_test = pad_sequences(test_seq, maxlen=pad_train.shape[1])

print(f"The len of test sequence is: {pad_test.shape[1]}")

The len of train sequence is: 2093
The len of test sequence is: 2093


Building the neural network

In [30]:
# input_len = keras.layers.Input(shape=(pad_train.shape[1], ))

# x = keras.layers.Embedding(len(wordidx) + 1, 20)(input_len) # len(wordidx) + 1 because the indexing starts from 1, not from 0

# x = keras.layers.LSTM(25, return_sequences=True)(x)

# x = keras.layers.GlobalMaxPool1D()(x)

# x = keras.layers.Dense(32, activation="relu")(x)

# x = keras.layers.Dense(1, activation="sigmoid")(x)

# neural_network_model = keras.Model(input_len, x)

Compiling the model

In [31]:
# neural_network_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# earlystop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)
# mcheckpoint = keras.callbacks.ModelCheckpoint("data/models/neural_network.h5")

Training the model

In [32]:
# history = neural_network_model.fit(pad_train, y_train, validation_data=(pad_test, y_test), epochs=10, callbacks=[earlystop, mcheckpoint])

In [None]:
# neural_network_model.save("data/models/neural_network.h5")

In [35]:
neural_network_model = keras.models.load_model("data/models/neural_network.h5")

In [36]:
neural_network_model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 2093)]            0         
                                                                 
 embedding (Embedding)       (None, 2093, 20)          2578740   
                                                                 
 lstm (LSTM)                 (None, 2093, 25)          4600      
                                                                 
 global_max_pooling1d (Globa  (None, 25)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 32)                832       
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                             

In [38]:
neural_network_accuracy = neural_network_model.evaluate(pad_test, y_test)



## Final Results

In [43]:
results = pd.DataFrame({"Model" : ["Linear Regression", "Linear Support Vector Machine", "XGBoost", "Neural Network"],
                        "Accuracy" : [log_reg_accuraccy, svc_accuraccy, xgb_accuraccy, neural_network_accuracy[1]]})

results.sort_values(by="Accuracy", ascending=False)

Unnamed: 0,Model,Accuracy
1,Linear Support Vector Machine,0.818241
3,Neural Network,0.813837
0,Linear Regression,0.811081
2,XGBoost,0.791419


Per thousandths, the Linear Support Vector Machine is the model that best generalizes with an accuracy of 0.818241, followed by the Neural Network and Linear Regression an accuracy of 0.813837 and 0.811081, respectively. We can see that the difference between metrics is very narrow, so even if the first mentioned generalizes better, it could depend on the data that is inserted than some of the better or worse results, although they will always maintain a very narrow difference between them.