# Natural Language Processing Project
----------------------------

# Sepideh Khalafi
------------------------------

## About Dataset

This is a labeled sentiment analysis dataset on tweets from different topics on Twitter. ([Kaggel](https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets), 2022) The text is labeled for four different categories.  

---------------------------------

## The Problem
This task is a supervised sentiment analysis on tweets. We will try and build an NLP pipeline to classify the sentiment of a given text and review. The target language would be **English**. We need to extract the english teewts and their lable to build our classification model on predincting sentiment of a given tweet. 


----------------
## Importing Libraries

* *pandas* is for table processing
* *nltk* and *re* have been used for processing the text
* *numpy* is used for hadling array form data
* *sklearn.model_selection* is to split the data
* *imblearn_undersampling* is for resampling the data
* *sklearn.preprocessing* is used for encoding target labels.
* *sklearn_feature_extraction* has been used for vectorisatoin
* *sklearn.naive_bayes* and *sklearn.ensemble* are used for classification.
* *gensim* is used to download the pretrained model for word2vec

In [1]:
import pandas
import nltk
import re
import numpy

import sklearn.model_selection
import sklearn.metrics
import sklearn.naive_bayes
import sklearn.ensemble
import sklearn.preprocessing
import imblearn.under_sampling

import sklearn.feature_extraction.text
import gensim.downloader

## Data Acquisition and Prprocessing

In [2]:
tweets = pandas.read_csv("../datasets/my_datasets/dataset.csv")

print(tweets.shape)
tweets.head()

(937854, 3)


Unnamed: 0,Text,Language,Label
0,@Charlie_Corley @Kristine1G @amyklobuchar @Sty...,en,litigious
1,#BadBunny: Como dos gotas de agua: Joven se di...,es,negative
2,https://t.co/YJNiO0p1JV Flagstar Bank disclose...,en,litigious
3,Rwanda is set to host the headquarters of Unit...,en,positive
4,OOPS. I typed her name incorrectly (today’s br...,en,litigious


This is a large dataset of 937854 texts from Twitter with only 3 features: *Text* which we will work on, *Language* indicating the language of the text, and *label* indicating the sentiment that is our target label.

The target is of four different classes:
- neagative
- positive
- litigious
- uncertainty

Using *groupby* on *language* column, we can find out how many tweets there are for each language after *counting* them. It is sorted alphabetically by default.

In [3]:
tweets.groupby(by = "Language").count()

Unnamed: 0_level_0,Text,Label
Language,Unnamed: 1_level_1,Unnamed: 2_level_1
"[Photo(previewUrl='https://pbs.twimg.com/media/EPYG2rKVAAA1e_O?format=jpg&name=small', fullUrl='https://pbs.twimg.com/media/EPYG2rKVAAA1e_O?format=jpg&name=large')]",1,1
"[Photo(previewUrl='https://pbs.twimg.com/media/FWV94O7UEAAjMOi?format=jpg&name=small', fullUrl='https://pbs.twimg.com/media/FWV94O7UEAAjMOi?format=jpg&name=large')]",1,1
"[Photo(previewUrl='https://pbs.twimg.com/media/FWVRZ6bWYAAM8-3?format=jpg&name=small', fullUrl='https://pbs.twimg.com/media/FWVRZ6bWYAAM8-3?format=jpg&name=large')]",1,1
"[Photo(previewUrl='https://pbs.twimg.com/media/FWVhrnXX0AE0h9Z?format=jpg&name=small', fullUrl='https://pbs.twimg.com/media/FWVhrnXX0AE0h9Z?format=jpg&name=large')]",1,1
"[Photo(previewUrl='https://pbs.twimg.com/media/FWWBzWTXkAAyZqm?format=jpg&name=small', fullUrl='https://pbs.twimg.com/media/FWWBzWTXkAAyZqm?format=jpg&name=large')]",1,1
...,...,...
und,2702,2702
ur,42,42
vi,32,32
zh,248,248


There are 72 different values for languages and some are not valid languages. We need to process on *English* tweets. We can find about the number of the english tweets by counting them.

In [4]:
tweets.loc[tweets["Language"] == "en"].count()

Text        871310
Language    871310
Label       871310
dtype: int64

Out of the whole 937854 tweets, 871310 of them are in *English*. And after finding the english tweets, we don't need the column. So we can drop it.

In [5]:
tweets_en = tweets.loc[tweets["Language"] == "en"]
tweets_en = tweets_en.drop(["Language"], axis = 1)
tweets_en.shape

(871310, 2)

Since this is a large dataset and the hardware and software is not enough to process all of it, we take a sample of 5000. The data is texual and the task is time consuming. (even with this size the gradient Bossting model for training takes almost an hour to work and give a result)

We then define the *X* and *Y* which are both only one column of the dataset.

In [6]:
tweets_en = tweets_en.sample(n = 5000)

tweets_x = tweets_en["Text"]
tweets_y = tweets_en["Label"]

train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(tweets_x, tweets_y)

print("shape of train: ", train_x.shape)
print("shape of test: ", test_x.shape)

shape of train:  (3750,)
shape of test:  (1250,)


In [7]:
tweets_x.isna().sum()

0

There are no *Null* values in the dataset.

Let's discover the proportion and labels of the target label.

In [8]:
train_y.value_counts()

negative       1063
positive       1060
uncertainty     865
litigious       762
Name: Label, dtype: int64

The proportion of the target label is not too imbalanced. But it would be better to have a balanced one. I took a large enough sample from the original dataset, so that I can do undersampling here with enough data points to avoid dealing with artificially generated data.

I needed to reshape the train set because of an error that was because of the 1D array.

In [9]:
under_sampling = imblearn.under_sampling.RandomUnderSampler()
under_x, under_y = under_sampling.fit_resample(train_x.values.reshape(-1,1), train_y)

print(f"the shape of train_x after undersampling is: {under_x.shape}")
print(f"the shape of train_y after undersampling is: {under_y.shape}")

the shape of train_x after undersampling is: (3048, 1)
the shape of train_y after undersampling is: (3048,)


The type of the undersample data is *numpy.ndarray* and we can't apply the preprocessing on this type. It needs to be converted to series object and to be able to do that we have to reshape it.

In [10]:
train_x = pandas.Series(under_x.reshape(-1))
print(f"type of the train set is: {type(train_x)}")
train_x.shape

type of the train set is: <class 'pandas.core.series.Series'>


(3048,)

Here we define a function to clean and process the text for vectorization.

* First, we just lower all the letters.
* Using *re* library and *sub* function I remove numbrs and digits, since they do not express sentiment.
* Then using *NLTK* library and *RegexpTokenizer* class we find all the tokens. The speicial charechters and spaces that I have defined are going to be eliminated.
* Next, we find each word's stem, meaning eliminating the prefix or sufix
* finally, we join these rooted words and join them to have the text back and return it

I defined a small corpus for **stopwords**. The one that belongs to *nltk* has many stemmed and lemmatized words that are about sentiment, like negative verbs. So I just decided to define a shorter one containing pronouns without verbs, based on nltk's corpus.

In [11]:
stopwords = ['i',
             'me',
             'my',
             'myself',
             'we',
             'our',
             'ours',
             'ourselves',
             'you',
             'your',
             'yours',
             'yourself',
             'yourselves',
             'he',
             'him',
             'his',
             'himself',
             'she',
             'her',
             'hers',
             'herself',
             'it',
             'its',
             'itself',
             'they',
             'them',
             'their',
             'theirs',
             'themselves',
             'what',
             'which',
             'who',
             'whom',
             'this',
             'that',
             'these',
             'those'
            ]


def tweet_preprocess(text):
    text = text.lower()
    text = re.sub(r"\d+", "", text)
    tokenizer = nltk.RegexpTokenizer(r"\w+[^@*!#$.%&? (0-9)]*")
    words = tokenizer.tokenize(text)
    snowball = nltk.stem.snowball.SnowballStemmer(language = "english")
    stemmed_words = [snowball.stem(w) for w in words if w not in stopwords]
    processed_text = " ".join(stemmed_words)
    return processed_text

In [12]:
train_x = train_x.apply(tweet_preprocess)
train_x.shape

(3048,)

Since we would have to do this part for the test set anyway, we can do it here.

In [13]:
test_x = test_x.apply(tweet_preprocess)
test_x.shape

(1250,)

## Feature Engineering, and Text Representation



We need to represent our text with vectors to be able to process it by the model.

### TF-IDF Vectorizer

TF-IDF returns a matrix, so we convert it to an array.

In [14]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
vectorize_x = vectorizer.fit_transform(train_x).toarray()
print(vectorize_x.shape)

(3048, 13438)


### Word2Vec Vectors

I chose a pretrained model from *hugging face* that is trained on Tweets: "Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased." ([hugging face](https://huggingface.co/fse/glove-wiki-gigaword-300), 2022)

In [15]:
word2vec_pretrained = gensim.downloader.load("glove-wiki-gigaword-300")

Word2vec models can not be used on sentences or phrases. We need to apply it on each word and calculate the average of them for each text. I defined a function for this purpose. (We have already processed the text. This is just for vectorizing.)

First we need to define a casual tokenizer to handle the text in the training set in our function. This part is out of the function. 

In the function:
- First we tokenize the text
- then we create a vector of size 300 (that should be enough) and it should be zero. We will add each word's weight, defined by the word2vec object, to this vector in the for loop.
- we define a counter for the words and in the end we calculate the average weight of the vector with it.
- the numbers can be negative and classifier could not deal with it, so I added aconstant number of 10 (large enough) to make it a positive value.
- this function returns a vwctor with average weight of the words in the text.

In [16]:
tokenizer = nltk.tokenize.TweetTokenizer()

def w2v_vectorizer(text):
    words = tokenizer.tokenize(text)
    vector = numpy.zeros(300)
    word_numbers = 0
    for w in words:
        if w in word2vec_pretrained:
            vector += word2vec_pretrained[w]
            word_numbers += 1
    if word_numbers > 0:
        vector /= word_numbers
        vector += 10
    return vector

In [17]:
w2v = train_x.apply(w2v_vectorizer)
print(f"the shape of the train set after applying word2vec vectorization: {w2v.shape}")
print(f"the length of each data point: {len(w2v[0])}")

the shape of the train set after applying word2vec vectorization: (3048,)
the length of each data point: 300


The outcome is a series object with only one feature. And this feature is an array of size 300. sklearn models cannot deal with it and it geneates and error while using this dataset. So I have to put every value of the arrays in one feature of a dataframe. I defined a zero matrix of rows equal to the trainig dataset and features of 300. while iterating over the rows of the dataset, I added each value of that to the matrix. In the end I converted the matrix to a dataframe.

It could not be done in the first place by just converting the output to a dataframe. Each value should have been extracted from datapoints.

In [18]:
matrix = numpy.zeros((len(w2v.axes[0]), 300))

for i in range(len(w2v.axes[0])):
    matrix[i] = matrix[i] + w2v[i]
  

In [19]:
w2v_train_x = pandas.DataFrame(matrix)

In [20]:
w2v_train_x.shape

(3048, 300)

## Modeling

For modeling we try a simple classifier, Naive Bayes, and a transformer, from a simple transformer library, to see which one has a better performance on our modeling.


### Naive Bayes and TF-IDF

Multinomial Naive Bayes classifier does not have many features. I tried a few values for *alpha* parameter. 

Here we pass the dataset vectorized by TF-IDF

In [21]:
naive_parameters = {"alpha" : [1.7, 1.8, 1.9, 2.0, 2.5, 3.0, 3.5, 4.0]}
naive_classifier = sklearn.naive_bayes.MultinomialNB()
naive_classifier_model_tf = sklearn.model_selection.GridSearchCV(naive_classifier, 
                                                              naive_parameters, 
                                                              scoring = "accuracy",
                                                              cv = 5,
                                                              n_jobs = -1
                                                             )
naive_classifier_model_tf.fit(vectorize_x, under_y)

print("TF-IDF vectorizer:")
print(f"best accuracy of 'naive bayes' model with Tf_IDf is {naive_classifier_model_tf.best_score_}")
print(f"best parameters for 'naive bayes' model with Tf_IDf is{naive_classifier_model_tf.best_params_}")

TF-IDF vectorizer:
best accuracy of 'naive bayes' model with Tf_IDf is 0.86024011413497
best parameters for 'naive bayes' model with Tf_IDf is{'alpha': 4.0}


With rasing *alpha* no significant improvement was shown.

### Naive Bayes and word2vec

Here I do the same only this time the vectorizer is the function I defined using word2vec predefiend model.

In [22]:
naive_parameters = {"alpha" : [0.2, 0.3, 0.4, 0.5]}
naive_classifier = sklearn.naive_bayes.MultinomialNB()
naive_classifier_model_w2v = sklearn.model_selection.GridSearchCV(naive_classifier, 
                                                                  naive_parameters, 
                                                                  scoring = "accuracy",
                                                                  cv = 5,
                                                                  n_jobs = -1
                                                                 )


naive_classifier_model_w2v.fit(w2v_train_x, under_y)

print("word2vec vectorizer:")
print(f"best accuracy of 'naive bayes' model with word2vec is {naive_classifier_model_w2v.best_score_}")
print(f"best parameters for 'naive bayes' model with word2vec is{naive_classifier_model_w2v.best_params_}")

word2vec vectorizer:
best accuracy of 'naive bayes' model with word2vec is 0.4849056502193868
best parameters for 'naive bayes' model with word2vec is{'alpha': 0.2}


### Gradient Boosting and TF-IDF

For this classifier model I tried to find best values for these parameter: *learning rate* which is for the contribution of the trees, *criterion* that defines the measurement of the quality of splitting, and *max_depth* that is for the maximum depth of each node. (It takes almost an hour to run and get results)

In [23]:
gboosting_parameters = {"learning_rate" : [0.1, 0.2, 0.3, 0.4, 0.5],
                        "criterion" : ["friedman_mse", "squared_error"],
                        "max_depth" : (2, 5)
                       }
gboosting = sklearn.ensemble.GradientBoostingClassifier()
gboosting_classifier_tf = sklearn.model_selection.GridSearchCV(gboosting, 
                                                            gboosting_parameters,
                                                            scoring = "accuracy",
                                                            cv = 5,
                                                            n_jobs = -1
                                                           )

gboosting_classifier_tf.fit(vectorize_x, under_y)

print("TF-IDF vectorizer:")
print(f"best accuracy of 'gradient boosting' model with Tf_IDf is {gboosting_classifier_tf.best_score_}")
print(f"best parameters for 'gradient boosting' model with Tf_IDf is{gboosting_classifier_tf.best_params_}")

TF-IDF vectorizer:
best accuracy of 'gradient boosting' model with Tf_IDf is 0.9609588414223801
best parameters for 'gradient boosting' model with Tf_IDf is{'criterion': 'friedman_mse', 'learning_rate': 0.5, 'max_depth': 5}


### Gradient Boosting and Word2vec

In [24]:
gboosting_parameters = {"learning_rate" : [0.1, 0.2, 0.3, 0.4, 0.5],
                        "criterion" : ["friedman_mse", "squared_error"],
                        "max_depth" : (2, 5)
                       }
gboosting = sklearn.ensemble.GradientBoostingClassifier()
gboosting_classifier_w2v = sklearn.model_selection.GridSearchCV(gboosting, 
                                                            gboosting_parameters,
                                                            scoring = "accuracy",
                                                            cv = 5,
                                                            n_jobs = -1
                                                           )

gboosting_classifier_w2v.fit(w2v_train_x, under_y)

print("Word2vec vectorizer:")
print(f"best accuracy of 'gradient boosting' model with word2vec is {gboosting_classifier_w2v.best_score_}")
print(f"best parameters for 'gradient boosting' model with word2vec is{gboosting_classifier_w2v.best_params_}")

Word2vec vectorizer:
best accuracy of 'gradient boosting' model with word2vec is 0.6407580284799053
best parameters for 'gradient boosting' model with word2vec is{'criterion': 'friedman_mse', 'learning_rate': 0.5, 'max_depth': 5}


### Simple transformer

Simple transformers get a *dataframe* for an input and not vectorized. Using the *train_test_split* function on the training set, I defined a new train and validation set to check the performance of our model. After setting the column names and passing the x and y for the dataframe, I checked for null values to see if everyhting is done right. 

The transformer arguments I defined are number of epochs and an option to over-writing the output on the directory used.

In the final model, after trying out a few different parameters, I chose the *reberta* model and passed its name. Specified the number of the target classes, and I needed to specify that the cuda should not be used.

Unfortunately, The model did not work on more than **35** data points!!! Due to this problem I had to eliminate this part of model training.

I also had to eliminate the aprt that we needed to encode the target label since we do not use this model any more.

## Evaluation

The best performance is for Gradient Boosting using TF-IDF vectorizer.

Now we have to test it on our unseen dataset to evaluate this model.

In [25]:
test_vector = vectorizer.transform(test_x).toarray()
test_vector.shape

(1250, 13438)

In [26]:
predicted_y = gboosting_classifier_tf.predict(test_vector)
predicted_y.shape

(1250,)

In [27]:
accuracy = sklearn.metrics.accuracy_score(test_y, predicted_y)
accuracy

0.9536

The accuracy score is high for the test set too. More than 95% of the labels were predicted right.

In [28]:
precision = sklearn.metrics.precision_score(test_y, predicted_y, average = None)
precision

array([0.95849057, 0.94827586, 0.97667638, 0.92857143])

Precision score shows how much the perfirmane is good due to not labeling a negative label as positive. Which is high for htis modeling. 

In [29]:
recall = sklearn.metrics.recall_score(test_y, predicted_y, average = None)
recall

array([0.96212121, 0.94827586, 0.95988539, 0.94463668])

The higher recall score for each class indicates that the positive labels were predicted in a correct way. 

## Conclusion

This pipeline is a supervised sentiment analysis on a sample dataset of 5000 data points. The original dataset was about a million data points containing different tweets in different languages. The modeling is done only on *English* language. 

To find the best classification model, I used two different vectorization methods and two different models.

The models resulted in higher performance with TF-IDF vectorization. Since the word2vec vectorizer is basicaly used for words and not for texts, my function calculates the average of the text gievn, and it results in poor perfomance since the sentiment of words are lost and cannot be kept.

This model is working with high accuracy, but in real business world it should be perfect. It could be monitored in a regular basis and improved. 
- We could improve the stopword corpus, we could deal with some noisy words that are not in the dictionary but express some imotions. For example "looooove". 
- We could define some emojy corpus wich can affect the sentiment analysis.

## References

* Analytics Vidhya website (2022) 'Beginners Guide to Regular Expressions in Natural Language Processing'. Available at: https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-regular-expressions-in-natural-language-processing/ (Accessed: 01.12.2022)
* Higging Face website (2022) 'Transformers'. Available at: https://huggingface.co/docs/transformers/index (Accessed: 20.11.2022)
* imbalanced learn website (2022) 'RandomUnderSampler'. Available at: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html#imblearn.under_sampling.RandomUnderSampler (Accessed: 12.12.2022)
* matplotlib (2022) 'Pyplot tutorial'. Available at: https://matplotlib.org/stable/tutorials/introductory/pyplot.html (Accessed: 20.11.2022)
* MUHAMMAD TARIQ (kaggle website) (2022) 'Sentiment Dataset with 1 Million Tweets' Available at: https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets (Accessed: 02.11.2022)
* Pandas website (2022) 'Pnadas User Guide'. Available at: https://pandas.pydata.org/docs/user_guide/index.html (Accessed: 11.12.2022)
* scikit learn website (2022) 'sklearn.model_selection'. Available at: https://scikit-learn.org/stable/model_selection.html (Accessed: 17.11.2022)
* Stack Over Flow website (2022) Available at: https://stackoverflow.com/ (Accessed: 16.12.2022)
