<a href="https://colab.research.google.com/github/Sayed-Ali-Raza-Naqvi/CodexCue_Sentiment-Analysis_Project/blob/main/CodexCue_Sentiment_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Twitter Sentiment Analysis**

---

The dataset used is 'Twitter Sentiment Analysis' from Kaggle.

Importing all the necessary libraries and modules.

In [None]:
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

Reading the datasets.

In [None]:
training_data = pd.read_csv('/content/twitter_training.csv', names=['id', 'entity', 'sentiment', 'content'])
validation_data = pd.read_csv('/content/twitter_validation.csv', names=['id', 'entity', 'sentiment', 'content'])

In [None]:
training_data.sample(5)

Unnamed: 0,id,entity,sentiment,content
45706,11845,Verizon,Neutral,Video : Just Cavs'featuring Kevin Little Love ...
1579,2676,Borderlands,Positive,"Hee, yasss! < 333"
17365,9777,PlayStation5(PS5),Positive,I like competitiveness better
61982,5026,GrandTheftAuto(GTA),Neutral,This is the Grand Theft Auto GTA That that cam...
69636,3932,Cyberpunk2077,Negative,There's only two outcomes for Cyberpunk 2077.....


In [None]:
validation_data.sample(5)

Unnamed: 0,id,entity,sentiment,content
918,930,AssassinsCreed,Positive,😅 I love you guys. 😘
766,4275,CS-GO,Neutral,Don't jump me... @CSGO #CSGO\n\n🔗 medal.tv/cli...
966,4313,CS-GO,Neutral,The Russians from cs:go are starting to invade...
727,8943,Nvidia,Neutral,"Half are political accounts, most of them I ha..."
836,10054,PlayerUnknownsBattlegrounds(PUBG),Negative,India Bans 118 Chinese apps including PUBG #PU...


Checking the shape of the datasets.

In [None]:
print(f'Training data shape: {training_data.shape}')
print(f'Validation data shape: {validation_data.shape}')

Training data shape: (74682, 4)
Validation data shape: (1000, 4)


Getting the null values.


In [None]:
training_data.isnull().sum()

id             0
entity         0
sentiment      0
content      686
dtype: int64

In [None]:
validation_data.isnull().sum()

id           0
entity       0
sentiment    0
content      0
dtype: int64

Removing the data containing null values.

In [None]:
training_data = training_data.dropna(subset=['content'])

In [None]:
training_data.isnull().sum()

id           0
entity       0
sentiment    0
content      0
dtype: int64

In [None]:
training_data.shape

(73996, 4)

Changing the string values into numbers. Ordinal encoding is done.

In [None]:
ordinal_mapping = {'Neutral': 0, 'Positive': 1, 'Negative': 2, 'Irrelevant': 3}
training_data['sentiment_encoded'] = training_data['sentiment'].map(ordinal_mapping)
validation_data['sentiment_encoded'] = validation_data['sentiment'].map(ordinal_mapping)

In [None]:
training_data.sample(5)

Unnamed: 0,id,entity,sentiment,content,sentiment_encoded
8089,9389,Overwatch,Neutral,anna's ult group is clc. anna is a regular acc...,0
9864,12899,Xbox(Xseries),Positive,Well fuck that's only 2 months before Xbox Ser...,1
40629,1372,Battlefield,Irrelevant,. New Video..... Flying Is Dangerous (Battlefi...,3
9899,12904,Xbox(Xseries),Positive,Boy am I chomping at a bit for this @xbox show...,1
26212,900,AssassinsCreed,Positive,Can't it wait,1


In [None]:
validation_data.sample(5)

Unnamed: 0,id,entity,sentiment,content,sentiment_encoded
838,9788,PlayStation5(PS5),Positive,Very much looking forward to getting new info ...,1
80,8121,Microsoft,Neutral,“the Free Software movement is dead. Linux doe...,0
955,4158,CS-GO,Positive,bhopping in csgo is so cozy,1
828,1857,CallOfDutyBlackopsColdWar,Negative,Score Streaks are the worst thing to happen in...,2
543,706,ApexLegends,Negative,I’m not gonna spend any more money on #apexleg...,2


Downloading NLTK resouces.

stopwords: Downloads a list of common stopwords in various languages.

punkt: Downloads the Punkt tokenizer models for tokenization.

wordnet: Downloads the WordNet lexical database for tasks such as lemmatization.

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Getting english stop words.

Nltk stop words are widely used words (such as “the,” “a,” “an,” or “in”) that a search engine has been configured to disregard while indexing and retrieving entries.

In [None]:
english_stopwords = set(stopwords.words('english'))

Instantiating WordNetLemmatizer object for lemmatization.

Lemmatization is the process of reducing words to their base or dictionary form.

In [None]:
lemmatizer = WordNetLemmatizer()

Function to preprocess the tweet column.

In [None]:
def preprocess_content(text):
  cleaned_text = re.sub('[^a-zA-Z]', ' ', text.lower())
  tokens = nltk.word_tokenize(cleaned_text)
  lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]
  processed_text = ' '.join(lemmatized_tokens)

  return processed_text

Applying the function on the tweets column and then saving the result in another column.

In [None]:
training_data['processed_content'] = training_data['content'].apply(preprocess_content)

In [None]:
validation_data['processed_content'] = validation_data['content'].apply(preprocess_content)

In [None]:
training_data.head()

Unnamed: 0,id,entity,sentiment,content,sentiment_encoded,processed_content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,1,im getting borderland murder
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,1,coming border kill
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,1,im getting borderland kill
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,1,im coming borderland murder
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,1,im getting borderland murder


In [None]:
validation_data.head()

Unnamed: 0,id,entity,sentiment,content,sentiment_encoded,processed_content
0,3364,Facebook,Irrelevant,I mentioned on Facebook that I was struggling ...,3,mentioned facebook struggling motivation go ru...
1,352,Amazon,Neutral,BBC News - Amazon boss Jeff Bezos rejects clai...,0,bbc news amazon bos jeff bezos reject claim co...
2,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it funct...,2,microsoft pay word function poorly samsungus c...
3,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking,...",2,csgo matchmaking full closet hacking truly awf...
4,4433,Google,Neutral,Now the President is slapping Americans in the...,0,president slapping american face really commit...


Train test splitting with 80% training data and 20% testing data.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(training_data['processed_content'], training_data['sentiment'],
                                                    test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape

((59196,), (14800,))

Using Pipeline to perform TF-IFD and model training.

TfidfVectorizer is a technique for converting text data into a matrix of TF-IDF features.

Using Logistic Regression as training model.

In [None]:
model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression())
])

Making prediction and getting the accuracy score.

In [None]:
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Prediction and accuracy score.

In [None]:
prediction = model.predict(X_test)

In [None]:
accuracy = accuracy_score(Y_test, prediction)
print(f'Accuracy Score: {accuracy}')

Accuracy Score: 0.7772972972972974


In [None]:
print(classification_report(Y_test, prediction))

              precision    recall  f1-score   support

  Irrelevant       0.83      0.66      0.74      2696
    Negative       0.75      0.86      0.80      4380
     Neutral       0.80      0.73      0.76      3605
    Positive       0.77      0.81      0.79      4119

    accuracy                           0.78     14800
   macro avg       0.79      0.76      0.77     14800
weighted avg       0.78      0.78      0.78     14800



Importing pickle to save the model for future predictions and for deployment on the website.

In [None]:
import pickle

In [None]:
filename = 'twitter_sentiment_analysis_model.pkl'
pickle.dump(model, open(filename, 'wb'))