## Twitter Sentiment Analysis

Twitter Sentiment Analysis using NLP is a project that leverages Natural Language Processing (NLP) techniques to analyze and categorize the sentiment expressed in tweets as positive, negative, or neutral.

It aims to understand public opinion or sentiment on specific topics, events, or brands by analyzing tweets.

In [9]:
# import libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import joblib
import os


In [10]:
# download necessary NLTK data

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
# load the datasets

train_data = pd.read_csv("D:/sentiment_analysis/data/twitter_training.csv",names=['Serial_number', 'Source', 'Sentiments', 'Text'])

valid_data = pd.read_csv("D:/sentiment_analysis/data/twitter_validation.csv",names=['Serial_number', 'Source', 'Sentiments', 'Text'])

In [12]:
print(f"Validation data shape: {valid_data.shape}")
print(valid_data.head())

Validation data shape: (1000, 4)
   Serial_number     Source  Sentiments  \
0           3364   Facebook  Irrelevant   
1            352     Amazon     Neutral   
2           8312  Microsoft    Negative   
3           4371      CS-GO    Negative   
4           4433     Google     Neutral   

                                                Text  
0  I mentioned on Facebook that I was struggling ...  
1  BBC News - Amazon boss Jeff Bezos rejects clai...  
2  @Microsoft Why do I pay for WORD when it funct...  
3  CSGO matchmaking is so full of closet hacking,...  
4  Now the President is slapping Americans in the...  


In [13]:
print(f"Training data shape: {train_data.shape}")
print(train_data.head())

Training data shape: (74682, 4)
   Serial_number       Source Sentiments  \
0           2401  Borderlands   Positive   
1           2401  Borderlands   Positive   
2           2401  Borderlands   Positive   
3           2401  Borderlands   Positive   
4           2401  Borderlands   Positive   

                                                Text  
0  im getting on borderlands and i will murder yo...  
1  I am coming to the borders and I will kill you...  
2  im getting on borderlands and i will kill you ...  
3  im coming on borderlands and i will murder you...  
4  im getting on borderlands 2 and i will murder ...  


In [14]:
def preprocess_data(text):
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z\s]','',text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    return ' '.join(tokens)

In [15]:
train_data['Processed_text'] = train_data['Text'].apply(preprocess_data)
print(train_data)

       Serial_number       Source Sentiments  \
0               2401  Borderlands   Positive   
1               2401  Borderlands   Positive   
2               2401  Borderlands   Positive   
3               2401  Borderlands   Positive   
4               2401  Borderlands   Positive   
...              ...          ...        ...   
74677           9200       Nvidia   Positive   
74678           9200       Nvidia   Positive   
74679           9200       Nvidia   Positive   
74680           9200       Nvidia   Positive   
74681           9200       Nvidia   Positive   

                                                    Text  \
0      im getting on borderlands and i will murder yo...   
1      I am coming to the borders and I will kill you...   
2      im getting on borderlands and i will kill you ...   
3      im coming on borderlands and i will murder you...   
4      im getting on borderlands 2 and i will murder ...   
...                                                  ...   
746

In [17]:
valid_data['Processed_text'] = valid_data['Text'].apply(preprocess_data)
print(valid_data.head())

   Serial_number     Source  Sentiments  \
0           3364   Facebook  Irrelevant   
1            352     Amazon     Neutral   
2           8312  Microsoft    Negative   
3           4371      CS-GO    Negative   
4           4433     Google     Neutral   

                                                Text  \
0  I mentioned on Facebook that I was struggling ...   
1  BBC News - Amazon boss Jeff Bezos rejects clai...   
2  @Microsoft Why do I pay for WORD when it funct...   
3  CSGO matchmaking is so full of closet hacking,...   
4  Now the President is slapping Americans in the...   

                                      Processed_text  
0  mentioned facebook struggling motivation go ru...  
1  bbc news amazon boss jeff bezos rejects claims...  
2  microsoft pay word functions poorly samsungus ...  
3  csgo matchmaking full closet hacking truly awf...  
4  president slapping americans face really commi...  


In [19]:
def train_source_model(source_data):
    X = source_data['Processed_text']
    y = source_data['Sentiments']
    
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)
    X = tfidf_vectorizer.fit_transform(X)
    
    model = LinearSVC()
    model.fit(X, y)
    
    return model, tfidf_vectorizer

# Train models for each source
sources = train_data['Source'].unique()

if not os.path.exists('models'):
    os.makedirs('models')

for source in sources:
    print(f"Training model for source: {source}")
    source_data = train_data[train_data['Source'] == source]
    model, vectorizer = train_source_model(source_data)
    
    # Save the model and vectorizer
    joblib.dump(model, f'models/{source}_model.joblib')
    joblib.dump(vectorizer, f'models/{source}_vectorizer.joblib')

print("Training completed. Models saved in 'models' directory.")



Training model for source: Borderlands
Training model for source: CallOfDutyBlackopsColdWar
Training model for source: Amazon
Training model for source: Overwatch
Training model for source: Xbox(Xseries)
Training model for source: NBA2K
Training model for source: Dota2
Training model for source: PlayStation5(PS5)
Training model for source: WorldOfCraft
Training model for source: CS-GO
Training model for source: Google
Training model for source: AssassinsCreed
Training model for source: ApexLegends
Training model for source: LeagueOfLegends
Training model for source: Fortnite
Training model for source: Microsoft
Training model for source: Hearthstone
Training model for source: Battlefield
Training model for source: PlayerUnknownsBattlegrounds(PUBG)
Training model for source: Verizon
Training model for source: HomeDepot
Training model for source: FIFA
Training model for source: RedDeadRedemption(RDR)
Training model for source: CallOfDuty
Training model for source: TomClancysRainbowSix
Tr

In [20]:
def predict_sentiment(text, source):
    model = joblib.load(f'models/{source}_model.joblib')
    vectorizer = joblib.load(f'models/{source}_vectorizer.joblib')
    processed_text = preprocess_data(text)
    vectorized_text = vectorizer.transform([processed_text])
    prediction = model.predict(vectorized_text)[0]
    return prediction

# Evaluate on validation set
val_predictions = []
for _, row in valid_data.iterrows():
    pred = predict_sentiment(row['Text'], row['Source'])
    val_predictions.append(pred)

# Print classification report
print(classification_report(valid_data['Sentiments'], val_predictions))


              precision    recall  f1-score   support

  Irrelevant       0.99      0.98      0.99       172
    Negative       0.98      0.98      0.98       266
     Neutral       0.99      0.99      0.99       285
    Positive       0.98      0.98      0.98       277

    accuracy                           0.99      1000
   macro avg       0.99      0.99      0.99      1000
weighted avg       0.99      0.99      0.99      1000



In [21]:
from sklearn.metrics import accuracy_score

print(accuracy_score(valid_data['Sentiments'], val_predictions))

0.986


In [22]:
sample_texts = [
    ("I love playing Borderlands! Can't wait to kill some skags!", "Borderlands"),
    ("This new graphics card is amazing!", "Nvidia"),
    ("Facebook's new privacy policy is concerning.", "Facebook"),
    ("The latest Windows update broke my computer.", "Microsoft")
]

for text, source in sample_texts:
    try:
        sentiment = predict_sentiment(text, source)
        print(f"Text: '{text}'")
        print(f"Source: {source}")
        print(f"Predicted sentiment: {sentiment}\n")
    except FileNotFoundError as e:
        print(e)
        print(f"Text: '{text}'")
        print(f"Source: {source}")
        print("Predicted sentiment: Unable to predict (model not found)\n")


Text: 'I love playing Borderlands! Can't wait to kill some skags!'
Source: Borderlands
Predicted sentiment: Positive

Text: 'This new graphics card is amazing!'
Source: Nvidia
Predicted sentiment: Positive

Text: 'Facebook's new privacy policy is concerning.'
Source: Facebook
Predicted sentiment: Neutral

Text: 'The latest Windows update broke my computer.'
Source: Microsoft
Predicted sentiment: Negative

