TWITTER SENTIMENT ANALYSIS

In [20]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

import spacy

EDA

In [21]:
columns = ['id', 'country', 'Label', 'Text']
df = pd.read_csv(r"C:\Users\pavan\Downloads\twitter_training.csv", names=columns)


In [22]:
df

Unnamed: 0,id,country,Label,Text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [23]:
print(df.shape)

(74682, 4)


In [24]:
df.head(5)

Unnamed: 0,id,country,Label,Text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       74682 non-null  int64 
 1   country  74682 non-null  object
 2   Label    74682 non-null  object
 3   Text     73996 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB


In [26]:
df['Label'].value_counts()

Label
Negative      22542
Positive      20832
Neutral       18318
Irrelevant    12990
Name: count, dtype: int64

In [27]:
for i in range(5):
    print(f"{i+1}: {df['Text'][i]} -> {df['Label'][i]}")

1: im getting on borderlands and i will murder you all , -> Positive
2: I am coming to the borders and I will kill you all, -> Positive
3: im getting on borderlands and i will kill you all, -> Positive
4: im coming on borderlands and i will murder you all, -> Positive
5: im getting on borderlands 2 and i will murder you me all, -> Positive


Preprocessing

Drop Null value

In [28]:
df.dropna(inplace=True)

Preprocess Function

In [29]:
# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 

In [30]:
# use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

Apply Preprocess function on dataframe

In [31]:
df['Preprocessed Text'] = df['Text'].apply(preprocess)

In [32]:
df

Unnamed: 0,id,country,Label,Text,Preprocessed Text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,m get borderland murder
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,come border kill
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,m get borderland kill
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,m come borderland murder
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,m get borderland 2 murder
...,...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...,realize Windows partition Mac like 6 year Nvid...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...,realize Mac window partition 6 year Nvidia dri...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...,realize window partition Mac 6 year Nvidia dri...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...,realize window partition Mac like 6 year Nvidi...


Encoding target column

In [33]:
le_model = LabelEncoder()
df['Label'] = le_model.fit_transform(df['Label'])

In [34]:
df.head(5)

Unnamed: 0,id,country,Label,Text,Preprocessed Text
0,2401,Borderlands,3,im getting on borderlands and i will murder yo...,m get borderland murder
1,2401,Borderlands,3,I am coming to the borders and I will kill you...,come border kill
2,2401,Borderlands,3,im getting on borderlands and i will kill you ...,m get borderland kill
3,2401,Borderlands,3,im coming on borderlands and i will murder you...,m come borderland murder
4,2401,Borderlands,3,im getting on borderlands 2 and i will murder ...,m get borderland 2 murder


Split data into train and test

In [35]:
# Split the data into training and testing sets without stratification
X_train, X_test, y_train, y_test = train_test_split(df['Preprocessed Text'], df['Label'], 
                                                    test_size=0.2, random_state=42)

In [36]:
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)

Shape of X_train:  (59196,)
Shape of X_test:  (14800,)


Machine Learning Model

Naive Bayes Model

In [37]:
# Create a pipeline
clf = Pipeline([
    ('vectorizer_tri_grams', TfidfVectorizer(ngram_range=(1, 3))),
    ('naive_bayes', MultinomialNB())
])

In [38]:
# Create classifier pipeline
clf = Pipeline([
    ('vectorizer_tri_grams', TfidfVectorizer(ngram_range=(1, 3))),
    ('naive_bayes', MultinomialNB())
])

# Train the classifier
clf.fit(X_train, y_train)


In [39]:
# Get prediction
y_pred = clf.predict(X_test)

In [40]:
# Print classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.62      0.77      2696
           1       0.78      0.96      0.86      4380
           2       0.93      0.83      0.88      3605
           3       0.84      0.91      0.88      4119

    accuracy                           0.85     14800
   macro avg       0.89      0.83      0.85     14800
weighted avg       0.87      0.85      0.85     14800



Random Forest

In [41]:
clf = Pipeline([
    ('vectorizer_tri_grams', TfidfVectorizer()),
    ('naive_bayes', (RandomForestClassifier()))         
])

In [42]:
clf.fit(X_train, y_train)

In [43]:
# Get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

In [44]:
# Print Accuracy
print(accuracy_score(y_test, y_pred))

0.909527027027027


In [45]:
# Print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.84      0.90      2696
           1       0.92      0.93      0.92      4380
           2       0.94      0.89      0.91      3605
           3       0.85      0.95      0.90      4119

    accuracy                           0.91     14800
   macro avg       0.92      0.90      0.91     14800
weighted avg       0.91      0.91      0.91     14800



Test Model

In [46]:
test_df = pd.read_csv(r"C:\Users\pavan\Downloads\twitter_validation.csv",names=columns)
test_df.head()

Unnamed: 0,id,country,Label,Text
0,3364,Facebook,Irrelevant,I mentioned on Facebook that I was struggling ...
1,352,Amazon,Neutral,BBC News - Amazon boss Jeff Bezos rejects clai...
2,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it funct...
3,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking,..."
4,4433,Google,Neutral,Now the President is slapping Americans in the...


In [47]:
test_text = test_df['Text'][10]
print(f"{test_text} ===> {test_df['Label'][10]}")

The professional dota 2 scene is fucking exploding and I completely welcome it.

Get the garbage out. ===> Positive


Apply preprocess

In [48]:
test_text_processed = [preprocess(test_text)]
test_text_processed

['professional dota 2 scene fucking explode completely welcome \n\n garbage']

Get Prediction

In [49]:
test_text = clf.predict(test_text_processed)

Output

In [50]:

test_text = [0, 1, 2, 3] 

classes = ['Irrelevant', 'Natural', 'Negative', 'Positive']

print(f"True Label: {test_df['Label'][10]}")
print(f'Predict Label: {classes[test_text[0]]}')

True Label: Positive
Predict Label: Irrelevant


Sentiment analysis on Twitter data using machine learning techniques. Here's a summary of the process:

Importing Libraries: Necessary libraries such as NumPy, pandas, and scikit-learn are imported.

Exploratory Data Analysis (EDA): The dataset "Emotion_classify_Data.csv" is loaded into a DataFrame named df. EDA involves checking the shape of the DataFrame, displaying the top 5 rows, and examining the distribution of labels.

Preprocessing:

Null values are dropped from the DataFrame.
A preprocess function is defined using spaCy for text preprocessing, including tokenization, stop word removal, and lemmatization.
The preprocess function is applied to the 'Text' column, and the preprocessed text is stored in a new column named 'Preprocessed Text'.
Encoding Target Column: The labels in the 'Label' column are encoded using LabelEncoder to convert them into numerical values.

Splitting Data: The preprocessed text and encoded labels are split into training and testing sets using the train_test_split function from scikit-learn.

Model Building and Evaluation:

Two models are trained and evaluated: Naive Bayes and Random Forest.
For each model, a pipeline is created consisting of a TF-IDF vectorizer and the respective classifier.
Accuracy scores and classification reports are generated for each model to evaluate their performance.
Testing the Model:

A test DataFrame named test_df is loaded with new data from the "twitter_validation.csv" file.
The preprocess function is applied to preprocess the text.
The trained model is used to predict the sentiment label for a sample text from the test data.
The true label and predicted label for the sample text are displayed.
Overall, the code demonstrates a systematic approach to sentiment analysis, starting from data preprocessing to model building, evaluation, and testing. It provides insights into the sentiment expressed in Twitter data and serves as a foundation for further analysis or application development.