INSTALL THE NECESSARY LIBRARIES

This block install and imports the required libraries , it uses pandas to load and handle data , TfidVectorizer to turn text into numbers and scikit learn to train model 

In [2]:
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC

LOAD DATASET

- Here we loads the Sentiment140 dataset from a zipped CSV file, we can also download this file from kaggle also 

-  we keep onle the polarity and tweet text columns, renames them for clarity and prints the first few rows to chexk the data

In [9]:
data = pd.read_csv(r"C:\ohk\training.1600000.processed.noemoticon.csv",encoding ="latin-1",header=None)

In [10]:
data = data[[0,5]]
data.columns = ["polarity","text"]
print(data.head(100)) 

    polarity                                               text
0          0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          0  is upset that he can't update his Facebook by ...
2          0  @Kenichan I dived many times for the ball. Man...
3          0    my whole body feels itchy and like its on fire 
4          0  @nationwideclass no, it's not behaving at all....
..       ...                                                ...
95         0  Strider is a sick little puppy  http://apps.fa...
96         0  so rylee,grace...wana go steve's party or not?...
97         0  hey, I actually won one of my bracket pools! T...
98         0  @stark YOU don't follow me, either  and i work...
99         0  A bad nite for the favorite teams: Astros and ...

[100 rows x 2 columns]


KEEP ONLY POSITIVE AND NEGATIVE SENTIMENTS 

- Here we remove neutral tweets where the polarity is 2,maps the labels so 0 stays negative and 4 becomes 1 for positive 

- Then er print how many positive and negative tweets are left in the data

In [13]:
data = data[data.polarity != 2]

data["polarity"] = data["polarity"].map({0:0,4:1})
print(data["polarity"].value_counts()) 

polarity
0    800000
1    800000
Name: count, dtype: int64


CLEAN THE TWEETS 

- Here we define a simple function to convert all text to loowercase  for consistency, applies it to every tweet in the dataset

- then shows the original and cleaned version of the first few tweets

In [14]:
def clean_text(text):
    return text.lower()
data["clean_text"] = data["text"].apply(clean_text)
print(data[["text","clean_text"]].head())


                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                          clean_text  
0  @switchfoot http://twitpic.com/2y1zl - awww, t...  
1  is upset that he can't update his facebook by ...  
2  @kenichan i dived many times for the ball. man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


TRAIN TEST SPLIT

- This code splits the clean_text and polarity column into training and testing sets using am 80/20 split

randon_state = 42 ensures reporducibility 

In [15]:
x_train,x_test,y_train,y_test = train_test_split(data["clean_text"],data["polarity"],random_state=42 , test_size=0.2)
 
print("Train Size:",len(x_train))
print("Test_size:",len(x_test)) 


Train Size: 1280000
Test_size: 320000


PERFORM VALIDATION

- This code creates a TF IDF vectorizer that converts text into numerical featureusing unigrams and bigrams limited to 5000 feature 

- its fits and transform the training data and transform the test data and prints the shapes of resulting TF IDF metrics 

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
x_train_tfidf = vectorizer.fit_transform(x_train)
x_test_tfidf  = vectorizer.transform(x_test)

print("TF-IDF shape (train):", x_train_tfidf.shape)
print("TF-IDF shape (test):", x_test_tfidf.shape)

TF-IDF shape (train): (1280000, 5000)
TF-IDF shape (test): (320000, 5000)


TRAIN BERNOULI NAIVE BAYES MODEL

- Here we train a Bernouli Naive Bayes Classifier on TF IDF features  from the training data 

- it predicts sentiments for the test data and then prints the accuracy and a detailed classification report

In [26]:
bnb = BernoulliNB()
bnb.fit(x_train_tfidf,y_train)
bnb_pred = bnb.predict(x_test_tfidf)
print("Bernouli Naive Bayes Accuracy:",accuracy_score(y_test,bnb_pred))
print("/nBernoulib classification report:/n",classification_report(y_test,bnb_pred))

Bernouli Naive Bayes Accuracy: 0.766478125
/nBernoulib classification report:/n               precision    recall  f1-score   support

           0       0.77      0.75      0.76    159494
           1       0.76      0.78      0.77    160506

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



TRAIN SUPPORT VECTOR MACHINE (SVM) MODEL

- This code trains a support vector machine witha maximum of 1000 iterating on the TF IDF festure 

- it predicts text label than points the accuracy and a detailed classification report showing how well the SVM performed

In [27]:
svm = LinearSVC(max_iter=1000)
svm.fit(x_train_tfidf,y_train)
svm_pred = svm.predict(x_test_tfidf)
print("SVM Accuracy:",accuracy_score(svm_pred,y_test))
print("\nClassifier Report:/n" , classification_report(svm_pred,y_test))

SVM Accuracy: 0.79528125

Classifier Report:/n               precision    recall  f1-score   support

           0       0.78      0.80      0.79    154168
           1       0.81      0.79      0.80    165832

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



TRAIN LOGISTIC REGRESSION MODEL
- This code trins a logistic regression model with up to 100 iterations on the TF IDF features

- it predicts sentiment labels for the test data and prints the accuracy and detailed classifiaction report for model evaluation

In [28]:
logpreg =  LogisticRegression(max_iter=100) 
logpreg.fit(x_train_tfidf,y_train)

logpreg_pred = logpreg.predict(x_test_tfidf)

print("Logistic Regression Accuracy:",accuracy_score(logpreg_pred,y_test))
print("\n Classification Report:/n",classification_report(logpreg_pred,y_test))

Logistic Regression Accuracy: 0.79539375

 Classification Report:/n               precision    recall  f1-score   support

           0       0.78      0.80      0.79    155344
           1       0.81      0.79      0.80    164656

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



MAKE PREDICTION ON SAMPLE TWEETS 

- this code take three sample tweets and tranform them into TF IDF feature using the same vectorizer 

- It then predicts their sentiment using the trained BernouliNB , SVM,LogisticRegression and prints the result for each classfier 

- where 1 stands for positive and 0 for negative 

In [30]:
sample_tweets = ["I love this!", "I hate that!", "It was okay, not great."]
sample_vec = vectorizer.transform(sample_tweets)

print("\nSample Predictions:")
print("BernoulliNB:", bnb.predict(sample_vec))
print("SVM:", svm.predict(sample_vec))
print("Logistic Regression:", logpreg.predict(sample_vec))


Sample Predictions:
BernoulliNB: [1 0 1]
SVM: [1 0 1]
Logistic Regression: [1 0 1]
