# **Twitter Sentiment Analysis using Machine Learning**

**Author : Ramandeep Kaur**

**Description :**

The objective of this project is to perform sentiment analysis on Twitter data by classifying tweets into positive or negative categories. Due to the informal, short, and noisy nature of tweets, Natural Language Processing (NLP) techniques combined with Machine Learning models are used to achieve reliable sentiment classification.

In [8]:
# installing kaggle library
! pip install kaggle



Uploading Kaggle.json file

In [10]:
# configuring the path of Kaggle.json file
! mkdir -p ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Importing Twitter Sentiment Dataset

In [11]:
# API to fetch the dataset from Kaggle
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
  0% 0.00/80.9M [00:00<?, ?B/s]
100% 80.9M/80.9M [00:00<00:00, 1.12GB/s]


In [12]:
# extracting the compressed dataset

from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset,'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted


# 1. **Importing Required Libraries**

In [34]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [32]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [16]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

# **2. Load the Dataset**

In [17]:
# loading the data from csv file to pandas dataframe
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1')

# **3. Data Preprocessing**

In [18]:
# checking the number of rows and columns
twitter_data.shape

(1599999, 6)

In [19]:
# printing the first 5 rows of the dataframe
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [20]:
# naming the columns and reading the dataset again
column_names = ['target','ids','date','flag','user','text']
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv',names=column_names, encoding='ISO-8859-1')

In [21]:
# checking the number of rows and columns
twitter_data.shape

(1600000, 6)

The dataset contains approximately 1.6 million tweets labeled for sentiment.

In [22]:
# printing the first 5 rows of the dataframe
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [23]:
# counting the number of missing values in the dataset
twitter_data.isnull().sum()

Unnamed: 0,0
target,0
ids,0
date,0
flag,0
user,0
text,0


In [24]:
# checking the distribution of target column
# (0 = negative, 4 = positive)
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


Convert the target "4" to "1"

In [25]:
twitter_data.replace({'target':{4:1}}, inplace=True)

In [26]:
# checking the distribution of target column
# (0 = negative, 1 = positive)
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


# **4. Text Preprocessing**
Preprocessing is essential to clean noisy Twitter data and improve model performance.

**Preprocessing Steps:**

* Convert text to lowercase

* Remove URLs and user mentions

* Remove special characters and numbers

* Remove stopwords

* Apply lemmatization

In [35]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return " ".join(words)

twitter_data['clean_text'] = twitter_data['text'].apply(clean_text)

In [39]:
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text,clean_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",thats bummer shoulda got david carr third day
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset cant update facebook texting might cry r...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,dived many time ball managed save rest go bound
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole body feel itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",behaving im mad cant see


In [40]:
print(twitter_data['clean_text'])

0              thats bummer shoulda got david carr third day
1          upset cant update facebook texting might cry r...
2            dived many time ball managed save rest go bound
3                            whole body feel itchy like fire
4                                   behaving im mad cant see
                                 ...                        
1599995                        woke school best feeling ever
1599996               thewdbcom cool hear old walt interview
1599997                       ready mojo makeover ask detail
1599998    happy th birthday boo alll time tupac amaru sh...
1599999                                 happy charitytuesday
Name: clean_text, Length: 1600000, dtype: object


In [41]:
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


# **5. Feature Extraction using TF-IDF**

TF-IDF (Term Frequency–Inverse Document Frequency) converts text into numerical vectors while reducing the importance of commonly occurring words.

In [42]:
tfidf = TfidfVectorizer(
ngram_range=(1, 2),
max_features=15000,
min_df=5
)

In [45]:
X = tfidf.fit_transform(twitter_data['clean_text'])
y = twitter_data['target']

# **6. Train-Test Split**

The dataset is split into training and testing sets to evaluate model performance.

In [46]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

In [47]:
print(X.shape, X_train.shape, X_test.shape)

(1600000, 15000) (1280000, 15000) (320000, 15000)


In [49]:
print(y.shape, y_train.shape, y_test.shape)

(1600000,) (1280000,) (320000,)


# **7. Model 1 : Logistic Regression**
Logistic Regression is widely used for text classification tasks and performs well with TF-IDF features.

In [50]:
lr_model = LogisticRegression(
max_iter=1000,
class_weight='balanced',
solver='liblinear'
)

In [51]:
lr_model.fit(X_train, y_train)

**Hyperparameter Tuning for Logistic Regression**

GridSearchCV is used to tune the regularization parameter (C) to improve model generalization.

In [52]:
param_grid = {
'C': [0.01, 0.1, 1, 10]
}


grid = GridSearchCV(
LogisticRegression(
max_iter=1000,
class_weight='balanced',
solver='liblinear'
),
param_grid,
cv=5,
scoring='f1'
)


grid.fit(X_train, y_train)


best_lr_model = grid.best_estimator_
print('Best Logistic Regression Parameters:', grid.best_params_)

Best Logistic Regression Parameters: {'C': 1}


# **8. Model 2: Linear Support Vector Machine (SVM)**

Linear SVM is highly effective for high-dimensional and sparse text data such as TF-IDF vectors.

In [54]:
svm_model = LinearSVC(
C=1.0,
class_weight='balanced',
max_iter=2000,
tol=1e-3
)


svm_model.fit(X_train, y_train)

# **9. Model Evaluation**

**Logistic Regression Evaluation**

In [55]:
y_test_pred_lr = best_lr_model.predict(X_test)


print('Logistic Regression Accuracy:', accuracy_score(y_test, y_test_pred_lr))
print('\nClassification Report:\n')
print(classification_report(y_test, y_test_pred_lr))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test, y_test_pred_lr))

Logistic Regression Accuracy: 0.7882625

Classification Report:

              precision    recall  f1-score   support

           0       0.80      0.77      0.78    160000
           1       0.78      0.81      0.79    160000

    accuracy                           0.79    320000
   macro avg       0.79      0.79      0.79    320000
weighted avg       0.79      0.79      0.79    320000

Confusion Matrix:

[[122530  37470]
 [ 30286 129714]]


**Linear SVM Evaluation**

In [56]:
y_test_pred_svm = svm_model.predict(X_test)


print('Linear SVM Accuracy:', accuracy_score(y_test, y_test_pred_svm))
print('\nClassification Report:\n')
print(classification_report(y_test, y_test_pred_svm))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test, y_test_pred_svm))

Linear SVM Accuracy: 0.7872

Classification Report:

              precision    recall  f1-score   support

           0       0.80      0.76      0.78    160000
           1       0.77      0.81      0.79    160000

    accuracy                           0.79    320000
   macro avg       0.79      0.79      0.79    320000
weighted avg       0.79      0.79      0.79    320000

Confusion Matrix:

[[121592  38408]
 [ 29688 130312]]


# **10. Model Comparison**

In [57]:
results = pd.DataFrame({
'Model': ['Logistic Regression', 'Linear SVM'],
'Accuracy': [
accuracy_score(y_test, y_test_pred_lr),
accuracy_score(y_test, y_test_pred_svm)
]
})


results

Unnamed: 0,Model,Accuracy
0,Logistic Regression,0.788262
1,Linear SVM,0.7872


In [67]:
# Logistic Regression was selected as the final model due to its superior accuracy and interpretability
final_model = best_lr_model

# **11. Saving the trained model**

In [59]:
import pickle

In [60]:
filename = 'trained_model.sav'
pickle.dump(final_model, open(filename, 'wb'))


Using the saved model for future predictions

In [61]:
# loading the saved model
loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

In [64]:
X_new = X_test[200]
print('Actual Outcome : ',y_test.iloc[200])

prediction = loaded_model.predict(X_new)
print('Predicted Outcome : ',prediction)

if (prediction[0]==0):
  print('The tweet is Negative')
else:
  print('The tweet is Positive')

Actual Outcome :  0
Predicted Outcome :  [0]
The tweet is Negative


In [65]:
X_new = X_test[3]
print('Actual Outcome : ',y_test.iloc[3])

prediction = loaded_model.predict(X_new)
print('Predicted Outcome : ',prediction)

if (prediction[0]==0):
  print('The tweet is Negative')
else:
  print('The tweet is Positive')

Actual Outcome :  1
Predicted Outcome :  [1]
The tweet is Positive


# **12. Conclusion**

The Twitter Sentiment Analysis project was successfully implemented using Natural Language Processing (NLP) and Machine Learning techniques. Tweets were preprocessed and converted into numerical features using TF-IDF vectorization. Two classification models—Logistic Regression and Linear Support Vector Machine (SVM)—were trained and evaluated.

Logistic Regression was selected as the final model because it achieved slightly higher accuracy while maintaining a balanced precision, recall, and F1-score. Additionally, it required less training time and offered better interpretability compared to Linear SVM.

The final Logistic Regression model demonstrated strong generalization performance on unseen Twitter data with an accuracy of 78.83%. The confusion matrix and classification report indicate that the model does not exhibit bias toward any particular class. Therefore, Logistic Regression is considered an effective and reliable model for Twitter sentiment classification.

For a given tweet, the trained model successfully predicts whether the sentiment is **Positive** or **Negative**, making it suitable for real-world sentiment analysis applications.
