In [51]:
# installing kaggle library
!pip install kaggle



### Uploading the kaggle.json file

In [52]:
 # configuring the path of kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

### Importing Twitter Sentiment dataset

In [53]:
#API to fetch the dataset from kaggle 
!kaggle datasets download kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [54]:
# extracting the compressed dataset

from zipfile import ZipFile
dataset = 'sentiment140.zip'

with ZipFile(dataset,'r') as zip:
    zip.extractall()
    print("The dataset is extracted")
    

The dataset is extracted


### Importing the dependencies


In [55]:
import numpy as np  # NumPy: For numerical operations and working with arrays
import pandas as pd  # Pandas: For handling and analyzing structured data (like DataFrames)
import re  # re: Python's regular expressions module for pattern matching and text cleaning
from nltk.corpus import stopwords  # NLTK stopwords: Common words (like 'the', 'and') to filter out from text
from nltk.stem.porter import PorterStemmer  # PorterStemmer: Reduces words to their root form (e.g., "running" → "run")
from sklearn.feature_extraction.text import TfidfVectorizer  # TF-IDF Vectorizer: Converts text to numerical features based on term frequency and importance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


🛑 What are Stop Words in NLTK?

Stop words are common words in a language that are often filtered out during natural language processing (NLP) tasks because they are considered to carry little meaningful information.


💡 Why remove stop words?

To reduce noise in text data.

To focus on more meaningful words (nouns, verbs, etc.) for tasks like:

Text classification

Sentiment analysis

Topic modeling

Search engines



In [56]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aakashkhanal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [57]:
# Printing the stopwords in English

print(stopwords.words('english'))



['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

## Data Processing



In [58]:
#loading the data form csv file to pandas dataframe
twitter_data = pd.read_csv('training.1600000.csv', encoding = 'ISO-8859-1')
twitter_data.shape

(1599999, 6)

Encoding='ISO-8859-1' handles special characters that may cause issues with the default UTF-8 encoding.

In [59]:
twitter_data.head()


Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [60]:
# naming the columns and reading the dataset again

column_names = ['target', 'id', 'date', 'flag','user', 'text']
twitter_data = pd.read_csv('training.1600000.csv', names= column_names, encoding = 'ISO-8859-1')

In [61]:
twitter_data.shape

(1600000, 6)

In [62]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [63]:
# checking missing values in the dataset
twitter_data.isnull().sum()


target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

⚠️ Why Class Imbalance is a Problem:
The model may bias toward the majority class, giving high accuracy just by predicting the dominant label.

It may fail to learn the minority class well, which is often the one we care about (e.g., detecting spam, fraud, or negative sentiment).

In [64]:
# checking the distribution of target column
twitter_data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

our data has equal distribution.

## Convert the target "4" to "1".

In [65]:
twitter_data.replace({'target':{4:1}}, inplace=True)


In [66]:
twitter_data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64