In [51]:
# installing kaggle library
!pip install kaggle



### Uploading the kaggle.json file

In [52]:
 # configuring the path of kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

### Importing Twitter Sentiment dataset

In [53]:
#API to fetch the dataset from kaggle 
!kaggle datasets download kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [54]:
# extracting the compressed dataset

from zipfile import ZipFile
dataset = 'sentiment140.zip'

with ZipFile(dataset,'r') as zip:
    zip.extractall()
    print("The dataset is extracted")
    

The dataset is extracted


### Importing the dependencies


In [1]:
import numpy as np  # NumPy: For numerical operations and working with arrays
import pandas as pd  # Pandas: For handling and analyzing structured data (like DataFrames)
import re  # re: Python's regular expressions module for pattern matching and text cleaning
from nltk.corpus import stopwords  # NLTK stopwords: Common words (like 'the', 'and') to filter out from text
from nltk.stem.porter import PorterStemmer  # PorterStemmer: Reduces words to their root form (e.g., "running" → "run")
from sklearn.feature_extraction.text import TfidfVectorizer  # TF-IDF Vectorizer: Converts text to numerical features based on term frequency and importance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


🛑 What are Stop Words in NLTK?

Stop words are common words in a language that are often filtered out during natural language processing (NLP) tasks because they are considered to carry little meaningful information.


💡 Why remove stop words?

To reduce noise in text data.

To focus on more meaningful words (nouns, verbs, etc.) for tasks like:

Text classification

Sentiment analysis

Topic modeling

Search engines



In [2]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aakashkhanal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Printing the stopwords in English

print(stopwords.words('english'))



['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

## Data Processing



In [4]:
#loading the data form csv file to pandas dataframe
twitter_data = pd.read_csv('training.1600000.csv', encoding = 'ISO-8859-1')
twitter_data.shape

(1599999, 6)

Encoding='ISO-8859-1' handles special characters that may cause issues with the default UTF-8 encoding.

In [5]:
twitter_data.head()


Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [6]:
# naming the columns and reading the dataset again

column_names = ['target', 'id', 'date', 'flag','user', 'text']
twitter_data = pd.read_csv('training.1600000.csv', names= column_names, encoding = 'ISO-8859-1')

In [7]:
twitter_data.shape

(1600000, 6)

In [8]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [9]:
# checking missing values in the dataset
twitter_data.isnull().sum()


target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

⚠️ Why Class Imbalance is a Problem:
The model may bias toward the majority class, giving high accuracy just by predicting the dominant label.

It may fail to learn the minority class well, which is often the one we care about (e.g., detecting spam, fraud, or negative sentiment).

In [10]:
# checking the distribution of target column
twitter_data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

our data has equal distribution.

## Convert the target "4" to "1".

In [11]:
twitter_data.replace({'target':{4:1}}, inplace=True)


In [12]:
twitter_data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

0 --> Negative Tweet

1--> Positive Tweet

### Stemming 

In NLTK (Natural Language Toolkit), stemming is the process of reducing a word to its base or root form. For example, words like "running", "runner", and "ran" might be reduced to the stem "run".

NLTK provides several stemmers. The most commonly used one is the PorterStemmer.

### 🔑 Summary:

- Why Stemming Is Important
Reduces word variations to a common base (e.g., "running", "ran" → "run")

- Decreases dimensionality of text data, making models faster and simpler

- Improves search accuracy by matching similar word forms

- Enhances model generalization by treating related words as one

- Boosts efficiency in NLP tasks like classification and information retrieval

⚠️ While powerful, stemming can be imprecise. For more accuracy, consider lemmatization.

### 🔄 Difference: Stemming vs Lemmatization

Stemming: crude chopping (e.g., "studies" → "studi")

Lemmatization: uses vocabulary & grammar (e.g., "studies" → "study")

Use lemmatization (with WordNetLemmatizer) when you need proper words as base forms

In [13]:
port_stem = PorterStemmer()

In [17]:
def stemming(content):

    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english') ]
    stemmed_content = ' '.join(stemmed_content)

    return stemmed_content

In [18]:
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)

In [19]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [20]:
print(twitter_data['stemmed_content'])

0          switchfoot http twitpic com zl awww bummer sho...
1          upset updat facebook text might cri result sch...
2          kenichan dive mani time ball manag save rest g...
3                            whole bodi feel itchi like fire
4                              nationwideclass behav mad see
                                 ...                        
1599995                           woke school best feel ever
1599996    thewdb com cool hear old walt interview http b...
1599997                         readi mojo makeov ask detail
1599998    happi th birthday boo alll time tupac amaru sh...
1599999    happi charitytuesday thenspcc sparkschar speak...
Name: stemmed_content, Length: 1600000, dtype: object


In [21]:
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


In [22]:
# separating the data and label
X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values

In [23]:
print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)


You're splitting your data into training (80%) and testing (20%) while:

Keeping class distribution the same (stratify=Y)

Ensuring reproducibility (random_state=2)

In [25]:
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1280000,) (320000,)


In [26]:
print(X_train)

['watch saw iv drink lil wine' 'hatermagazin'
 'even though favourit drink think vodka coke wipe mind time think im gonna find new drink'
 ... 'eager monday afternoon'
 'hope everyon mother great day wait hear guy store tomorrow'
 'love wake folger bad voic deeper']


In [27]:
print(X_test)

['mmangen fine much time chat twitter hubbi back summer amp tend domin free time'
 'ah may show w ruth kim amp geoffrey sanhueza'
 'ishatara mayb bay area thang dammit' ...
 'destini nevertheless hooray member wonder safe trip' 'feel well'
 'supersandro thank']


In [28]:
# converting the textual data to numerical data

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

### 🧠 What is TfidfVectorizer()?
It's a tool from scikit-learn (sklearn.feature_extraction.text) that transforms a collection of text documents into a matrix of TF-IDF features.

📌 TF-IDF: What does it mean?
TF (Term Frequency): How often a word appears in a document.

IDF (Inverse Document Frequency): How rare the word is across all documents.

So,

TF-IDF gives higher weight to words that are important in a document but not common across all documents.

### ✅ Why TfidfVectorizer() is needed:
Machine learning models can’t process raw text — they need numbers.

TfidfVectorizer() converts text into a numeric format using TF-IDF scores.

TF-IDF gives higher weight to important words and lowers the weight of common, less useful words.

It creates a matrix of features (words) that can be used to train a model (like for classification).

In [31]:
print(X_train)

  (0, 443066)	0.4484755317023172
  (0, 235045)	0.41996827700291095
  (0, 109306)	0.3753708587402299
  (0, 185193)	0.5277679060576009
  (0, 354543)	0.3588091611460021
  (0, 436713)	0.27259876264838384
  (1, 160636)	1.0
  (2, 288470)	0.16786949597862733
  (2, 132311)	0.2028971570399794
  (2, 150715)	0.18803850583207948
  (2, 178061)	0.1619010109445149
  (2, 409143)	0.15169282335109835
  (2, 266729)	0.24123230668976975
  (2, 443430)	0.3348599670252845
  (2, 77929)	0.31284080750346344
  (2, 433560)	0.3296595898028565
  (2, 406399)	0.32105459490875526
  (2, 129411)	0.29074192727957143
  (2, 407301)	0.18709338684973031
  (2, 124484)	0.1892155960801415
  (2, 109306)	0.4591176413728317
  (3, 172421)	0.37464146922154384
  (3, 411528)	0.27089772444087873
  (3, 388626)	0.3940776331458846
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 390130)	0.22064742191076112
  (1279996, 434014)	0.2718945052332447
  (1279996, 318303)	0.21254698865277746
  (1279996, 237899)	0.2236567560099234
  (1279996, 2910



### Meaning of output:

(0, 443066)    0.4484755317023172

- Row 0 corresponds to document 0 (i.e., first sample in X_train)

- Column 443066 corresponds to a specific word (feature)

- Value 0.448... is the TF-IDF score for that word in that document

## Training the Machine Learning Model

### Logistic Regression

In [33]:
model = LogisticRegression(max_iter=1000)

### ✅ How to choose the best max_iter in LogisticRegression:

Start with the default (100).

If you get a ConvergenceWarning, it means the model needs more iterations → increase max_iter (e.g., 200, 300, 500...).

Stop increasing once:

The warning disappears.

Model performance (e.g., accuracy) stops improving.

Use cross-validation to confirm you're not overfitting

In [34]:
model.fit(X_train, Y_train)

### Accuracy Score

In [36]:
# accuracy score on the training data.
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [37]:
print('Accuracy score on the training data:', training_data_accuracy)

Accuracy score on the training data: 0.81018125


In [38]:
# accuracy score on the testing data.
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print('Accuracy score on the test data:', test_data_accuracy)


Accuracy score on the test data: 0.777996875


- If test accuracy ≈ training accuracy → ✅ Model is generalizing well.

- If test accuracy ≪ training accuracy → ⚠️ Model may be overfitting.

## Saving the trained model

In [39]:
import pickle

In [40]:
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

### Using the saved model for future predictions

In [41]:
# loading the saved model

loaded_model = pickle.load(open('trained_model.sav','rb'))

In [45]:
X_new = X_test[200]
print(Y_test[200])

prediction = loaded_model.predict(X_new)
print(prediction)

if (prediction[0]==0):
    print('Negative Tweet')

else:
    print('Positive Tweet')

1
[1]
Positive Tweet


In [46]:
X_new = X_test[4]
print(Y_test[4])

prediction = loaded_model.predict(X_new)
print(prediction)

if (prediction[0]==0):
    print('Negative Tweet')

else:
    print('Positive Tweet')

0
[1]
Positive Tweet
