<a href="https://colab.research.google.com/github/Ajayj025/Twitter_Sentimental_analysis_using_ML.ipynb/blob/main/Twitter_Sentimental_analysis_using_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install kaggle
!pip install nltk
!pip install scikit-learn



Using API to access the dataset

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Extracting the dataset

In [None]:
!kaggle datasets download -d kazanova/sentiment140

from zipfile import ZipFile
with ZipFile('/content/sentiment140.zip', 'r') as zip_ref:
    zip_ref.extractall()
print("Dataset extracted.")

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)
Dataset extracted.


Data Pre-Processing

In [None]:
import pandas as pd

column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
df = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', names=column_names, encoding='ISO-8859-1')

# Fix the dataset before splitting
df['target'] = df['target'].replace({4: 1})

# Confirm
print(df['target'].value_counts())  # should show only 0 and 1


target
0    800000
1    800000
Name: count, dtype: int64


In [None]:
print(df.shape)

(1600000, 6)


In [None]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Clean the text-data

In [None]:
import re

def clean_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#', '', text)  # remove mentions and hashtags
    text = re.sub(r"[^\w\s]", '', text)  # remove punctuation
    text = re.sub(r"\d+", "", text)      # remove numbers
    text = text.lower()
    return text

df['cleaned_text'] = df['text'].apply(clean_text)
df[['text', 'cleaned_text']].head()

Unnamed: 0,text,cleaned_text
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",a thats a bummer you shoulda got david car...
1,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...
2,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sa...
3,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire
4,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i he...


Lemmatizing to understand with Grammar

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

df['processed_text'] = df['cleaned_text'].apply(lemmatize_text)
df[['cleaned_text', 'processed_text']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,cleaned_text,processed_text
0,a thats a bummer you shoulda got david car...,thats bummer shoulda got david carr third day
1,is upset that he cant update his facebook by t...,upset cant update facebook texting might cry r...
2,i dived many times for the ball managed to sa...,dived many time ball managed save rest go bound
3,my whole body feels itchy and like its on fire,whole body feel itchy like fire
4,no its not behaving at all im mad why am i he...,behaving im mad cant see


Split the Data into Training and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

X = df['processed_text'].values
Y = df['target'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

print(X_train.shape, X_test.shape)

(1280000,) (320000,)


 TF-IDF Vectorization with N-grams and Filtering

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2),min_df=5, max_df=0.8, sublinear_tf=True)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
print(X_train_vec.shape, X_test_vec.shape)

(1280000, 233991) (320000, 233991)


In [None]:
print(X_train_vec)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12293252 stored elements and shape (1280000, 233991)>
  Coords	Values
  (0, 216309)	0.2238173399992751
  (0, 166786)	0.24189531839296083
  (0, 100172)	0.3580219252588351
  (0, 48226)	0.27241773305819855
  (0, 111990)	0.28454273943144365
  (0, 223085)	0.30424804982058085
  (0, 216601)	0.4892099976538259
  (0, 48285)	0.5254935440301554
  (1, 95594)	1.0
  (2, 48226)	0.2696244971064601
  (2, 95594)	0.07581452610129889
  (2, 53330)	0.12624056180847001
  (2, 197007)	0.12093802799529277
  (2, 58257)	0.18752033466982618
  (2, 195726)	0.18118948512856362
  (2, 213025)	0.213193172801957
  (2, 32732)	0.20285206382113716
  (2, 223232)	0.2372670812416978
  (2, 126092)	0.15700253278360307
  (2, 198748)	0.09852510146739404
  (2, 75356)	0.12131938480429205
  (2, 60360)	0.13278343025230235
  (2, 135587)	0.10828516945058939
  (2, 53710)	0.1812372214192709
  (2, 48322)	0.291137940877246
  :	:
  (1279998, 40856)	0.11395565970872376
  (1279998, 

In [None]:
print(X_test_vec)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3020042 stored elements and shape (320000, 233991)>
  Coords	Values
  (0, 5473)	0.11573769944528027
  (0, 12595)	0.10642619810386852
  (0, 13486)	0.2717861335579994
  (0, 29586)	0.1820349936664754
  (0, 46308)	0.2824042052763589
  (0, 60740)	0.16756826050163037
  (0, 60771)	0.3222869619841347
  (0, 64825)	0.15580664733523175
  (0, 65004)	0.2371275212451504
  (0, 85032)	0.14644804945454343
  (0, 85131)	0.266366921729513
  (0, 93188)	0.18442265368808505
  (0, 93192)	0.30003414068026985
  (0, 131464)	0.11649851254024852
  (0, 132058)	0.2212853051318641
  (0, 187262)	0.14414339675022841
  (0, 187272)	0.2801608485396649
  (0, 192243)	0.27128537839637396
  (0, 198748)	0.1750200052784939
  (0, 198892)	0.2816331638528801
  (0, 207689)	0.11952579660155019
  (1, 5473)	0.22307338151998496
  (1, 68612)	0.6021981810408035
  (1, 103910)	0.4282447751737432
  (1, 123111)	0.2956933417795338
  :	:
  (319995, 95238)	0.28380724954705533
  (3199

Training the Model

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Train the SVM model
svm_model = LinearSVC()
svm_model.fit(X_train_vec, Y_train)

# Predictions
svm_train_pred = svm_model.predict(X_train_vec)
svm_test_pred = svm_model.predict(X_test_vec)

# Accuracy scores
svm_train_accuracy = accuracy_score(Y_train, svm_train_pred)
svm_test_accuracy = accuracy_score(Y_test, svm_test_pred)

print("SVM Training Accuracy:", svm_train_accuracy)
print("SVM Testing Accuracy:", svm_test_accuracy)


SVM Training Accuracy: 0.85749296875
SVM Testing Accuracy: 0.786940625


In [None]:
import pickle

In [None]:
filename = 'trained_model.sav'
pickle.dump(svm_model, open(filename, 'wb'))

In [None]:
# loading the saved model
loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

In [None]:
X_new = ["love sucky day get unexpected text put smile face."]
X_new_vec = vectorizer.transform(X_new)
prediction = loaded_model.predict(X_new_vec)
print(prediction)

if (prediction[0] == 0):
  print('Negative Tweet')

else:
  print('Positive Tweet')

[1]
Positive Tweet
