# Data

## About Dataset
#### Context
- This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

### Content
It contains the following 6 fields:

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

- ids: The id of the tweet ( 2087)

- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

- flag: The query (lyx). If there is no query, then this value is NO_QUERY.

- user: the user that tweeted (robotickilldozr)

- text: the text of the tweet (Lyx is cool)

# Import Libraries

In [None]:
from zipfile import ZipFile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import LogisticRegression 

from sklearn.metrics import accuracy_score, classification_report

# Data Loading

In [3]:
# dataset = r'D:\projects\z datasets\Sentiment140_dataset.zip'

# with ZipFile(dataset, 'r') as ds:
#     ds.extractall()

In [45]:
ds = pd.read_csv(r'D:\projects\z datasets\Sentiment140_dataset\train.csv', encoding= 'ISO-8859-1', header= None)

In [20]:
ds.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [46]:
ds.columns = ['target', 'id', 'date', 'flag', 'user', 'text']

In [22]:
ds.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [26]:
ds.shape

(1600000, 6)

# Data Processing

In [23]:

nltk.data.path.append(r'C:\Users\E_Hom\AppData\Roaming\nltk_data')


In [55]:

stop = stopwords.words('english')
print(stop)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [36]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   id      1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [47]:
ds['target'].value_counts()

0    800000
4    800000
Name: target, dtype: int64

In [48]:
ds['target'] = ds['target'].map({4: 1, 1: 0}).fillna(0).astype(int)


In [49]:
ds['target'].value_counts()

0    800000
1    800000
Name: target, dtype: int64

## Stemming
- reduce to root form

In [50]:
port_stem = PorterStemmer()

In [58]:
def stemming(content):
    stemmed = re.sub('[^a-zA-Z]', ' ', content) # remove any non-letter charachters - re.sub(pattern, replacement, string)
    stemmed=stemmed.lower() 
    stemmed=stemmed.split()
    stemmed= [port_stem.stem(word) for word in stemmed if not word in stop] 
    stemmed= ' '.join(stemmed)
    return stemmed

In [59]:
ds['stemmed'] = ds['text'].apply(stemming)

In [60]:
ds.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


- Don't Need id, date, flag

## Split Data

In [61]:
X = ds['stemmed'].values
Y = ds['target'].values

In [67]:
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size= 0.33, stratify= Y, random_state= 2)
# stratify= Y --> make sure to have same proportion of each class

## Embeddings

- It uses the TF-IDF (Term Frequency-Inverse Document Frequency) technique, which calculates the importance of each word in the document relative to a corpus of documents.

- TF (Term Frequency): Measures how frequently a word appears in a document.
- IDF (Inverse Document Frequency): Measures how common or rare a word is across all documents.

In [69]:
# Embedding
vectorizer = TfidfVectorizer()

In [70]:
xtrain = vectorizer.fit_transform(xtrain)


In [71]:
xtest = vectorizer.transform(xtest)

In [72]:
type(xtrain)

scipy.sparse._csr.csr_matrix

In [79]:
print(xtrain)

  (0, 151639)	0.20202437270061568
  (0, 311409)	0.230962413835814
  (0, 21009)	0.28224566955220287
  (0, 294272)	0.5341820321529278
  (0, 274722)	0.2662602640801861
  (0, 343784)	0.2555311890386471
  (0, 118972)	0.3105447944905826
  (0, 30411)	0.30638170350870314
  (0, 54215)	0.46295887543539493
  (1, 111022)	0.26223528912169286
  (1, 118543)	0.28258013821949673
  (1, 73042)	0.2179809833794416
  (1, 278660)	0.2694220942226651
  (1, 303294)	0.2079598520803409
  (1, 152672)	0.27746938769951157
  (1, 213979)	0.2283558727205571
  (1, 134224)	0.15986484870694564
  (1, 18442)	0.2582070219479215
  (1, 215535)	0.16497562095045878
  (1, 397062)	0.20996187691075113
  (1, 11250)	0.24827653171056654
  (1, 281524)	0.3716818609082496
  (1, 226527)	0.44224083442565515
  (2, 162423)	0.3198183545772798
  (2, 70038)	0.15766957660919292
  :	:
  (1071997, 149343)	0.4646059862092223
  (1071997, 336662)	0.40067256731296536
  (1071997, 379730)	0.40422324309327595
  (1071997, 132588)	0.3659741349382083
  (107

## Train Logistic Regression Model

In [80]:
lr = LogisticRegression(max_iter= 1000)

In [82]:
lr.fit(xtrain, ytrain)

## Evaluate

In [83]:
train_pred = lr.predict(xtrain)
train_acc = accuracy_score(ytrain, train_pred)

In [84]:
train_acc


0.8108386194029851

In [85]:
pred = lr.predict(xtest)
acc  = accuracy_score(ytest, pred)

In [86]:
acc

0.7770719696969697