## Imports

In [59]:
import pandas as pd
import string
import re
import contractions
from textblob import TextBlob
from spellchecker import SpellChecker

In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

## Load Data and explore to understand

In [4]:
data = pd.read_csv(r"C:\Users\soitb\OneDrive\Desktop\Datasets\twitter_sentiments\training.1600000.processed.noemoticon.csv",  encoding="latin-1")

In [5]:
data.head(3)

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire


In [6]:
data.columns

Index(['0', '1467810369', 'Mon Apr 06 22:19:45 PDT 2009', 'NO_QUERY',
       '_TheSpecialOne_',
       '@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D'],
      dtype='object')

Since the header is not provided, we pass our own list of column names

In [7]:
data = pd.read_csv(r"C:\Users\soitb\OneDrive\Desktop\Datasets\twitter_sentiments\training.1600000.processed.noemoticon.csv",  encoding="latin-1", header=None, names=["target","ids","date","flag","user","text"])

In [8]:
data.head(3)

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...


In [9]:
data.shape

(1600000, 6)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Target column has two distinct values 0 and 4, 0 is for negative sentiment and 4 is for positive sentiment.

Also looking at the value counts for each class(0 and 4), we can say the data is balanced so no need for balancing further

In [11]:
data["target"].value_counts()

0    800000
4    800000
Name: target, dtype: int64

In [12]:
data.dtypes

target     int64
ids        int64
date      object
flag      object
user      object
text      object
dtype: object

There is no null values present in dataset (so no need to handle null values)

In [13]:
data.isnull().any()

target    False
ids       False
date      False
flag      False
user      False
text      False
dtype: bool

## Data Cleaning and Preprocessing

Selecting the two useful columns namely 1. Target and 2. Text

In [14]:
data.columns

Index(['target', 'ids', 'date', 'flag', 'user', 'text'], dtype='object')

In [15]:
data = data[['target','text']]

Setting the target value = 4 to 1

In [16]:
data['target'][data['target']==4]=1

For now we are taking 1/4th of the data of both classes to train our model (will further see how the change in data size affects the model accuracy)

In [17]:
data_positive = data[data['target']==1].iloc[:20000]
data_negative = data[data['target']==0].iloc[:20000]

In [18]:
print(data_positive.shape)
print(data_negative.shape)

(20000, 2)
(20000, 2)


In [19]:
data_positive.head(2)

Unnamed: 0,target,text
800000,1,I LOVE @Health4UandPets u guys r the best!!
800001,1,im meeting up with one of my besties tonight! ...


concating the whole dataset again to combine both classes in one dataset

In [20]:
data = pd.concat([data_positive,data_negative],axis=0,ignore_index=True)
print(data.shape)

(40000, 2)


In [21]:
data.head(2)

Unnamed: 0,target,text
0,1,I LOVE @Health4UandPets u guys r the best!!
1,1,im meeting up with one of my besties tonight! ...


In [22]:
n = 6
print(data['text'][n])
print(len(data['text'][n]))
print(data['target'][n])

@r_keith_hill Thans for your response. Ihad already find this answer 
69
1


### Data Cleaning

Before cleaning the text i am making a copy of data frame to compare later

In [23]:
data_c = data.copy(deep=True)

In [24]:
id(data)

2508571611728

In [25]:
id(data_c)

2508321438416

Looking at some of the text data it contains:
- web address (www. or https)
- Hashtags #GoWithTheFlow
- numerics
- alpha-numerics
- Upper case and lower case
- Punctuations
- @switchfoot kind of tagging
- repeating words like 'haaappppyyyyy', 'soooooo much', 'Thanksss', 'nooooo'
- Emoticons
- words like u for you and r for are 

Some preprocessings for text:
- lower case
- Punctuations removal
- Stopwords removal except for the 'not', 'shouldn't' didn't etc, as these words changes the sentiment of a text
- 

Converting everything to lower case for uniformity of same words presented in both upper and lower case

In [26]:
data_c['text'] = data_c['text'].apply(lambda x: x.lower())

In [27]:
data_c.head(5)

Unnamed: 0,target,text
0,1,i love @health4uandpets u guys r the best!!
1,1,im meeting up with one of my besties tonight! ...
2,1,"@darealsunisakim thanks for the twitter add, s..."
3,1,being sick can be really cheap when it hurts t...
4,1,@lovesbrooklyn2 he has that effect on everyone


Removing Hashtags

In [28]:
def remove_hashtags(text):
    return re.sub('#[^\s]+', '', text)

In [29]:
data_c['text'] = data_c['text'].apply(lambda x: remove_hashtags(x))

In [30]:
data_c.head(5)

Unnamed: 0,target,text
0,1,i love @health4uandpets u guys r the best!!
1,1,im meeting up with one of my besties tonight! ...
2,1,"@darealsunisakim thanks for the twitter add, s..."
3,1,being sick can be really cheap when it hurts t...
4,1,@lovesbrooklyn2 he has that effect on everyone


Removing usernames like @twitter

In [31]:
def remove_username(text):
    #text = str(text)
    return re.sub('@\w+?\s', '', text)

In [32]:
remove_username('I LOVE @Health4UandPets u guys r the')

'I LOVE u guys r the'

In [33]:
data_c['text'] = data_c['text'].apply(lambda x: remove_username(x))

In [34]:
data.head(2)

Unnamed: 0,target,text
0,1,I LOVE @Health4UandPets u guys r the best!!
1,1,im meeting up with one of my besties tonight! ...


In [35]:
data_c.head(2)

Unnamed: 0,target,text
0,1,i love u guys r the best!!
1,1,im meeting up with one of my besties tonight! ...


Removing web addresses

In [36]:
def remove_urls(text):
    return re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)

In [37]:
data_c['text'] = data_c['text'].apply(lambda x: remove_urls(x))

In [38]:
data_c.head(5)

Unnamed: 0,target,text
0,1,i love u guys r the best!!
1,1,im meeting up with one of my besties tonight! ...
2,1,"thanks for the twitter add, sunisa! i got to m..."
3,1,being sick can be really cheap when it hurts t...
4,1,he has that effect on everyone


Using contractions library (i'm --> i am, "wont" --> "will not", "dont"-->"do not" etc.)

In [39]:
data_c['text'] = data_c['text'].apply(lambda x: contractions.fix(x))

In [40]:
data_c.head(5)

Unnamed: 0,target,text
0,1,i love you guys r the best!!
1,1,i am meeting up with one of my besties tonight...
2,1,"thanks for the twitter add, sunisa! i got to m..."
3,1,being sick can be really cheap when it hurts t...
4,1,he has that effect on everyone


Punctuations removal

Remember that removing all the punctuations will remove all the emmoticons too (like :) :-) :( etc, so if all the punctuations are removed we cant leverage the emoticons for sentiments.

I have to find a way to solve for this problem. ?

In [41]:
def remove_punct(text):
    english_punctuations = string.punctuation
    translator = str.maketrans('','', english_punctuations)
    return text.translate(translator)

In [42]:
data_c['text'] = data_c['text'].apply(lambda x: remove_punct(x))

In [43]:
data_c.head(5)

Unnamed: 0,target,text
0,1,i love you guys r the best
1,1,i am meeting up with one of my besties tonight...
2,1,thanks for the twitter add sunisa i got to mee...
3,1,being sick can be really cheap when it hurts t...
4,1,he has that effect on everyone


Removing numbers (as numbers doesn't seem to add any sentiments to a sentence)

In [44]:
def remove_numbers(text):
    return re.sub('[0-9]+', '', text)

In [45]:
data_c['text'] = data_c['text'].apply(lambda x: remove_numbers(x))

In [46]:
data_c.head(5)

Unnamed: 0,target,text
0,1,i love you guys r the best
1,1,i am meeting up with one of my besties tonight...
2,1,thanks for the twitter add sunisa i got to mee...
3,1,being sick can be really cheap when it hurts t...
4,1,he has that effect on everyone


Removing Extra Spaces from text

In [47]:
def remove_extra_spaces(text):
    return re.sub(r'\s\s+', ' ', text)

In [48]:
data_c['text'] = data_c['text'].apply(lambda x: remove_extra_spaces(x))

Spelling correction

In english there is no more than two repeated characters in correct spellings, So removing the more than two repeated characters adn replacing them with only two repeated characters using regex

In [49]:
def remove_repeated_char(text,n):
    n = str(n)
    pattern = r'([a-z])\1{' + n + r',}'
    out_pattern = r'\1\1'
    return re.sub(pattern, out_pattern, text)

In [50]:
data_c['text'] = data_c['text'].apply(lambda x: remove_repeated_char(x,2))

In [51]:
remove_repeated_char('haaappy',2)

'haappy'

Corecting spellings using TextBlob library

Since pyspellchecker and textblob are almost same but for the word 'guys', textblob outputs 'guns' and pyspellchecker outputs 'guys' and 'checker'-->'checked' in textblob and 'checker'-->'checker' in pyspellchecker , I am going with pyspellchecker as it seems more accurate. though they both converted 'plz' to 'ply'.

Note: Pyspellchecker returns none for the unknown words and that causes a problem
and Both of the spellchecker programs are taking so much time which is why i am ignoring sppell check for now

In [None]:
# def spell_correct_tb(x):
#     b = TextBlob(x)
#     return str(b.correct())

In [None]:
# def spell_correct_py(x):
#     spell = SpellChecker()
#     c = [spell.correction(y) for y in x.split()]
#     #print(c)
#     return ' '.join(c)

removing single characters from text

In [52]:
def remove_single(text):
    return re.sub(r'\b[a-z]\s','',text)

In [53]:
data_c['text'] = data_c['text'].apply(lambda x: remove_single(x))

Stop Words handeling

I am including 'no' and 'not' in the stopwords as they affect the sentiment of a text,
Since we already performed contraction, i can leave "won't" "wouldn't" "doesn't" "don't" etc

In [54]:
stopwords_list = set(stopwords.words('english'))
stop_words = set(stopwords.words('english')) - set(['not', 'no',])
#print(stop_words)

In [55]:
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

In [56]:
data_c['text'] = data_c['text'].apply(lambda text: remove_stopwords(text))

In [57]:
data_c['text'][0]

'love guys best'

In [58]:
data['text'][0]

'I LOVE @Health4UandPets u guys r the best!! '

Lemmatization to convert words to their roots

I will not use lemmatization here (though will see how it affects the output in future) as pointed out in this paper "The Role of Pre-processing in Twitter Sentiment Analysis" by Yanwei Bao, Changqin Quan, Lijuan Wang, Fuji Ren (Our experiments results on Stanford Twitter Sentiment
Dataset show that URLs features reservation, negation transformation and repeated
letters normalization have a positive impact on classification accuracy while stemming and lemmatization have a negative impact. )(https://kd.nsfc.gov.cn/paperDownload/1000013535999.pdf) 

In [None]:
# lemmatizer = WordNetLemmatizer()
# def lemmatize_text(data):
#     text = [lemmatizer.lemmatize(word) for word in data]
#     return data

## Preparation for Modeling

### Tokenization

There is something called nltk.tokenize.TweetTokenizer which is specially made for social media tokenization and it picks up the emoticons as a token too. (only if they are there in text and we did not remove them during punctuation cleaning process)

Useful Links: https://berkeley-stat159-f17.github.io/stat159-f17/lectures/11-strings/11-nltk..html#First-pass