## Data Preprocessing

Data Source: https://www.kaggle.com/code/paoloripamonti/twitter-sentiment-analysis/data


In [5]:
import pandas as pd
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
train=pd.read_csv("./datasets/training.1600000.processed.noemoticon.csv",encoding =DATASET_ENCODING , names=DATASET_COLUMNS)
train = train[["target", "ids", "text"]]
train.head(10)


Unnamed: 0,target,ids,text
0,0,1467810369,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,is upset that he can't update his Facebook by ...
2,0,1467810917,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,my whole body feels itchy and like its on fire
4,0,1467811193,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,@Kwesidei not the whole crew
6,0,1467811592,Need a hug
7,0,1467811594,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,1467811795,@Tatiana_K nope they didn't have it
9,0,1467812025,@twittera que me muera ?


## 1 Basic Feature Engineering

### 1.1 Word Count

Usually, negative comments have more words than positive comments.

In [6]:
train['word_count']=train['text'].apply(lambda x:len(str(x).split(" ")))
train[['text','word_count']].head()

Unnamed: 0,text,word_count
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",20
1,is upset that he can't update his Facebook by ...,22
2,@Kenichan I dived many times for the ball. Man...,19
3,my whole body feels itchy and like its on fire,11
4,"@nationwideclass no, it's not behaving at all....",22


### 1.2 Character Count

Equivalent to word count, we can also count characters.

In [8]:
train['char_count']=train['text'].str.len()
train[['text','char_count']].head()

Unnamed: 0,text,char_count
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",115
1,is upset that he can't update his Facebook by ...,111
2,@Kenichan I dived many times for the ball. Man...,89
3,my whole body feels itchy and like its on fire,47
4,"@nationwideclass no, it's not behaving at all....",111


### 1.3 Average Word Length

For each tweet, calculate the average word length: sum of lengths of words / number of words.

In [10]:
def avg_word(sentence):
    words=sentence.split()
    return (sum(len(word) for word in words)/len(words))

train['avg_word']=train['text'].apply(lambda x:avg_word(x))
train[['text','avg_word']].head()

Unnamed: 0,text,avg_word
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",5.052632
1,is upset that he can't update his Facebook by ...,4.285714
2,@Kenichan I dived many times for the ball. Man...,3.944444
3,my whole body feels itchy and like its on fire,3.7
4,"@nationwideclass no, it's not behaving at all....",4.285714


### 1.4 Number of Stop Words

Stop words are those words too common to be useful. Usually we need to remove stop words, so the number of stop words provides missing information back.

In [12]:
# !pip install nltk

In [13]:
from nltk.corpus import stopwords
stop=stopwords.words('english')

train['stopwords']=train['text'].apply(lambda sen:len([x for x in sen.split() if x in stop]))
train[['text','stopwords']].head()

Unnamed: 0,text,stopwords
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",4
1,is upset that he can't update his Facebook by ...,8
2,@Kenichan I dived many times for the ball. Man...,5
3,my whole body feels itchy and like its on fire,4
4,"@nationwideclass no, it's not behaving at all....",10


### 1.5 Number of Special Characters

We can get the number of '#' and '@' in each tweet, which is potentially useful.


In [14]:
train['hashtags']=train['text'].apply(lambda sen:len([x for x in sen.split() if x.startswith("#")]))
train[['text','hashtags']].head()

Unnamed: 0,text,hashtags
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


### 1.6 Number of Numbers

Less commonly used but could be potentially useful

In [15]:
train['numerics']=train['text'].apply(lambda sen:len([x for x in sen.split() if x.isdigit()]))
train[['text','numerics']].head()

Unnamed: 0,text,numerics
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


### 1.7 Number of Uppercases

Uppercases express anger and rage.

In [16]:
train['upper']=train['text'].apply(lambda sen:len([x for x in sen.split() if x.isupper()]))
train[['text','upper']].head()

Unnamed: 0,text,upper
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",1
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,1
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",1


## 2. Text Preprocessing 

### 2.1 Lowercases

In [17]:
train['text']=train['text'].apply(lambda sen:" ".join(x.lower() for x in sen.split()))
train['text'].head()

0    @switchfoot http://twitpic.com/2y1zl - awww, t...
1    is upset that he can't update his facebook by ...
2    @kenichan i dived many times for the ball. man...
3       my whole body feels itchy and like its on fire
4    @nationwideclass no, it's not behaving at all....
Name: text, dtype: object

### 2.2 Remove punctuations

Punctuations do not carry useful information. Removing punctuations reduces the data size.

In [18]:
train['text'] = train['text'].str.replace('[^\w\s]','')
train['text'].head()

  train['text'] = train['text'].str.replace('[^\w\s]','')


0    switchfoot httptwitpiccom2y1zl  awww thats a b...
1    is upset that he cant update his facebook by t...
2    kenichan i dived many times for the ball manag...
3       my whole body feels itchy and like its on fire
4    nationwideclass no its not behaving at all im ...
Name: text, dtype: object

### 2.3 Remove Stop Words

We can create a stopwords list and use it to remove stop words in text.

In [19]:
from nltk.corpus import stopwords
stop=stopwords.words('english')
train['text']=train['text'].apply(lambda sen:" ".join(x for x in sen.split() if x not in stop))
train['text'].head()

0    switchfoot httptwitpiccom2y1zl awww thats bumm...
1    upset cant update facebook texting might cry r...
2    kenichan dived many times ball managed save 50...
3                     whole body feels itchy like fire
4             nationwideclass behaving im mad cant see
Name: text, dtype: object

### 2.4 Remove Common Words

Check common words in the text, then decide to delete or keep. We found they are useless so we remove all of them.

In [24]:
freq=pd.Series(' '.join(train['text']).split()).value_counts()[:10]
freq

im       177478
good      89398
day       82363
get       81484
like      77748
go        72911
dont      66923
today     64601
going     64087
love      63463
dtype: int64

In [25]:
freq=list(freq.index)
train['text']=train['text'].apply(lambda sen:' '.join(x for x in sen.split() if x not in freq))
train['text'].head()

0    switchfoot httptwitpiccom2y1zl awww thats bumm...
1    upset cant update facebook texting might cry r...
2    kenichan dived many times ball managed save 50...
3                          whole body feels itchy fire
4                nationwideclass behaving mad cant see
Name: text, dtype: object

### 2.5 Remove Rare Words

The relationships between rare words and other words are noise.

In [26]:
freq = pd.Series(' '.join(train['text']).split()).value_counts()[-10:]
freq

kf_elakeboss247    1
lorikatehall       1
onlineesp          1
stalkarazzi        1
bkilat             1
jerriberri98       1
luvli_levi         1
theianmcneny       1
julyit             1
speakinguph4h      1
dtype: int64

In [27]:
freq = list(freq.index)
train['text'] = train['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['text'].head()

0    switchfoot httptwitpiccom2y1zl awww thats bumm...
1    upset cant update facebook texting might cry r...
2    kenichan dived many times ball managed save 50...
3                          whole body feels itchy fire
4                nationwideclass behaving mad cant see
Name: text, dtype: object

### 2.6 Check Spelling

Spelling error happens when sending tweets in a rush. Correcting spelling errors would reduce the vocabulary size.

In [29]:
# !pip install textblob

In [32]:
from textblob import TextBlob
train['text'][:5].apply(lambda x: str(TextBlob(x).correct()))
# train['text'].apply(lambda x: str(TextBlob(x).correct()))

0    switchfoot httptwitpiccom2y1zl www that summer...
1    upset can update facebook testing might cry re...
2    kenichan dived many times ball managed save 50...
3                          whole body feels itchy fire
4                 nationwideclass behaving mad can see
Name: text, dtype: object

### 2.7 Tokenization

We can use textblob to split the sentence into words.

In [34]:
TextBlob(train['text'][1]).words

WordList(['upset', 'cant', 'update', 'facebook', 'texting', 'might', 'cry', 'result', 'school', 'also', 'blah'])

### 2.8 Stemming

Remove suffix such as 'ing', 'ly', 's' etc.

In [36]:
from nltk.stem import PorterStemmer
st=PorterStemmer()
train['text'][:5].apply(lambda x:" ".join([st.stem(word) for word in x.split()]))

0    switchfoot httptwitpiccom2y1zl awww that bumme...
1    upset cant updat facebook text might cri resul...
2    kenichan dive mani time ball manag save 50 res...
3                           whole bodi feel itchi fire
4                   nationwideclass behav mad cant see
Name: text, dtype: object

### 2.9 Lemmatization

In [42]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from textblob import Word
train['text']=train['text'].apply(lambda x:" ".join([Word(word).lemmatize() for word in x.split()]))
train['text'].head()

[nltk_data] Downloading package wordnet to /Users/dizhen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/dizhen/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


0    switchfoot httptwitpiccom2y1zl awww thats bumm...
1    upset cant update facebook texting might cry r...
2    kenichan dived many time ball managed save 50 ...
3                           whole body feel itchy fire
4                nationwideclass behaving mad cant see
Name: text, dtype: object