# Basic Text Pre-processing of text data

- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

In [4]:
import pandas as pd

In [20]:
train = pd.read_csv('D:\\Datasets\\train_E6oV3lV.csv')

In [21]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [22]:
train['tweet'].head()

0     @user when a father is dysfunctional and is s...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model   i love u take with u all the time in ...
4               factsguide: society now    #motivation
Name: tweet, dtype: object

In [23]:
train['tweet']=train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['tweet'].head()

0    @user when a father is dysfunctional and is so...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model i love u take with u all the time in ur...
4                  factsguide: society now #motivation
Name: tweet, dtype: object

## Removing Punctuation

In [24]:
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')

In [26]:
train['tweet'].head()

0    user when a father is dysfunctional and is so ...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model i love u take with u all the time in urð...
4                    factsguide society now motivation
Name: tweet, dtype: object

## Removal of stop words

In [28]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['tweet']=train['tweet'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop))
train['tweet'].head()

0    user father dysfunctional selfish drags kids d...
1    user user thanks lyft credit cant use cause do...
2                                       bihday majesty
3                model love u take u time urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

## Common word removal

In [50]:
freq= pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

user     17473
love      2647
ð         2511
day       2199
â         1797
happy     1663
amp       1582
im        1139
u         1136
time      1110
dtype: int64

In [51]:
freq = list(freq.index)

In [52]:
train['tweet']=train['tweet'].apply(lambda x: ' '.join(x for x in x.split() if x not in freq))

In [53]:
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

## Rare words removal

In [56]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq.head()

callumscho1    1
earnedit       1
fulfill        1
fiji           1
delhiâs        1
dtype: int64

In [57]:
freq = list(freq.index)

In [59]:
train['tweet']=train['tweet'].apply(lambda x: ' '.join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

## Spelling correction

In [61]:
from textblob import TextBlob

In [64]:
train['tweet'][:5].apply(lambda x : str(TextBlob(x).correct()))

0    father dysfunctional selfish drags kiss dysfun...
1    thanks left credit can use cause dont offer wh...
2                                       midday majesty
3                               model take or ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

## Tokenization

Tokenization refers to dividing the test into a sequence of words or sentences

In [68]:
TextBlob(train['tweet'][0]).words

WordList(['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run'])

## Stemming

Stemming refers to the removal of suffices, like 'ing','ly','s' etc. by a simple rule based approach.

In [69]:
from nltk.stem import PorterStemmer

In [71]:
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: ' '.join([st.stem(word) for word in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3                              model take urð ðððð ððð
4                              factsguid societi motiv
Name: tweet, dtype: object

## Lemmatization

Lemmatization is a more effective option than stemming because it converts word\
into its root word, rather than just stripping sufices.\
It makes use of vocabulary and does a morphological analysis to obtain root word.\
Therefore, we usually prefer using lemmatization over stemming. 

In [74]:
from textblob import Word
train['tweet'][:5].apply(lambda x: ' '.join([Word(word).lemmatize() for word in x.split()]))


0    father dysfunctional selfish drag kid dysfunct...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object