<a href="https://colab.research.google.com/github/Ashking1981/lb_training/blob/master/Text_Preprocessing_on_Twitter_Customer_Support_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Preprocessing

We have the Twitter customer support dataset:
https://drive.google.com/file/d/1N6HZyzX-yj_sG1yFAZ4d2WE42OTDf4gV/view?usp=sharing

In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important.

Objective of this code is to understand the various text preprocessing steps with examples.

Some of the common text preprocessing / cleaning steps are:

* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Removal of Rare words
* Number to words/ignoring numbers
* Stemming
* Lemmatization
* Removal of emojis
* Removal of emoticons
* Conversion of emoticons to words
* Conversion of emojis to words
* Removal of URLs
* Removal of HTML tags
* Chat words conversion
* Spelling correction


So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role.

For example, in sentiment analysis use case, we need not remove the emojis or emoticons as it will convey some important information about the sentiment. Similarly we need to decide based on our use cases.

In [None]:
from nltk.parse import nonprojectivedependencyparser
import pandas as pd
import re
import nltk
import spacy
import string
pd.set_option('display.max_colwidth', None)

In [None]:
from google.colab import drive  # while working on colab only
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
! ls -l "/content/drive/My Drive/data"

total 757033
-rw------- 1 root root   4882918 Feb  2  2019 banking.csv
-rw------- 1 root root 150828752 Jun 20  2021 creditcard.csv
drwx------ 2 root root      4096 Dec 27  2019 dataset
-rw------- 1 root root     11328 Apr 10  2022 heart_disease.csv
-rw------- 1 root root     49586 Dec 19  2019 housing.tsv
-rw------- 1 root root      3867 Nov 20  2019 iris.csv
-rw------- 1 root root      1389 May  1  2020 petrol_consumption.csv
drwx------ 2 root root      4096 Jul 12  2020 pubg
-rw------- 1 root root       452 Jan 20  2019 salaryData.csv
-rw------- 1 root root      4286 Jun  1  2022 shopping_data.csv
drwx------ 2 root root      4096 Aug 13  2022 stackoverflow
-rw------- 1 root root 516508641 Aug  6  2022 twcs.csv
-rw------- 1 root root 102895657 Oct 10  2019 uci-news-aggregator.csv
-rw------- 1 root root       180 May 28  2022 uci-news-aggregator.gsheet


In [None]:
! pwd

/content


In [None]:
# alternative for large file reads in pandas

# pd.read_csv('filename.csv', chunksize=1000)


In [None]:
full_df = pd.read_csv("/content/drive/My Drive/data/twcs.csv", nrows=100000)
df = full_df[["text"]]
df["text"] = df["text"].astype(str)
full_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text"] = df["text"].astype(str)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messages and no one is responding as usual,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [None]:
df.head(25)

Unnamed: 0,text
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.
1,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.
4,@sprintcare I did.
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?"
6,@sprintcare is the worst customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC"
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA"


1. Lower case conversion

In [None]:
df["text_lower"] = df["text"].str.lower()
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_lower"] = df["text"].str.lower()


Unnamed: 0,text,text_lower
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.
4,@sprintcare I did.,@sprintcare i did.


In [None]:
full_df.head(25)

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messages and no one is responding as usual,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0
5,6,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,"@115712 Can you please send us a private message, so that I can gain further details about your account?",57.0,8.0
6,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610.0,
7,11,sprintcare,False,Tue Oct 31 22:10:35 +0000 2017,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC",,12.0
8,12,115713,True,Tue Oct 31 22:04:47 +0000 2017,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,111314.0,15.0
9,15,sprintcare,False,Tue Oct 31 20:03:31 +0000 2017,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA",12.0,16.0


In [None]:
df.head(25)

Unnamed: 0,text,text_lower
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.
4,@sprintcare I did.,@sprintcare i did.
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?"
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc"
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa"


### Removal of Punctuation

In [None]:
# drop the new column created in last cell
# df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
print (PUNCT_TO_REMOVE)
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["text_lower"].apply(lambda text: remove_punctuation(text))
df.head()

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_punct"] = df["text_lower"].apply(lambda text: remove_punctuation(text))


Unnamed: 0,text,text_lower,text_wo_punct
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did


In [None]:
df.head(25)

Unnamed: 0,text,text_lower,text_wo_punct
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?",115712 can you please send us a private message so that i can gain further details about your account
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service,sprintcare is the worst customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc",115713 this is saddening to hear please shoot us a dm so that we can look into this for you kc
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯,sprintcare you gonna magically change your connectivity for me and my whole family 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa",115713 we understand your concerns and wed like for you to please send us a direct message so that we can further assist you aa


2. Removal of Stopwords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [None]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))


Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare


In [None]:
df.head(50)

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?",115712 can you please send us a private message so that i can gain further details about your account,115712 please send us private message gain details account
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service,sprintcare is the worst customer service,sprintcare worst customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc",115713 this is saddening to hear please shoot us a dm so that we can look into this for you kc,115713 saddening hear please shoot us dm look kc
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯,sprintcare you gonna magically change your connectivity for me and my whole family 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa",115713 we understand your concerns and wed like for you to please send us a direct message so that we can further assist you aa,115713 understand concerns wed like please send us direct message assist aa


4. Removal of Frequent words

In [None]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(10)

[('us', 15280),
 ('please', 14172),
 ('dm', 10290),
 ('help', 9181),
 ('hi', 8364),
 ('thanks', 7794),
 ('get', 7393),
 ('sorry', 6913),
 ('amazonhelp', 5916),
 ('know', 5443)]

In [None]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_freq"] = df["text_wo_punct"].apply(lambda text: remove_freqwords(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_freq"] = df["text_wo_punct"].apply(lambda text: remove_freqwords(text))


Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_freq
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 i understand i would like to assist you we would need to you into a private secured link to further assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcare i have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 send a private message so that we can further assist you just click ‘message’ at the top of your profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcare i did


5. Removal of Rare words

In [None]:
# Drop the two columns which are no more needed
# df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 20
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
print(RAREWORDS)
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_freqrare"] = df["text_wo_freq"].apply(lambda text: remove_rarewords(text))
df.head()

{'230k', 'httpstcoklebhaiotk', 'httpstcoldgvfgko7y', 'rajni', '144296', 'ulllrrrich', 'tale”', 'httpstcoehp8tcbeds', '7109', '144295', 'mailtemplate', 'knight’s', 'haveagainasked', 'enforcer', 'therelets', 'phone131027', 'lichtenstein', '7110', 'chaucer', 'httpstcosxccepn7bl'}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_freqrare"] = df["text_wo_freq"].apply(lambda text: remove_rarewords(text))


Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_freq,text_wo_freqrare
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 i understand i would like to assist you we would need to you into a private secured link to further assist,115712 i understand i would like to assist you we would need to you into a private secured link to further assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcare and how do you propose we do that,sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 send a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 send a private message so that we can further assist you just click ‘message’ at the top of your profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcare i did,sprintcare i did


In [None]:
df.drop(["text_wo_freq", "text_wo_freqrare"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(["text_wo_freq", "text_wo_freqrare"], axis=1, inplace=True)


In [None]:
df.head(50)

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?",115712 can you please send us a private message so that i can gain further details about your account,115712 please send us private message gain details account
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service,sprintcare is the worst customer service,sprintcare worst customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc",115713 this is saddening to hear please shoot us a dm so that we can look into this for you kc,115713 saddening hear please shoot us dm look kc
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯,sprintcare you gonna magically change your connectivity for me and my whole family 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa",115713 we understand your concerns and wed like for you to please send us a direct message so that we can further assist you aa,115713 understand concerns wed like please send us direct message assist aa


6. Stemming

In [None]:
from nltk.stem.snowball import SnowballStemmer

# Drop the two columns
# df.drop(["text_wo_stopfreq", "text_wo_stopfreqrare"], axis=1, inplace=True)

stemmer = SnowballStemmer("english")
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["text_wo_stop"].apply(lambda text: stem_words(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_stemmed"] = df["text_wo_stop"].apply(lambda text: stem_words(text))


Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_stemmed
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 understand would like assist would need get privat secur link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcar propos
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcar sent sever privat messag one respond usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 pleas send us privat messag assist click messag top profil
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcar


In [None]:
df.head(35)

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_stemmed
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 understand would like assist would need get privat secur link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcar propos
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcar sent sever privat messag one respond usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 pleas send us privat messag assist click messag top profil
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcar
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?",115712 can you please send us a private message so that i can gain further details about your account,115712 please send us private message gain details account,115712 pleas send us privat messag gain detail account
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service,sprintcare is the worst customer service,sprintcare worst customer service,sprintcar worst custom servic
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc",115713 this is saddening to hear please shoot us a dm so that we can look into this for you kc,115713 saddening hear please shoot us dm look kc,115713 sadden hear pleas shoot us dm look kc
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯,sprintcare you gonna magically change your connectivity for me and my whole family 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯,sprintcar gonna magic chang connect whole famili 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa",115713 we understand your concerns and wed like for you to please send us a direct message so that we can further assist you aa,115713 understand concerns wed like please send us direct message assist aa,115713 understand concern wed like pleas send us direct messag assist aa


7. Lemmatization

In [None]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["text_wo_stop"].apply(lambda text: lemmatize_words(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_lemmatized"] = df["text_wo_stop"].apply(lambda text: lemmatize_words(text))


Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 understand would like assist would need get privat secur link assist,115712 understand would like assist would need get private secured link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcar propos,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcar sent sever privat messag one respond usual,sprintcare sent several private message one responding usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 pleas send us privat messag assist click messag top profil,115712 please send u private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcar,sprintcare


In [None]:
df.head(25)

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 understand would like assist would need get privat secur link assist,115712 understand would like assist would need get private secured link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcar propos,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcar sent sever privat messag one respond usual,sprintcare sent several private message one responding usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 pleas send us privat messag assist click messag top profil,115712 please send u private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcar,sprintcare
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?",115712 can you please send us a private message so that i can gain further details about your account,115712 please send us private message gain details account,115712 pleas send us privat messag gain detail account,115712 please send u private message gain detail account
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service,sprintcare is the worst customer service,sprintcare worst customer service,sprintcar worst custom servic,sprintcare worst customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc",115713 this is saddening to hear please shoot us a dm so that we can look into this for you kc,115713 saddening hear please shoot us dm look kc,115713 sadden hear pleas shoot us dm look kc,115713 saddening hear please shoot u dm look kc
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯,sprintcare you gonna magically change your connectivity for me and my whole family 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯,sprintcar gonna magic chang connect whole famili 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa",115713 we understand your concerns and wed like for you to please send us a direct message so that we can further assist you aa,115713 understand concerns wed like please send us direct message assist aa,115713 understand concern wed like pleas send us direct messag assist aa,115713 understand concern wed like please send u direct message assist aa


In [None]:
lemmatizer.lemmatize("running")

'running'

In [None]:
lemmatizer.lemmatize("running", "v") # v for verb

'run'

In [None]:
print("Word is : stripes")
print("Lemma result for verb : ",lemmatizer.lemmatize("stripes", 'v'))
print("Lemma result for noun : ",lemmatizer.lemmatize("stripes", 'n'))

Word is : stripes
Lemma result for verb :  strip
Lemma result for noun :  stripe


In [None]:
lemmatizer.lemmatize("stripes")

'stripe'

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
text = "We are meeting tomorrow for our business dealings and paperwork signing."
pos_tagged_text = nltk.pos_tag(text.split())
print(pos_tagged_text)

[('We', 'PRP'), ('are', 'VBP'), ('meeting', 'VBG'), ('tomorrow', 'NN'), ('for', 'IN'), ('our', 'PRP$'), ('business', 'NN'), ('dealings', 'NNS'), ('and', 'CC'), ('paperwork', 'NN'), ('signing.', 'NN')]


In [None]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.VERB)) for word, pos in pos_tagged_text])

df["text_lemmatized"] = df["text_wo_stop"].apply(lambda text: lemmatize_words(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_lemmatized"] = df["text_wo_stop"].apply(lambda text: lemmatize_words(text))


Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 understand would like assist would need get privat secur link assist,115712 understand would like assist would need get private secure link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcar propos,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcar sent sever privat messag one respond usual,sprintcare send several private message one respond usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 pleas send us privat messag assist click messag top profil,115712 please send us private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcar,sprintcare


In [None]:
df.head(25)

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.,@115712 i understand. i would like to assist you. we would need to get you into a private secured link to further assist.,115712 i understand i would like to assist you we would need to get you into a private secured link to further assist,115712 understand would like assist would need get private secured link assist,115712 understand would like assist would need get privat secur link assist,115712 understand would like assist would need get private secure link assist
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcar propos,sprintcare propose
2,@sprintcare I have sent several private messages and no one is responding as usual,@sprintcare i have sent several private messages and no one is responding as usual,sprintcare i have sent several private messages and no one is responding as usual,sprintcare sent several private messages one responding usual,sprintcar sent sever privat messag one respond usual,sprintcare send several private message one respond usual
3,@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.,@115712 please send us a private message so that we can further assist you. just click ‘message’ at the top of your profile.,115712 please send us a private message so that we can further assist you just click ‘message’ at the top of your profile,115712 please send us private message assist click ‘message’ top profile,115712 pleas send us privat messag assist click messag top profil,115712 please send us private message assist click ‘message’ top profile
4,@sprintcare I did.,@sprintcare i did.,sprintcare i did,sprintcare,sprintcar,sprintcare
5,"@115712 Can you please send us a private message, so that I can gain further details about your account?","@115712 can you please send us a private message, so that i can gain further details about your account?",115712 can you please send us a private message so that i can gain further details about your account,115712 please send us private message gain details account,115712 pleas send us privat messag gain detail account,115712 please send us private message gain detail account
6,@sprintcare is the worst customer service,@sprintcare is the worst customer service,sprintcare is the worst customer service,sprintcare worst customer service,sprintcar worst custom servic,sprintcare bad customer service
7,"@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC","@115713 this is saddening to hear. please shoot us a dm, so that we can look into this for you. -kc",115713 this is saddening to hear please shoot us a dm so that we can look into this for you kc,115713 saddening hear please shoot us dm look kc,115713 sadden hear pleas shoot us dm look kc,115713 sadden hear please shoot us dm look kc
8,@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯,@sprintcare you gonna magically change your connectivity for me and my whole family ? 🤥 💯,sprintcare you gonna magically change your connectivity for me and my whole family 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯,sprintcar gonna magic chang connect whole famili 🤥 💯,sprintcare gonna magically change connectivity whole family 🤥 💯
9,"@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA","@115713 we understand your concerns and we'd like for you to please send us a direct message, so that we can further assist you. -aa",115713 we understand your concerns and wed like for you to please send us a direct message so that we can further assist you aa,115713 understand concerns wed like please send us direct message assist aa,115713 understand concern wed like pleas send us direct messag assist aa,115713 understand concern wed like please send us direct message assist aa


8. Remove Emojis

In [None]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("game is on 🔥🔥")

'game is on '

Emoji's to Words : https://github.com/ikatyang/emoji-cheat-sheet/blob/master/README.md

10. Remove URLs

In [None]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [None]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

In [None]:
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

In [None]:
# for removal of emails

def remove_email(text):
    email_pattern = re.compile(r'\S+@\S+\.\S+')
    return email_pattern.sub(r'', text)

In [None]:
text = "Want to know more: send us an email at amrit@cs.unm.edu"
remove_email(text)

'Want to know more: send us an email at '

11. Remove HTML tags

In [None]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI



You could also practise with any other dataset of your choice and apply such practices