# Raw Data

In [0]:
raw_txt = "There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators.\nThe corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form 'UserNNN', and manually edited to remove any other identifying information.\nThe corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom)."

In [4]:
print(raw_txt)

There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators.
The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form 'UserNNN', and manually edited to remove any other identifying information.
The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).


#Segementation

## Segementation by Space

This method we just use space to tokenize a sentence and remove all the punctuation.

In [15]:
import re
import string
pun = string.punctuation
def tokenize(sent):
  return [x.strip() for x in re.split('(\W+)', sent.lower()) if x.strip() and x not in pun]
txt_tok1 = tokenize(raw_txt)
print(txt_tok1[:10])

['there', 'is', 'also', 'a', 'corpus', 'of', 'instant', 'messaging', 'chat', 'sessions']


## Segementation by NLTK

In [14]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
sent = sent_tokenize(raw_txt)
txt_tok2 = []
for s in sent:
  txt_tok2.extend(word_tokenize(s.lower()))
print(txt_tok2[:10])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['there', 'is', 'also', 'a', 'corpus', 'of', 'instant', 'messaging', 'chat', 'sessions']


# Cleaning

## Remove Stopwords

Here we can use the stopwords corpus in NLTK directly.

In [11]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Then we can see some examples of stopwords.

In [13]:
print(stopwords[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


Then we need to remove the stopwords from the text data.

In [18]:
print(f"The orignial length of our text is {len(txt_tok2)}.")
txt_tok2_restop = []
for w in txt_tok2:
  if w not in stopwords and w not in pun:
    txt_tok2_restop.append(w)
print(f"The length of our text after removing stopwords is {len(txt_tok2_restop)}.")
print(txt_tok2_restop[:10])

The orignial length of our text is 98.
The length of our text after removing stopwords is 54.
['also', 'corpus', 'instant', 'messaging', 'chat', 'sessions', 'originally', 'collected', 'naval', 'postgraduate']


# Normalization

## Stemming

A word stem is a part of word. Removing morphological affixes form words,leaving only the word stem. For example: "word" is the stem of "waited", "waiting" and "waits". We can use nltk stemmer to finish this job.

We first define some words.

In [0]:
words = ["waits", "waited", "waiting", "wait", "playing", "played", "winning"]

We import the module and use the PorterStemmer to stem our words.

In [24]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stem_word = []
for word in words:
  stem_word.append(ps.stem(word))
print(stem_word)

['wait', 'wait', 'wait', 'wait', 'play', 'play', 'win']


We can also stem our text data.

In [27]:
txt_tok2_restop_stem = []
for word in txt_tok2_restop:
  txt_tok2_restop_stem.append(ps.stem(word))
for i in range(10):
  print(f"{txt_tok2_restop[i]} : {txt_tok2_restop_stem[i]}")

also : also
corpus : corpu
instant : instant
messaging : messag
chat : chat
sessions : session
originally : origin
collected : collect
naval : naval
postgraduate : postgradu


## Lemmazation

Lemmazation is the process of grouping together the differrent inflected forms of a word so they can ba analysed as a single item. It's similar to stemming, but it links with similar meaning to one word.

In [42]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
words = ["better", "best", "good", "play", "playing", "cats", "dogs", "are"]
lemma = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [43]:
stem_word = []
lem_word = []
for w in words:
  stem_word.append(ps.stem(w))
  lem_word.append(lemma.lemmatize(w, pos="v"))
print(f"Stemming result: {stem_word}.")
print(f"Lemmazation result: {lem_word}.")

Stemming result: ['better', 'best', 'good', 'play', 'play', 'cat', 'dog', 'are'].
Lemmazation result: ['better', 'best', 'good', 'play', 'play', 'cat', 'dog', 'be'].


We can also lemmatize our text data.

In [44]:
txt_tok2_restop_lem = []
for word in txt_tok2_restop:
  txt_tok2_restop_lem.append(lemma.lemmatize(word, pos='v'))
for i in range(10):
  print(f"{txt_tok2_restop[i]} : {txt_tok2_restop_lem[i]}")

also : also
corpus : corpus
instant : instant
messaging : message
chat : chat
sessions : sessions
originally : originally
collected : collect
naval : naval
postgraduate : postgraduate
