# Test preprocessing

### This notebook serves as an exercise to practice all text preprocessing steps on a given text document using 3 different libraries

### load and explore the data

In [20]:
data = ''
with open('data.txt','r') as inputfile:
    data = inputfile.read()

In [21]:
print(data)

Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s. NLP's creators claim there is a connection between neurological processes (neuro-), language (linguistic) and behavioral patterns learned through experience (programming), and that these can be changed to achieve specific goals in life.[1][2] Bandler and Grinder also claim that NLP methodology can "model" the skills of exceptional people, allowing anyone to acquire those skills.[3][4] They claim as well that, often in a single session, NLP can treat problems such as phobias, depression, tic disorders, psychosomatic illnesses, near-sightedness,[5] allergy, common cold,[6] and learning disorders.[7][8]

There is no scientific evidence supporting the claims made by NLP advocates and it has been discredited as a pseudoscience.[9][10][11]

Scientific reviews state that NLP is based o

### Data cleaning

In [22]:
# Need to clean unneeded markings
import re

# data_clean = re.sub("\[.+\]","",data) #remove [NUM] tags # . means everything between first . and second .
data_clean = re.sub("\[[0-9]+\]","",data)

In [34]:
# data_clean

#### We can still see unneeded new-line (\n) characters, but the tokenizer will take care of those

### Tokenization

In [23]:
! pip install nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

data_clean_word_tokenized = word_tokenize(data_clean)
data_clean_word_tokenized[:10]



You should consider upgrading via the 'c:\users\arthu\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arthu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Neuro-linguistic',
 'programming',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'pseudoscientific',
 'approach',
 'to']

#### Side note: in some nlp tasks, you need to preserve the sentence separation. In that case, you must first separate by sentence, and then separate these sentences into tokens

In [24]:
from nltk.tokenize import sent_tokenize

data_clean_sent_tokenized = sent_tokenize(data_clean)
data_clean_sent_tokenized[:2]

['Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s.',
 "NLP's creators claim there is a connection between neurological processes (neuro-), language (linguistic) and behavioral patterns learned through experience (programming), and that these can be changed to achieve specific goals in life."]

In [25]:
data_clean_word_sent_tokenized = [word_tokenize(sentence) for sentence in data_clean_sent_tokenized]
data_clean_word_sent_tokenized[0]

['Neuro-linguistic',
 'programming',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'pseudoscientific',
 'approach',
 'to',
 'communication',
 ',',
 'personal',
 'development',
 ',',
 'and',
 'psychotherapy',
 'created',
 'by',
 'Richard',
 'Bandler',
 'and',
 'John',
 'Grinder',
 'in',
 'California',
 ',',
 'United',
 'States',
 'in',
 'the',
 '1970s',
 '.']

### remove punctuations with lowercasing

In [38]:
data_clean_word_tokenized = [word.lower() for word in data_clean_word_tokenized if word.isalpha()]
data_clean_word_tokenized[:10]

['a',
 'few',
 'words',
 'about',
 'dostoevsky',
 'himself',
 'may',
 'help',
 'the',
 'english']

### Removing stopwords

In [39]:
from nltk.corpus import stopwords

data_clean_word_tokenized = [word for word in data_clean_word_tokenized if not word in stopwords.words('english')]
data_clean_word_tokenized[:10]

['words',
 'dostoevsky',
 'may',
 'help',
 'english',
 'reader',
 'understand',
 'work',
 'dostoevsky',
 'son']

### Lemmatization / POS tagging

In [40]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet') 
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    return ''
    
lemmatizer = WordNetLemmatizer()

data_clean_word_lemmatized = []

for i, word in enumerate(data_clean_word_tokenized):
    pos = get_wordnet_pos(pos_tag([word])[0][1])
    if pos != '':
        data_clean_word_lemmatized.append(lemmatizer.lemmatize(word, pos))
    else:
        data_clean_word_lemmatized.append(word)

data_clean_word_lemmatized[:10]

['word',
 'dostoevsky',
 'may',
 'help',
 'english',
 'reader',
 'understand',
 'work',
 'dostoevsky',
 'son']

## Now we have a dataset of pre-processed words


In [3]:
import re
import unicodedata
import string
import random
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arthu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\arthu\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\arthu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\arthu\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [36]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    return ''

In [37]:
#Tokenize and perform lemmatization
lemmatizer = WordNetLemmatizer()

def clean(text):
    output = []
    # tokenize
    text = word_tokenize(text)
    # only alphabets and numerics and lower case
    text = [word.lower() for word in text if word.isalpha()]
    # apply lemmatization
    for i, word in enumerate(text):
        pos = get_wordnet_pos(pos_tag([word])[0][1])
        if pos != '':
            output.append(lemmatizer.lemmatize(word, pos))
        else:
            output.append(word)
    return output

In [38]:
# remove unnecessary characters, perform regex parsing
def filter(text):
    # remove punctuation and special characters
    text = re.sub("\[.,\/#!$%\^&\*;:{}=\-_`~()]","",text) 
    # replace newline with space
    text = re.sub("[\\t\\n\\r]+"," ",text)
    return text

In [39]:
# Generate predictions from the created 3-grams
def predict(model, user_input):
    print("Filtering user input...")
    text = filter(user_input)
    print("Cleaning user input...")
    words = clean(text)
    n = len(words)
    preds = model[(words[n-3],words[n-2],words[n-1])].most_common(5)
    print(preds)
    print(text + " " + str(preds[0][0]))

In [40]:
# Make a language model using a dictionary, trigrams
def n_gram_model(list_of_tokenized_text):
    # a nifty tool to help us create ngrams. Here, ztri-grams
    fourgrams = list(nltk.ngrams(list_of_tokenized_text, 4, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
    # now using nltk to get trigram frequency
    a,b,c,d = list(zip(*fourgrams))
    fourgrams = list(zip(a,b,c))
    return nltk.ConditionalFreqDist(list(zip(fourgrams, d)))


In [41]:
# pre-process text data
 
file = open('n-gram-data.txt', 'r')
    
text = ""
while True:
    line = file.readline()
    text += line
    if not line:
        break

# pre-process text
print("Filtering corpus...")
text = filter(text)
print("Cleaning corpus...")
words = clean(text)

Filtering corpus...
Cleaning corpus...


In [42]:
from nltk.tokenize import sent_tokenize

text_clean_sent_tokenized = sent_tokenize(text)
text_clean_sent_tokenized[:10]

['A few words about Dostoevsky himself may help the English reader to understand his work.',
 'Dostoevsky was the son of a doctor.',
 'His parents were very hard- working and deeply religious people, but so poor that they lived with their five children in only two rooms.',
 'The father and mother spent their evenings in reading aloud to their children, generally from books of a serious character.',
 'Though always sickly and delicate Dostoevsky came out third in the final examination of the Petersburg school of Engineering.',
 'There he had already begun his first work, "Poor Folk."',
 'This story was published by the poet Nekrassov in his review and was received with acclamations.',
 'The shy, unknown youth found himself instantly something of a celebrity.',
 'A brilliant and successful career seemed to open before him, but those hopes were soon dashed.',
 'In 1849 he was arrested.']

In [53]:
def main():

    # make language model
    print("Making model...")
    model = n_gram_model(words)

    print("Enter a phrase: ")
    user_input = input()
    predict(model, user_input)
    

main()

Making model...
Enter a phrase: 
Filtering user input...
Cleaning user input...
[('and', 1)]
Though always sickly and
