# Introduction

Data prep is task specific for NLP - Consider you goal and review text to understand what may help

# Manual Tokenization

## Load Data

In [None]:
my_file = 'metamorphosis_clean.txt'
file = open(my_file, 'rt')
text = file.read()
file.close()

## Split by Whitespace

In [None]:
words = text.split()
print(words[:100])

## Select Words

In [None]:
# Can use re modula to split based on words only, this can create problems like "what's" becomes "what" and "s" while "armour-like" becomes "armour" and "like"
import re
words = re.split(r'\W+', text)
print(words[:100])

## Split by Whitespace and remove Punctuation

Essentially, split by whitespace then remove all punctuation

In [None]:
# built in punctuation
import string
print(string.punctuation)

In [None]:
#make mapping table and use translate to map
#table of punctation
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words] # list of stripped words
print(stripped[:100])

## Normalizing Case

can shrink vocabulary because some words may lose meaning or distiction as Apple the company is not apple the fruit

In [None]:
words = [word.lower() for word in words]
print(words[:100])

## Final notes on manual tokenization

Simple is always better, start simple then add complexity, but not too complex.

# Tokenization and Cleaning with NLTK

In [None]:
import nltk # insalling nltk
nltk.download()

## Load Data

In [None]:
# did this above, but for consistency
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

## Split into sentances

In [None]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])

## Split into Words

In [None]:
from nltk.tokenize import word_tokenize # splits based on whitespace and punctuation so commas and similar become tokens, contractions are split apart
tokens = word_tokenize(text)
print(tokens[:100])

## Filter Out Punctuation

In [None]:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

## Filter out Stop words

words that don't carry much significance like "The" "is" etc

In [1]:
from nltk.corpus import stopwords #stop words from nltk
stop_words = stopwords.words('english')
print(stop_words)

#be careful, rabbit holes can become quite deep

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/home/sbangarh/nltk_data'
    - '/home/sbangarh/anaconda3/envs/ds_env/nltk_data'
    - '/home/sbangarh/anaconda3/envs/ds_env/share/nltk_data'
    - '/home/sbangarh/anaconda3/envs/ds_env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Stem Words

Reduce words to root or base so fishing, fished, fisher become fish

In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

Don't forget about lemmatization either - grouping together similar words