## Basic

1. Cleaning text
    - numbers
    - punctuation
    - whitespace
    - accented characters
    - case conversion
    - abbreviations
2. Tokenization
3. Removing stopwords
4. Stemming/Lemmatization
5. Removing sparse terms/specific words.

## Advanced
1. Removing URL/html tags
2. Cleaning and expanding emoticons.
3. POS Tagging
4. Chunking
5. NER Tagging

In [48]:
DATA_PATH='./data/text.txt'

### Reading text from files
- open(filename,mode) : Returns a file object for `filename` in `mode`.
- f.readlines() : Returns all lines in f as a list.
- f.readline() : Returns single line from f.
- f.read(size=-1) : Returns `size` bytes from f. 

Reference : https://docs.python.org/3/tutorial/inputoutput.html 

Reference : https://docs.python.org/3/tutorial/inputoutput.html

In [49]:
with open(DATA_PATH,'r') as F:
    text=F.read()
    print(text[:50])

﻿The Project Gutenberg EBook of Crime and Punishme


If the file is too large to fit in memory, one way is to read using for loop and process text line by line as follows.

### Regular Expressions
Using `re` package in python. Important functions:
- re.match(pattern,text) : Matches `pattern` to beginning of `text`. Returns `match` object. 

- re.search(pattern,text) : Matches `pattern` to first occurrence in `text`. Returns `match` object.

- re.findall(pattern,text) : Finds all non-overlapping occurrences of `pattern` in `text`. Returns python list.

- re.sub(pattern,replacement,text) : Substitutes all leftmost non-overlapping occurrences of pattern in text by replacement. Returns replaced string. Replacement can be a string or a function. If it is a function, it takes a single `match` object as input and returns string.

- re.split(pattern,text)  : Splits `text` with `pattern`. Returns list of strings.

Reference : https://docs.python.org/3/library/re.html

### Cleaning Data
#### Handling Numbers
- Using inflect 

Reference : https://pypi.org/project/inflect/


In [22]:
# Removing numbers using re
import re
text= "There are 12 items in a dozen, and 20 in a score."
modified_text = re.sub(r'\d+','',text)

print(text)
print(modified_text)

There are 12 items in a dozen, and 20 in a score.
There are  items in a dozen, and  in a score.


In [23]:

# Replacing numbers using inflect and re
import inflect
import re

text= "There are 12 items in a dozen, and 20 in a score."
inflect_engine = inflect.engine()

def replace_number(match_obj):
    matched_string = match_obj.group()
    return inflect_engine.number_to_words(matched_string)

modified_text = re.sub(r'\d+',replace_number,text)

print(text)
print(modified_text)




There are 12 items in a dozen, and 20 in a score.
There are twelve items in a dozen, and twenty in a score.


#### Handling Punctuation
- str.maketrans(x,y,z) : Returns a translation table to map chars in `x` to corresponding chars in `y` and chars in `z` to `None`.
- string.punctuation : Constant string containing all punctuation symbols.

In [28]:
print('Python default punctuation: ', string.punctuation)

text = "Some [text], {;with? ra@ndom punctuation !"

translator = str.maketrans('','',string.punctuation)
modified_text = text.translate(translator)

print(text)
print(modified_text)

Python default punctuation:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Some [text], {;with? ra@ndom punctuation !
Some text with random punctuation 


#### Handling Whitespace

In [39]:
text = "   \t Some text with extra whitespace   "
modified_text = text.strip()

print('')
print(text)
print(modified_text)


   	 Some text with extra whitespace   
Some text with extra whitespace


#### Handling Accented Characters
- Using unidecode

Reference : https://pypi.org/project/Unidecode/

In [41]:
import unidecode

text = "Would you like to have latté at our café?"
modified_text = unidecode.unidecode(text)

print(text)
print(modified_text)

Would you like to have latté at our café?
Would you like to have latte at our cafe?


In [None]:
#### Tokenization
- str.split(separator=' ') : Returns list of tokens separated by `separator` in `str`.
- nltk : word_tokenizer and sentence_tokenizer in nltk

In [17]:
with open(DATA_PATH,'r') as F:
    text=F.read()
    tokens=text.split()
    print("Tokens : ",end='')
    print(tokens[:50])
    print("Number of tokens : ",len(tokens))

Tokens : ['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment,', 'by', 'Fyodor', 'Dostoevsky', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this']
Number of tokens :  206530


In [19]:
with open(DATA_PATH,'r') as F:
    from nltk import word_tokenizer
    text=F.read()
    tokens=word

ImportError: cannot import name 'word_tokenizer' from 'nltk' (/mnt/c/Users/capta/wsl_home/anaconda3/lib/python3.7/site-packages/nltk/__init__.py)

In [50]:
with open(DATA_PATH,'r') as F:
    for line in F:
        print(line)
        break

﻿The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky

