# Text Mining 
### Information Extraction 

This notebook covers the simple but a powerful tool in text processing: **Regex**. Regex is extremely powerful for pulling out specific pieces of information which match a pattern. 

While regex is extremely powerful, it is slow when you have to run multiple regex of similar type over a single document. 

We will see how [FlastText](https://github.com/vi3k6i5/flashtext) helps speed this up by 5-10X. FlashText has seen lot of love from community. Adopters include [NLProc](https://github.com/NIHOPA/NLPre) - the NLP Preprocessing Toolkit from National Institute of Health. 

In [1]:
!python --version
__author__ = "nirant.bits@gmail.com"

Python 3.6.4 :: Anaconda, Inc.


# Spell Correction
One of the most frequently seen text challenges is correcting spellings. This is even all the more true when data is entered by casual human users, for instance, say shipping addresses or similar. 

Let's take an example, we want to correct `Gujrat`, `Gujart` and other minor misspelligns to  `Gujarat`.
There are several good ways to do this, depending on your dataset, and level of expertise. We discuss 2-3 popular ways, and discuss their pros and cons. 

Before I begin, we need to pay our homage to the legendary [Peter Norvig's Spell Correct](https://norvig.com/spell-correct.html). It's stil worth a read on how to _think_ about solving a problem and _exploring_ implementations. Even the way he refactors his code and writes functions is educational. 

His spell correction module is not the simplest or best way. I recommend two packages: one with a bias towards simplicity, one with a bias towards giving you all the knives, bells and whistles to try:

**[FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy)** is easy to use. It gives a simple similarity score between two strings, capped to 100. Higher numbers mean the words are more similar. 

**[Jellyfish](https://github.com/jamesturk/jellyfish)** supports 6 edit distance functions and 4 phonetic encoding options which you can use as per your use case. 

## FuzzyWuzzy

If you have heard or used the [Levenshtein distance](https://www.wikiwand.com/en/Levenshtein_distance) (or, _edit distance_ functions in general), this package is a wrapper over the same. It uses difflib from the standard Python libs as well. 

Let's see how we can use FuzzyWuzzy to correct our misspellings:

**Installation**

In [5]:
# import sys
# !{sys.executable} -m pip install fuzzywuzzy
# alternative for 4-10x faster computation: 
# !{sys.executable} -m pip install fuzzywuzzy[speedup]

# Collecting fuzzywuzzy
#   Downloading fuzzywuzzy-0.16.0-py2.py3-none-any.whl
# Installing collected packages: fuzzywuzzy
# Successfully installed fuzzywuzzy-0.16.0

Collecting python-levenshtein>=0.12; extra == "speedup" (from fuzzywuzzy[speedup])
  Downloading python-Levenshtein-0.12.0.tar.gz (48kB)
Building wheels for collected packages: python-levenshtein
  Running setup.py bdist_wheel for python-levenshtein: started
  Running setup.py bdist_wheel for python-levenshtein: finished with status 'done'
  Stored in directory: C:\Users\nirantk\AppData\Local\pip\Cache\wheels\c0\83\e9\b2cc2876e175d04091caf4e9f5de564ff2503b1f1885e7c3ba
Successfully built python-levenshtein
Installing collected packages: python-levenshtein
Successfully installed python-levenshtein-0.12.0


In [6]:
from fuzzywuzzy import fuzz

In [11]:
fuzz.ratio("Electronic City Phase One", "Electronic City Phase One, Bangalore")

82

In [12]:
fuzz.partial_ratio("Electronic City Phase One", "Electronic City Phase One, Bangalore")

100

We can see how the `ratio` function is confused by the trailing “Bangalore” used in address above, but really the two strings refer to the same address/entity. This is captured by `partial_ratio`. 

Note that how both `ratio` and `partial_ratio` are sensitive to ordering of the words. This is useful for comparing addresses, which follow a rough logical order. On the other hand, if we want to compare something else like person names, it might give counterintuitive results: 



In [17]:
fuzz.ratio('Narendra Modi', 'Narendra D. Modi')

90

In [18]:
fuzz.partial_ratio('Narendra Modi', 'Narendra D. Modi')

77

Arrgh, this is not nice. Just because we had an extra `D.` token, our logic is less universal. We want something that is less order sensitive. Luckily, the authors of fuzzywuzzy kept this in mind. 

They support functions which tokenize our input on space, remove punctuations, numbers and non-ASCII characters. Then this is used to calculate similarity. Let's try that out: 

In [19]:
fuzz.token_sort_ratio('Narendra Modi', 'Narendra D. Modi')

93

In [20]:
fuzz.token_set_ratio('Narendra Modi', 'Narendra D. Modi')

100

Nice, this works perfectly for us. In case we have a list of options and we want to find the closest match(es), we can use the process module:

In [22]:
from fuzzywuzzy import process

In [25]:
query = 'Gujrat'
choices = ['Gujarat', 'Gujjar', 'Gujarat Govt.']
# Get a list of matches ordered by score, default limit to 5
print(process.extract(query, choices))

# If we want only the top one
process.extractOne(query, choices)

[('Gujarat', 92), ('Gujarat Govt.', 75), ('Gujjar', 67)]


('Gujarat', 92)

In [27]:
query = 'Banglore'
choices = ['Bangalore', 'Bengaluru']
print(process.extract(query, choices))
process.extractOne(query, choices)

[('Bangalore', 94), ('Bengaluru', 59)]


('Bangalore', 94)

In [29]:
# Let's take an example of a common search typo in online shopping:
query = 'chili'
choices = ['chilli', 'chilled', 'chilling']
print(process.extract(query, choices))
process.extractOne(query, choices)

[('chilli', 91), ('chilling', 77), ('chilled', 67)]


('chilli', 91)

In [2]:
import re

## [Basic] Information Extraction

As an example task, consider the challenge of automating Amazon Retail's customer service email response. We should be able to find the following attributes or mark them as missing with high confidence:

- Order Id
- Dates (such as Shopping Date, Order Delivery) 
- Any `$` amounts 

Please note that I don't have any relation to Amazon other than shopping from there. 

Let's consider the following totally imagined complaint email from me to Jeff Bezos, the CEO of Amazon:


In [32]:
complaint_email = """Hello Jeff,

I am Nirant, a loyal Amazon in first customer for months now. I am a huge fan of Kindle as well. 
I am stuck in a new city without a phone thanks to a sequence of problems - and are now compounded by Amazon's inhumane behaviour.

The particular issues I am facing: My new phone bought from Amazon stopped working. What did I do? Requested a replacement on Jul 23
- First Issue: The system did not allow a pick up on July 23 forcing a delay of more than a day to 24 July 8:00 - 11:00 AM
- Second Issue: Despite requesting the customer service on chat THRICE, the pickup is delayed to July 24 8:00 - 11:00 AM
- Third Issue: The pickup is rescheduled without any reason!

Is this how you want Amazon to be world's most customer centric company?

Here is how Amazon can help me:
- Pick up the order as urgently as possible
- Deliver the phone on a priority basis on Monday i.e. July 25 itself

Here are the order numbers for reference: 
ORDER # 402-4870778-5154753 and ORDER # 404-8689779-9721113

Here is my phone number: +91 7737887058

I am stuck in a new city, where I don't know the language or directions without a working phone. I would really appreciate it if you could help in anyway. 

Regards,
Nirant Kasliwal"""

Yikes, that is a lot of text. 

**The information to pull from this are (1) dates + times (2) phone number and (3) order numbers**. Let's figure out how to do that

### Extract Date and Times

If you are new to regex, consider reading the amazing [HOWTO on Python Regex](https://docs.python.org/3/howto/regex.html) and then coming back here. Let's warm up our regex muscles a bit: 

In [33]:
p = re.compile(r'\d+')
p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')

['12', '11', '10']

TKX: Add compile and findall explanations here

TKX: Add d+ explanations here

In [40]:
%%time
date_pattern = r"^(Jan|Feb|Mar|Apr|May|Jun|July|Aug|Sep|Oct|Nov|Dec)$"
p = re.compile(date_pattern)

Wall time: 0 ns


In [41]:
p.findall(complaint_email)

[]