## What are we going to cover

<ul>
    <li>Text Processing with standard Python libraries</li>
    <li>Regular Expressions</li>
    <li>Basics of NLP - Text Processing with Spacy library</li>
    <li>Exploratory Data Analysis</li>
    <li>Sentence Similarity via Vectorization</li>
    <li>Text Generation</li>
</ul>

In [3]:
with open('cv000_29590.txt') as f:
    text = f.read()

In [4]:
type(text)

str

In [5]:
lines = text.split('\n')

In [6]:
sentence = lines[0]

In [7]:
sentence

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . "

In [8]:
sentence.split()

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 ',',
 'whether',
 "they're",
 'about',
 'superheroes',
 '(',
 'batman',
 ',',
 'superman',
 ',',
 'spawn',
 ')',
 ',',
 'or',
 'geared',
 'toward',
 'kids',
 '(',
 'casper',
 ')',
 'or',
 'the',
 'arthouse',
 'crowd',
 '(',
 'ghost',
 'world',
 ')',
 ',',
 'but',
 "there's",
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before',
 '.']

In [2]:
"Batman"[0].isupper()

True

In [14]:
"Batm@n".isalnum()

False

In [8]:
"12345".isnumeric()

True

In [10]:
"superman".capitalize()

'Superman'

In [13]:
"-".join(["First", "Second", "Third"])

'First-Second-Third'

In [11]:
words = []
for s in lines:
    words.extend(s.split(" "))
len(words)

828

In [14]:
count = {}
for word in words:
    if word in count:
        count[word]+=1
    else:
        count[word] = 1

In [15]:
count

{'films': 1,
 'adapted': 1,
 'from': 8,
 'comic': 5,
 'books': 1,
 'have': 2,
 'had': 3,
 'plenty': 1,
 'of': 14,
 'success': 1,
 ',': 43,
 'whether': 1,
 "they're": 1,
 'about': 4,
 'superheroes': 1,
 '(': 18,
 'batman': 1,
 'superman': 1,
 'spawn': 1,
 ')': 18,
 'or': 3,
 'geared': 1,
 'toward': 1,
 'kids': 1,
 'casper': 1,
 'the': 46,
 'arthouse': 1,
 'crowd': 1,
 'ghost': 1,
 'world': 2,
 'but': 7,
 "there's": 1,
 'never': 2,
 'really': 2,
 'been': 3,
 'a': 15,
 'book': 3,
 'like': 4,
 'hell': 2,
 'before': 1,
 '.': 23,
 '': 26,
 'for': 3,
 'starters': 1,
 'it': 6,
 'was': 1,
 'created': 1,
 'by': 1,
 'alan': 1,
 'moore': 3,
 'and': 20,
 'eddie': 1,
 'campbell': 3,
 'who': 4,
 'brought': 1,
 'medium': 1,
 'to': 15,
 'whole': 2,
 'new': 1,
 'level': 1,
 'in': 18,
 'mid': 1,
 "'80s": 1,
 'with': 4,
 '12-part': 1,
 'series': 1,
 'called': 2,
 'watchmen': 1,
 'say': 4,
 'thoroughly': 1,
 'researched': 1,
 'subject': 1,
 'jack': 2,
 'ripper': 3,
 'would': 1,
 'be': 3,
 'saying': 1,
 'mi

In [17]:
sent = "This is the first sentence. This is another sentence. This is the third sentence. This is the last sentence"
sent.split(". ")

['This is the first sentence',
 'This is another sentence',
 'This is the third sentence',
 'This is the last sentence']

## Can we do better? Regular Expressions to the Rescue

In [18]:
import re

### This module provides regular expression matching operations.

Below is a list of expressions and what they match to. 

| Expression | Matches With                   |
| ---------- | -----------------------------  |
| `abc...`   | lowercase letter               |
| `123…`     | Digits                         |
| `\d`       | Any Digit                      |
| `'\D'`     | Any Non-digit character        |
| `.`        | Any Character                  |
| `\.`       | Period                         |
| `[abc]`    | Only a, b, or c                |
| `\.`       | Period                         |
| `[abc]`    | Only a, b, or c                |
| `[^abc]`   | Not a, b, nor c                |
| `[a-z]`    | Characters a to z              |
| `[0-9]`    | Numbers 0 to 9                 |
| `\w`       | Any Alphanumeric character     |
| `\W`       | Any Non-alphanumeric character |
| `{m}`      | m Repetitions                  |
| `{m,n}`    | m to n Repetitions             |
| `\*`       | Zero or more repetitions       |
| `\+`       | One or more repetitions        |
| `?`        | Optional character             |
| `\s`       | Any Whitespace                 |
| `\S`       | Any Non-whitespace character   |
| `^…$`      | Starts and ends                |
| `(…)`      | Capture Group                  |


In [19]:
re.findall('\w+', sentence)

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 'whether',
 'they',
 're',
 'about',
 'superheroes',
 'batman',
 'superman',
 'spawn',
 'or',
 'geared',
 'toward',
 'kids',
 'casper',
 'or',
 'the',
 'arthouse',
 'crowd',
 'ghost',
 'world',
 'but',
 'there',
 's',
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before']

In [20]:
re.split('[?.!] ', sent)

['This is the first sentence',
 'This is another sentence',
 'This is the third sentence',
 'This is the last sentence']

In [21]:
sent = "My phone number is +1-972-1234567. Indian number is +91-987654321"
phone = re.findall("[\+\-0-9]+", sent)
phone

['+1-972-1234567', '+91-987654321']

In [23]:
r_otp = "[0-9]{6}?"
text = "Your otp to login to xyz app is 567846. Go to the following link, https://xyz.co/34567"
otp = re.findall(r_otp, text)
otp

['567846']

### Groups

Groups of text show up everywhere.
<ul>
    <li>Names</li>
    <li>Phone Numbers</li>
    <li>Noun Phrases - "The" `< adjective>+` `< noun >` - For example - The funny man</li>
</ul>

In [24]:
p = phone[0]
print(re.match("(?P<country_code>[+0-9]*)-(?P<area_code>[0-9]*)-(?P<number>[0-9]*)", p).groupdict())

{'country_code': '+1', 'area_code': '972', 'number': '1234567'}


## More complicated patterns - Email IDs, URLs, etc

## Fun Exercise

Building a regular expression to test the validity of a password

A valid password is one which
<ul>
    <li> must contains one digit</li>
    <li>must contains one special symbols [#@!?]</li>
    <li>must contains one upper characters</li>
    <li>must contains one lowercase characters</li>
    <li>length at least 6 characters and maximum of 20</li>	
</ul>

In [25]:
def is_valid(p):
    regex = re.compile("(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*#?&])[A-Za-z\d@$!#%*?&]{6,20}")
    if re.match(regex, p):
        return True
    return False

passwords = ["Regex123", "Regex@123", "Rr@12"]
for p in passwords:
    print(is_valid(p))

False
True
False
