# NLP Overview & RegEx Lessons

23 February 2023

- corpus : whole dataset  
    
- document : one observation (row)
                               
- tokenization : breaking down into tokens  
                               
- stemming and lemmatising : breaking words down into morpheme  
    "Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma ."
    
- stopwords : articles ; common words that (usually) don't add value  

- ngrams : combination of n words, in order to return greater meaning from the results (ie, the musical group 'the humans', vs the generic term 'humans').  

- POS : part of speach


DATA PREP   

- casing : almost always maked lowercased the text   
- tokenisation : breaking a phrase on each white space, each prefix (first word of phrase), any punctuation  
- stopword removal ('and', 'the', 'a', 'of', etc)   
- stemming : chop off suffixes from words  
- lemmatising : interior transformation of the word (drove / drives --> drive).  
- use either stemmer / or lemmatiser (but the stemmer is a quicker process).  


EXPLORATION  

- wordcount, document length, common words / ngrams, wordclouds  
- wordclouds are a hard case for data analysis : what's the scale ? how accurate is it (cf image appearance)


MODELLING 
- how to represent a docu as numbers ?
    - count vectorisation (bag of words) : how many time each word appears in the corpus, per document row   
    - TF-IDF ('term frequency' in a single document in a corpus ; 'inverse document frequency' of how much a term shows up in all the other documents in the entire corpus)  
        - then use a ML model to make predictions of text class

# REGEX

### RegEx library functions

```re.search``` : scans through a string, looks for any location where the RE matches.  
```re.findall``` : finds all substrings where the RE matches ; returns a list.  
```re.split``` : splits a string ona given RE patter, removing that pattern ; returns a list of strings.  
```re.sub``` : matches a regex and subs in a new substring for the match.  


Works with any other programming language, with only minor differences.    
If parsing HTML, JSON or XML, use a tool built for those formats.  
If stuff already exists in the computer language, use that instead of creating a RE.  
 
RegEx metacharacters ```. ^ $ * + ? { } [ ] \ | ( )``` have special meanings.  
Metacharacers are not active inside of the character class square brackets ```[]```.

```r"[a-z]"``` : r = 'raw' ; matches lowercase a thru z.   
Actual character meaning :  
```\+``` finds a literal '+'.  
```\d``` = [0-9].  
```\D``` = [^0-9].  
```\s``` = any sort of space.  
```\w``` = any alphanumeric char and underscore . = [0-9a-zA-Z_]  
```\W``` = anything non-alphanumeric . = [^0-9a-zA-Z_]  
```*``` = zero or more of the previous pattern.  
```+``` = matches one or more of the previous pattern.  
```?``` = pattern might be optional.  
ANCHORS   
```^``` start  
```$``` end  
```\b``` word boundary    
GROUPS  
```(a)```, contains parts of a pattern.    




In [1]:
import pandas as pd
import re

## Matching literal expressions

In [2]:
string = 'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [3]:
# re.search (r'pattern', 'our subject') : literal match of string 'Verona'

re.search(r'Verona', string)

<re.Match object; span=(47, 53), match='Verona'>

In [4]:
# string at location indicated

string[47:53]

'Verona'

In [5]:
# string

re.search(r'In fair Verona', string)

<re.Match object; span=(39, 53), match='In fair Verona'>

In [6]:
# returns 'none'

re.search(r'Leonardo', string)

In [7]:
# returns first and only first match

re.search(r'civil', string)

<re.Match object; span=(126, 131), match='civil'>

In [8]:
# every match with .findall

re.findall(r'civil', string)

['civil', 'civil']

In [9]:
print(re.findall(r'Two', string))
print(re.search(r'Two', string))

['Two']
<re.Match object; span=(0, 3), match='Two'>


In [10]:
re.findall(r'Leonardo', string)
# finds nothing, returns empty list

[]

In [11]:
print(re.search(r'Two', string))
print(re.search(r'two', string))
# none, bc case specific

<re.Match object; span=(0, 3), match='Two'>
None


In [12]:
# re.ignorecase IGNORES THE CSAE

print(re.search(r'two', string, re.IGNORECASE))

<re.Match object; span=(0, 3), match='Two'>


In [13]:
re.search(r'Aaaaaaa', 'aaaaaaa', re.IGNORECASE)
#searching for 'Aaaaaaa' in 'aaaaaaa'

<re.Match object; span=(0, 7), match='aaaaaaa'>

### | = 'or'



In [14]:
# or

re.findall(r'grey|gray', 'How does one spell gray ? Or is it grey or gray ?')

['gray', 'grey', 'gray']

In [15]:
# or : takes the first instance in the subject string. It doesn't matter that it's plural.


re.search(r'orange|apple', 'I like both apples and oranges.')

<re.Match object; span=(12, 17), match='apple'>

In [16]:
# returns all

re.findall(f'this|that', 'this, that, this, that')

['this', 'that', 'this', 'that']

In [17]:
# has a vowel, anywhere
# only returns first instance in the subject

re.search(r'[aeiou]', 'banana, grape')

<re.Match object; span=(1, 2), match='a'>

In [18]:
# has a vowel, anywhere
# only returns first instance in the subject

re.search(r'a|e|i|o|u', 'banana, grape')

<re.Match object; span=(1, 2), match='a'>

In [19]:
# ^ starts with
# . any char
# * zero or more of whatever is before it. Here, the b.  = (b followed by any char 0 or more times)

re.search(r'^b.*', 'bananarama is my jam')

# the battern describes the entire phrase

<re.Match object; span=(0, 20), match='bananarama is my jam'>

In [20]:
print(re.search(r'^b.', 'b'))
# doesn't return bc 'b' has nothing following it

# cf
print(re.search(r'^b.*', 'b'))

None
<re.Match object; span=(0, 1), match='b'>


In [21]:
# just the a, bc it's a single character due to the []
# the single char following the undesired-b

re.search(r'[^b.*]', 'bananarama is my jam')

<re.Match object; span=(1, 2), match='a'>

In [22]:
# nothing , bc, in this case, the string must start with the letter b

re.search(r'^b.*', 'my jam is bananarama')

### On to ```+```

In [23]:
# "+"" = needs 1 or more instances of something after the b

re.search(r'^b.+', 'b')

# returns nothing, bc nothing follows 'b'

In [24]:
# "+"" = needs 1 or more instances of something after the b

re.search(r'^b.*', 'b')

# returns 'b', bc anything could follow 'b'

<re.Match object; span=(0, 1), match='b'>

In [25]:
# .* finds the largest-possible match : a greedy pattern that takes in all following the requirement before .*
 
re.search(r'^b.*', 'banamana daelkjjjdaf')

<re.Match object; span=(0, 20), match='banamana daelkjjjdaf'>

In [26]:
# match b then 1 or more alphanumerics for a word
# \w means any a-zA-Z0-9_
# + means 1 or more lettrs

re.search(r'^b\w+', 'banamana daelkjjjdaf')

# when it hit the ' ' before dael..., it stopped, bc not indicated in the RE

<re.Match object; span=(0, 8), match='banamana'>

In [27]:
re.search(r'^b\w+', 'cest chouette')

# returns nothing, bc there's no b

In [28]:
re.search(r'^b\w+', 'b chouette')

# returns nothing, bc there's a space after b

In [29]:
re.search(r'^b\w*', 'b chouette')

# returns b, bc anything is permitted after b

<re.Match object; span=(0, 1), match='b'>

In [30]:
re.search(r'^b.', 'b chouette')

# grabs 1 char after the b, bc no repetition requested in the RE

<re.Match object; span=(0, 2), match='b '>

In [31]:
re.search(r'b\w*\b', 'bddddd chouette b')

# b followed by anything until it breaks off

<re.Match object; span=(0, 6), match='bddddd'>

In [32]:
re.search(r'b\w\b', 'bddddd chouette b')

# nothing, bc nothing could followe

In [33]:
# matches b, then any other char

re.search(r'b.', 'bddddd chouette b')

<re.Match object; span=(0, 2), match='bd'>

In [34]:
re.search(r'b\w', 'bddddd chouette b')

# b followed by any non-numeric char

<re.Match object; span=(0, 2), match='bd'>

In [35]:
re.search(r'b\w\w\w\w\w\D', 'bddddd chouette b')

<re.Match object; span=(0, 7), match='bddddd '>

In [36]:
re.search(r'b\w{3}', 'bddddd chouette b')

# the 3 alphanumeric chars following the 'b'

<re.Match object; span=(0, 4), match='bddd'>

In [37]:
re.search(r'[^b]', 'bddddd chouette b')

# bc of the brackets, it references 1 single char, then returns it

<re.Match object; span=(1, 2), match='d'>

In [38]:
re.search(r'[^a.*]', 'asdfd abddddd chouette b')

<re.Match object; span=(1, 2), match='s'>

In [39]:
re.search(r'^a.*', 'asdfd abddddd chouette b')

<re.Match object; span=(0, 24), match='asdfd abddddd chouette b'>

In [40]:
 #  same as below

re.search(r'^b.*a$', 'basdfd abddddd chouette ba')

<re.Match object; span=(0, 26), match='basdfd abddddd chouette ba'>

In [41]:
# same as above

re.search(r'^b.+a$', 'basdfd abddddd chouette ba')

<re.Match object; span=(0, 26), match='basdfd abddddd chouette ba'>

In [42]:
# what does \w mean

re.search(r'\w{4}', 'abc123')

# any alphanum of 4 chars

<re.Match object; span=(0, 4), match='abc1'>

In [43]:
# \w matches [a-zA-Z0-9_]

re.search(r'\w{1,4}', 'abc 1234')

# takes one to 3, then stops bc the blank space is not alphanumeric

<re.Match object; span=(0, 3), match='abc'>

In [44]:
# + matches one or more to the left of the pattern to the left of that + char

# greedy like *

re.search(r'f|F+', 'Fred asked a good question.')

<re.Match object; span=(0, 1), match='F'>

In [45]:
re.search(r'f|F\w*', 'Fred asked a good question.')

# the block of 'Fred' text, stopping bc of the space

<re.Match object; span=(0, 4), match='Fred'>

In [46]:
# what if we want only letters and not nums and letters and _ char
# the [a-z] is finding any and all sqeuqneces that are only [a-zA-Z]

print(re.findall(r'[a-zA-Z]+', '42 $stuff a****nd things and 123'))
print()
print(re.findall(r'[a-zA-Z]', '42 $stuff a****nd things and 123'))
print()
print(re.search(r'[a-zA-Z]+', '42 $stuff a****nd things and 123'))

['stuff', 'a', 'nd', 'things', 'and']

['s', 't', 'u', 'f', 'f', 'a', 'n', 'd', 't', 'h', 'i', 'n', 'g', 's', 'a', 'n', 'd']

<re.Match object; span=(4, 9), match='stuff'>


In [47]:
# starts with anyting followed by alphanumeric but not a digfit

re.search('^.\w\D', ';adfaeqe#$#^$hrtrq73489&^&$(&$)')


# only cares about the first 3 chars

<re.Match object; span=(0, 3), match=';ad'>

### Difference between ```*``` and ```+```

In [48]:
# match F and one or more of an a-zA-Z0-9 char
# does not match F on its own

re.search(r'F\w', 'F red asked a great question. Great job, Fred.')

<re.Match object; span=(41, 43), match='Fr'>

In [49]:
# match F then zero or more of an a-zA-Z0-9 

re.search(r'F\w*', 'F red asked a great question. Great job, Fred.')

<re.Match object; span=(0, 1), match='F'>

In [50]:
# {n, } matches n or more times

re.findall(r'[a-zA-Z]{1,}', 'abc2324 is the lace to be')

['abc', 'is', 'the', 'lace', 'to', 'be']

In [51]:
# 3 digisti then a single char of any then 4 digits
re.search(r'\d{3}.\d{4}', '714-7576259')

<re.Match object; span=(0, 8), match='714-7576'>

In [52]:
# what if the delimiter isoptional

# ? metachar means the thing to the left of the ? is optional


re.search(r'\d{3}.?\d{4}', '7147576259')

<re.Match object; span=(0, 8), match='71475762'>

## Use a RegEx pattern to split a string

- ```re.split``` returns a list of strings  
- the matching substring is removed  
- we can split on any RE pattern, not only char literals

In [53]:
# splits on '-'

'333-333-3333'.split('-')

['333', '333', '3333']

In [54]:
# split the phone number on the '-'

re.split(r'-', '333 333 3333')

['333 333 3333']

In [55]:
# split the phone number on the '-' or a space

re.split(r'-| ', '333 333 3333')

['333', '333', '3333']

In [56]:
# split on the space

re.split(r' ', 'this, that and the other thing')

['this,', 'that', 'and', 'the', 'other', 'thing']

### character classes

- square brackets make char cases

- char classes provide OR behaviour
-- in a char class ```^``` works as a 'none of' operator
- metachars mattch their literal char wheninsdie of a square brakcets for char classs

In [57]:
re.search(r'[aeiou]', 'banana')

<re.Match object; span=(1, 2), match='a'>

In [58]:
re.findall(r'gr[ae]y', 'some grey, gray clouds')

['grey', 'gray']

In [59]:
# is only a single value

re.search(r'[aeiou]{1}', 'a')

<re.Match object; span=(0, 1), match='a'>

In [60]:
# is only vowels

re.search(r'[aeiou]*$', 'aeiouaeou')

<re.Match object; span=(0, 9), match='aeiouaeou'>

In [61]:
# has a p or a q, anywhere

re.search(r'p|q', 'albuquerque', re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [62]:
# has a p or a q, anywhere

re.search(r'[pq]', 'albuquerque', re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [63]:
re.search(r'[pqPQ]*$', 'qqpqpqpqpqpqQPQPQPQP')

# to the end, as long as its only p and q and starts with a p or a q

<re.Match object; span=(0, 20), match='qqpqpqpqpqpqQPQPQPQP'>

In [64]:
# find all the occurences of 'civil' followed by the word immediately after 'civil'

re.findall(r'civil\s[a-z]+', string)

['civil blood', 'civil hands']

## Repetition and special sequences

```.``` any single character.  
```*``` zero or mroe chars.   
```+``` one or omore chars.  
```.``` matches any char  
```\b``` word boundary anchor.  
```\d``` matches any digit ; equivalent to [0-9]  
```\D``` any non-digit char ; equivalent to [^0-9]



In [65]:
# world without \b boundard

re.search(r'o\w+', 'do you like apples or oranges ?')

<re.Match object; span=(4, 6), match='ou'>

In [66]:
# \b means work boundary any word that starts with o

re.search(r'\bo\w+', 'do you like appels or orangs')

<re.Match object; span=(19, 21), match='or'>

In [68]:
# any word that starts with o

re.findall(r'\bo\w+', 'do you like apples or oranges ?')

['or', 'oranges']

In [None]:
re.findall(r'\bo\w+', 'do you like apples or oranges ?')

# Groupings

Capture specific groups.

In [69]:
sentence = 'You can find us on the web at https://codeup.com, Our IP address is 123.123.123.123 (maybe)'

In [71]:
url_re = r'(https?)://(\w+)\.(\w+)'

re.search(url_re, sentence).groups()

('https', 'codeup', 'com')

In [72]:
# unpacking


protocol, domaine, tld = re.search(url_re, sentence).groups()

print(f'''
protocol: {protocol}
domain: {domaine}
tld: {tld}
''')


protocol: https
domain: codeup
tld: com



In [80]:


#name inside of pattern

url_re = r'(?P<protocol>https:?)://(?P<domain>\w+)\.(?P<tld>\w+)'
# <named protocol> folleowd by what to capture in capturegroup



In [81]:
match = re.search(url_re, sentence)

In [82]:
# matching all groups in 'match'
match.groups()

('https', 'codeup', 'com')

In [85]:
# matching indicated group
match.group('domain')

'codeup'

In [83]:
# group dictionary

match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

In [None]:
print(f'''
groups:)

In [87]:
#lines in a log

logs = """
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
"""
print(logs)


GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58



In [106]:
emails = ["list_o_emails.surba@company.com", 'jane.janeee@han.com']

In [122]:
pattern = re.compile(r'''
(?P<first_name>\w+)?
\.?
(?P<last_name>\w+)? 
\@
(?P<domain>\w+)
\.
(?P<tld>\w+)
''', re.VERBOSE)



In [123]:
[re.search(pattern, email).groupdict() for email in emails]

[{'first_name': 'list_o_emails',
  'last_name': 'surba',
  'domain': 'company',
  'tld': 'com'},
 {'first_name': 'jane', 'last_name': 'janeee', 'domain': 'han', 'tld': 'com'}]



```re.MULTILINE``` : The ^ and $ anchors will apply line by line, instead of applying to start and end of the string.

```re.IGNORECASE``` : Ignore character casing when matching.

```re.VERBOSE``` : Ignore any whitespace in the regular expression. This can be useful to make more readable regular expressions, especially when combined with non-capturing comment groups.