## Regular Expressions

Defining REs in Python is straightforward:

In [1]:
import re

pattern = re.compile('[bcrh]at')

pattern2 = re.compile('(.*)([bcrh]at)(.*)')

We can then use the pattern to `search()` or `match()` strings to it. 

`search()` will return a result if the pattern occurs **anywhere** in the input string.

`match()` will only return a result if the pattern **completely** matches the input string.

In [7]:
word = 'the batter won the game'
#word = 'hi how are you'

matches = re.match(pattern2, word) # won't return a a result, i.e., matches = None
searches = re.search(pattern, word) # finds a substring

In [13]:
print(matches)

<re.Match object; span=(0, 23), match='the batter won the game'>


In [14]:
print(matches.groups())
print(searches)

('the ', 'bat', 'ter won the game')
<re.Match object; span=(4, 7), match='bat'>


Both have a number of attributes to access the results. 
- `span()` gives us a tuple of the substring that matches
- `group()`returns the matched substring

In [15]:
span = searches.span()
word[span[0]:span[1]], span

('bat', (4, 7))

In [16]:
searches.group()

'bat'

If we have used several RE groups (in brackets `()`), we can access them individually via `groups()`

In [17]:
word = 'preconstitutionalism'
affixes = re.compile('(...).+(...)')
re.search(affixes, word).groups()

('pre', 'ism')

For the email address finder, we can use a more advanced pattern and test it:

In [18]:
email = re.compile('^[A-Za-z0-9][A-Za-z0-9\.-]*@[A-Za-z0-9][A-Za-z0-9\.-]+\.[A-Za-z0-9\.-][A-Za-z0-9\.-][A-Za-z0-9\.-]?$')
# for address in ['me.@unibocconi.it', '@web.de', '.@gmx.com', 'not working@aol.com']:

for address in 'notMyFault@webmail.com,smithie123@gmx,Free stuff@unibocconi.it,mark_my_words@hotmail;com,truthOrDare@webmail.in,look@me@twitter.com,how2GetAnts@aol.dfdsfgfdsgfd'.split(','):
    print(address, re.match(email, address))

notMyFault@webmail.com <re.Match object; span=(0, 22), match='notMyFault@webmail.com'>
smithie123@gmx None
Free stuff@unibocconi.it None
mark_my_words@hotmail;com None
truthOrDare@webmail.in <re.Match object; span=(0, 22), match='truthOrDare@webmail.in'>
look@me@twitter.com None
how2GetAnts@aol.dfdsfgfdsgfd None


We can also use the pattern to replace elements of a string that match with `sub()`

In [20]:
print('Are you all awake?'.replace('?', '!'))

numbers = re.compile('[0-9]')
re.sub(numbers, '', '1 Back in the 90s, when I was a 12-year-old, a CD cost just 15,99EUR!')

Are you all awake!


'5 Back in the 55s, when I was a 55-year-old, a CD cost just 55,55EUR!'

## Exercise

Write a RegEx to remove all user names from the tweets and replace them with the token "@USER"

In [None]:
! pip install wget

In [21]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/tweets_en.txt'
wget.download(url, 'tweets_en.txt')
tweets = [line.strip() for line in open('tweets_en.txt', encoding='utf8')]

In [54]:
tweets[:10]

['@cosmetic_candy I think a lot of people just enjoy being a pain in the ass on there',
 'Best get ready sunbed and dinner with nana today :)',
 '@hardlyin70 thats awesome!',
 'Loving this weather',
 '“@danny_boy_37: Just seen an absolute idiot in shorts! Be serious!” Desperado gentleman',
 '@SamanthaOrmerod trying to resist a hardcore rave haha! Resisting towns a doddle! Posh dance floor should wear them in quite easy xx',
 '59 days until @Beyonce!!! Wooo @jfracassini #cannotwait',
 'That was the dumbest tweet i ever seen',
 'Oh what to do on this fine sunny day?',
 '@Brooke_C_X hows the fish ? Hope they r ok. Xx']

In [80]:
# your code here
pattern = re.compile('@[^ \t\n\r\f\v:]+')
[re.sub(pattern, '@USER', tweets[i]) for i in range(10)]

['@USER I think a lot of people just enjoy being a pain in the ass on there',
 'Best get ready sunbed and dinner with nana today :)',
 '@USER thats awesome!',
 'Loving this weather',
 '“@USER: Just seen an absolute idiot in shorts! Be serious!” Desperado gentleman',
 '@USER trying to resist a hardcore rave haha! Resisting towns a doddle! Posh dance floor should wear them in quite easy xx',
 '59 days until @USER Wooo @USER #cannotwait',
 'That was the dumbest tweet i ever seen',
 'Oh what to do on this fine sunny day?',
 '@USER hows the fish ? Hope they r ok. Xx']

Now, write a RegEx to extract all user names from the tweets


In [84]:
# your code here
[re.match(pattern, tweets[i]).group() for i in [0, 2, 5]]

['@cosmetic_candy', '@hardlyin70', '@SamanthaOrmerod']

## Exercise

Write a RegEx to search for all hashtags containing the word `good` in them.

In [94]:
import numpy as np

In [112]:
# your code here
pattern = re.compile("#[a-zA-Z0-9]+good[a-zA-Z0-9]+")

tweets_with_good = []
_ = [tweets_with_good.append(tweet) for tweet in tweets if re.search(pattern, tweet)]
len(tweets_with_good)

36

## TF-IDF

Let's extract the most important words from Moby Dick

In [None]:
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')

'moby_dick.txt'

In [None]:
import pandas as pd
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8')]
print(documents[1])

Call me Ishmael .


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word',
                                   min_df=0.001,
                                   max_df=0.75,
                                   stop_words='english'
                                   )

X = tfidf_vectorizer.fit_transform(documents)

In [None]:
X.shape

(9768, 1850)

Now, let's get the same information as raw counts:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')

X2 = vectorizer.fit_transform(documents)

In [None]:
X.shape, X2.shape

((9768, 1850), (9768, 1850))

In [None]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                       })

In [None]:
X2.sum(axis=0).A1

array([20, 21, 10, ..., 10, 18, 80], dtype=int64)

In [None]:
df = df.sort_values(['tfidf', 'tf', 'idf'], ascending=False)
df

Unnamed: 0,word,tf,idf,tfidf
1782,whale,1150,3.262357,224.212236
1838,ye,467,4.257380,153.091587
231,chapter,171,5.039475,148.370596
972,man,525,3.982412,134.964448
922,like,639,3.808543,133.426528
...,...,...,...,...
554,fleet,11,7.702063,3.049731
1423,shortly,10,7.789074,3.032615
1735,valiant,10,7.789074,3.017954
1602,surprise,10,7.789074,2.934600


In [None]:
df = df.sort_values(['tf', 'idf'], ascending=False)
df

Unnamed: 0,word,tf,idf,tfidf
1782,whale,1150,3.262357,224.212236
922,like,639,3.808543,133.426528
972,man,525,3.982412,134.964448
21,ahab,511,4.019453,131.484086
1414,ship,509,4.006953,111.771529
...,...,...,...,...
407,downward,10,7.789074,3.111894
1423,shortly,10,7.789074,3.032615
1735,valiant,10,7.789074,3.017954
1602,surprise,10,7.789074,2.934600


## Exercise
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10 in terms of TF-IDF.

In [None]:
# your code 


## PMI
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [None]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [None]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter

stopwords_ = set(stopwords.words('english'))

words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]
         
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
collocations

{'moby_dick': 83.0,
 'sperm_whale': 20.002847184002935,
 'mrs_hussey': 10.5625,
 'mast_heads': 4.391152941176471,
 'sag_harbor': 4.0,
 'vinegar_cruet': 4.0,
 'try_works': 3.7944046844502277,
 'dough_boy': 3.7067873303167422,
 'white_whale': 3.698807453416149,
 'caw_caw': 3.4722222222222223,
 'samuel_enderby': 3.4285714285714284,
 'cape_horn': 3.4133333333333336,
 'new_bedford': 3.3402061855670104,
 'quarter_deck': 3.2339339991315676,
 'deacon_deuteronomy': 3.2,
 'father_mapple': 3.0,
 'gamy_jesty': 3.0,
 'hoky_poky': 3.0,
 'jesty_joky': 3.0,
 'joky_hoky': 3.0,
 'sporty_gamy': 3.0,
 'sulk_pout': 3.0,
 'twos_threes': 3.0,
 'mast_head': 2.464640949554896,
 '000_lbs': 2.45,
 'chief_mate': 2.4075114075114077,
 'old_man': 2.269660474055093,
 'straits_sunda': 2.25,
 'crow_nest': 2.227272727272727,
 'crested_comb': 2.0,
 'daboll_arithmetic': 2.0,
 'distension_contraction': 2.0,
 'gemini_twins': 2.0,
 'helter_skelter': 2.0,
 'hogs_bristles': 2.0,
 'kith_kin': 2.0,
 'lirra_skirra': 2.0,
 'pell_m

In [None]:
finder.score_ngrams(score)

[(('moby', 'dick'), 83.0),
 (('sperm', 'whale'), 20.002847184002935),
 (('mrs', 'hussey'), 10.5625),
 (('mast', 'heads'), 4.391152941176471),
 (('sag', 'harbor'), 4.0),
 (('vinegar', 'cruet'), 4.0),
 (('try', 'works'), 3.7944046844502277),
 (('dough', 'boy'), 3.7067873303167422),
 (('white', 'whale'), 3.698807453416149),
 (('caw', 'caw'), 3.4722222222222223),
 (('samuel', 'enderby'), 3.4285714285714284),
 (('cape', 'horn'), 3.4133333333333336),
 (('new', 'bedford'), 3.3402061855670104),
 (('quarter', 'deck'), 3.2339339991315676),
 (('deacon', 'deuteronomy'), 3.2),
 (('father', 'mapple'), 3.0),
 (('gamy', 'jesty'), 3.0),
 (('hoky', 'poky'), 3.0),
 (('jesty', 'joky'), 3.0),
 (('joky', 'hoky'), 3.0),
 (('sporty', 'gamy'), 3.0),
 (('sulk', 'pout'), 3.0),
 (('twos', 'threes'), 3.0),
 (('mast', 'head'), 2.464640949554896),
 (('000', 'lbs'), 2.45),
 (('chief', 'mate'), 2.4075114075114077),
 (('old', 'man'), 2.269660474055093),
 (('straits', 'sunda'), 2.25),
 (('crow', 'nest'), 2.2272727272727

In [None]:
Counter(collocations).most_common(20)

[('moby_dick', 83.0),
 ('sperm_whale', 20.002847184002935),
 ('mrs_hussey', 10.5625),
 ('mast_heads', 4.391152941176471),
 ('sag_harbor', 4.0),
 ('vinegar_cruet', 4.0),
 ('try_works', 3.7944046844502277),
 ('dough_boy', 3.7067873303167422),
 ('white_whale', 3.698807453416149),
 ('caw_caw', 3.4722222222222223),
 ('samuel_enderby', 3.4285714285714284),
 ('cape_horn', 3.4133333333333336),
 ('new_bedford', 3.3402061855670104),
 ('quarter_deck', 3.2339339991315676),
 ('deacon_deuteronomy', 3.2),
 ('father_mapple', 3.0),
 ('gamy_jesty', 3.0),
 ('hoky_poky', 3.0),
 ('jesty_joky', 3.0),
 ('joky_hoky', 3.0)]

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [None]:
# your code here
! pip install emoji
import nltk
nltk.download('stopwords')

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/24/fa/b3368f41b95a286f8d300e323449ab4e86b85334c2e0b477e94422b8ed0f/emoji-1.2.0-py3-none-any.whl (131kB)
[K     |██▌                             | 10kB 13.6MB/s eta 0:00:01[K     |█████                           | 20kB 11.8MB/s eta 0:00:01[K     |███████▌                        | 30kB 8.4MB/s eta 0:00:01[K     |██████████                      | 40kB 7.4MB/s eta 0:00:01[K     |████████████▌                   | 51kB 4.2MB/s eta 0:00:01[K     |███████████████                 | 61kB 4.7MB/s eta 0:00:01[K     |█████████████████▌              | 71kB 4.9MB/s eta 0:00:01[K     |████████████████████            | 81kB 5.2MB/s eta 0:00:01[K     |██████████████████████▌         | 92kB 5.2MB/s eta 0:00:01[K     |█████████████████████████       | 102kB 5.5MB/s eta 0:00:01[K     |███████████████████████████▌    | 112kB 5.5MB/s eta 0:00:01[K     |██████████████████████████████  | 122kB 5.5MB/s eta 0:00:

True

In [None]:
# your code here
