## Regular Expressions

Defining REs in Python is straightforward:

In [None]:
import re

pattern = re.compile('[bcrh]at')
pattern2 = re.compile('(.*)([bcrh]at)(.*)')

We can then use the pattern to `search()` or `match()` strings to it. 

`search()` will return a result if the pattern occurs **anywhere** in the input string.

`match()` will only return a result if the pattern **completely** matches the input string.

In [None]:
word = 'the batter won the game'
matches = re.match(pattern2, word) # won't return a a result, i.e., matches = None
searches = re.search(pattern, word) # finds a substring

In [None]:
print(matches.groups())
print(searches)

Both have a number of attributes to access the results. 
- `span()` gives us a tuple of the substring that matches
- `group()`returns the matched substring

In [None]:
span = searches.span()
word[span[0]:span[1]], span

In [None]:
searches.group()

If we have used several RE groups (in brackets `()`), we can access them individually via `groups()`

In [None]:
word = 'preconstitutionalism'
affixes = re.compile('(...).+(...)')
re.search(affixes, word).groups()

For the email address finder, we can use a more advanced pattern and test it:

In [None]:
email = re.compile('^[A-Za-z0-9][A-Za-z0-9\.-]*@[A-Za-z0-9][A-Za-z0-9\.-]+\.[A-Za-z0-9\.-][A-Za-z0-9\.-][A-Za-z0-9\.-]?$')
# for address in ['me.@unibocconi.it', '@web.de', '.@gmx.com', 'not working@aol.com']:

for address in 'notMyFault@webmail.com,smithie123@gmx,Free stuff@unibocconi.it,mark_my_words@hotmail;com,truthOrDare@webmail.in,look@me@twitter.com,how2GetAnts@aol.dfdsfgfdsgfd'.split(','):
    print(address, re.match(email, address))

We can also use the pattern to replace elements of a string that match with `sub()`

In [None]:
print('Are you all awake?'.replace('???', '!'))

numbers = re.compile('[0-9]')
re.sub(numbers, '0', 'Back in the 90s, when I was a 12-year-old, a CD cost just 15,99EUR!')

## Exercise

Write a RegEx to remove all user names from the tweets and replace them with the token "@USER"

In [None]:
! pip install wget

In [None]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/tweets_en.txt'
wget.download(url, 'tweets_en.txt')
tweets = [line.strip() for line in open('tweets_en.txt', encoding='utf8')]

In [None]:
# your code here



Now, write a RegEx to extract all user names from the tweets


In [None]:
# your code here


## Exercise

Write a RegEx to search for all hashtags containing the word `good` in them.

In [None]:
# your code here

## TF-IDF

Let's extract the most important words from Moby Dick

In [None]:
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')

In [None]:
import pandas as pd
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8')]
print(documents[1])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word',
                                   min_df=0.001,
                                   max_df=0.75,
                                   stop_words='english',
                                   sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

In [None]:
X.shape

Now, let's get the same information as raw counts:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')

X2 = vectorizer.fit_transform(documents)

In [None]:
X.shape, X2.shape

In [None]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                       })

In [None]:
X2.sum(axis=0).A1

In [None]:
df = df.sort_values(['tfidf', 'tf', 'idf'], ascending=False)
df

In [None]:
df = df.sort_values(['tf', 'idf'], ascending=False)
df

## Exercise
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10 in terms of TF-IDF.

In [None]:
# your code 


## PMI
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [None]:
import nltk
nltk.download('all')

In [None]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter

stopwords_ = set(stopwords.words('english'))

words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]
         
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
collocations

In [None]:
finder.score_ngrams(score)

In [None]:
Counter(collocations).most_common(20)

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [None]:
# your code here