## Regular Expressions

Defining REs in Python is straightforward:

In [1]:
import re

pattern = re.compile('[bcrh]at')
pattern2 = re.compile('(.*)([bcrh]at)(.*)') #3 groups : each group between ()

We can then use the pattern to `search()` or `match()` strings to it. 

`search()` will return a result if the pattern occurs **anywhere** in the input string.

`match()` will only return a result if the pattern **completely** matches the input string.

In [2]:
word = 'the batter won the game'
matches = re.match(pattern2, word) # won't return a a result, i.e., matches = None
searches = re.search(pattern, word) # finds a substring

In [3]:
print(matches)

<re.Match object; span=(0, 23), match='the batter won the game'>


In [4]:
print(matches.groups()) #baranthesis split groups 
print(searches) #starts at 4 and ends at 7

('the ', 'bat', 'ter won the game')
<re.Match object; span=(4, 7), match='bat'>


Both have a number of attributes to access the results. 
- `span()` gives us a tuple of the substring that matches
- `group()`returns the matched substring

In [5]:
span = searches.span() #span[0] index of the first char, span[1] index of the last char
word[span[0]:span[1]], span

('bat', (4, 7))

In [6]:
searches.group()

'bat'

If we have used several RE groups (in brackets `()`), we can access them individually via `groups()`

In [7]:
word = 'preconstitutionalism'
affixes = re.compile('(...).+(...)')
re.search(affixes, word).groups()

('pre', 'ism')

For the email address finder, we can use a more advanced pattern and test it:

In [8]:
email = re.compile('^[A-Za-z0-9][A-Za-z0-9\.-]*@[A-Za-z0-9][A-Za-z0-9\.-]+\.[A-Za-z0-9\.-][A-Za-z0-9\.-][A-Za-z0-9\.-]?$')
# for address in ['me.@unibocconi.it', '@web.de', '.@gmx.com', 'not working@aol.com']:

for address in 'notMyFault@webmail.com,smithie123@gmx,Free stuff@unibocconi.it,mark_my_words@hotmail;com,truthOrDare@webmail.in,look@me@twitter.com,how2GetAnts@aol.dfdsfgfdsgfd'.split(','):
    print(address, re.match(email, address))

notMyFault@webmail.com <re.Match object; span=(0, 22), match='notMyFault@webmail.com'>
smithie123@gmx None
Free stuff@unibocconi.it None
mark_my_words@hotmail;com None
truthOrDare@webmail.in <re.Match object; span=(0, 22), match='truthOrDare@webmail.in'>
look@me@twitter.com None
how2GetAnts@aol.dfdsfgfdsgfd None


We can also use the pattern to replace elements of a string that match with `sub()`

In [9]:
print('Are you all awake?'.replace('???', '!'))

numbers = re.compile('[0-9]')
re.sub(numbers, '0', 'Back in the 90s, when I was a 12-year-old, a CD cost just 15,99EUR!')

Are you all awake?


'Back in the 00s, when I was a 00-year-old, a CD cost just 00,00EUR!'

## Exercise

Write a RegEx to remove all user names from the tweets and replace them with the token "@USER"

In [10]:
! pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=69fc9c581e619e5818bf81f4447e5529b89734711053c284d3f0dfb0a984c50d
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [11]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/tweets_en.txt'
wget.download(url, 'tweets_en.txt')
tweets = [line.strip() for line in open('tweets_en.txt', encoding='utf8')]

In [12]:
# your code here
user_names_pattern = re.compile('@[A-Za-z0-9\.-_]+')
user_tweets = [re.sub(user_names_pattern, '@USER', tweet) for tweet in tweets[:20]]
user_tweets

['@USER I think a lot of people just enjoy being a pain in the ass on there',
 'Best get ready sunbed and dinner with nana today :)',
 '@USER thats awesome!',
 'Loving this weather',
 '“@USER Just seen an absolute idiot in shorts! Be serious!” Desperado gentleman',
 '@USER trying to resist a hardcore rave haha! Resisting towns a doddle! Posh dance floor should wear them in quite easy xx',
 '59 days until @USER!!! Wooo @USER #cannotwait',
 'That was the dumbest tweet i ever seen',
 'Oh what to do on this fine sunny day?',
 '@USER hows the fish ? Hope they r ok. Xx',
 '@USER 😠',
 'Or this @USER http://t.co/Gsb7V1oVLU',
 '@USER your diary is undoubtedly busier than mine, but feel free to check http://t.co/0pjNL1uwU9',
 'Willy',
 '@USER congrats gorgeous xxx',
 'Puppies are hard work!!! So rewarding though, I love my little Bovril so much! http://t.co/a8cHDbGUKo',
 'Hungover banter lol hehe what http://t.co/eqbAFlyHng',
 '@USER Then come down to London and see him. Lol :-) xxxx',
 "i'm not

Now, write a RegEx to extract all user names from the tweets


In [13]:
# your code here

#users_matches = [re.match(user_names_pattern, tweet) for tweet in tweets]
#users_matches
list_mentions = []
for tweet in tweets[:20]:
  if re.search(user_names_pattern,tweet) is not None:
    list_mentions.append(re.search(user_names_pattern,tweet).group())
list_mentions

['@cosmetic_candy',
 '@hardlyin70',
 '@danny_boy_37:',
 '@SamanthaOrmerod',
 '@Beyonce',
 '@Brooke_C_X',
 '@Jbowe_',
 '@louise_munchi',
 '@guy_clifton',
 '@StephanieLee__',
 '@peterbaird5',
 '@GreigSweeney']

In [14]:
tweet = " Hi @azza"
re.findall(user_names_pattern,tweet)

['@azza']

## Exercise

Write a RegEx to search for all hashtags containing the word `good` in them.

In [15]:
# your code here 

list_hash = []
good_hashtag = re.compile('#[\w]*[Gg][Oo][Oo][Dd][\w]*')

for tweet in tweets:
  found = re.findall(good_hashtag,tweet)
  if found:
    print(found)

['#goodone']
['#goodtime']
['#homeforgood']
['#GoodbyeMrchips']
['#youresogoodforme']
['#goodone']
['#goodolddays']
['#goodtimes']
['#notagoodday']
['#skiingisGOOD']
['#goodbody']
['#naegood']
['#goodbyesavings']
['#goodtimes']
['#toogoodtoyou']
['#suchagoodfriend']
['#gooddriver']
['#needtobeagoodstudent']
['#goodnight']
['#GoodMood']
['#lookssoooogood']
['#shesnotthatgoodlooking']
['#verygood']
['#GoodLuck']
['#notgood']
['#GoodDay']
['#wasteofaperfectlygoodtweet']
['#goodyeh']
['#goodboy']
['#MeetAndGreetForGoodbyeTour']
['#goodbye']
['#goodgirlfriend']
['#goodreviews']
['#goodfeeling']
['#goodtimenolongtime']
['#goodtimes']
['#goodtimes', '#goodwin']
['#goodmemories']
['#good']
['#whyareallthegoodmengay']
['#goodflow']
['#feelinggood']
['#goodfriends']
['#goodone']
['#goodtimes']
['#goodpal']
['#goodnightssleepneeded']
['#SoGood']
['#damnitgood']
['#notgoodenough']
['#itsjustsogood']
['#lifeisgood']
['#notgood']
['#goodluck']
['#NotGood']
['#GoodTimes']
['#goodluck']
['#3goodthings

## TF-IDF

Let's extract the most important words from Moby Dick

In [16]:
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')

'moby_dick.txt'

In [17]:
import pandas as pd
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8')]
print(documents[1])

Call me Ishmael .


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word',
                                   min_df=0.001,
                                   max_df=0.75,
                                   stop_words='english',
                                   sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

In [19]:
X.shape

(9768, 1850)

Now, let's get the same information as raw counts:

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')

X2 = vectorizer.fit_transform(documents)

In [21]:
X.shape, X2.shape

((9768, 1850), (9768, 1850))

In [22]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                       })

In [23]:
df

Unnamed: 0,word,tf,idf,tfidf
0,000,20,7.478919,10.561871
1,aboard,21,7.191237,9.852074
2,absent,10,7.789074,3.251602
3,according,26,6.928873,8.937477
4,account,32,6.721233,10.686999
...,...,...,...,...
1845,yield,15,7.414381,5.941932
1846,yojo,17,7.702063,4.528857
1847,yon,10,7.789074,5.455786
1848,yonder,18,7.296598,9.507531


In [24]:
X2.sum(axis=0).A1

array([20, 21, 10, ..., 10, 18, 80], dtype=int64)

In [25]:
df = df.sort_values(['tfidf', 'tf', 'idf'], ascending=False)
df

Unnamed: 0,word,tf,idf,tfidf
1782,whale,1150,3.262357,224.212236
1838,ye,467,4.257380,153.091587
231,chapter,171,5.039475,148.370596
972,man,525,3.982412,134.964448
922,like,639,3.808543,133.426528
...,...,...,...,...
554,fleet,11,7.702063,3.049731
1423,shortly,10,7.789074,3.032615
1735,valiant,10,7.789074,3.017954
1602,surprise,10,7.789074,2.934600


In [26]:
df = df.sort_values(['tf', 'idf'], ascending=False)
df

Unnamed: 0,word,tf,idf,tfidf
1782,whale,1150,3.262357,224.212236
922,like,639,3.808543,133.426528
972,man,525,3.982412,134.964448
21,ahab,511,4.019453,131.484086
1414,ship,509,4.006953,111.771529
...,...,...,...,...
407,downward,10,7.789074,3.111894
1423,shortly,10,7.789074,3.032615
1735,valiant,10,7.789074,3.017954
1602,surprise,10,7.789074,2.934600


## Exercise
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10 in terms of TF-IDF.

In [27]:
# your code 
tfidf_vectorizer_bigrams = TfidfVectorizer(analyzer='word',
                                   min_df=0.001,
                                   max_df=0.75,
                                   stop_words='english',
                                   sublinear_tf=True,
                                   ngram_range = (2,2)
                                   )
X3 = tfidf_vectorizer_bigrams.fit_transform(documents)
vectorizer_bigrams = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english', ngram_range=(2,2))
X4 = vectorizer_bigrams.fit_transform(documents)

df_bigrams = pd.DataFrame(data={'word': vectorizer_bigrams.get_feature_names(), 
                        'tf': X4.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer_bigrams.idf_,
                        'tfidf': X3.sum(axis=0).A1
                       })
df_bigrams = df_bigrams.sort_values(['tfidf', 'tf', 'idf'], ascending=False)
df_bigrams[:10]

Unnamed: 0,word,tf,idf,tfidf
56,sperm whale,176,5.051171,143.425771
73,white whale,106,5.581799,89.700192
43,old man,81,5.78025,73.301252
37,moby dick,83,5.817522,68.840652
8,captain ahab,64,6.043835,53.91183
48,right whale,55,6.161618,46.4371
35,mast head,47,6.315768,41.407495
36,mast heads,36,6.576051,32.451617
12,cried ahab,33,6.660609,31.403016
71,whale ship,33,6.660609,29.261082


## PMI
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [28]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [29]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter

stopwords_ = set(stopwords.words('english'))

words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]
         
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
#collocations

In [30]:
finder.score_ngrams(score)

[(('moby', 'dick'), 83.0),
 (('sperm', 'whale'), 20.002847184002935),
 (('mrs', 'hussey'), 10.5625),
 (('mast', 'heads'), 4.391152941176471),
 (('sag', 'harbor'), 4.0),
 (('vinegar', 'cruet'), 4.0),
 (('try', 'works'), 3.7944046844502277),
 (('dough', 'boy'), 3.7067873303167422),
 (('white', 'whale'), 3.698807453416149),
 (('caw', 'caw'), 3.4722222222222223),
 (('samuel', 'enderby'), 3.4285714285714284),
 (('cape', 'horn'), 3.4133333333333336),
 (('new', 'bedford'), 3.3402061855670104),
 (('quarter', 'deck'), 3.2339339991315676),
 (('deacon', 'deuteronomy'), 3.2),
 (('father', 'mapple'), 3.0),
 (('gamy', 'jesty'), 3.0),
 (('hoky', 'poky'), 3.0),
 (('jesty', 'joky'), 3.0),
 (('joky', 'hoky'), 3.0),
 (('sporty', 'gamy'), 3.0),
 (('sulk', 'pout'), 3.0),
 (('twos', 'threes'), 3.0),
 (('mast', 'head'), 2.464640949554896),
 (('000', 'lbs'), 2.45),
 (('chief', 'mate'), 2.4075114075114077),
 (('old', 'man'), 2.269660474055093),
 (('straits', 'sunda'), 2.25),
 (('crow', 'nest'), 2.2272727272727

In [31]:
Counter(collocations).most_common(20)

[('moby_dick', 83.0),
 ('sperm_whale', 20.002847184002935),
 ('mrs_hussey', 10.5625),
 ('mast_heads', 4.391152941176471),
 ('sag_harbor', 4.0),
 ('vinegar_cruet', 4.0),
 ('try_works', 3.7944046844502277),
 ('dough_boy', 3.7067873303167422),
 ('white_whale', 3.698807453416149),
 ('caw_caw', 3.4722222222222223),
 ('samuel_enderby', 3.4285714285714284),
 ('cape_horn', 3.4133333333333336),
 ('new_bedford', 3.3402061855670104),
 ('quarter_deck', 3.2339339991315676),
 ('deacon_deuteronomy', 3.2),
 ('father_mapple', 3.0),
 ('gamy_jesty', 3.0),
 ('hoky_poky', 3.0),
 ('jesty_joky', 3.0),
 ('joky_hoky', 3.0)]

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [32]:
stopwords_ = set(stopwords.words('english'))

tweets = [line.strip() for line in open('tweets_en.txt', encoding='utf8')]

def preprocessing(tweet):
  tweet = ' '.join([w.lower() for w in tweet.split() if w not in stopwords_])
  #remove stopwords, remove links, remove emojis , remove mentions, remove hashtags , remove numbers
  url_pattern = re.compile('https?://[A-Za-z0-9\.-_]*/[A-Za-z0-9\.-_]*') 
  tweet = re.sub(url_pattern, '', tweet)
  user_names_pattern = re.compile('@[A-Za-z0-9\.-_]+')
  tweet = re.sub(user_names_pattern, '', tweet)
  hashtag_pattern = re.compile('#[\w]*')
  tweet = re.sub(hashtag_pattern, '', tweet)
  emojis_pattern = re.compile(pattern = "["
          u"\U0001F600-\U0001F64F"  # emoticons
          u"\U0001F300-\U0001F5FF"  # symbols & pictographs
          u"\U0001F680-\U0001F6FF"  # transport & map symbols
          u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
          u"\u2600-\u26FF\u2700-\u27BF"
                            "]+", flags = re.UNICODE)
  tweet = re.sub(emojis_pattern, '', tweet)

  numbers_patterns = re.compile('[0-9]+[\w]*')
  tweet = re.sub(numbers_patterns, '', tweet)

  punctiation_pattern = re.compile('[!-_@#$%^&*()?<>;\.,:"]')
  tweet = re.sub(punctiation_pattern, '', tweet) 

  return tweet

cleaned_tweets = [preprocessing(tweet) for tweet in tweets]
print(tweets[500:520])
print(cleaned_tweets[500:520])

words = [word.lower() for tweet in cleaned_tweets for word in tweet.split() 
         if len(word) > 2 ]
         
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
#collocations

Counter(collocations).most_common(20)

["I'm at Warrington Central Railway Station (WAC) (Warrington, Cheshire) http://t.co/UNOZyxhVg6", 'I looked liked such a wanker in red shorts today when we were playing in our blue strip #Schoolboyerror', '@SUNSHHHEEEIINNE deaths = ratings. Everyone must have a mick foley attitude', 'About to watch #raw #spoiler alert @LeviKitson http://t.co/PmFuOIBOjR', 'I love justin', 'Thumbs up as, I tweet, from my 8yr old. We eat @ different times on a tues. Cooking it this way means no soggy pastry http://t.co/jtRj2ONA1X', "Don't say I'm better off dead, 'cause heavens full and hell doesn't want me.", 'Wtf my insurance is still gonna be under £700 even with £100 excess haha #winning', '@JacobMcparland where you watching football at?', "People that spell school 'skool' should go back there.", "@mrpeterandre I've never been more devastated in my whole life, won tickets to Gilgamesh and you aren't going to be there #hatemylife", 'Think about what to think.', "Can't wait to start going on the sunbeds

[('temperature_rain', 183.11751137578898),
 ('wind_mph', 155.16358564189488),
 ('cant_wait', 105.13793208133924),
 ('last_night', 79.39003525812679),
 ('slowly_temperature', 75.07782138676582),
 ('looking_forward', 69.01986497537506),
 ('mph_barometer', 51.48125544899739),
 ('happy_birthday', 49.91443818065343),
 ('barometer_hpa', 35.55359336609337),
 ('today_humidity', 34.377616000084835),
 ('rain_today', 26.259736082602583),
 ('falling_slowly', 21.991356382978722),
 ('railway_station', 20.288212986610944),
 ('arctic_monkeys', 20.031772575250837),
 ('rising_slowly', 18.086537050623626),
 ('cant_believe', 17.656357621879035),
 ('fingers_crossed', 17.527572016460905),
 ('jeremy_kyle', 16.329142488716958),
 ('advent_calendar', 15.419772256728779),
 ('paul_walker', 12.849506578947368)]

In [35]:
#Or other solution by using emoji module
#https://pypi.org/project/emoji/
! pip install emoji
import emoji
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
import spacy 
import re
from nltk.corpus import stopwords

nlp = spacy.load('en')
stop_words = [w.lower() for w in stopwords.words()]
#for textual emoji
emoticon_string = r"""
(?:
[<>]?
[:;=8] #eyes
[\-o\*\']? #optional nose
[\)\]\(\[dDpP/\:\}\{@\|\\] #mouth
|
[\)\]\(\[dDpP/\:\}\{@\|\\] #mouth
[\-o\*\']? #optional nose
[:;=8] #eyes
[<>]?
)"""

#remove graphical emoji
def give_emoji_free_text(text):
  return emoji.get_emoji_regexp().sub(r'',text)

def sanitize(string):
  """ Sanitize one string """

  #remove graphical emoji
  string = give_emoji_free_text(string)

  #remove textual emoji
  string = re.sub(emoticon_string, '', string)

  #normalize to lowercase
  string = string.lower()

  # spacy tokenizer
  string_split = [token.text for token in nlp(string)]

  #in case the string is empty 
  if not string_split:
    return ''

  #remove user, assuming user has @ in front
  names = re.compile('@[A-Za-z0-9_][A-Za-z0-9_]+')
  string = re.sub(names, '', string)

  #remove 't.co/' links
  string = re.sub(r'http://t.co\/[^\s]+', '', string, flags=re.MULTILINE)

  #remove # symbol and punctuations
  for punc in '"?,/.:!#()""':
    string = string.replace(punc, '')

  #removing stopwords 
  string = ' '.join([w for w in string.split() if w not in stop_words])
  return string

list_sanitized = [sanitize(string) for string in tweets[:300]]
[(tweets[i], list_sanitized[i]) for i in range(0,40)]

[('@cosmetic_candy I think a lot of people just enjoy being a pain in the ass on there',
  'think lot people enjoy pain ass'),
 ('Best get ready sunbed and dinner with nana today :)',
  'best get ready sunbed dinner nana today'),
 ('@hardlyin70 thats awesome!', 'thats awesome'),
 ('Loving this weather', 'loving weather'),
 ('“@danny_boy_37: Just seen an absolute idiot in shorts! Be serious!” Desperado gentleman',
  '“ seen absolute idiot shorts serious” desperado gentleman'),
 ('@SamanthaOrmerod trying to resist a hardcore rave haha! Resisting towns a doddle! Posh dance floor should wear them in quite easy xx',
  'trying resist hardcore rave haha resisting towns doddle posh dance floor wear quite easy xx'),
 ('59 days until @Beyonce!!! Wooo @jfracassini #cannotwait',
  '59 days wooo cannotwait'),
 ('That was the dumbest tweet i ever seen', 'dumbest tweet ever seen'),
 ('Oh what to do on this fine sunny day?', 'oh fine sunny day'),
 ('@Brooke_C_X hows the fish ? Hope they r ok. Xx', 'ho

In [44]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter

words = [word for document in list_sanitized for word in document.split() if len(word) > 2]

finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
#collocations
Counter(collocations).most_common(20)

[('fingers_crossed', 2.0),
 ("''did_tattoos", 1.0),
 ("''you're_tall''", 1.0),
 ("'blow_go'", 1.0),
 ("'out_order'", 1.0),
 ("'piss_off'", 1.0),
 ("'side_effects'", 1.0),
 ("'when_dolmio", 1.0),
 ("'you_wot", 1.0),
 ('00mm_hum', 1.0),
 ('100231mb_temp', 1.0),
 ('13th_thurs', 1.0),
 ('14th_fri', 1.0),
 ('150mb_data', 1.0),
 ('153_miles', 1.0),
 ('25mph_baro', 1.0),
 ("2pprizesplease_how's", 1.0),
 ('5thhashtag_cmyk', 1.0),
 ('60d_lenses', 1.0),
 ("6ft_''did", 1.0)]