# Regular Expressions for Pattern Matching
## Regex metacharacters
- `\d` - digit
- `\s` - whitespace
- `\w` - word
- `{3,10}` - leftmost character should appear between 3 and 10 times
- `\D` - non-digit
- `\W` - non-word

## Regex Quantifiers
- `{8}` - repeated 8 times
- `{2,8}` - repeated to 2 to 8 times
- `{3,}` - repeated at least 3 times
- `+` - character appears one or more times
- `*`  - character appears zero or more times
- `?` - character appears zero or once

## Regex Special Chars
- `^` - start of string
- `$` - end of string
- `.` - any char
- `\` - scape
- `|` - or
- `[]` - set of chars
- `^` - not

## Greedy vs. non-greedy matching
Greedy: match as many characters as possible
Lazy: match as few characters as needed

## Alternation and non-capturing groups
- ADD: `?:` `(?:regex)` 

## Backreferences
- `?P<name>regex` - name groups
- `\1` `\2` refers back to last n or more regex group
- `?p=group_name` refers back to group name
- `\g<name>` - refers back to group for replacement reference

## Lookaround
### look-ahead
- `?=regex` - positive
- `?!run` - negative
### look-behind
- `?<=regex` - positive
- `?<!regex` - negative

# Tweets analyses 

1. Identify robot mentions  

In [113]:
import pandas as pd
import re

df = pd.read_csv('datasets\short_tweets.csv')
print(df.shape)
df.head(3)

(19948, 6)


  df = pd.read_csv('datasets\short_tweets.csv')


Unnamed: 0,target,id,date,flag,user,text
0,0,1467821085,Mon Apr 06 22:22:26 PDT 2009,NO_QUERY,crzy_cdn_bulas,our duck and chicken are taking wayyy too long...
1,0,1467821338,Mon Apr 06 22:22:30 PDT 2009,NO_QUERY,justnetgirl,Put vacation photos online (They were so cute)...
2,0,1467821455,Mon Apr 06 22:22:32 PDT 2009,NO_QUERY,CiaraRenee,I need a hug


In [114]:
regex = r"@robot\d\W"

print("mentions:")
print([tweet for tweet in df['text'].str.findall(regex) if len(tweet)>0])
print("tweets:")
display(df[df['text'].str.match(regex)== True])


mentions:
[['@robot9!', '@robot4&', '@robot9$', '@robot7%']]
tweets:


Unnamed: 0,target,id,date,flag,user,text
105,0,1467852067,Mon Apr 06 22:30:34 PDT 2009,NO_QUERY,kirstenj0y,@robot9! @robot4& I have a good feeling that t...


2.  Find user mentions 

In [115]:
regex = r"@\w+"

print("mentions:")
print([tweet for tweet in df['text'].str.findall(regex) if len(tweet)>0])
print("tweets:")
display(df[df['text'].str.match(regex)== True])


mentions:
[['@andywana'], ['@oanhLove'], ['@BatManYNG'], ['@Starrbby'], ['@katortiz'], ['@Lt_Algonquin'], ['@jdarter'], ['@ninjen'], ['@ashleyac'], ['@statravelAU'], ['@markhardy1974'], ['@msdrama'], ['@januarycrimson'], ['@Hollywoodheat'], ['@makeherfamous'], ['@stark'], ['@mangaaa'], ['@kpreyes'], ['@paradisej'], ['@Henkuyinepu'], ['@marykatherine_q'], ['@jacobsummers'], ['@Alliana07'], ['@salancaster'], ['@mercedesashley'], ['@HibaNick'], ['@eRRe_sC'], ['@allyheman'], ['@grum'], ['@thecoolestout'], ['@chelserlynn'], ['@Knights_'], ['@BridgetsBeaches'], ['@JonathanRKnight'], ['@ozesteph1992'], ['@mrsaintnick'], ['@twista202'], ['@rumblepurr'], ['@onemoreproject'], ['@jonathanchard'], ['@robot9', '@robot4', '@robot9', '@robot7'], ['@RyanSeacrest'], ['@pinkserendipity'], ['@marieclr'], ['@naughtyhaughty'], ['@penndbad'], ['@machineplay'], ['@ColinDeMar'], ['@dannyvegasbaby'], ['@supersport'], ['@robluketic'], ['@HillyDoP'], ['@goodlaura'], ['@JonathanRKnight'], ['@stustone'], ['@DjAliz

Unnamed: 0,target,id,date,flag,user,text
3,0,1467821715,Mon Apr 06 22:22:37 PDT 2009,NO_QUERY,deelau,"@andywana Not sure what they are, only that th..."
4,0,1467822384,Mon Apr 06 22:22:47 PDT 2009,NO_QUERY,Lindsey0920,@oanhLove I hate when that happens...
8,0,1467822687,Mon Apr 06 22:22:52 PDT 2009,NO_QUERY,xVivaLaJuicyx,"@BatManYNG I miss my ps3, it's out of commissi..."
13,0,1467824199,Mon Apr 06 22:23:15 PDT 2009,NO_QUERY,adri_mane,@Starrbby too bad I won't be around I lost my ...
16,0,1467825003,Mon Apr 06 22:23:28 PDT 2009,NO_QUERY,leslierosales,@katortiz Not forever... See you soon!
...,...,...,...,...,...,...
19930,0,1556973243,Sun Apr 19 01:18:32 PDT 2009,NO_QUERY,RachaelPhillips,@CaiGriffiths poor thing. When they're out ru...
19937,0,1556974294,Sun Apr 19 01:18:54 PDT 2009,NO_QUERY,CaptiveCulture,@BindMe but that's true you have to block late...
19938,0,1556974470,Sun Apr 19 01:18:58 PDT 2009,NO_QUERY,paperclipface,@TyPie I think I am going to seek out more thi...
19939,0,1556974516,Sun Apr 19 01:18:59 PDT 2009,NO_QUERY,SirCrumpet,@Rogerthatv2 Not looking hopeful


3. find links

In [116]:
regex = r"https?.+?\s"

print("mentions:")
print([tweet for tweet in df['text'].str.findall(regex) if len(tweet)>0])
print("tweets:")
display(df[df['text'].str.match(regex)== True])


mentions:
[['http://twitpic.com/2y2wr '], ['http://twitpic.com/2y2yi '], ['https://www.mycomicshop.com/search?TID=395031 '], ['http://twitpic.com/2y34e '], ['http://tinyurl.com/dc2htx '], ['http://twitpic.com/2y36e '], ['http://is.gd/r8Zf, ', 'http://is.gd/r8Zy, ', 'http://is.gd/r8ZG '], ['http://twitpic.com/2y3cf '], ['https://radio.foxnews.com '], ['http://twitpic.com/2y1pe '], ['http://sp2.ro/5b7bdb '], ['http://www.krispykreme.com.my/ '], ['http://tinyurl.com/dmukpr), '], ['http://www.rightpundits.com/?p=3669 '], ['http://community.livejournal.com/ohnotheydidnt/33907252.html '], ['http://twitpic.com/2y4vn '], ['http://bit.ly/kHBN '], ['http://is.gd/r9vr '], ['http://tinyurl.com/djjc46 '], ['http://twitpic.com/2y5s9 '], ['http://twitpic.com/2y65i '], ['http://twitpic.com/2xszg '], ['http://tinyurl.com/djh4pr '], ['http://bit.ly/4dVYg3 '], ['http://twitpic.com/2y606 '], ['http://twitpic.com/2y6z6 '], ['http://twurl.nl/iyar6d '], ['http://tinyurl.com/c5mja5 '], ['http://twitpic.com/2y

Unnamed: 0,target,id,date,flag,user,text
258,0,1467892720,Mon Apr 06 22:41:20 PDT 2009,NO_QUERY,sarawang,http://twitpic.com/2y2wr - according to my bro...
303,0,1467900244,Mon Apr 06 22:43:26 PDT 2009,NO_QUERY,Mowgli3,"http://twitpic.com/2y2yi - I love you, Buck."
382,0,1467922983,Mon Apr 06 22:49:51 PDT 2009,NO_QUERY,cinnayum,http://twitpic.com/2y34e - I wanna wear my Doc...
422,0,1467931396,Mon Apr 06 22:52:11 PDT 2009,NO_QUERY,utehbaik,http://twitpic.com/2y36e - cant see the flower...
465,0,1467944552,Mon Apr 06 22:56:00 PDT 2009,NO_QUERY,tjslater,"http://is.gd/r8Zf, http://is.gd/r8Zy, and ht..."
...,...,...,...,...,...,...
19221,0,1556810633,Sun Apr 19 00:27:33 PDT 2009,NO_QUERY,missC1977,http://twitpic.com/3l2zm - another rainy day i...
19308,0,1556832860,Sun Apr 19 00:34:07 PDT 2009,NO_QUERY,itscatbaby,http://twitpic.com/3l37i - @dangerradio i have...
19506,0,1556879397,Sun Apr 19 00:48:19 PDT 2009,NO_QUERY,marykate_L,http://twitpic.com/3l3m2 - My Bruised Arm
19596,0,1556903436,Sun Apr 19 00:55:57 PDT 2009,NO_QUERY,ReallyCookin,http://twitpic.com/3l3uq - for the record... t...


In [117]:
regex = r"\S+\.com\S+"

print("mentions:")
print([tweet for tweet in df['text'].str.findall(regex) if len(tweet)>0])
print("tweets:")
display(df[df['text'].str.match(regex)== True])


mentions:
[['http://twitpic.com/2y2es'], ['http://apps.facebook.com/dogbook/profile/view/5248435'], ['http://apps.facebook.com/dogbook/profile/view/6176014'], ['http://twitpic.com/2y2wr'], ['http://tinyurl.com/cw2l9t'], ['http://tinyurl.com/ceprvs'], ['http://twitpic.com/2y2yi'], ['https://www.mycomicshop.com/search?TID=395031'], ['http://tinyurl.com/cec5ka'], ['http://twitpic.com/2y34e'], ['http://tinyurl.com/dc2htx'], ['http://twitpic.com/2y36e'], ['http://tinyurl.com/c4ooho'], ['http://twitpic.com/2y3cf'], ['http://tinyurl.com/c8bvqh'], ['http://tinyurl.com/cxe8w7'], ['http://fanclub.backstreetboys.com/chat.php'], ['http://twitpic.com/2y1pe'], ['http://twitpic.com/2y3jp'], ['http://twitpic.com/2y3ty'], ['http://twitpic.com/2y3y0'], ['http://plurk.com/p/mzxbg'], ['http://plurk.com/p/mzxcs'], ['http://www.tv.com/story/13720.html?ref_story_id=13720&amp;ref_type=1101&amp;ref_name=story'], ['http://plurk.com/p/mzygb'], ['http://tinyurl.com/cexkqy'], ['http://apps.facebook.com/dogbook/pro

Unnamed: 0,target,id,date,flag,user,text
258,0,1467892720,Mon Apr 06 22:41:20 PDT 2009,NO_QUERY,sarawang,http://twitpic.com/2y2wr - according to my bro...
303,0,1467900244,Mon Apr 06 22:43:26 PDT 2009,NO_QUERY,Mowgli3,"http://twitpic.com/2y2yi - I love you, Buck."
382,0,1467922983,Mon Apr 06 22:49:51 PDT 2009,NO_QUERY,cinnayum,http://twitpic.com/2y34e - I wanna wear my Doc...
422,0,1467931396,Mon Apr 06 22:52:11 PDT 2009,NO_QUERY,utehbaik,http://twitpic.com/2y36e - cant see the flower...
512,0,1467951931,Mon Apr 06 22:58:05 PDT 2009,NO_QUERY,mitrepeak,http://twitpic.com/2y3cf - Filled with curry ...
...,...,...,...,...,...,...
19221,0,1556810633,Sun Apr 19 00:27:33 PDT 2009,NO_QUERY,missC1977,http://twitpic.com/3l2zm - another rainy day i...
19308,0,1556832860,Sun Apr 19 00:34:07 PDT 2009,NO_QUERY,itscatbaby,http://twitpic.com/3l37i - @dangerradio i have...
19506,0,1556879397,Sun Apr 19 00:48:19 PDT 2009,NO_QUERY,marykate_L,http://twitpic.com/3l3m2 - My Bruised Arm
19596,0,1556903436,Sun Apr 19 00:55:57 PDT 2009,NO_QUERY,ReallyCookin,http://twitpic.com/3l3uq - for the record... t...


4. removing hashtags

In [118]:
regex = r"#\S+"

df['text'].str.replace(regex, '')
print("tweets:")
display(df['text'][df['text'].str.match(regex)== True].head())
display(df['text'].str.replace(regex, '', regex=True)[df['text'].str.match(regex)== True].head())

tweets:


348     #3 woke up and was having an accident - &quot;...
979     #travian Total cost of the atk for the aggress...
2030    #mhbigcatch 8oz Golem  But finally got a Wight...
2404          #php gives me a segfault with a preg_split 
2964    #heyxboxlive Probably shouldn't mention any sh...
Name: text, dtype: object

348      woke up and was having an accident - &quot;It...
979      Total cost of the atk for the aggressor: 273,...
2030             8oz Golem  But finally got a Wight - 3oz
2404               gives me a segfault with a preg_split 
2964     Probably shouldn't mention any show with Drew...
Name: text, dtype: object

5. find e-mails 

In [123]:
regex = r"([A-Za-z0-9]+)@\S+.com\S+"

df['text'].str.replace(regex, '')
print("tweets:")
display(df['text'][df['text'].str.match(regex)== True].head())
display(df['text'].str.replace(regex, '', regex=True)[df['text'].str.match(regex)== True].head())

tweets:


Series([], Name: text, dtype: object)

Series([], Name: text, dtype: object)