Regular Expressions
-------------------

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists. 

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. We will present examples using grep, which we covered previously. (In case you forgot, we used grep to find lines of a text file with a given string in them.) 

### `grep`:

A utility for pattern matching. grep is by far the most useful unix utility. grep is typically called like this: 

`grep [options] [pattern] [files]`. 

With no options specified, this simply looks for the specified pattern in the given files, printing to the console only those lines that match the given pattern. 

Consider the file sample.txt (which we have downloaded previously):

In [11]:
!curl -L 'https://dl.dropboxusercontent.com/u/16006464/IPDS/sample.txt' -o sample.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   201  100   201    0     0   1148      0 --:--:-- --:--:-- --:--:--  1155


In [12]:
!cat sample.txt

123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop


If we are trying to find a line that contains the string 'biz baz' in the sample.txt file, we issue the command:

In [13]:
!grep 'biz baz' sample.txt

222	1346699955	11145	biz baz


If we search for the lines containing the string 'foo bar' we type:

In [14]:
!grep 'foo bar' sample.txt

123	1346699925	11122	foo bar
146	1346699999	11123	foo bar


If you want to see grep marking the regular expression in color, you can pass the parameter "--color=always". 

In [15]:
!grep --color=always 'foo bar' sample.txt

123	1346699925	11122	[01;31m[Kfoo bar[m[K
146	1346699999	11123	[01;31m[Kfoo bar[m[K


If you also use the command `less` together with grep, you will want to pass the parameter -R to less, to allow less to display the colors:

`grep --color=always 'foo bar' sample.txt | less -R`

### NYC Restaurant Names Data

To have a longer data set to play with, let's download the list of restaurant names from the NYC Restaurant Inspection Dataset. (I have already extracted the names from the file, removed duplicates, and sorted them, to save us time. As an exercise, you may want to take the original 100Mb dataset, and then use the UNIX commands that we described previously to generate the file)

In [17]:
!curl -L 'https://dl.dropboxusercontent.com/u/16006464/DwD_Winter2015/uniquenames.txt' -o /home/ubuntu/data/uniquenames.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  364k  100  364k    0     0  1403k      0 --:--:-- --:--:-- --:--:-- 1406k


Let's take a peek at the contents using the `head` and `tail` commands:

In [18]:
#change the working directory for iPython
%cd /home/ubuntu/data 

/home/ubuntu/data


In [19]:
!head -10 uniquenames.txt
!echo '........' # The "echo" command just prints in the output the string that follows the command (in this case "......")
!tail -10 uniquenames.txt

#1 GARDEN CHINESE
#1 ME. NICK'S
#1 SABOR LATINO RESTAURANT
$1.25 PIZZA
''U'' LIKE CHINESE RESTAURANT
''W'' CAFE
'WICHCRAFT
(LEWIS DRUG STORE) LOCANDA VINI E OLII
(LIBRARY)  FOUR & TWENTY BLACKBIRDS
(PUBLIC FARE) 81ST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)
........
ZUCKER'S BAGELS AND SMOKED FISH
ZUM SCHNEIDER
ZUM STAMMTISCH
ZUMA JAPANESE RESTAURANT NEW YORK
ZUMBA RESTAURANT
ZUTTO
ZUZU RAMEN
ZYMI BAR & GRILL
ZZ CLAM BAR
ZZ'S PIZZA & GRILL


Now, let's see if there are any restaurants with the string 'PANO' in them:

In [26]:
!grep --color=always 'PANO' uniquenames.txt

BUFFALO WILD WINGS,PEETS COOFEE &TEA, [01;31m[KPANO[m[KPOLIS BAKERY & CAFE
CAFE ES[01;31m[KPANO[m[KL
EL CHARRO ES[01;31m[KPANO[m[KL
EL POTE ES[01;31m[KPANO[m[KL
LA CANDELA ES[01;31m[KPANO[m[KLA
PAM[01;31m[KPANO[m[K
[01;31m[KPANO[m[KRAMA OF MY SILENCE-HEART
[01;31m[KPANO[m[KRAMA RESTAURANT
TIGIN IRISH PUB,PEETS COFFEE&TEA,[01;31m[KPANO[m[KPOLIS BAKERY&CAFE


What can we do if we want to search for something more complex than a fixed string? Regular expressions are solving exactly this problem. 

### The atoms

The simplest regular expressions are a sequence of `atoms`. An atom can be any of the following:
* single character, 
* a dot,
* a bracket expression, 
* an anchor.

#### Single character atom

A single character atom matches itself.

#### The `.` character atom

A dot atom matches any single character (except for a new line character `\n`).

Example: Using single character atoms, and the `.` atom, let's find all restaurant names that contain the characters `AB`, followed by any character (`.`) and then the character `D`:

In [27]:
!grep --color=always 'AB.D' uniquenames.txt 

[01;31m[KABID[m[KE BROOKLYN PITA
JJ PE[01;31m[KABOD[m[KY'S
L[01;31m[KABAD[m[KEE MANOIR
NEW KAB[01;31m[KAB D[m[KINER
RESTAURANT [01;31m[KABID[m[KJAN


In [33]:
! grep --color=always 'A...Z' uniquenames.txt 

99 CENTS MEG[01;31m[KA PIZ[m[KZA
ALB[01;31m[KA PIZ[m[KZA
ALITALI[01;31m[KA PIZ[m[KZA RESTAURANT
[01;31m[KALMAZ[m[K RESTAURANT
[01;31m[KANDAZ[m[K
[01;31m[KANDAZ[m[K FIFTH AVENUE
ANGELIC[01;31m[KA PIZ[m[KZERIA
ANNA MARI[01;31m[KA PIZ[m[KZA PASTA
ANTIK[01;31m[KA PIZ[m[KZERIA
ANTILLAN[01;31m[KA PIZ[m[KZERIA
[01;31m[KAPIZZ[m[K RESTAURANT
[01;31m[KARBUZ[m[K
ASI[01;31m[KA BAZ[m[KARR
ASTORI[01;31m[KA PIZ[m[KZA
ASTORI[01;31m[KA PIZ[m[KZA FACTORY
BELLA DONN[01;31m[KA PIZ[m[KZERIA
BELLA DONN[01;31m[KA PIZ[m[KZZA
BELL[01;31m[KA PIZ[m[KZA
BELL[01;31m[KA PIZ[m[KZA OF BRONX
BELLA ROM[01;31m[KA PIZ[m[KZA
BELL[01;31m[KA ROZ[m[KA
BELLA SER[01;31m[KA PIZ[m[KZA 
BELLA VIT[01;31m[KA PIZ[m[KZERIA
BELMOR[01;31m[KA PIZ[m[KZA & RESTAURANT
BON[01;31m[KA PIZ[m[KZA
BRONX LAL[01;31m[KA PIZ[m[KZA
BUON[01;31m[KA PIZ[m[KZA RESTAURANT
BUONISSIM[01;31m[KA PIZ[m[KZERIA
CAFE BAR N[0

#### Bracket expression atom

A bracket expression (defined by square brackets []) defines a set of characters. matches only one single character that can be any of the characters defined in a set. Example: [ABL] matches either A, B, or L.

Now, let's use a bracket expression: We want to find restaurants that contain one of the letters A,B,C,X,Y,Z followed by a digit. We specify the set of letters as `[ABCXYZ]` and the set of digits as [0123456789].  

In [39]:
!grep --color=always '[ABCXYZ][0123456789][0123456789] ' uniquenames.txt 

[01;31m[KB66 [m[KCLUB
YOGURT [01;31m[KY23 [m[KINC


##### Brackets and ranges

Instead of typing long lists of characters in a bracket expression, we can use the range character: [0-9] is equivalent to [0123456789]. Similarly [A-Z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. And [D-T] is equivalent to [DEFGHIJKLMNOPQRST]. (You get the idea.) You can also combine multiple ranges: [a-e1-9] is equivalent to [abcde123456789]. Finally, you can even specify to be excluded from the set using the character (^). For example, [^0-9] matches any character other than a number.

For example, let's find restaurants that contain a letter, followed by a number, and then followed by a charather that is not a number:

In [42]:
!grep --color=always '[A-Z][0-9][^0-9]' uniquenames.txt 

[01;31m[KA1 [m[KOCHA SUSHI
A[01;31m[KH2 [m[KICE TEA
[01;31m[KB4 [m[KNYC
B[01;31m[KT3 [m[KBAR
B[01;31m[KT4 [m[KBAR
[01;31m[KC2 [m[KCAFE
CAF[01;31m[KE1 [m[K& CAFE 4 (AMERICAN MUSEUM OF NATURAL HISTORY)
[01;31m[KF1 [m[KLOUNGE AND GRILL
ILLY/VELOCITY BAR (E[01;31m[KC2)[m[K
[01;31m[KJ4 [m[KHOOKAH LOUNGE
JUIC[01;31m[KE4U[m[K
[01;31m[KM1-[m[K5
[01;31m[KM2M[m[K MART
[01;31m[KM2N[m[K BUFFET
NINET[01;31m[KY9 [m[K& UP DINER
N[01;31m[KO1 [m[KCHINESE RESTAURANT
[01;31m[KQ2 [m[KTHAI RESTAURANT
[01;31m[KT2 [m[K- GO
TERMINA[01;31m[KL1 [m[KEMPLOYEE CAFETERIA
THE NEW YORK PALACE HOTEL ([01;31m[KC1 [m[KLEVEL CAFETERIA)
TW[01;31m[KO8T[m[KWO BAR & BURGER
US FRIED CHICKEN & [01;31m[KP1Z[m[KZA


Hm, we do not want to get results that have a space after the number, so let's also exclude the space character:

In [44]:
!grep --color=always '[A-Z][0-9][^0-9 ]' uniquenames.txt 

ILLY/VELOCITY BAR (E[01;31m[KC2)[m[K
JUIC[01;31m[KE4U[m[K
[01;31m[KM1-[m[K5
[01;31m[KM2M[m[K MART
[01;31m[KM2N[m[K BUFFET
TW[01;31m[KO8T[m[KWO BAR & BURGER
US FRIED CHICKEN & [01;31m[KP1Z[m[KZA


In [45]:
!grep --color=always '[0-9][^A-Z0-9][0-9]' uniquenames.txt 

$[01;31m[K1.2[m[K5 PIZZA
[01;31m[K1 2[m[K 3 BURGER SHOT BEER
[01;31m[K1.5[m[K GALBI CORP
10[01;31m[K4-0[m[K1 FOSTER AVENUE COFFEE SHOP(UPS)
3[01;31m[K6-0[m[K2 DITMARS COFFEE CORP.
4[01;31m[K0/4[m[K0 CLUB
4[01;31m[K0/4[m[K0 CLUB BAR
4[01;31m[K4 1[m[K/2 CAFE
51[01;31m[K0 1[m[K1ST BAR
8[01;31m[K3 1[m[K/2
BRASSERIE [01;31m[K8 1[m[K/2
CAFE 10[01;31m[K1 1[m[K6TH FLOOR CAFETERIA
FOOD DEPOT 1[01;31m[K2*4[m[K
HOT DOG CONCESSION A80[01;31m[K3-1[m[K
LADY M CONFECTIONS (PLAZA HOTEL 77[01;31m[K0 5[m[KTH AVENUE)
M[01;31m[K1-5[m[K
PRB 2[01;31m[K4-7[m[K
THE BEST $[01;31m[K1.0[m[K0 PIZZA


#### Anchor

Anchor atoms are used to define the location of a regex within a line. 

The anchor `^` specifies the *beginning of a line*, the anchor `$` specifies the end of a line. The anchor `\<` specifies the beginning of a word, and the anchor `\>` specifies the end of a word.

Example: Find restaurant names that start with the characters `BAL`

In [46]:
!grep --color=always '^BAL' uniquenames.txt

[01;31m[KBAL[m[KABOOSTA
[01;31m[KBAL[m[KADE
[01;31m[KBAL[m[KBOA RESTAURANT.
[01;31m[KBAL[m[KCON QUITENO RESTAURANT
[01;31m[KBAL[m[KDOR SPECIALTY FOODS
[01;31m[KBAL[m[KDUCCI'S
[01;31m[KBAL[m[KI NUSA INDONESIAN RESTAURANT
[01;31m[KBAL[m[KILO DELI
[01;31m[KBAL[m[KIMAYA RESTAURANT
[01;31m[KBAL[m[KKANIKA
[01;31m[KBAL[m[KKH SHISH KABAB HOUSE
[01;31m[KBAL[m[KL PARK HOT DOG
[01;31m[KBAL[m[KLARO
[01;31m[KBAL[m[KLATO'S RESTAURANT
[01;31m[KBAL[m[KLFIELDS CAFE
[01;31m[KBAL[m[KLI DELI & SALAD BAR
[01;31m[KBAL[m[KLY TOTAL FITNESS
[01;31m[KBAL[m[KLY'S SPORT CLUB
[01;31m[KBAL[m[KNDIE'S PLACE, INC
[01;31m[KBAL[m[KON
[01;31m[KBAL[m[KTHAZAR BAKERY
[01;31m[KBAL[m[KTHAZAR RESTAURANT
[01;31m[KBAL[m[KUCHI
[01;31m[KBAL[m[KUCHI'S
[01;31m[KBAL[m[KUCHI'S FRESH
[01;31m[KBAL[m[KUCHI'S INDIAN FOOD
[01;31m[KBAL[m[KVANERA
[01;31m[KBAL[m[KZEM


Example: Find restaurant names that end with the characters `NORTH`

In [47]:
!grep --color=always 'NORTH$' uniquenames.txt

AQUEDUCT [01;31m[KNORTH[m[K
BOURGEOIS PIG [01;31m[KNORTH[m[K
PRATT INSTITUTE [01;31m[KNORTH[m[K


Example: Let's try to find restaurants containing the word `COLUMBIA`:

In [51]:
!grep --color=always 'COLUMBIA' uniquenames.txt

104[01;31m[K-[m[K01 FOSTER AVENUE COFFEE SHOP(UPS)
108 LOUNGE [01;31m[K-[m[K CLUB 108
1617[01;31m[K-[m[KA NATIONAL BAKERY
3[01;31m[K-[m[KJ RESTAURANT AND PIZZA
310 [01;31m[K-[m[K EXELSIOR
318 [01;31m[K-[m[K TWO BOOTS
337 [01;31m[K-[m[K BURGERS & DOGS
36[01;31m[K-[m[K02 DITMARS COFFEE CORP.
A[01;31m[K-[m[K1 PIZZA SHOP
A[01;31m[K-[m[K12 SUSHI & RAMEN
A[01;31m[K-[m[KJIAO SICHUAN CUISINE
A[01;31m[K-[m[KROMA BAKERY
A[01;31m[K-[m[KTOWN BUFFALO WINGS
A[01;31m[K-[m[KWAH RESTAURANT
AAMANNS[01;31m[K-[m[KCOPENHGEN
AGUA FRESCA[01;31m[K-[m[K TAPAS BAR EL KALLEJON
AL[01;31m[K-[m[KARAF HALAL FRIED CHICKEN AND FROZEN DELIGHTS
AL[01;31m[K-[m[KDENTE
AL[01;31m[K-[m[KMEHRAN RESTAURANT
AL[01;31m[K-[m[KRAHAMANIA RESTAURANT AND CATERING
ALIMENTOS SALUDABLES [01;31m[K-[m[K MEXICAN GRILL
ALITALIA [01;31m[K-[m[K COMPAGNIA AEREA ITALIANA
ALT [01;31m[K-[m[K A LITTLE TASTE
AMC THEATERS 34TH STREET [0

Hm, something is wrong. We also get COLUMBIANO, COLUMBIANAS, and other words. We want only the word COLUMBIA, so we add the word anchors:

In [53]:
!grep --color=always '\<COLUMBIA\>' uniquenames.txt

BROWNIE'S CAFE AT [01;31m[KCOLUMBIA[m[K
CAFE 212/[01;31m[KCOLUMBIA[m[K CATERING KITCHEN - ALFRED LERNER HALL
[01;31m[KCOLUMBIA[m[K UNIVERSITY MEDICAL CENTER BOOKSTORE CAFE
THE FACULTY CLUB ([01;31m[KCOLUMBIA[m[K UNIVERSITY)
THE SCHOOL AT [01;31m[KCOLUMBIA[m[K UNIVERSITY


#### In class exercises

Write a regular expression for:

* Match any character
* Match the end of line
* Match any digit
* Find all characters that are not digits
* Find all words with four letters
* Find every line that starts with a digit
* Find all empty lines
* Find all lines with 4 characters


### Regular Expressions: Operators

#### Alternation |

The alternation operator `|` defines one or more alternatives regular expressions that need to be true for the string to match the regular expression. 

For example, if we are looking for names that contain either the word `GREEK` or the word `RUSSIAN`, we issue the following command: 

In [54]:
!grep -E --color=always 'GREEK|RUSSIAN|FRENCH' uniquenames.txt

ANTHI'S [01;31m[KGREEK[m[K FOOD
AVLEE  [01;31m[KGREEK[m[K KITCHEN
AVLEE [01;31m[KGREEK[m[K KITCHEN
AVLI THE LITTLE [01;31m[KGREEK[m[K TAVERN
BREEZE THAI-[01;31m[KFRENCH[m[K KITCHEN
BY SUZETTE [01;31m[KFRENCH[m[K CREPES
DIRTY [01;31m[KFRENCH[m[K
ETHOS [01;31m[KGREEK[m[K CUISINE
[01;31m[KFRENCH[m[K CAFE GOURMAND
[01;31m[KFRENCH[m[K DINER
[01;31m[KFRENCH[m[K LOUIE
[01;31m[KFRENCH[m[K ROAST
[01;31m[KGREEK[m[K EXPRESS
[01;31m[KGREEK[m[K FAMILY KITCHEN
[01;31m[KGREEK[m[K GARDENS GRILL
[01;31m[KGREEK[m[K GRILL
[01;31m[KGREEK[m[K ISLANDS
GRK FRESH [01;31m[KGREEK[m[K
GYRO [01;31m[KGREEK[m[K STYLE
JEAN CLAUDE [01;31m[KFRENCH[m[K BISTRO
JEAN DANET [01;31m[KFRENCH[m[K PASTRY
JENNY [01;31m[KFRENCH[m[K TOAST COFFEE SHOP RESTAURANT
MEDITERRANEAN GRILL [01;31m[KGREEK[m[K TARVERNA
OKEANOS [01;31m[KGREEK[m[K SEAFOOD
OPA! [01;31m[KGREEK[m[K RESTAURANT
PIZZA AND [01;31m[KFRENCH[m

The -E flag specifies that we will be using the "Extended Regular Expressions" standard (see the slides), which has more simplified syntax than the original regular expressions, which is the default for grep.

#### Repetition {m,n}

A repetition operator specifies that the atom or expression immediately before the repetition may be repeated. For example, if we are looking for restaurants that contain the letter I, three to five times:  

_**Note**: The double braces that we use here is *just* for iPython Notebooks. In a unix shell we would issue the command `grep -E 'I{3,5}' UniqueNames.txt` but due to the special way the iPython notebook treats the `{}` characters, we need to use double braces for the command to be interpreted properly._

In [56]:
!grep -E --color=always 'I{{3,5}}' uniquenames.txt

100% PATACON [01;31m[KCAC[m[KHAPA YAROA
555 VIV[01;31m[KACA[m[KFE
[01;31m[KAAA[m[K BURRITO MARIACHI
[01;31m[KAAA[m[K CARIDAD
[01;31m[KAAA[m[K ICHIBAN SUSHI
[01;31m[KAAA[m[K KENNEDY FRIED CHICKEN
[01;31m[KABA[m[K ASIAN FUSION CUISINE AVE
[01;31m[KABA[m[K TURKISH RESTAURANT
[01;31m[KABAC[m[KE SUSHI
[01;31m[KABAC[m[KKY POTLUCK
[01;31m[KABA[m[KLEH
[01;31m[KABB[m[KO[01;31m[KCCA[m[KTO RISTORANTE
[01;31m[KABB[m[KY CHINESE RESTAURANT
[01;31m[KABC[m[K BAKERY
[01;31m[KABC[m[K BEER CO.
[01;31m[KABC[m[K COCINA
[01;31m[KABC[m[K KITCHEN
[01;31m[KACA[m[KDEMIA BARILLA RESTAURANTS
[01;31m[KACA[m[KDEMIA COFFEE
[01;31m[KACA[m[KDEMY RESTAURAUNT
[01;31m[KACA[m[KPELLA GOURMET PIZZA & RESTAURANT CORP
[01;31m[KACA[m[KPPELLA RESTAURANT
[01;31m[KACA[m[KPULCO
[01;31m[KACA[m[KPULCO BAR RESTAURANT
[01;31m[KACA[m[KPULCO DELI & RESTAURANT
[01;31m[KACCC[m[KORD ASIAN CUISINE
[01;31m[K

Now, let's find all the restaurants that have a name length from 50 to 55 characters:

In [61]:
!grep --color=always -E '^.{{50,55}}$' uniquenames.txt

[01;31m[KBRASSIERIE 1605/BROADWAY 49 BAR & LOUNGE (MAIN KITCHEN)[m[K
[01;31m[KBROOKLYN CHILDREN'S MUSEUM CAFE/FOREST CITY RATNER CAFE[m[K
[01;31m[KCAFE 212/COLUMBIA CATERING KITCHEN - ALFRED LERNER HALL[m[K
[01;31m[KCAFE1 & CAFE 4 (AMERICAN MUSEUM OF NATURAL HISTORY)[m[K
[01;31m[KCARIBBEAN CONNECTION CATERING SERVICES INC RESTAURANT[m[K
[01;31m[KCHARTWELLS AT COLLEGE OF MOUNT ST. VINCENT-BENEDICT[m[K
[01;31m[KCOURTYARD & RESIDENCE INN BY MARRIOTT CENTRAL PARK[m[K
[01;31m[KFORDHAM UNIVERSITY/MCGINLEY CENTER/RAMSKELLER KITCHEN[m[K
[01;31m[KGREEN AND ACKERMAN KOSHER DAIRY RESTAURANT & PIZZA[m[K
[01;31m[KHOMESTYLE FOOD SERVICES (ST. BARNABAS HIGH SCHOOL)[m[K
[01;31m[KLOBBY LOUNGE AND TROUBLE'S TRUST @ THE PALACE HOTEL[m[K
[01;31m[KNATURAL TOFU & NOODLES RESTAURANT (BOOK CHANG DONG)[m[K
[01;31m[KNEW YORK BOTANICAL GARDENS TERRACE CAFE ( GARDEN CAFE )[m[K
[01;31m[KNEW YORK UNIVERSITY - KIMMEL STUDENT CENTER CAFETERIA[m[K


In the repetition operator {m,n}, we can skip putting the upper limit if we want to say, "anything with m matches and above". For example, let's find all the restaurants that have a name length 60 characters and above:

In [62]:
!grep --color=always -E '^.{{60,}}$' uniquenames.txt

[01;31m[K(PUBLIC FARE) 81ST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)[m[K
[01;31m[KBUFFALO WILD WINGS,PEETS COOFEE &TEA, PANOPOLIS BAKERY & CAFE[m[K
[01;31m[KCENTER PLATE- CONCOURSE CAFE-JACOB K JAVITS CONVENTION CENTER[m[K
[01;31m[KCENTERPLATE-EMPLOYEE CAFETERIA-JACOB K JAVITS CONVENTION CENTER[m[K
[01;31m[KCENTRA`L MARKET ALL AMERICAN GRILL ( STATEN ISLAND FERRY TERMINAL)[m[K
[01;31m[KDELTA SKY CLUB (BARTENDER SERVICE TERMINAL D DELTA DEPARTURE)[m[K
[01;31m[KDUNKIN DONUTS (INSIDE GULF GAS STATION ON NORTH SIDE OF MAJ. DEEGAN EXWY- AFTER EXIT 13 - 233 ST.)[m[K
[01;31m[KFASHION INSTITUTE OF TECHNOLOGY DAVID DUBINSKY STUDENT CENTER[m[K
[01;31m[KGREATER NEW YORK SOCIAL AND HEALTH ADULT DAY CARE CENTER LLC[m[K
[01;31m[KHOMEWOOD SUITES BY HILTON NEW YORK MIDTOWN MANHATTAN TIMES SQUARE[m[K
[01;31m[KHONG KONG CAFE / FRESH SANDWICH BAKERY (BASEMENT FOOD COURT RESTAURANT & 1ST FL BAKERY)[m[K
[01;31m[KMARLIN BAR AT TOMMY BAHAMA AND

##### Repetition shortcuts (very common!): 

* `* = {0,}`. The `*` character means match the previous atom zero or more times
* `+ = {1,}`. The `+` character means match the previous atom one or more times
* `? = {0,1}`. The `*` character means match the previous atom zero or one times






Find all restaurants that start with one or more digits, followed by a space.

In [66]:
!grep -E --color=always '^[0-9]+ ' uniquenames.txt

[01;31m[K10T[m[KH AVENUE COOKSHOP
[01;31m[K10T[m[KH AVENUE PIZZA & CAFE
[01;31m[K12T[m[KH STREET BAR & GRILL
[01;31m[K14T[m[KH STREET PIZZA BAGEL CAFE
[01;31m[K16T[m[KH AVENUE GLATT
[01;31m[K19A[m[K EMPIRE RESTAURANT
[01;31m[K1A[m[K STORE INC
[01;31m[K1S[m[KT AVENUE GOURMET
[01;31m[K1S[m[KT BASE CONCESSION STAND
[01;31m[K1S[m[KT MAMA RESTAURANT
[01;31m[K1S[m[KT STOP
[01;31m[K224T[m[KH CORNER RESTAURANT & BAKERY
[01;31m[K241S[m[KT CAFE RESTAURANT
[01;31m[K25T[m[KH DELI
[01;31m[K29T[m[KH STREET HOTEL ACQUISITION LLC
[01;31m[K2A[m[K
[01;31m[K2F[m[KL
[01;31m[K2N[m[KD AVE BLUE 9 BURGER
[01;31m[K2N[m[KD AVENUE DELI
[01;31m[K31S[m[KT AVENUE GYRO.
[01;31m[K36T[m[KH AVE COFFEE SHOP
[01;31m[K38T[m[KH STREET DINER
[01;31m[K3E[m[K TASTE OF THAI
[01;31m[K3R[m[KD & 7
[01;31m[K40T[m[KH ROAD LUNCH BOX
[01;31m[K42N[m[KD STREET PIZZA DINER
[01;31m[K44T[m[KH STREET MINAR
[01;31m[K44T[m[KH STREE

Find all restaurants that start with a letter, followed by one or more digits, followed by a space.

In [67]:
!grep -E --color=always '^[A-Z][0-9]+ ' uniquenames.txt

[01;31m[KA1 [m[KOCHA SUSHI
[01;31m[KB4 [m[KNYC
[01;31m[KB66 [m[KCLUB
[01;31m[KC2 [m[KCAFE
[01;31m[KF1 [m[KLOUNGE AND GRILL
[01;31m[KH20 [m[KLOUNGE AND RESTAURANT
[01;31m[KJ4 [m[KHOOKAH LOUNGE
[01;31m[KQ2 [m[KTHAI RESTAURANT
[01;31m[KT2 [m[K- GO
[01;31m[KT49 [m[KCAFE


Find all restaurants that start with the word STARBUCKS, followed by any number of characters, and then have a digit.

In [68]:
!grep -E --color=always 'STARBUCKS.*[0-9]+' uniquenames.txt

[01;31m[KSTARBUCKS # 14840[m[K
[01;31m[KSTARBUCKS (JFK TERMINAL 5[m[K-POST SECURITY DEPARTURE)
[01;31m[KSTARBUCKS (STORE 16628[m[K)
[01;31m[KSTARBUCKS 22420[m[K
[01;31m[KSTARBUCKS COFFEE  #16608[m[K
[01;31m[KSTARBUCKS COFFEE # 15440[m[K
[01;31m[KSTARBUCKS COFFEE # 7463[m[K
[01;31m[KSTARBUCKS COFFEE # 7540[m[K
[01;31m[KSTARBUCKS COFFEE #14240[m[K
[01;31m[KSTARBUCKS COFFEE #18509[m[K
[01;31m[KSTARBUCKS COFFEE #20679[m[K
[01;31m[KSTARBUCKS COFFEE #21514[m[K
[01;31m[KSTARBUCKS COFFEE #22596[m[K
[01;31m[KSTARBUCKS COFFEE #23266[m[K
[01;31m[KSTARBUCKS COFFEE #23267[m[K
[01;31m[KSTARBUCKS COFFEE #3438[m[K
[01;31m[KSTARBUCKS COFFEE #7344[m[K
[01;31m[KSTARBUCKS COFFEE #7358[m[K
[01;31m[KSTARBUCKS COFFEE #7416[m[K
[01;31m[KSTARBUCKS COFFEE #7682[m[K
[01;31m[KSTARBUCKS COFFEE #7826[m[K
[01;31m[KSTARBUCKS COFFEE #9282[m[K
[01;31m[KSTARBUCKS COFFEE #9722[m[K
[01;31m[KSTARBUCKS COFFEE

#### Grouping ()

In the group operator, when a group of characters is enclosed in parentheses, the next operator applies to the whole group, not only the previous characters. For example, find all restaurant names that contain BA two times or more:

In [73]:
!grep -E --color=always '(BA){{2,}}' uniquenames.txt

ALI [01;31m[KBABA[m[K
ALI [01;31m[KBABA[m[K RESTAURANT
ALI [01;31m[KBABA[m[K'S
ALI[01;31m[KBABA[m[K
ALI[01;31m[KBABA[m[K EXPRESS
ALI[01;31m[KBABA[m[K GRILL
[01;31m[KBABA[m[K COOL
[01;31m[KBABA[m[K GHANOUGE
[01;31m[KBABA[m[K'S PIEROGIES
[01;31m[KBABA[m[KGHANOUSH
[01;31m[KBABA[m[KLU
SA[01;31m[KBABA[m[K LOUNGE


#### In class exercices

What do these regular expressions match?

* b (cd)*
* h (d)+
* j? k+
* (cd){2,5}
* o(pre){3,}
* Panos|Ipeirotis

#### In class exercises (advanced)

Write down the regular expressions for the following:

* A telephone number (e.g, 212-555-0921)
* A zip+4 code (e.g, 10012-1809)
* For matching a float number (e.g., +12.34 or -1.457 or 1023.4568)
* Dollar amount with optional cents  (e.g. \$0.33, \$784)
* Time of Day (e.g. 12:15am, 3:34pm)
* Match urls  only of the form http://www.alphanumeric.com
* Match an email of the form username@domain (assume  that the domain might be in the form alphanumeric.alphanumeric, or alphanumeric.alphanumeric.alphanumeric)   



### Backreferences

Sometimes it is handy to be able to refer to a match that was made earlier in a regex. This is done with backreferences. `\k` is the backreference specifier, where `k` is a number, which refers to the `k`-th regular expression *that was enclosed in parenthesis*.

For example, find if the first character(s) of a line are the same as the last:


In [75]:
!grep -E --color=always '^(.{{3,}}).*\1$' uniquenames.txt

[01;31m[K108 LOUNGE - CLUB 108[m[K
[01;31m[KANTEK RESTAURANT[m[K
[01;31m[KANTOJITOS RETAURANT[m[K
[01;31m[KANTONIO'S RESTAURANT[m[K
[01;31m[KARRIBA ARRIBA[m[K
[01;31m[KBARCELONA BAR[m[K
[01;31m[KBARRACUDA BAR[m[K
[01;31m[KBERONBERON[m[K
[01;31m[KBINGO BINGO BINGO[m[K
[01;31m[KBUMBLE AND BUMBLE[m[K
[01;31m[KBURGER BURGER[m[K
[01;31m[KCENTER PLATE- CONCOURSE CAFE-JACOB K JAVITS CONVENTION CENTER[m[K
[01;31m[KCENTERPLATE-EMPLOYEE CAFETERIA-JACOB K JAVITS CONVENTION CENTER[m[K
[01;31m[KCHARLES SALLY & CHARLES[m[K
[01;31m[KCHEEBURGER CHEEBURGER[m[K
[01;31m[KCHEN MOMMY KITCHEN[m[K
[01;31m[KCHEN'S KITCHEN[m[K
[01;31m[KCHOP CHOP[m[K
[01;31m[KCREPE SUCRE[m[K
[01;31m[KDIP DIP[m[K
[01;31m[KETCETERA ETCETERA[m[K
[01;31m[KGAJI GAJI[m[K
[01;31m[KGIT-IT-N-GIT[m[K
[01;31m[KGONZALEZ Y GONZALEZ[m[K
[01;31m[KGUDE GUDE[m[K
[01;31m[KHALF AND HALF[m[K
[01;31m[KHOME SWEET HOME[

Or find all the restaurant names that the first 5 characters (or more) are identical to the last characters.

In [76]:
!grep -E '^([A-Z]+)\1$' uniquenames.txt

BERONBERON
COCO
ISIS
MANGOMANGO
NONO
VIVI


Find all names that have three consecutive same digits

In [None]:
!grep -E  --color=always '([0-9])\1\1' uniquenames.txt

As we are going to see, these backreferences will also be of tremendous use for extraction purposes.

#### In class exercise (advanced)

Say that you have a file with telephone numbers written in a variety of forms: 

* 679-397-5255
* 2126660921
* 212-998-0902
* 888-888-2222
* 800-555-1211
* 800 555 1212
* 800.555.1213
* (800) 555-1214
* 1-800-555-1215
* 1(800)555-1216
* 800-555-1212-1234
* 800-555-1212x1234
* 800-555-1212 ext. 1234
* work 1-(800) 555.1212 #1234

The task is to standardize everything in the form (xxx)-xxx-xxx.


To make the process interactive, go to http://regex101.com/?#python, copy and paste the numbers above in the textarea called "Text String", and then try to write the regular expression above. (As a side note, the website provides excellent explanations about the meaning of the regular expression that you write down.) Remember to put the "g" character in the small textfield next to the regex: this has the same meaning as in sed, and it means "find globally" the regex, not just the first occurence.


If you manage to deal with that task, consider the case of also having international country calling codes (e.g., +1 for US, +44 for UK, +7 for Russia, +30 for Greece, +354 for Iceland etc), and also standardizing the extensions.

### Additional Regex Resources

* [Visual Regular Expression Tester](http://www.debuggex.com/?flavor=pcre)
* [Test Python Regular Expressions Online](http://www.pyregex.com/)
* [Regular Expressions 101](http://regex101.com/?)
* [Python's re Library Official Documentation](http://docs.python.org/2/library/re.html)
* [Regular expression reference at W3schools](http://www.w3schools.com/jsref/jsref_obj_regexp.asp)
* [Parsing phone numbers using Python and regular expressions](http://www.diveintopython.net/regular_expressions/phone_numbers.html)

### Additional Regular Expressions

While we have not used these before, they are commonly used shortcuts to simplify the construction of regular expressions:

* `\d`: matches the digits, 0-9.
* `\D`: matches anything but `\d`.
* `\w`: matches any alphanumeric character plus underscore: `[A-Za-z0-9_]`.
* `\W`: matches anything but `\w`.
* `\s`: matches any "whitespace" character (space, tab, newline, etc): `[ \t\n\r\f\v]`.
* `\S`: matches anything but `\s`.
* `\b`: matches the breaks between alphanumeric and non-alphanumeric characters (an empty string), the boundary between `\w` and `\W`. Useful for ensuring that what you match is actually a word.
* `\B`: matches anything but `\b`. Useful for ensuring your match is in the middle of a word.

And the ones below get a little bit more advanced:

* `*?`, `+?`: ordinarily, `*`, `+` and `?` are greedy, matching the longest possible string that satisfies the regular expression. Adding the `?` to any of these makes it non-greedy, instead matching the shortest possible expression. 
* `(?: )`: A non-capturing group. This works just as `()`, but doesn’t hold on to the matched contents.
* `(?<=x)`: Matches any string that is preceded by x (an arbitrary regular expression).
