# Regular Expression in Hacker News dataset

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. This is a compilation of some regular expressions exercises from dataquest. Thanks for this amazing content.

Let's start by reading our Hacker News dataset into a pandas dataframe.

In [1]:
import pandas as pd
hn = pd.read_csv("HN_posts_year_to_Sep_26_2016.csv")

**Let's watch the first five rows using `hn.head()`**

In [2]:
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


**Let's watch the last five rows using `hn.tail()`**

In [3]:
hn.tail()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
293114,10176919,Ask HN: What is/are your favorite quote(s)?,,15,20,kumarski,9/6/2015 6:02
293115,10176917,Attention and awareness in stage magic: turnin...,http://people.cs.uchicago.edu/~luitien/nrn2473...,14,0,stakent,9/6/2015 6:01
293116,10176908,Dying vets fuck you letter (2013),http://dangerousminds.net/comments/dying_vets_...,10,2,mycodebreaks,9/6/2015 5:56
293117,10176907,"PHP 7 Coolest Features: Space Ships, Type Hint...",https://www.zend.com/en/resources/php-7,2,0,Garbage,9/6/2015 5:55
293118,10176903,Toyota Establishes Research Centers with MIT a...,http://newsroom.toyota.co.jp/en/detail/9233109/,4,0,tim_sw,9/6/2015 5:50


**Finding more information using `hn.info()`**

In [4]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            293119 non-null  int64 
 1   title         293119 non-null  object
 2   url           279256 non-null  object
 3   num_points    293119 non-null  int64 
 4   num_comments  293119 non-null  int64 
 5   author        293119 non-null  object
 6   created_at    293119 non-null  object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB


In [5]:
hn.shape

(293119, 7)

The actual dataset has 29319 rows and 7 columns and has missing data in the url column. 

We're going to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both `Python` with capital 'P' and `python` with lowercase 'p'.

In [6]:
import re

titles = hn["title"].tolist()

python_mentions = 0

pattern = "[Pp]ython"  # Seting the pattern to find

for s in titles:
    if re.search(pattern,s):
        python_mentions += 1
python_mentions

2572

Let's find the titles that contains the last pattern.

In [7]:
python_titles = hn["title"].str.contains(pattern)
hn[python_titles].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
109,12577538,Solve a wooden puzzle with Python and Jupyter,http://www.craig-wood.com/nick/articles/snake-...,2,0,nickcw,9/25/2016 21:25
257,12576002,A fast PostgreSQL client library for Python: 3...,https://github.com/MagicStack/asyncpg,3,1,arjun27,9/25/2016 16:33
284,12575660,Asynchronous Python,https://medium.com/@nhumrich/asynchronous-pyth...,1,0,nhumrich,9/25/2016 15:13
444,12574035,Python 4 Kids: Python for Kids: Python 3 Proj...,https://python4kids.brendanscott.com/2016/09/2...,3,0,samber,9/25/2016 4:51
478,12573573,Cubr A Rubiks Cube Solver Written in Python a...,http://www.cbarker.net/projects/cubr,81,7,Halienja,9/25/2016 2:22


Doing the same for `"Ruby"` & `"ruby"` titles

In [8]:
ruby_titles = hn["title"].str.contains("[Rr]uby")
hn[ruby_titles].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
1291,12565844,Wrote a simple ps clone in ruby,https://github.com/fredrb/ps-clone,1,0,fredrb,9/23/2016 16:30
1497,12564017,Ruby Poltergeist gem the best way to scrape data,http://joshuakemp.blogspot.com/2016/09/ruby-po...,4,0,asung40,9/23/2016 12:25
1624,12563059,A proposal of new concurrency model for Ruby 3...,http://www.atdot.net/~ko1/activities/2016_ruby...,2,0,robin_reala,9/23/2016 8:09
1734,12561981,Ruby gem to fetch gocd information as rich mod...,https://github.com/ajitsing/gocd,1,0,ajitsing,9/23/2016 2:49
1981,12559671,"Ruby vs. Python, the Definitive FAQ",https://hackernoon.com/ruby-vs-python-the-defi...,2,0,weatherlight,9/22/2016 19:39


Now we're going to find how many titles in our dataset mention `email` or `e-mail`. To do this, we'll need to use `?`, the optional quantifier, to specify that the dash character `-` is optional in our regular expresion

|Quantifier|Pattern|Explanation                              |
|:--------:|:-----:|-----------------------------------------|
|          |a{3}   |The character a three times              |
|Numeric   |a{3,5} |The character a three, four or five times|
|          |a{8,}  |The character a eight or more times      |

In addition to numeric quantifiers, there are single charactersin regex that specify some common quantifiers that you're likely to use. A summary of them is below.

|Quantifier  |Pattern|Equivalent|Explanation                     |
|:-----------|:-----:|:-----:|-----------------------------------|
|Zero or more|a*     |a{0,}  |The character a zero or more times |
|One or more |a+     |a{1,}  |The character a one or more times  |
|Optional    |a?     |a{0,1} |The character a zero or one times  |

In [9]:
email_bool = hn["title"].str.contains("e-?mail")
hn[email_bool].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
90,12577773,This is what happens when you reply to spam email,https://www.ted.com/talks/james_veitch_this_is...,4,0,NicoJuicy,9/25/2016 22:23
173,12576882,Correct way to validate email adresses,https://hackernoon.com/the-100-correct-way-to-...,2,0,pvsukale3,9/25/2016 19:18
1010,12568819,"Obama used a pseudonym in emails with Clinton,...",http://www.politico.com/story/2016/09/hillary-...,8,2,douche,9/24/2016 0:15
1012,12568789,The most broken part of your user experience i...,https://uxdesign.cc/the-most-broken-part-of-yo...,2,0,toomanyapples,9/24/2016 0:07
1029,12568583,Visualization of Clinton email scandal,https://www.scedast.com/4,3,0,scedast,9/23/2016 23:11


Now let's find how many titles in our dataset have tags. To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters.

|Character Class  |Pattern |Explanation                           |
|:----------------|:------:|--------------------------------------|
|Set              |[fud]   |Either f, u or d                      |
|                 |[a-e]   |Any of the characters a, b, c, d or e |
|Range            |[0-3]   |Any of the characters 0, 1, 2 or 3    |
|                 |[A-Z]   |Any uppercase letter                  |
|Set + Range      |[A-Za-z]|Any uppercase or lowercase letter     |

Just like with quantifiers, there are some common character classes wich we'll use a lot.

|Character Class  |Pattern |Explanation                                                                         |
|:----------------|:------:|:-----------------------------------------------------------------------------------|
|Digit            |\d      |Any digit character                                                                 |
|Word             |\w      |Any digit, uppercase, lowercase, orunderscore character (equivalent to [A-Za-z0-9_])|
|Whitespace       |\s      |Any space, tab, or linebreak character                                              |
|Dot              |.       |Any character except newline                                                        |

Write a regular expression, assigning it as a string to the variable `pattern`. The regular expression should match, in order:
- A single open bracket character
- One or more word characters
- A single close bracket character

In [10]:
pattern = "\[\w+\]"
tag_titles = hn["title"].str.contains(pattern)
hn[tag_titles].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
36,12578568,Cuba's DIY Inventions from 30 Years of Isolati...,https://www.youtube.com/watch?v=v-XS4aueDUg,1,0,GuiA,9/26/2016 1:26
114,12577471,A Possible Future of Software Development by ...,https://www.youtube.com/watch?v=4moyKUHApq4,2,0,adamnemecek,9/25/2016 21:11
153,12577026,FreeBSD Issue #1 [pdf],http://support.rossw.net/FreeBSD-Issue1.pdf,1,0,tachion,9/25/2016 19:44
154,12577024,Proprietary versus open instruction sets [pdf],http://research.cs.wisc.edu/multifacet/papers/...,21,7,jsnell,9/25/2016 19:44
217,12576307,Forever Alone Programming [FAP],https://github.com/nopara73/ForeverAloneProgra...,3,0,misnamed,9/25/2016 17:36


Count how many matching titles there are. Assign the result to `tag_count`

In [11]:
tag_count = tag_titles.sum()
tag_count

5871

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise

In [12]:
tag_5 = hn[tag_titles].head()
tag_5

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
36,12578568,Cuba's DIY Inventions from 30 Years of Isolati...,https://www.youtube.com/watch?v=v-XS4aueDUg,1,0,GuiA,9/26/2016 1:26
114,12577471,A Possible Future of Software Development by ...,https://www.youtube.com/watch?v=4moyKUHApq4,2,0,adamnemecek,9/25/2016 21:11
153,12577026,FreeBSD Issue #1 [pdf],http://support.rossw.net/FreeBSD-Issue1.pdf,1,0,tachion,9/25/2016 19:44
154,12577024,Proprietary versus open instruction sets [pdf],http://research.cs.wisc.edu/multifacet/papers/...,21,7,jsnell,9/25/2016 19:44
217,12576307,Forever Alone Programming [FAP],https://github.com/nopara73/ForeverAloneProgra...,3,0,misnamed,9/25/2016 17:36


In [13]:
pattern = r"(\[\w+\])"
tag_5_matches = tag_5["title"].str.extract(pattern, expand = False)
tag_5_matches

36     [video]
114    [video]
153      [pdf]
154      [pdf]
217      [FAP]
Name: title, dtype: object

In [14]:
tag_5_freq = tag_5_matches.value_counts()
tag_5_freq

[video]    2
[pdf]      2
[FAP]      1
Name: title, dtype: int64

Now we're going to produce a frequency table of all the tags in the `titles` series.

In [15]:
tag_freq = hn["title"].str.extract(pattern, expand = False).value_counts()
tag_freq

[pdf]            3531
[video]          1524
[audio]            65
[2015]             23
[Infographic]      18
                 ... 
[old]               1
[NIST]              1
[FR]                1
[rooted]            1
[Error]             1
Name: title, Length: 445, dtype: int64

**Negative character classes** are character classes that match every character *except* a a character class. Let's look at a table of the common negative character classes:

|Character Class    |Pattern  |Explanation                                               |
|:------------------|:-------:|:---------------------------------------------------------|
|Negative Set       |[^fud]   |Any character except **f**, **u** or **d**                |
|                   |[^1-3Z\s]|Any character except **1**, **2**, **3**, **Z**, or whitespace characters |
|Negative Digit     |\D       |Any character except digit characters                     |
|Negative Word      |\W       |Any character except word characters                      |
|Negative Whitespace|\S       |Any character except whitespace characters                |

In [16]:
def first_10_matches(df,name_column,pattern):
    """
    Return the firs 10 story titles that match the provided regular expression
    """
    all_matches = df[df[name_column].str.contains(pattern)]
    return all_matches.head(10)

In [17]:
pattern = r"[Jj]ava[^sS]"
#hn[hn["title"].str.contains(pattern)].head(10)
java_titles = first_10_matches(hn,"title",pattern)
java_titles

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
734,12571182,Show HN: New and Painless Couchbase Java SDK W...,https://github.com/RealityGamesLtd/couchbase-j...,1,0,Scotrix,9/24/2016 15:03
1244,12566258,2016 JavaOne Intel Keynote 32mn Talk,https://www.youtube.com/watch?v=MAi1eHpLY5M,1,1,BenoitP,9/23/2016 17:16
1803,12561409,RxJava library for Jersey framework,https://github.com/alex-shpak/rx-jersey,5,0,winterly,9/23/2016 0:13
2569,12555014,A Demo App of Zhihu Daily Based on MVP and RxJ...,https://github.com/hefuyicoder/ZhihuDaily,1,0,hefuyi,9/22/2016 7:02
2592,12554793,Swift versus Java: the bitset performance test,http://lemire.me/blog/2016/09/22/swift-versus-...,2,2,deafcalculus,9/22/2016 6:13
2919,12551356,Scala VS Java: fresh view,http://fruzenshtein.com/scala-vs-java-another-...,1,0,Fruzenshtein,9/21/2016 19:21
3590,12545243,HeapStats: JVMTI agent and JavaFX analyzer for...,https://github.com/HeapStats/heapstats,1,2,oza,9/21/2016 3:47
3904,12542486,Red Hat Links Java to Microsoft's Visual Studi...,http://www.infoworld.com/article/3122362/java/...,1,0,rbanffy,9/20/2016 19:24
4126,12540797,?Oracle pledges continued support for Java and...,http://www.zdnet.com/article/oracle-pledges-co...,1,1,CrankyBear,9/20/2016 16:32
4223,12540066,How Did We End Up with Java Running Inside of ...,https://www.linkedin.com/pulse/how-did-we-end-...,3,0,pescerosso,9/20/2016 15:07


Let's use the **word boundary anchor** (*\b*) as part of our regular expression to select the titles that mention Java

In [18]:
pattern = r"\b[Jj]ava\b"
java_titles = first_10_matches(hn,"title",pattern) 
java_titles

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
734,12571182,Show HN: New and Painless Couchbase Java SDK W...,https://github.com/RealityGamesLtd/couchbase-j...,1,0,Scotrix,9/24/2016 15:03
2592,12554793,Swift versus Java: the bitset performance test,http://lemire.me/blog/2016/09/22/swift-versus-...,2,2,deafcalculus,9/22/2016 6:13
2919,12551356,Scala VS Java: fresh view,http://fruzenshtein.com/scala-vs-java-another-...,1,0,Fruzenshtein,9/21/2016 19:21
3584,12545277,Var comes to Java,https://www.voxxed.com/blog/2016/09/var-comes-...,1,0,antoaravinth,9/21/2016 3:59
3904,12542486,Red Hat Links Java to Microsoft's Visual Studi...,http://www.infoworld.com/article/3122362/java/...,1,0,rbanffy,9/20/2016 19:24
4126,12540797,?Oracle pledges continued support for Java and...,http://www.zdnet.com/article/oracle-pledges-co...,1,1,CrankyBear,9/20/2016 16:32
4223,12540066,How Did We End Up with Java Running Inside of ...,https://www.linkedin.com/pulse/how-did-we-end-...,3,0,pescerosso,9/20/2016 15:07
4315,12539439,Java Language Support for Visual Studio Code H...,http://developerblog.redhat.com/2016/09/19/jav...,2,0,tilt,9/20/2016 13:49
4512,12538064,How to find and fix memory leaks in your Java ...,http://developers.redhat.com/blog/2014/08/14/f...,1,0,iamcreasy,9/20/2016 8:52
4535,12537897,A Beginners Guide to Java Internationalization,https://phraseapp.com/blog/posts/a-beginners-g...,3,0,torbenfabel,9/20/2016 7:57


Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **end anchor**, which represent the start and end of the string.

|Anchor   |Pattern|Explanation                                     |
|:--------|:-----:|:-----------------------------------------------|
|Beginning| ^abc  | Matches **abc** only at the start of the string|
|End      | abc$  |  Matches **abc** only at the end of the string |


Let's use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset. 

In [19]:
beginning_count = hn["title"].str.contains(r"^\[\w+\]").sum()
beginning_count

304

In [20]:
ending_count = hn["title"].str.contains(r"\[\w+\]$").sum()
ending_count

5384

Let's write a regular expression to count the number of times that email is mentioned in story titles. You'll need to use both ignorecase.  

In [21]:
email_mentions = hn["title"].str.contains(r"e-?\s?mail", flags = re.I).sum()
email_mentions

1731

The last pattern cover all the following cases:
- Any combination of upper and lowercases
- e-mail 
- email
- E mail
- e mail

In [22]:
email_mentions = hn["title"].str.contains(r"\be\s?mail\b", flags = re.I).sum()
first_10_matches(hn,"title",r"\be\s?mail\b")

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
90,12577773,This is what happens when you reply to spam email,https://www.ted.com/talks/james_veitch_this_is...,4,0,NicoJuicy,9/25/2016 22:23
173,12576882,Correct way to validate email adresses,https://hackernoon.com/the-100-correct-way-to-...,2,0,pvsukale3,9/25/2016 19:18
1012,12568789,The most broken part of your user experience i...,https://uxdesign.cc/the-most-broken-part-of-yo...,2,0,toomanyapples,9/24/2016 0:07
1029,12568583,Visualization of Clinton email scandal,https://www.scedast.com/4,3,0,scedast,9/23/2016 23:11
1069,12568034,Show HN: Inside The network of email newsletters,http://inside.com,1,0,awwstn,9/23/2016 21:18
2603,12554701,What are the top ten email services by number ...,,4,1,kartickv,9/22/2016 5:51
2650,12554279,SendGrid wrapper NPM package for email alert e...,https://www.npmjs.com/package/email-alerts,2,0,omgimanerd,9/22/2016 4:09
3347,12547312,Is your team's primary communication channel e...,,4,0,eli_oat,9/21/2016 11:56
3963,12541961,Show HN: Pepo Campaigns 1st enterprise-grade ...,https://pepocampaigns.com/,2,2,betashop,9/20/2016 18:27
4122,12540831,Ask HN: Suggestions for email provider separat...,,1,1,tmaly,9/20/2016 16:35


In [23]:
email_mentions

1217

Use a regex pattern and the ignorecase flag to count the number of mentions of SQL in title. Assign the result to `sql_counts`

In [24]:
titles = hn["title"]
sql_counts = titles.str.contains(r"sql", flags = re.I).sum()
sql_counts

1327

Let's to create a new dataframe, `hn_sql` including only rows that mention a SQL flavor.

In [25]:
hn_sql = hn[hn["title"].str.contains(r"\w+SQL", flags = re.I)].copy()

In [26]:
len(hn_sql)

775

In [27]:
hn_sql["flavor"] = hn_sql["title"].str.extract(r"(\w+SQL)", flags = re.I, expand = False)

In [28]:
hn_sql["flavor"].value_counts().head()

PostgreSQL    352
MySQL         206
NoSQL         106
Postgresql     16
mysql          13
Name: flavor, dtype: int64

In [29]:
hn_sql["flavor"] = hn_sql["flavor"].str.lower()

In [30]:
hn_sql["flavor"].value_counts().head()

postgresql    375
mysql         235
nosql         114
memsql         10
tsql            5
Name: flavor, dtype: int64

In [31]:
hn_sql.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,flavor
43,12578514,PostgreSQL RDS pg-stat-ramdisk-size new featur...,http://www.3manuek.com/pgstatramdisksize,1,0,3manuek,9/26/2016 1:16,postgresql
240,12576116,Bidirectional Replication is coming to Postgre...,http://blog.2ndquadrant.com/bdr-is-coming-to-p...,200,38,iamd3vil,9/25/2016 16:54,postgresql
257,12576002,A fast PostgreSQL client library for Python: 3...,https://github.com/MagicStack/asyncpg,3,1,arjun27,9/25/2016 16:33,postgresql
452,12573947,4 N00b MySQL Mistakes Every Programmer Makes,http://devops.com/2016/08/11/4-n00b-mysql-mist...,5,0,agsw,9/25/2016 4:22,mysql
569,12572611,Postgresql 9.6.0 release schedule (9-29-2016),https://www.postgresql.org/message-id/27572.14...,1,0,phaas,9/24/2016 20:58,postgresql


In [32]:
sql_pivot = hn_sql.pivot_table(index = "flavor", values = "num_comments")
sql_pivot

Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
continuousql,7.0
deepsql,0.0
firebirdsql,0.0
hgsql,0.0
hottsql,0.0
html5sql,0.0
htsql,17.0
knowsql,0.0
memsql,1.6


We'll use a capture group to capture the version number after the word "Python", and then build a frequency table of the different versions.

In [33]:
pattern = r"([pP]ython ?[\d.]+)"


            

|Pattern   |Explanation        |
|:--------:|:----------------------------------------------|
|[Pp]      |The characters P or p                          |
|ython     |The substring **ython**                        |
| ?        |Followed by a space character or nothing at all|

In [34]:
py_versions_freq = dict(titles.str.extract(pattern, flags = re.I, expand = False).value_counts())

In [35]:
py_versions_freq

{'Python 3': 107,
 'Python 3.5': 16,
 'Python 201': 15,
 'Python 2': 13,
 'Python 2.7': 12,
 'Python 3.6': 10,
 'Python3': 10,
 'Python 101': 7,
 'Python 2.7.11': 5,
 'Python 3.5.1': 5,
 'Python 3.6.0': 4,
 'Python 3.4': 4,
 'Python.': 4,
 'Python 4': 3,
 'Python 2.7.': 3,
 'Python 3.5.0': 3,
 'Python 5': 3,
 'python 2': 2,
 'python2': 2,
 'Python2': 2,
 'python3': 2,
 'Python 2016': 1,
 'python 2.': 1,
 'Python3.4': 1,
 'Python 1.5': 1,
 'Python4': 1,
 'python3.5': 1,
 'Python 0.6': 1,
 'python 2.7.2': 1,
 'Python 8': 1,
 'python 3': 1,
 'Python3.5': 1,
 'Python 1.7': 1,
 'python.': 1,
 'Python 1.8.1': 1,
 'Python 5.0': 1,
 'Python 2.7.12': 1,
 'Python 3.5.2': 1}

Find the number of aparitions of the C language

In [36]:
pattern = r"[^-]\b[Cc]\b[^+#.]"
hn[titles.str.contains(pattern)]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
252,12576053,"Show HN: openemacs a tiny emacs clone, ? 1024...",https://github.com/practicalswift/openemacs,1,0,practicalswift,9/25/2016 16:43
457,12573886,Talking to C Programmers about C++ [video],https://www.youtube.com/watch?v=D7Sd8A6_fYU,81,112,adamnemecek,9/25/2016 3:59
517,12573105,Booksbyus/scalable-c: Scalable C The Book,https://github.com/booksbyus/scalable-c,4,1,mpweiher,9/24/2016 23:29
519,12573099,Python by the C side,https://www.paypal-engineering.com/2016/09/22/...,131,11,type0,9/24/2016 23:27
906,12569749,Tiny C Compiler,http://www.tinycc.org/,4,0,mynameislegion,9/24/2016 6:48
...,...,...,...,...,...,...,...
291655,10186851,"(C,C++,Java,PHP,SQL,Python,Linux,XML)asked que...",http://xquizzes.com/programming/XML,1,0,xquizzes,9/8/2015 17:03
291715,10186571,How to Create C/C++ Addons in Node,http://stackabuse.com/how-to-create-c-cpp-addo...,3,0,ScottWRobinson,9/8/2015 16:15
292405,10181992,A 1980s Commodore PC has controlled this schoo...,http://www.dailydot.com/technology/commadore-a...,2,0,DonnyV,9/7/2015 16:25
292563,10181120,Show HN: Libcox A C Library for Cross-Platfor...,http://libcox.net,47,27,symisc_devel,9/7/2015 12:10


**Lookarounds** let us define a character or sequence of characters that either must or must not come before or after our regex match. There are four types of lookarounds

|Lookaround        |Pattern    |Explanation                                        |
|------------------|:---------:|:--------------------------------------------------|
|Positivelookahead |zzz(?=abc) |Matches zzz only when it is followed by abc        |
|Negativelookahead |zzz(?!abc)|Matches zzz only when it is **not** followed by abc|
|Positivelookbehind|(?<=abc)zzz|Matches zzz only when it is preceded by abc        |
|Negativelookahead |(?<!abc)zzz|Matches zzz only when it is **not** followed by abc|


In [37]:
pattern = r"(?<![A-Za-z0-9-.])[cC](?![A-Za-z0-9+#-.Ã])"
c_mentions = hn[titles.str.contains(pattern)]

Let's use **backreference** to find story titles that have repeated words

In [38]:
repeated_words = titles.str.contains(r"(\b\w\b)\s\1")
hn[repeated_words]["title"]

  repeated_words = titles.str.contains(r"(\b\w\b)\s\1")


301       Google's self-driving car involved in Mountain...
510                         Am I Introverted, or Just Rude?
556       Google's self-driving car is the victim in a s...
801       Capcom's signed Windows driver allows arbitrar...
906                                         Tiny C Compiler
                                ...                        
292342            Steve Jobs was a Syrian migrant's son too
292656                          Asia's smartphone addiction
292852    Wikipedia founder backs site's systems after e...
292976    Ask HN: What Should I Including on My Company ...
293113    Why we aren't tempted to use ACLs on our Unix ...
Name: title, Length: 821, dtype: object

In [39]:
email_variations = pd.Series(['email','Email','e Mail','e mail','E-mail','e-mail','eMail','E-Mail','EMAIL'])

Use a regular expression to replace each of the matches in `email_variations` with "email"

In [40]:
email_uniform = email_variations.str.replace(r"e-?\s?mail","email", flags = re.I)
email_uniform

  email_uniform = email_variations.str.replace(r"e-?\s?mail","email", flags = re.I)


0    email
1    email
2    email
3    email
4    email
5    email
6    email
7    email
8    email
dtype: object

In [41]:
hn["titles_clean"] = titles.str.replace(r"e-?\s?mail","email", flags = re.I)

  hn["titles_clean"] = titles.str.replace(r"e-?\s?mail","email", flags = re.I)


In [42]:
hn[hn["titles_clean"].str.contains(r"e-?\s?mail")]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,titles_clean
90,12577773,This is what happens when you reply to spam email,https://www.ted.com/talks/james_veitch_this_is...,4,0,NicoJuicy,9/25/2016 22:23,This is what happens when you reply to spam email
173,12576882,Correct way to validate email adresses,https://hackernoon.com/the-100-correct-way-to-...,2,0,pvsukale3,9/25/2016 19:18,Correct way to validate email adresses
362,12574752,"Sharpen Your Pencils, Its Time for the #EmailI...",https://sendgrid.com/blog/email-iq-trivia-cont...,1,0,samber,9/25/2016 10:41,"Sharpen Your Pencils, Its Time for the #emailI..."
816,12570686,WHATSAPP MAY INCORPORATE PASSCODE AND RECOVERY...,http://naijafixer.com/phones/whatsapp-may-inco...,1,0,abula,9/24/2016 12:32,WHATSAPP MAY INCORPORATE PASSCODE AND RECOVERY...
1010,12568819,"Obama used a pseudonym in emails with Clinton,...",http://www.politico.com/story/2016/09/hillary-...,8,2,douche,9/24/2016 0:15,"Obama used a pseudonym in emails with Clinton,..."
...,...,...,...,...,...,...,...,...
292778,10179558,Show HN: LeadStage emaildomaindb Flag email a...,https://github.com/leadstage/email-domain-db,2,0,adamseabrook,9/6/2015 23:57,Show HN: LeadStage emaildomaindb Flag email a...
292819,10179305,Fake Mail Generator,http://www.fakemailgenerator.com/,2,0,akandiah,9/6/2015 22:23,Fakemail Generator
292886,10178637,Encrypted Email Service Tutanota Celebrates On...,https://tutanota.com/blog/posts/secure-email-o...,5,0,johnd03,9/6/2015 19:05,Encrypted email Service Tutanota Celebrates On...
292927,10178230,This would make good corporate mail footer,http://thinkbeforeyoumeet.com/,3,0,radoslawc,9/6/2015 17:10,This would make good corporatemail footer


Passing the cases in email_variations not guarantee passing all the cases in the titles column

## Analysis of the urls

In [43]:
hn["url"].head()

0    http://www.regulations.gov/document?D=FDA-2015...
1     https://www.sqlite.org/sqlar/doc/trunk/README.md
2    https://medium.com/vanmoof/our-secrets-out-f21...
3    http://cacm.acm.org/magazines/2011/7/109891-al...
4    https://www.talend.com/blog/2016/05/12/talend-...
Name: url, dtype: object

In [44]:
hn["protocol"] = hn["url"].str.extract(r"(https?://)")

In [45]:
hn["protocol"] = hn["protocol"].str[:-3]

In [46]:
hn["protocol"].value_counts()

http     169622
https    109557
Name: protocol, dtype: int64

In [47]:
hn["domain"] = hn["url"].str.extract(r"([^https?://]\w+[.]\w+)")

In [48]:
hn["domain"].value_counts().head()

medium.com       15929
github.com       14419
www.nytimes       5987
www.youtube       5234
echcrunch.com     4115
Name: domain, dtype: int64

Now we're going to create **multiple capture groups** to extract the protocol, domain, and page path

In [49]:
pattern = r"(.+)://(\w+.{0,1}\w+[.]\w+)[/?](.+)"
url_parts = hn["url"].str.extract(pattern)

Now we're going to give a name to a each column

In [50]:
pattern = r"(?P<protocol>.+)://(?P<domain>\w+.{0,1}\w+[.]\w+)[/?](?P<path>.+)"
url_parts = hn["url"].str.extract(pattern)
url_parts

Unnamed: 0,protocol,domain,path
0,http,www.regulations.gov,document?D=FDA-2015-D-3719-0018
1,https,www.sqlite.org,sqlar/doc/trunk/README.md
2,https,medium.com,vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43
3,http,cacm.acm.org,magazines/2011/7/109891-algorithmic-compositio...
4,https,www.talend.com,blog/2016/05/12/talend-and-Âthe-data-vaultÂ
...,...,...,...
293114,,,
293115,,,
293116,http,dangerousminds.net,comments/dying_vets_fuck_you_letter_to_george_...
293117,https,www.zend.com,en/resources/php-7


# Conclusions

We learned advanced regular expression techniques to help us work with data, including:

- Character classes to match certain groups of characters, including sets to match different capitalizations of programming languages.
- Quantifiers to match different quantities of characters, including matching different variations of "email".
- Negative character classes for matching anything except certain groups of characters.
- Word bundaries to match only specific instances of words.
- Positional anchors to match only at the start and end of strings.
- The ignorecase flag to make patterns case insensitive.
- Using multiple capture groups to extract URL data.
- How to use lookarounds to customize matches based on the surrounding text.
- How to substitute a regular expression match to clean inconsistent data.
- How to use named capture groups to extract dataframes from a text column.

These techniques allow us to clean and analize text data in an extremely powerful way, and will be one of the most useful tools. The key with regular expressions is to understand the key concepts and what is possible, and know where and how to look up the rest.