# 1. Introduction

In [1]:
import pandas as pd

hn=pd.read_csv("hacker_news.csv")

In [2]:
print(hn.head())

         id                                              title  \
0  12224879                          Interactive Dynamic Video   
1  11964716  Florida DJs May Face Felony for April Fools' W...   
2  11919867       Technology ventures: From Idea to Enterprise   
3  10301696  Note by Note: The Making of Steinway L1037 (2007)   
4  10482257  Title II kills investment? Comcast and other I...   

                                                 url  num_points  \
0            http://www.interactivedynamicvideo.com/         386   
1  http://www.thewire.com/entertainment/2013/04/f...           2   
2  https://www.amazon.com/Technology-Ventures-Ent...           3   
3  http://www.nytimes.com/2007/11/07/movies/07ste...           8   
4  http://arstechnica.com/business/2015/10/comcas...          53   

   num_comments      author       created_at  
0            52    ne0phyte   8/4/2016 11:52  
1             1    vezycash  6/23/2016 22:20  
2             1     hswarna   6/17/2016 0:01  
3     

In [3]:
print(hn.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20099 entries, 0 to 20098
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20099 non-null  int64 
 1   title         20099 non-null  object
 2   url           17659 non-null  object
 3   num_points    20099 non-null  int64 
 4   num_comments  20099 non-null  int64 
 5   author        20099 non-null  object
 6   created_at    20099 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB
None


In [4]:
print(hn.shape)

(20099, 7)


# 2. The Regular Expression Module

In [5]:
import re

titles = hn["title"].tolist()

python_mentions=0

pattern="[Pp]ython"

for title in titles:
    if re.search(pattern,title):
        python_mentions+=1
print(python_mentions)

160


# 3. Counting Matches with pandas Methods

In [6]:
pattern="[Pp]ython"
titles=hn["title"]

python_mentions=titles.str.contains(pattern).sum()

print(python_mentions)

160


# 4. Using Regular Expressions to Select Data

In [7]:
titles = hn['title']

pattern="[Rr]uby"

ruby_titles=titles[titles.str.contains(pattern)]

print(ruby_titles)

190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa.

# 5. Quantifiers

![](https://s3.amazonaws.com/dq-content/354/quantifier_example.svg)

![](https://s3.amazonaws.com/dq-content/354/quantifiers_numeric.svg)

![](https://s3.amazonaws.com/dq-content/354/quantifiers_other.svg)

In [8]:
email_bool=titles.str.contains("e-?mail")

email_count=email_bool.sum()

email_titles=titles[email_bool]

print("Number of email/e-mail occurences",email_count,"Email Titles",email_titles,sep="\n")

Number of email/e-mail occurences
86
Email Titles
119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object


# 6. Character Classes

![](https://s3.amazonaws.com/dq-content/354/character_classes_v2_1.svg)

In [9]:
pattern="\[\w+\]"

tag_titles=titles.str.contains(pattern)
tag_count=tag_titles.sum()

print(tag_count)

444


# 7. Accessing the Matching Text with Capture Groups

![](https://s3.amazonaws.com/dq-content/354/tags_syntax_breakdown_v2.svg)

In [10]:
pattern = r"\[(\w+)\]"

tag_freq=titles.str.extract(pattern,expand=False).value_counts()

print(tag_freq)

pdf            276
video          111
audio            3
2015             3
slides           2
beta             2
2014             2
blank            1
Ubuntu           1
gif              1
detainee         1
Python           1
Map              1
React            1
crash            1
Benchmark        1
HBR              1
CSS              1
5                1
JavaScript       1
transcript       1
ANNOUNCE         1
SpaceX           1
Excerpt          1
Petition         1
GOST             1
coffee           1
survey           1
Live             1
Challenge        1
repost           1
SPA              1
Videos           1
png              1
song             1
USA              1
satire           1
viz              1
map              1
Beta             1
Infograph        1
NSFW             1
videos           1
1996             1
Australian       1
ask              1
German           1
2008             1
much             1
Skinnywhale      1
comic            1
updated          1
Name: title,

# 8. Negative Character Classes

![](https://s3.amazonaws.com/dq-content/354/negative_character_classes.svg)

In [11]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

pattern=r"[Jj]ava[^Ss]"

first_10_matches(pattern)

java_titles=titles[titles.str.contains(pattern)]

print(java_titles)

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1840                     Adopting RxJava on the Airbnb App
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
2910                 2016 JavaOne Intel Keynote  32mn Talk
3452     What are the Differences Between Java Platform...
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
5947                                        JavaFX is dead
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

# 9. Word Boundaries

In [12]:
pattern=r"\b[Jj]ava\b"

first_10_matches(pattern)

java_titles=titles[titles.str.contains(pattern)]

print(java_titles)

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

# 10. Matching at the Start and End of Strings

![](https://s3.amazonaws.com/dq-content/354/positional_anchors.svg)

In [13]:
beginning_count=titles.str.contains(r"^\[\w+\]").sum()
print(beginning_count)

ending_count=titles.str.contains(r"\[\w+\]$").sum()
print(ending_count)

15
417


# 11. Challenge: Using Flags to Modify Regex Patterns

In [14]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern=r"\be[\-\s]?mails?\b"

email_tests[email_tests.str.contains(pattern)]

emails=email_tests.str.contains(pattern, flags=re.I)


email_mentions=emails.sum()
print("Test String",email_tests[emails],email_mentions, sep="\n")


emails=titles.str.contains(pattern,flags=re.I)

email_mentions=emails.sum()

print("HN Titles",titles[emails],email_mentions,sep="\n")


Test String
0       email
1       Email
2      e Mail
3      e mail
4      E-mail
5      e-mail
6       eMail
7      E-Mail
8       EMAIL
9      emails
10     Emails
11    E-Mails
dtype: object
12
HN Titles
119      Show HN: Send an email from your shell to your...
161      Computer Specialist Who Deleted Clinton Emails...
174                                        Email Apps Suck
261      Emails Show Unqualified Clinton Foundation Don...
313          Disposable emails for safe spam free shopping
                               ...                        
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19395    I used HTML Email when applying for jobs, here...
19446    Tell HN: Secure email provider Riseup will run...
19905    Gmail Will Soon Warn Users When Emails Arrive ...
Name: title, Length: 141, dtype: object
141


# 12. Next Steps

In this lesson, we learned the basics of using regular expressions to perform powerful text matching, including:

* Character classes to match certain groups of characters, including sets to match different capitalizations of programming languages.
* Quantifiers to match different quantities of characters, including matching different variations of "email."
* Negative character classes for matching anything except certain groups of characters.
* Word boundaries to match only specific instances of words.
* Positional anchors to match only at the start and end of strings.
* The ignorecase flag to make patterns case insensitive.

In the next lesson, we'll expand on our regular expression knowledge with some advanced regex concepts!