# 2 Regular Expressions Module

1. Initialize a variable python_mentions with the integer value 0.

1. Create a string — pattern — containing a regular expression pattern that uses a set to match Python or python.

1. Use a loop to iterate over each item in the titles list, and for each item:
    - Use the re.search() function to check whether pattern matches the title.
    - If re.search() returns a match object, increment (add 1 to) the python_mentions variable.

In [1]:
import pandas as pd
import re

hn = pd.read_csv("04_Data_Cleaning/04_2_Advanced_data_cleaning/hacker_news.csv")

titles = hn["title"].tolist()
python_mentions = 0
pattern = "[Pp]ython"

for t in titles:
    if re.search(pattern, t):
        python_mentions += 1
        
print(python_mentions)

160


# 3 Counting Matches with pandas Methods

`Series.str.contains()` method can be used to test whether a Series of strings match a particular regex pattern.

Next, we'll create our regex pattern, and use `Series.str.contains()` to compare to each value in our series:

```
pattern = "[Bb]lue"
pattern_contained = eg_series.str.contains(pattern)
```

The result is a boolean mask: a series of **True** / **False** values.

One of the neat things about boolean masks is that you can use the `Series.sum()` method to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern.

If we wanted, we could use method chaining to do the whole operation on one line:

```
pattern_count = eg_series.str.contains(pattern).sum()
print(pattern_count)
```




## Instructions

1. Assign the title column from the hn dataframe to the variable titles.

1. Use `Series.str.contains()` and `Series.sum()` with the provided regex pattern to count how many Hacker News titles contain **Python** or **python**. Assign the result to `python_mentions`.


In [2]:
pattern = '[Pp]ython'

titles = hn["title"]

python_mentions = titles.str.contains(pattern).sum()
print(python_mentions)

160


# 4 Using Regular Expressions to Select Data

On the previous two screens, we used regular expressions to count how many titles contain Python or python. What if we wanted to **view** those titles?

In that case, we can use the boolean array returned by `Series.str.contains()` to select just those rows from our series.

``` titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())
```

```0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool
```

Then, we can use that boolean array to select just the matching rows:


```
py_titles = titles[py_titles_bool]
print(py_titles.head())

```

We can also do it in a streamlined, single line of code:

```
py_titles = titles[titles.str.contains("[Pp]ython")]
print(py_titles.head())
```

## Instructions

Use `Series.str.contains()` to create a series of the values from titles that contain **Ruby** or **ruby**. Assign the result to `ruby_titles`.

In [3]:
titles = hn['title']
pattern = "[Rr]uby"

ruby_titles = titles[titles.str.contains(pattern)]
print(ruby_titles)

190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa.

# 5 Quantifiers

We learned that we could use braces ({}) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from `1000` to `2999` we could write the regular expression below:

![image.png](attachment:image.png)

**Quantifiers** specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both `e-mail` and `email`. To do this, we would want to specify to match `-` either zero or one times.

![image-2.png](attachment:image-2.png)

You might notice that the last two examples above omit the **first** and **last** character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to **numeric quantifiers**, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

![image-3.png](attachment:image-3.png)





## Instructions

1. Use a regular expression and `Series.str.contains()` to create a boolean mask that matches items from titles containing `email` or `e-mail`. Assign the result to `email_bool`.

1. Use `email_bool` to count the number of titles that matched the regular expression. Assign the result to `email_count`.

1. Use `email_bool` to select only the items from `titles` that matched the regular expression. Assign the result to `email_titles`.

In [5]:
# The `titles` variable is available from 
# the previous screens

pattern = "e-?mail" #e-mail or email

#find email boolean mask
email_bool =  titles.str.contains(pattern)

#count email titles
email_count = email_bool.sum()

# Return the specified row in title
email_titles =  titles[titles.str.contains(pattern)]

# 6 Character Classes

Some stories submitted to Hacker News include a topic tag in brackets, like `[pdf]`. In this screen, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex `[pdf]`. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters `p`, `d`, or `f`.

![image.png](attachment:image.png)

To match the substring `"[pdf]"`, we can use backslashes to escape both the open and closing brackets: `\[pdf\]`.

![image-2.png](attachment:image-2.png)

To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

1. The set notation using brackets to match any of a number of characters.

1. The range notation, which we used to match ranges of digits (like `[0-9]`).

![image-3.png](attachment:image-3.png)

Just like with quantifiers, there are some other common character classes which we'll use a lot.

![image-4.png](attachment:image-4.png)

The one that we'll be using to match characters in tags is `\w`, which represents any number or letter. Each character class represents a single character, so to match multiple characters (e.g. words like `video` and `pdf`), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (`\w`) with the 'one or more' quantifier `(+)`,  giving us a combined pattern of `\w+`.

This will match sequences like `pdf`, `video`, `Python`, and `2018` but won't match a sequence containing a space or punctuation character like `PHP-DEV` or `XKCD Flowchart`. If we wanted to match those tags as well, we could use `.+`; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this screen:

* We can use a backslash to escape characters that have special meaning in regular expressions (e.g. `\[{ } will match an open bracket character).

* Character classes let us match certain groups of characters (e.g. `\w` will match any word character).

* Character classes can be combined with quantifiers when we want to match different numbers of characters.


## Instructions

1. Write a regular expression, assigning it as a string to the variable pattern. The regular expression should match, in order:

        1. A single open bracket character.

        2. One or more word characters.

        3. A single close bracket character.

1. Use the regular expression to select only items from titles that match. Assign the result to the variable `tag_titles`.

1. Count how many matching titles there are. Assign the result to `tag_count`.




In [None]:
#A single open bracket character. - \[
#One or more word characters. - \w+
#A single close bracket character.

pattern = "\[\w+\]"

#find pattern boolean mask
tag_titles =  titles[titles.str.contains(pattern)]

#count matched  titles
tag_count = tag_titles.shape[0]



# 7 Accessing the Matching Text with Capture Groups

In Python, a backslash followed by certain characters represents an escape sequence — like the `\n` sequence — represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring `\b`:

```

print('hello\b world')
hell world
```

instead we use raw strings, which we denote by prefixing our string with the `r` character. Let's take a look at the code from above with a raw string:

```
print(r'hello\b world')
hello\b world
```

In the previous screen, we were able to calculate that **444 of the 20,100** Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use **capture groups**. Capture groups allow us to specify one or more groups within our match that we can access separately.

We specify capture groups using **parentheses**. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

![image.png](attachment:image.png)

We use the `Series.str.extract()` method to extract the match within our parentheses:

```ruby
    pattern = r"(\[\w+\])"
    tag_5_matches = tag_5.str.extract(pattern)
    print(tag_5_matches)
```

```
    67        [pdf]
    101    [German]
    160       [pdf]
    163       [pdf]
    196      [Beta]
    Name: title, dtype: object

```

We can move our parentheses inside the brackets to get just the text: we specify ``expand=False`` with the Series.str.extract() method to return a series.

```r
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern, expand=False)
print(tag_5_matches)
```

## Instructions

Let's use this technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

1. Create a capture group inside the brackets.

1. Use `Series.str.extract()` and `Series.value_counts()` with the modified regex pattern to produce a frequency table of all the tags in the titles series. Assign the frequency table to `tag_freq`.

In [None]:
pattern = r"\[(\w+)\]"
    
tag_titles = titles.str.extract(pattern,expand=False)

tag_freq = tag_titles.value_counts()
print(tag_freq)

# 8 Negative Character Classes

 In reality, regular expressions are often complex. When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.

 In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings:

 ```python 
 def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

```