<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#RegEx" data-toc-modified-id="RegEx-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>RegEx</a></span><ul class="toc-item"><li><span><a href="#Special-Characters" data-toc-modified-id="Special-Characters-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Special Characters</a></span></li><li><span><a href="#Capture-Groups" data-toc-modified-id="Capture-Groups-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Capture Groups</a></span></li><li><span><a href="#Negative-Character-Classes" data-toc-modified-id="Negative-Character-Classes-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Negative Character Classes</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Word-Boundary-Anchor" data-toc-modified-id="Word-Boundary-Anchor-1.3.0.1"><span class="toc-item-num">1.3.0.1&nbsp;&nbsp;</span>Word Boundary Anchor</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Continuing-RegEx" data-toc-modified-id="Continuing-RegEx-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Continuing RegEx</a></span><ul class="toc-item"><li><span><a href="#Lookarounds" data-toc-modified-id="Lookarounds-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Lookarounds</a></span></li><li><span><a href="#Backreferences" data-toc-modified-id="Backreferences-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Backreferences</a></span></li><li><span><a href="#re.sub()" data-toc-modified-id="re.sub()-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>re.sub()</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Extracting-Domains-and-URLs" data-toc-modified-id="Extracting-Domains-and-URLs-2.3.0.1"><span class="toc-item-num">2.3.0.1&nbsp;&nbsp;</span>Extracting Domains and URLs</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#List-Comprehension-&amp;-Lambda-Functions" data-toc-modified-id="List-Comprehension-&amp;-Lambda-Functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>List Comprehension &amp; Lambda Functions</a></span><ul class="toc-item"><li><span><a href="#List-Comprehension" data-toc-modified-id="List-Comprehension-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>List Comprehension</a></span></li><li><span><a href="#Lambda-Functions" data-toc-modified-id="Lambda-Functions-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Lambda Functions</a></span></li></ul></li></ul></div>

# RegEx

Regular Expressions are a powerful way of building patterns to matching text. In the first two missions of this Data Cleaning Advanced course, we're going to extend our knowledge about this extremely powerful tool that every data scientist should be familiar with.

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

`(.+)://([\w\.]+)/?(.*)`

That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.

One thing to keep in mind before we start: don't expect to remember all of the regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

With that in mind, don't be put off if some things in these missions don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.

When working with regular expressions, we use the term **pattern** to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has **matched**.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string `"and"` within another string, the regex pattern for that is simply `and`:

|RegEx|String      |Matches              |String with match|
|-----|------------|---------------------|-----------------|
| and |hand        |yes                  |h`and`           |
| and |android     |yes                  |`and`roid        |
| and |Andrew      |no                   |                 |
| and |antidote    |no                   |                 |


In the third example above, the pattern `and` does not match `Andrew` because even though `a` and `A` are the same letter, the two characters are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The `re` [module](https://docs.python.org/3/library/re.html#module-re). This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the `re` module is the `re.search()` [function](https://docs.python.org/3/library/re.html#re.search), which takes two required arguments:

- The regex pattern
- The string we want to search that pattern for

`import re
m = re.search("and", "hand")
print(m)`

`< _sre.SRE_Match object; span=(1, 4), match='and' >`

The `re.search()` function will return a `Match` [object](https://docs.python.org/3/library/re.html#match-objects) if the pattern is found anywhere within the string. If the pattern is not found, `re.search()` returns `None`:

`m = re.search("and", "antidote")
print(m)`

`None`

We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is `True` while `None` is `False` to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:

`string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]`

`pattern = "Blue"`

`for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")`
        
`Match
No Match
No Match`

So far, we haven't done anything with regular expressions that we couldn't do using the `in` keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:

`[msb]end`

- `[` - Start Set
- `]` - End Set
- `msb` - Look for `m`, `s`, or `b`
- `end` - the substring end

The regular expression above will match the strings `mend`, `send`, and `bend`.

If you look closely, you'll notice the first string contains the substring `Blue` with a capital letter, where the third string contains the substring `blue` in all lowercase. We can use the set `[Bb]` for the first character so that we can match both variations, and then use that to count how many times `Blue` or `blue` occur in the list:

`blue_mentions = 0
pattern = "[Bb]lue"`

`for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1`

`print(blue_mentions)`

`2`

We've learned that we should avoid using loops in pandas, and that vectorized methods are often faster and require less code.

In the data cleaning course, we learned that the `Series.str.contains()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) can be used to test whether a Series of strings match a particular regex pattern. Let's look at how we can replicate the example from the previous screen using pandas.

We'll start by creating a pandas object containing our strings:

`eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]`

`eg_series = pd.Series(eg_list)
print(eg_series)`

Next, we'll create our regex pattern, and use `Series.str.contains()` to compare to each value in our series:

`pattern = "[Bb]lue"`

`pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)`

The result is a boolean mask: a series of `True`/`False` values.

One of the neat things about boolean masks is that you can use the `Series.sum()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html) to sum all the values in the boolean mask, with each `True` value counting as `1`, and each `False` as `0`. This means that we can easily count the number of values in the original series that matched our pattern:

`pattern_count = pattern_contained.sum()
print(pattern_count)`

`2`

**Check this main**

`pattern = '[Pp]ython'
titles = hn["title"].tolist()
python_mentions = pd.Series(titles).str.contains(pattern).sum()`

On the previous two screens, we used regular expressions to count how many titles contain `Python` or `python`. What if we wanted to view those titles?

In that case, we can use the boolean array returned by `Series.str.contains()` to select just those rows from our series. Let's look at that in action, starting by creating the boolean array.

Then, we can use that boolean array to select just the matching rows:

`py_titles = titles[py_titles_bool]`

We can also do it in a streamlined, single line of code:

`py_titles = titles[titles.str.contains("[Pp]ython")]
print(py_titles.head())`

Let's use this technique to select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

`titles = hn['title']
ruby_titles = titles[titles.str.contains("[Rr]uby")]`

In the data cleaning course, we learned that we could use braces (`{}`) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from `1000` to `2999` we could write the regular expression below:

`[1-2][0-9]{3}`

- `[1-2]` - Any digit between 1 and 2
- `[0-9]` - Any digit between 0 and 9
- `{3}` - Repeat the previous range 3 times 

The name for this type of regular expression syntax is called a **quantifier**. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both `e-mail` and `email`. To do this, we would want to specify to match `-` either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

|Quantifier|Pattern     |Explanation                         |
|----------|------------|------------------------------------|
| numeric  |`a{3}`      |Character `a` 3 times               |
| numeric  |`a{3,5}`    |Character `a` 3, 4, OR 5 times      |
| numeric  |`a{,3}`     |Character `a` 0, 1, 2, OR 3 times   |
| numeric  |`a{8,}`     |Character `a` 8 or more             |

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

|Quantifier|Pattern     |Equivalent       |
|----------|------------|-----------------|
| numeric  |`a{0,}`     |`a*`             |
| numeric  |`a{1,}`     |`a+`             |
| numeric  |`a{0,1}`    |`a?`             |


So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like `[pdf]`. Here are a few examples of story titles with these tags:

To match the substring `"[pdf]"`, we can use backslashes to escape both the open and closing brackets: `\[pdf\]`.

`\[pdf\]`

- `\[` - The `[` character (escaped)
- `pdf` - The entire substring `pdf`
- `\]` - The `]` character (escaped)

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like `pdf` and `video`) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

1. The set notation using brackets to match any of a number of characters.
2. The range notation, which we used to match ranges of digits (like [0-9]).

Let's look at a summary of syntax for some of the regex character classes:

|Character Class |Pattern     |Explanation                          |
|----------------|------------|-------------------------------------|
| Set            |`[fud]`     |Either, `f`,`u`, or `d`              |
| Range          |`[a-e]`     |Any chars between `a` through `e`    |
| Range          |`[0-3]`     |Any chars between `0` through `3`    |
| Range          |`[A-Z]`     |Any Uppercase Character              |
| Set + Range    |`[A-Za-z]`  |Any Uppercase or lowercase Character |

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

|Character Class |Pattern  |Explanation                                                  |
|----------------|---------|-------------------------------------------------------------|
| Digit          |`\d`     |Any digit character (same as `[0-9]`                         |
| Word           |`\w`     |Any digit, upper/lower/underscore (same as `[A-Za-z0-9_]`    |
| Whitespace     |`\s`     |Any space, tab, or linebreak character                       |
| Dot            |`. `     |Any character except newline                                 |

The one that we'll be using in order to match characters in tags is `\w`, which represents any digit uppercase or lowercase letter. Each character class represents a single character, so to match multiple characters (e.g. words like `video` and `pdf`), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (`\w`) with the 'one or more' quantifier (`+`), giving us a combined pattern of `\w+`.

This will match sequences like `pdf`, `video`, `Python`, and `2018` but won't match a sequence containing a space or punctuation character like `PHP-DEV` or `XKCD Flowchart`. If we wanted to match those tags as well, we could use `.+`; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this screen:

- We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
- Character classes let us match certain groups of characters (e.g. \w will match any word character).
- Character classes can be combined with quantifiers when we want to match different numbers of characters.

We'll use these concepts to count the number of titles that contain a tag.

Use the regular expression to select only items from that match. Assign the result to the variable tag_titles.

`pattern = "\[\w+\]"
tag_titles = titles[titles.str.contains(pattern)]
tag_count = titles.str.contains(pattern).sum()`

## Special Characters
On the previous screen, we learned that we can use backslashes to escape the `[` and `]` characters. Backslashes are used to escape many other characters in regular expressions, as well as to denote some special character sequences (like character classes).

In Python, a backslash followed by certain characters represents an [escape sequence](https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences) — like the `\n` sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring `\b`:

`print('hello\b')`

`hell`

The escape sequence `\b` represents a backspace, so the final letter from our string is removed. The character sequence `\b` has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":

`print('hello\\b')`

`hello\b`

This can make regular expressions even more difficult to read and interpret, so instead we use [raw strings](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals), which we denote by prefixing our string with the `r` character. Let's take a look at the code from above with a raw string:

`print(r'hello\b')`

`hello\b`

## Capture Groups
We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use **capture groups**. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

`(\[\w+\])`

- `(` - Start capture group
- `\[` - The character `[` (escaped)
- `\w+` - One or more word characters
- `\]` - The `]` character (escaped)
- `)` - End capture group

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise:

`tag_5 = tag_titles.head()
print(tag_5)`

We use the `Series.str.extract()` [method](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.str.extract.html) to extract the match within our parentheses:

`pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)`

We can move our parentheses inside the brackets to get just the text:

`pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)`

If we then use `Series.value_counts()` we can quickly get a frequency table of the tags:

`tag_5_freq = tag_5_matches.value_counts()
print(tag_5_freq)`

**Finding all of the things that match in this joint**

`
pattern = r"\[(\w+)\]"
tag_freq = df.str.extract(pattern).value_counts()`

On the previous screens, we wrote mostly simple regular expressions. In reality, regular expressions are often complex. When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.

In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings:

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows you to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference. For this screen, we'll use the `first_10_matches` function we just built to iteratively build a regular expression.

Earlier, we counted the titles that included Python — let's write a simple regular expression to match Java (another popular language), and use our function to look at the matches:

`first_10_matches(r"[Jj]ava")`

We can see that there are a number of matches that contain `Java` as part of the word `JavaScript`. We want to exclude these titles from matching so we get an accurate count.

One way to do this is by using **negative character classes**. Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:

## Negative Character Classes

|Character Class |Pattern     |Explanation                               |
|----------------|------------|------------------------------------------|
| Negative Set   |`[^fud]`    |Any char except `f`,`u`, or `d`           |
| Negative Set   |`[^1-3Z\s]` |Any char except `1-3`,`Z`, & `whitespace` |
| Negative Digit |`\D`        |Any char except digital characters        |
| Negative Word  |`\W`        |Any char except word characters           |  
| Negative WhiteS|`\S`        |Any char except whitespace characters     |

Let's use the negative set `[^Ss]` to exclude instances like JavaScript and Javascript:

On the previous screen, we used a negative set to find all of the mentions of "Java" in our dataset:

`first_10_matches(r"[Jj]ava[^Ss]")`

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where `Java` occurs at the end of the string, like this title:

`Pippo  Web framework in Java`

This is because the negative set `[^Ss]` must match one character, so instances at the end of a string do not match.

#### Word Boundary Anchor
A different approach to take in cases like these is to use the **word boundary anchor**, specified using the syntax `\b`. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:

Let's look at how using a word boundary changes the match from the string in the example above:

`string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"`

`m1 = re.search(pattern_1, string)
print(m1)`

`None`

The regular expression returns `None`, because there is no substring that contains `Java` followed by a character that isn't `S`.

Let's instead use word boundaries in our regular expression:

`pattern_2 = r"\bJava\b"`

`m2 = re.search(pattern_2, string)
print(m2)`

`_sre.SRE_Match object; span=(41, 45), match='Java'`

With the word boundary, our pattern matches the `Java` at the end of the string.

Let's use the word boundary anchor as part of our regular expression to select the titles that mention Java.

`pattern = r'\b[Jj]ava\b'
java_titles = titles[titles.str.contains(pattern)]`

So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

On the previous screen, we learned that the **word boundary anchor** matches the space between a word character and a non-word character. More generally in regular expressions, an **anchor** matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **end anchor**, which represent the start and the end of the string, respectfully.

|Anchor          |Pattern     |Explanation                                    |
|----------------|------------|-----------------------------------------------|
| Beginning      |`^abc`      |Matches `abc` only at the start of the string  |
| End            |`abc$`      |Matches `abc` only at the end of the string    |

Note that the `^` character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a `[` or not.

Let's start with a few test cases that all contain the substring `Red` at different parts of the string, as well as a test function:

`pattern_beg = r'^\[(\w+)\]'
beginning_count = titles.str.contains(pattern_beg).sum()`

`pattern_end = r'\[(\w+)\]$'
ending_count = titles.str.contains(pattern_end).sum()`

Up until now, we've been using sets like [Pp] to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

`email
Email
e Mail
e mail
E-mail
e-mail
eMail
E-Mail
EMAIL`

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use `flags` to specify that our regular expression should ignore case.

Both `re.search()` and the pandas regular expression methods accept an optional `flags` argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

A [list of all available flags](https://docs.python.org/3/library/re.html#re.A) is in the documentation, but by far the most common and the most useful is the `re.IGNORECASE` `flag`, which is also available using the alias `re.I` for convenience.

When you use this flag, all uppercase letters will match their lowercase equivalents and vice versa. Let's look at an example without using the flag:

`email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")`

Now let's look at what happens when we use the flag:

`import re
email_tests.str.contains(r"email",flags=re.I)`

No matter what the capitalization is, our regular expression matches.

We'll finish this mission by writing a regular expression and count the number of times that email is mentioned in story titles. You'll need to use both ignorecase as well as some of the other regex components you've already learned in this mission.

In this mission, we learned the basics of using regular expressions to perform powerful text matching, including:

- Character classes to match certain groups of characters, including sets to match different capitalizations of programming languages.
- Quantifiers to match different quantities of characters, including matching different variations of "email."
- Negative character classes for matching anything except certain groups of characters.
- Word boundaries to match only specific instances of words.
- Positional anchors to match only at the start and end of strings.
- The ignorecase flag to make patterns case insensitive.

In the next mission, we'll expand on our regular expression knowledge with some advanced regex concepts!

# Continuing RegEx

So far we've used capture groups to extract all or most of the text in our regular expression pattern. Capture groups can also be useful to extract specific data from within our expression.

Let's look at a sample of Hacker News titles that mention Python:

`Developing a computational pipeline using the asyncio module in Python 3
Python 3 on Google App Engine flexible environment now in beta
Python 3.6 proposal, PEP 525: Asynchronous Generators
How async/await works in Python 3.5.0
Ubuntu Drops Python 2.7 from the Default Install in 16.04`

All of these examples have a number after the word "Python," which indicates a version number. We can use the following regular expression to match these cases:

`[Pp]ython [\d\.]+`

- `[Pp]` - Upper or Lowercase `P`
- `[ython]` - The substring `ython`
- `[\d\.]+` - One or more digits OR `.` chars

We can use capture groups to extract the version of Python that is mentioned most often in our dataset by wrapping parentheses around the part of our regular expression which captures the version number.

We'll use a capture group to capture the version number after the word "Python," and then build a frequency table of the different versions.

`pattern=r'[Pp]ython ([\d\.]+)'`
`py_versions_freq = dict(titles.str.extract(pattern, flags=re.I).value_counts())`

It looks like we're getting close. In our first 10 matches we have one irrelevant result, which is about "Series C," a term used to represent a particular type of startup fundraising.

Additionally, we've run into the same issue as we did in the previous mission — by using a negative set, we may have eliminated any instances where the last character of the title is "C" (the second last line of output matches in spite of the fact that it ends with "C," because it also has "C" earlier in the string).

Neither of these can be avoided using negative sets, which are used to allow multiple matches for a single character. Instead we'll need a new tool: **lookarounds**.

## Lookarounds

Lookarounds let us define a character or sequence of characters that either must or must not come before or after our regex match. There are four types of lookarounds:

|Lookaround      |Pattern       |Explanation                                  |
|----------------|--------------|---------------------------------------------|
| + Lookahead    |`zzz(?=abc)`  |Matches `zzz` when its followed by abc       |
| - Lookahead    |`zzz(!?=abc)` |Matches `zzz` when its NOT followed by abc   |
| + Lookabehind  |`(?<=abc)zzz` |Matches `zzz` when its before abc            |
| - Lookabehind  |`(?<!=abc)zzz`|Matches `zzz` when its NOT before abc        |

These tips can help you remember the syntax for lookarounds:

- Inside the parentheses, the first character of a lookaround is always ?.
- If the lookaround is a lookbehind, the next character will be <, which you can think of as an arrow head pointing behind the match.
- The next character indicates whether the is lookaround is positive (=) or negative (!).

Let's create some test data that we'll use to illustrate how lookarounds work:

`test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']`

We'll also create a function that will loop over our test cases and tell us whether our pattern matches. We'll use the `re` module rather than pandas since it tells us the exact text that matches, which will help us understand how the lookaround is working:

`def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")`
        
In each instance, we'll aim to match the substring `Green` depending on the characters that precede or follow it. Let's start by using a **positive lookahead** to include instances where the match is followed by the substring `_Blue`. We'll include the underscore character in the lookahead, otherwise we will get zero matches:

The contents of a lookaround can include any other regular expression component. For instance, here is an example where we match only cases that are followed by exactly five characters:

`run_test_cases(r"Green(?=.{5})")`

The second and third test cases are followed by four characters, not five, and the last test case isn't followed by anything.

Sometimes programming languages won't implement support for all lookarounds (notably, lookbehinds are not in the official JavaScript specification). As an example, to get full support in the `RegExr` tool, you'll need to set it to use the PCRE regex engine.

In this exercise, we're going to use lookarounds to refine the regular expression we build on the last screen to capture mentions of the "C" programming language. As a reminder, here is the last of the regular expressions we attempted to use with this exercise earlier, and the resultant titles that match:

`first_10_matches(r"\b[Cc]\b[^.+]")`

Let's now use lookarounds to exclude the matches we don't want. We want to:

- Keep excluding matches that are followed by . or +, but still match cases where "C" falls at the end of the string.
- Exclude matches that have the word 'Series' immediately preceding them.

This exercise is a little harder than those you've seen so far in this course — it's okay if it takes you a few attempts!

## Backreferences
Let's say we wanted to identify strings that had words with double letters, like the "ee" in "feed." Because we don't know ahead of time what letters might be repeated, we need a way to specify a capture group and then to repeat it. We can do this with **backreferences**.

Whenever we have one or more capture groups, we can refer to them using integers left to right as shown in this regex that matches the string `HelloGoodbye`:

`(Hello)(Goodbye)`

Within a regular expression, we can use a backslash followed by that integer to refer to the group:

`(Hello)(Goodbye)\2\1`

The regular expression above will match the text `HelloGoodbyeGoodbyeHello`. Let's look at how we could write a regex to capture instances of the same two word characters in a row:

`(\w)\1`

Notice that there was no match for the word Aaron, despite it containing a double "a." This is because the uppercase and lowercase "a" are two different characters, so the backreference does not match.

We can easily achieve the same thing using pandas:

Let's use this technique to identify story titles that have repeated words.

## re.sub()
When we learned to work with basic string methods, we used the `str.replace()` method to replace simple substrings. We can achieve the same with regular expressions using the `re.sub()` [function](https://docs.python.org/3/library/re.html#re.sub). The basic syntax for `re.sub()` is:

`re.sub(pattern, repl, string, flags=0)`

The `repl` parameter is the text that you would like to substitute for the match. Let's look at a simple example where we replace all capital letters in a string with dashes:

`string = "aBcDEfGHIj"
print(re.sub(r"[A-Z]", "-", string))`

`a-c--f---j`

Earlier, we discovered that there were multiple different capitalizations for SQL in our dataset. Let's look at how we could make these uniform with the `Series.str.replace()` method and a regular expression:

`sql_variations = pd.Series(["SQL", "Sql", "sql"])`

`sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)`

`pattern = r'e[-\s]{0,1}mail'`

`email_uniform = email_variations.str.replace(pattern, "email", flags=re.I)
titles_clean =  titles.str.replace(pattern, "email", flags=re.I)`

Over the final three screens in this mission, we'll extract components of URLs from our dataset. As a reminder, most stories on Hacker News contain a link to an external resource.

The task we will be performing first is extracting the different components of the URLs in order to analyze them. On this screen, we'll start by extracting just the domains. Below is a list of some of the URLs in the dataset, with the domains highlighted in color, so you can see the part of the string we want to capture.

`https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429,
 http://www.interactivedynamicvideo.com/
 http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0
 http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/
 HTTPS://github.com/keppel/pinn
 Http://phys.org/news/2015-09-scale-solar-youve.html
 https://iot.seeed.cc
 http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html
 http://beta.crowdfireapp.com/?beta=agnipath
 https://www.valid.ly?param`
 
The domain of each URL excludes the protocol (e.g. https://) and the page path (e.g. /Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429).
 
There are several ways that you could use regular expressions to extract the domain, but we suggest the following technique:

- Using a series of characters that will match the protocol.
- Inside a capture group, using a set that will match the character classes used in the domain.
- Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.

Once you have extracted the domains, you will be building a frequency table so we can determine the most popular domains. There are over 7,000 unique domains in our dataset, so to make the frequency table easier to analyze, we'll look at only the top 20 domains.

We have provided some of the URLs from the dataset which will help you to iterate while you build your regular expression.

#### Extracting Domains and URLs
The task we will be performing first is extracting the different components of the URLs in order to analyze them. On this screen, we'll start by extracting just the domains. Below is a list of some of the URLs in the dataset, with the domains highlighted in color, so you can see the part of the string we want to capture.

` 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'`
 
The domain of each URL excludes the protocol `(e.g. https://)` and the page path (e.g. /Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429).

**Pattern to extract domains from urls**
`pattern = r'https?://([\w\.]+)'`

Having extracted just the domains from the URLs, on this final screen we'll extract each of the three component parts of the URLs:

- Protocol
- Domain
- Page path

**Pattern to extract URL components using capture groups**
`pattern = r"(.+)://([\w\.]+)/?(.*)"`

**Pattern to extract URL components using NAMED capture groups**
`pattern = r"(?P<protocol>.+)://(?P<domain>[\w\.]+)/?(?P<path>.*)"`

# List Comprehension & Lambda Functions
So far, we've learned how to use regular expressions to make cleaning and analyzing text data easier.

In this mission, we'll learn some tips and syntax shortcuts we can use on top of everything we've learned, including:

- Creating list comprehensions to replace loops with a single line of code.
- Creating single use functions called lambda functions.

The data set we'll use in this mission is in a format called [`JavaScript Object Notation`](https://www.json.org/) **(JSON)**. As the name indicates, JSON originated from the JavaScript language, but has now become a language-independent format.

From a Python perspective, JSON can be thought as a collection of Python objects nested inside each other.

The JSON above is a list, where each element in the list is a dictionary. Each of the dictionaries have the same keys, and one of the values of each dictionary is itself a list.

The Python [`json` module](https://docs.python.org/3.7/library/json.html#module-json) contains a number of functions to make working with JSON objects easier. We can use the [json.loads() method](https://docs.python.org/3.7/library/json.html#json.loads) to convert JSON data contained in a string to the equivalent set of Python objects:

`json_string = """
[
  {
    "name": "Sabine",
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"]
  },
  {
    "name": "Zoe",
    "age": 40,
    "favorite_foods": ["Chicken", "Pizza", "Chocolate"]
  },
  {
    "name": "Heidi",
    "age": 40,
    "favorite_foods": ["Caesar Salad"]
  }
]
"""`

`import json
json_obj = json.loads(json_string)
print(type(json_obj))`

`class 'list'`

We can see that `json_string` has turned into a list. Let's take a look at the values in the list:

`print(json_obj)`

`[{'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal'], 'name': 'Sabine'},
 {'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate'], 'name': 'Zoe'},
 {'age': 40, 'favorite_foods': ['Caesar Salad'], 'name': 'Heidi'}]`
 
 We can observe a few things:

- The formatting from our original string is gone. This is because printing Python lists and dictionaries has a simple formatting structure.
- The order of the keys in the dictionary have changed. This is because (prior to version 3.6) Python dictionaries don't have fixed order.

Let's practice using `json.loads()` to convert JSON data from a string to Python objects!

One of the places where the JSON format is commonly used is in the results returned by an [`Application programming interface`](https://en.wikipedia.org/wiki/Application_programming_interface) **(API)**. APIs are interfaces that can be used to send and transmit data between different computer systems. We'll learn about how to work with APIs in a later course.

To read a file from JSON format, we use the `json.load()` [function](https://docs.python.org/3.7/library/json.html#json.load). Note that the function is json.load() without an "s" at the end. The `json.loads()` function is used for loading JSON data from a string ("loads" is short for "load string"), whereas the `json.load()` function is used to load from a file object. Let's look at how we would read that in our data:

`import json
file = open("hn_2014.json")
hn = json.load(file)`

`print(type(hn))`

`class 'list'`

Let's look at the first dictionary in full. To make it easier to understand, we're going to create a function which will print a JSON object with formatting to make it easier to read.

The function will use the `json.dumps()` [function](https://docs.python.org/3.7/library/json.html#json.dumps) ("dump string") which does the opposite of the `json.loads()` function — it takes a JSON object and returns a string version of it. The `json.dumps()` function accepts arguments that can specify formatting for the string, which we'll use to make things easier to read:

You may notice that the `createdAt` and `createdAtI` keys both have the date and time data in two different formats. Because the format of `createdAt` is much easier to understand, let's do some data cleaning by deleting the `createdAtI` key from every dictionary.

To delete a key from a dictionary, we can use the `del` [statement](https://docs.python.org/3.7/reference/simple_stmts.html#del). Let's learn the syntax by looking at a simple example:

`d = {'a': 1, 'b': 2, 'c': 3}
del d['a']
print(d)`

`{'b': 2, 'c': 3}`

We can create a function using `del` that will return a copy of our dictionary with the key removed:

## List Comprehension

The task we performed is an extremely common one. Specifically, we:

- Iterated over values in a list.
- Performed a transformation on those values.
- Assigned the result to a new list.

Python includes a special syntax shortcut for tasks that meet these criteria: **List Comprehensions**. A list comprehension provides a concise way of creating lists in a single line of code.

List comprehensions can look complex at first, but we are simply reordering the elements of our for loop code. To keep things simple, we'll start with a basic example, where we want to add 1 to each item in a list of integers.

To transform this structure into a list comprehension, we do the following within brackets:

- Start with the code that transforms each item.
- Continue with our for statement (without a colon).

We can then assign the list comprehension to a variable name. The animation below shows how we convert the manual loop version to a list comprehension.

Just like in a normal loop, we can use any name for our iterator variable. Here, we have used f.

For the last example, we'll apply a method to each string in a list to capitalize it. We won't color the different components, so we can get used to how that looks.

Let's recap what we have learned so far. A list comprehension can be used where we:

- Iterated over values in a list.
- Performed a transformation on those values.
- Assigned the result to a new list.

To transform a loop to a list comprehension, in brackets we:

- Start with the code that transforms each item.
- Continue with our for statement (without a colon).

We are going to write a list comprehension version of the code from the previous screen. To help, we've provided a copy of the code with the components labeled.

List comprehensions can be used for many different things. Three common applications are:

1. Transforming a list
2. Creating a new list
3. Reducing a list

On this screen, we're going to look at the first two of these applications.

The first application, **transforming** a list, is the category that all the examples you've seen so far fit under. You are taking an existing list, applying a transformation to every value, and assigning it to a variable.

The example below uses a list comprehension to transform a list of square numbers into their "square roots":

`squares = [1, 4, 9, 16]`

`sqroots = [sq ** (1/2) for sq in squares]`

The second application, **creating a new list**, is useful for creating test data or data that is based on a set of numbers.

As an example, let's create a list of generic columns names that we could use to create a dataframe using the the `range()` [function](https://docs.python.org/3/library/stdtypes.html#range) and the `str.format()` [method](https://docs.python.org/3/library/stdtypes.html#str.format) to combine numbers and text:

`cols = ["cols_{}".format(i) for i in range(1,5)]`

The last common application of list comprehensions is **reducing a list**. Let's say we had a list of integers and we wanted to remove any integers that were smaller than 50. We could do this by adding an if statement to our loop:

Our loop has one new component — the if statement, which we've colored yellow. Notice that instead of a transformation, we have just the list item itself (`i`) in red. Both if statements and transformations are optional in list comprehensions, but we must include some value to populate the elements in the new list.

To include an if statement in a list comprehension, we include it at the very end, before the closing bracket:

`ints = [1, 4, 9, 16, 50, 100, 75, 4324, 99]`

`big_ints = [i for i in ints if i >=50]`

We can use this technique to quickly and easily filter our data set using an if statement. Let's look at how we could use this to quickly count the number of stories that have comments. We'll start with a version using a loop:

`has_comments = []`

`for d in hn_clean:
    if d['numComments'] > 0:
        has_comments.append(d)`

`num_comments = len(has_comments)`

Now, let's use a list comprehension to perform the same calculation:

`has_comments = [d for d in hn_clean if d['numComments'] > 0]`

`num_comments = len(has_comments)`

In previous missions, we learned to use Python's built-in functions to analyze data in lists, like `min()`, `max()`, and `sorted()`.

What if we wanted to use these functions to work with data in JSON form? Let's use our demo JSON object to try and see what happens. First, we'll quickly remind ourselves of the data:

`jprint(json_obj)`

`[
    {
        "age": 36,
        "favorite_foods": ["Pumpkin", "Oatmeal"],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": ["Chicken", "Pizza", "Chocolate"],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": ["Caesar Salad"],
        "name": "Heidi"
    }
]`

Let's try and use Python to return the dictionary of the person with the lowest age:

`min(json_obj)`

`---------------------------------------------------------
TypeError               Traceback (most recent call last)
<ipython-input-290-60cd4510e136> in module()
----> 1 min(json_obj)

TypeError: unorderable types: dict() < dict()`

We get an error because Python doesn't have any way to tell whether one dictionary object is "greater" than another.

There is a way we can actually tell functions like `min()`, `max()`, and `sorted()` how to sort complex objects like dictionaries and lists of lists. We do this by using the optional `key` argument. The official Python documentation contains the following excerpts that describe how the argument works:

_The key argument specifies a one-argument ordering function like that used for list.sort()._

_key specifies a function of one argument that is used to extract a comparison key from each list element. The key corresponding to each item in the list is calculated once and then used for the entire sorting process._

These excerpts tell us we need to specify a function as an argument to control the comparison between values. Up until now, we've only passed variables or values as arguments, but not functions!

We'll learn the specifics of this particular application in a moment, but for now, we're going to explore how to pass a function as an argument.

Let's define a very simple function as an example:

`def greet():
    return "hello"`

`greet()
'hello'`

If we try to examine the type of our function, we are unsuccessful:

`t = type(greet())
print(t)
str`

What happens is that `greet()` is executed first; it returns the string `"hello"`, and then the `type()` function tells us the type of that string:

We need to find a way to look at the function itself, rather than the result of the function. The key to this is the parentheses: `()`

The parentheses are what tells Python to execute the function, so if we omit the parentheses we can treat a function like a variable, rather than working with the output of the function:

`t = type(greet)
print(t)
function`

There are other variable-like behaviors we can also use when we omit the parentheses from a function. For instance, we can assign a function to a new variable name:

`greet_2 = greet
greet_2()
'hello'`

Now that we understand how to treat a function as variable, let's look at how we can run a function inside another function by passing it as an argument:

`def run_func(func):
    print("RUNNING FUNCTION: {}".format(func))
    return func()`

`run_func(greet)`

`RUNNING FUNCTION: function greet at 0x12a64c400
'hello'`

Now that we have some intuition on how to pass functions as arguments, let's see how we use a function to control the behavior of the `sorted()` function:

`def get_age(json_dict):
    return json_dict['age']`

`youngest = min(json_obj, key=get_age)
jprint(youngest)`

`{
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"],
    "name": "Sabine"
}`

## Lambda Functions
Usually, we create functions when we want to perform the same task many times. In the previous exercise, we created a function to use just once — as an argument to `max()`.

Python provides a special syntax to create temporary functions for situations like these. These functions are called **lambda functions**. Lambda functions can be defined in a single line, which allows you to define a function you want to pass as an argument at the time you need it.

While it's unusual to assign a lamdba function to a variable name, we'll do that while we learn lambda functions through some simple examples. We'll start with a function that returns a single argument without modifying it:

`def unchanged(x):
    return x`
    
We're calling the returned element "transformation," even though there is no transformation. This will make sense as we introduce more complex examples.

To create a lambda function equivalent of this function, we:

- Use the lambda keyword, followed by
- The parameter and a colon, and then
- The transformation we wish to perform on our argument

We can then assign that to the function name:

`unchanged = lambda x: x`

If a function is particularly complex, it may be a better choice to define a regular function rather than create a lambda, even if it will only be used once. For instance, this function below, which extracts digits from a string and then adds one to the resultant integer:

`def extract_and_increment(string):
    digits = re.search(r"\d+", string).group()
    incremented = int(digits) + 1
    return incremented`
    
It becomes tough to understand in its lambda form:

`extract_and_increment = lambda string: int(re.search(r"\d+", string).group()) + 1`

Being mindful of this will ensure our code remains easy to read and understand.

Let's practice creating a lambda function version of a simple function:

As we mentioned briefly on the previous screen, assigning a lambda to a variable so it can be called by name is a pretty uncommon pattern. The primary use of lambda functions is to define a function in place, like when we are providing a function as an argument.

So we have a more precise understanding of how a lambda function works, let's look at how our solution from the previous screen is executed:

Let's look at how this works in common usage with `min()`, `max()`, and `sorted()`. We'll use the JSON object from the previous few screens so it's easier to observe what is happening:

`jprint(json_obj)`

`[
    {
        "age": 36,
        "favorite_foods": ["Pumpkin", "Oatmeal"],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": ["Chicken", "Pizza", "Chocolate"],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": ["Caesar Salad"],
        "name": "Heidi"
    }
]`

Let's start by using a lambda function with `sorted()` to sort the items in our JSON list alphabetically by name:

Over the past three screens, we have:

- Learned that functions can be passed as arguments.
- Created functions and used them to calculate the minimum, maximum, and to sort lists of lists.
- Learned about lambda functions and how to create them.
- Learned how to use a lambda function to pass an argument in place when calculating minimums, maximums, and sorting lists of lists.

We can now apply all of this new knowledge to our Hacker News data to calculate the posts that had the most points in 2014!

So far, we've worked with our JSON data using pure Python. One other option available to us is to convert the JSON to a pandas dataframe and then use pandas methods to manipulate it.

Pandas has the `pandas.read_json()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html), which is designed to read JSON from either a file or a JSON string. In our case, our JSON exists as Python objects already, so we don't need to use this function.

Because the structure of JSON objects can vary a lot, sometimes you will need to prepare your data in order to be able to convert it to a tabular form. In our case, our data is a list of dictionaries, which pandas is easily able to convert to a dataframe.

Let's look at an our example JSON again:

`jprint(json_obj)`

`[
    {
        "age": 36,
        "favorite_foods": ["Pumpkin", "Oatmeal"],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": ["Chicken", "Pizza", "Chocolate"],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": ["Caesar Salad"],
        "name": "Heidi"
    }
]`

Each of the dictionaries will become a row in the dataframe, with each key corresponding to a column name.

We can use the `pandas.DataFrame()` [constructor](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) and pass the list of dictionaries directly to it to convert the JSON to a dataframe:

`json_df = pd.DataFrame(json_obj)
print(json_df)`

In this case, the `favorite_foods` column contains the list from the JSON. We'll see a similar thing with the `tags` column for our Hacker News data. We'll learn how to correct that on the next screen, but for now, let's convert our data to a pandas dataframe.

Just like the `favorite_food` column in our example data on the previous screen, the `tags` column is a column where each item contains the list of data from our original JSON.

At first glance, it looks like each values in this JSON list contain three items:

- The string story
- The name of the author
- The story ID

If that's the case, then the column doesn't contain any unique data, and we can remove it. We're going to analyze this column to make sure that's the case.

Let's start by exploring how pandas is storing that data. First, we'll extract the column as a series, and check its type:

`tags = hn_df['tags']
print(tags.dtype)
[
    {
        "age": 36,
        "favorite_foods": ["Pumpkin", "Oatmeal"],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": ["Chicken", "Pizza", "Chocolate"],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": ["Caesar Salad"],
        "name": "Heidi"
    }
]`
`object`

The tags column is stored as an object type. Whenever pandas uses the object type, each item in the series uses a Python object to store the data. Most commonly we see this type used for string data.

We previously learned that we could use the `Series.apply()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply) to apply a function to every item in a series. Let's look at what we get when we pass the `type()` function as an argument to the column:

`tags_types = tags.apply(type)
type_counts = tags_types.value_counts(dropna=False)
print(type_counts)`

`class 'list'    35806
Name: tags, dtype: int64`

All 35,806 items in the column are a Python list type.

Next, let's use `Series.apply()` to check the length of each of those lists. If our hypothesis from earlier is correct, every row will have a list containing three items:

`tags_types = tags.apply(len)
type_lengths = tags_types.value_counts(dropna=False)
print(type_lengths)`

While most of the item have three values in the list, about 2,000 values contain four values. Let's use a boolean mask to look at the items where the list has four items:

It looks like whenever there are four tags, the extra tag is the last of the four. In this final exercise of the mission, we're going to use a lambda function to extract this fourth value in cases where there is one. To do this for any single list, we'll need to:

- Check the length of the list.
- If the length of the list is equal to four, return the last value.
- If the length of the list isn't equal to four, return a null value.

This is how we could create this as a standard function:

`def extract_tag(l):
    if len(l) == 4:
        return l[-1]
    else:
        return None`
        
We could use `Series.apply()` to apply this function as is, but to practice working with lambda functions, let's look at how we can complete this operation in a single line.

To achieve this, we'll have to use a special version of an if statement known as a **ternary operator**. You can use the ternary operator whenever you need to return one of two values depending on a boolean expression. The syntax is as follows:

Congratulations, you've reached the end of the mission! Let's quickly recap the techniques we learned:

- How to read and work with JSON data.
- How to use list comprehensions to extract specific values from JSON objects
- Some of the theory behind passing functions as arguments.
- How to create single-use lambda functions.
- How to use lambda functions in pandas to extract tags from Hacker News stories.

A lot of these techniques allow us to take code that was three to four lines long and write it in a single line of code. This is a really neat trick, and it can be tempting to start trying to write your code in as few lines as possible.

While this can be fun, it's useful to keep in mind you should always balance brevity with readability. When you write code, one of your highest priorities should be to make it readable. The importance of making your code accessible to others shouldn't be underrated; the person reading your code might be a colleague you're collaborating with, a potential employer looking at your portfolio, or yourself in six months when you have forgotten the details of why you wrote what.

In some cases, employing the techniques you will learn in this mission will make your code more readable, but using them for more complex scenarios can have the opposite effect. Try to keep this in mind as you continue to work through missions and when employing these techniques outside Dataquest.

In the final mission of the course, we'll learn techniques to fill missing values in data.