A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 2. Regular Expressions.

In this problem set, we will use regular expressions (regex) to process real Twitter data. Specifically, using a sample of real tweets that contain the hashtag `#informatics`, we will use the `re` library or `grep` to search for hashtags.

In [None]:
import re

from nose.tools import assert_equal, assert_is_instance, assert_true

For simplicity, we will use only five tweets in this problem, but it's straightforward to scale to a data set with a large number of tweets after we write and test our functions. Here are the five tweets:

In [None]:
tweets = """
New #job opening at The Ottawa Hospital in #Ottawa - #Clinical #Informatics Specialist #jobs http://t.co/3SlUy11dro
Looking for a #Clinical #Informatics Pharmacist Park Plaza Hospital #jobs http://t.co/4Qw8i6YaJI
Info Session 10/7: MSc in Biomedical Informatics, University of Chicago https://t.co/65G8dJmhdR #HIT #UChicago #informatics #healthcare
Here's THE best #Books I've read on #EHR #HIE #HIPAA and #Health #Informatics http://t.co/meFE0dMSPe
@RMayNurseDir @FNightingaleF Scholars talking passionately about what they believe in. #informatics &amp; #skincare  https://t.co/m8qiUSxk0h
""".strip().split("\n")

print(tweets)

Note that the tweets are saved as a list of strings. Our goal is to search for all words that start with a hashtag (#). So, it will be easier if we first split the tweets into a list of words.

In [None]:
def split_into_words(list_of_tweets):
    '''
    Take a list of tweets, and returns a list of words in all tweets.
    Since words are separated by one or more whitespaces,
    the return value is a list of strings with no whitespace.
    
    Parameters
    ----------
    list_of_tweets: a list of strings. Strings have whitespaces.
    
    Returns
    -------
    A list of strings. Strings have no whitespace.
    Results from splitting each tweet in tweets by whitespace.
    '''
    
    words = [word for tweet in list_of_tweets for word in tweet.split()]
    
    return words

I used a nested for loop inside the list comprehension expression. This is equivalent to

```python
words = []
for tweet in list_of_tweets:
    for word in tweet.split():
        words.append(word)
```

In [None]:
words = split_into_words(tweets)
print(words)

You may use either the Python `re` module or the `grep` command in Unix to solve this problem. If you choose to use `grep`, you need a text file. Let's save this list as a text file with each word on separate lines.

In [None]:
with open("words.txt", "w") as f:
    for word in words:
        f.write("{}\n".format(word))

The <a href="https://en.wikipedia.org/wiki/Cat_(Unix)">cat</a> command reads files and prints out their contents.

In [None]:
!cat words.txt

## Use regex to search for all words that start with hashtags (#)

- You may use either the Python `re` module or the [grep](https://en.wikipedia.org/wiki/Grep) command in Unix to solve this problem.
- If you use `grep`, capture the results of your shell command, and assign it to a Python variable named `hashtags`. That is, you answer should begin with
```
hashtags = ! # YOUR CODE HERE
```
For example, let's say, instead of hashtags, we want to find all words that start with `@`. We can use `grep` as follows:
```shell
$ grep -E "\@.*" words.txt
```
```
@RMayNurseDir
@FNightingaleF
```
We have to include a `\` before the `@` because `@` is a special character, and the `\` escapes the special character. The `.` matches any character (except newline), and `*` means zero or more repetitions. Thus, this regex expression matches every word that starts with a `@` in `words.txt`.

- To perform the same task in Python, you can use `re.search()`. For example,
```python
[word for word in words if re.search("\@.*", word) is not None]
```
because `re.search()` returns `None` when a word doesn't match the pattern. Assign the resulting list to a variable named `hashtags`.

In [None]:
# YOUR CODE HERE

In [None]:
print(hashtags)

In [None]:
assert_is_instance(hashtags, list)
assert_true(all(isinstance(h, str) for h in hashtags))
assert_equal(len(hashtags), 20)
assert_equal(
    set(hashtags),
    set(
        ['#job', '#Ottawa', '#Clinical', '#Informatics', '#jobs',
         '#Clinical', '#Informatics', '#jobs', '#HIT', '#UChicago',
         '#informatics', '#healthcare', '#Books', '#EHR', '#HIE',
         '#HIPAA', '#Health', '#Informatics', '#informatics', '#skincare']
    )
)

As an optional exercise, repeat the same task in Python if you used `grep`, and `grep` if you used Python.

In [None]:
!rm words.txt