# Practicing Python's Generator

In this exercise, we are going to extract *keywords* from Yelp's business about info (which could be extracted from Yelp Fusion API). In our setting, *keywords* are defined as phrases that contain two or more words starting with an uppercase letter (e.g. "San Francisco", "Los Angeles", etc. but not "California"). For example, keywords are highlighted below in AnQi Bistro's business about info.

> **House of An** is a portfolio of restaurants and a catering company all featuring the **Euro-Asian** fusion cuisine of **Master Chef Helene "Mama" An**. The company is run by the close-knit members of the An family, which includes Helene's five daughters and now, a new generation of their offspring. The company began with **Thanh Long in San Francisco**, which opened in 1971, bringing **Euro-Vietnamese** food to the city. When **Crustacean San Francisco** opened, in 1991, the **San Francisco Chronicle** newspaper called it the first Asian fusion restaurant, and dubbed **Helene An** the **"Mother of Fusion."** In 1997, Helene's daughter **Elizabeth An** took the idea upscale by opening **Crustacean Restaurant in Beverly Hills**, which quickly attracted raves reviews in national press and a sizable celebrity clientele.*

Note that stopwords such as ``the``, ``of``, ``in``, ``when`` may be included in the keywords **BUT** not at the beginning or end of the keywords. This is why **House of An** is okay, but not the **<s>the</s> San Francisco Chronicle**. The expected output of the above text is:

```
['House of An',
 'Euro Asian',
 'Master Chef Helene Mama An',
 'Thanh Long in San Francisco',
 'Euro Vietnamese',
 'Crustacean San Francisco',
 'San Francisco Chronicle',
 'Helene An the Mother of Fusion',
 'Elizabeth An',
 'Crustacean Restaurant in Beverly Hills']
 ```
 
 In the next 5 tasks we will be walk through the steps to achieve this output.


## Step 1
First, we split the about-text into sentences, and subsequently, splitting each sentence into words. We are going to [Python's regex split](https://docs.python.org/3/library/re.html#re.split) to turn sentences into words. This approach is more comprehensive than string split (that uses for splitting the text into sentences) because it considers a wider set of space characters. There is no work to be done in this step, just note the input sentences and their corresponding words.

In [4]:
about = '''House of An is a portfolio of restaurants and a catering 
company all featuring the Euro-Asian fusion cuisine of Master Chef
Helene "Mama" An. The company is run by the close-knit members of
the An family, which includes Helene's five daughters and now, a
new generation of their offspring. The company began with Thanh
Long in San Francisco, which opened in 1971, bringing
Euro-Vietnamese food to the city. When Crustacean San Francisco
opened, in 1991, the San Francisco Chronicle newspaper called it
the first Asian fusion restaurant, and dubbed Helene An
the "Mother of Fusion." In 1997, Helene's daughter Elizabeth An
took the idea upscale by opening Crustacean Restaurant in Beverly
Hills, which quickly attracted raves reviews in national press
and a sizable celebrity clientele.'''

import re
for sentence in about.split('.'):
    words = re.findall(r'\w+', sentence)
    print(f"\n'{sentence}'")
    print(words)


'House of An is a portfolio of restaurants and a catering 
company all featuring the Euro-Asian fusion cuisine of Master Chef
Helene "Mama" An'
['House', 'of', 'An', 'is', 'a', 'portfolio', 'of', 'restaurants', 'and', 'a', 'catering', 'company', 'all', 'featuring', 'the', 'Euro', 'Asian', 'fusion', 'cuisine', 'of', 'Master', 'Chef', 'Helene', 'Mama', 'An']

' The company is run by the close-knit members of
the An family, which includes Helene's five daughters and now, a
new generation of their offspring'
['The', 'company', 'is', 'run', 'by', 'the', 'close', 'knit', 'members', 'of', 'the', 'An', 'family', 'which', 'includes', 'Helene', 's', 'five', 'daughters', 'and', 'now', 'a', 'new', 'generation', 'of', 'their', 'offspring']

' The company began with Thanh
Long in San Francisco, which opened in 1971, bringing
Euro-Vietnamese food to the city'
['The', 'company', 'began', 'with', 'Thanh', 'Long', 'in', 'San', 'Francisco', 'which', 'opened', 'in', '1971', 'bringing', 'Euro', 'Vietnames

## Step 2
Next, we are going to replace all words that do not start with an upper letter with a `*`. Please write a generator that takes a list of words and dynamically replace lowercase-starting words with a `*`. You may only edit the `lowerToStars` generator below. The rest should stay the same. Note that `lowerToStar` must return a generator, aka. using a `yield` statement. The expected output is also provided below.

In [2]:
import re

def lowerToStar(words):
    for w in words:
        yield w if w[0].isupper() else '*'

for sentence in about.split('.'):
    words = re.findall(r'\w+', sentence)
    print(list(lowerToStar(words)))

['House', '*', 'An', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', 'Euro', 'Asian', '*', '*', '*', 'Master', 'Chef', 'Helene', 'Mama', 'An']
['The', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', 'An', '*', '*', '*', 'Helene', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*']
['The', '*', '*', '*', 'Thanh', 'Long', '*', 'San', 'Francisco', '*', '*', '*', '*', '*', 'Euro', 'Vietnamese', '*', '*', '*', '*']
['When', 'Crustacean', 'San', 'Francisco', '*', '*', '*', '*', 'San', 'Francisco', 'Chronicle', '*', '*', '*', '*', '*', 'Asian', '*', '*', '*', '*', 'Helene', 'An', '*', 'Mother', '*', 'Fusion']
['In', '*', 'Helene', '*', '*', 'Elizabeth', 'An', '*', '*', '*', '*', '*', '*', 'Crustacean', 'Restaurant', '*', 'Beverly', 'Hills', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*']
[]


## Step 3
In this step, we will remove all consecutive `*` elements with only a single `*`. Please complete the `removeStars` generator that takes your Step-2 generator and output a new one without consecutive `*`. Again, you must use `yield` inside `removeStars`.

In [3]:
import re

def lowerToStar(words):
    for w in words:
        yield w if w[0].isupper() else '*'

def removeStars(words):
    prev = None
    for w in words:
        if w != '*' or prev != '*':
            yield w
        prev = w

for sentence in about.split('.'):
    words = re.findall(r'\w+', sentence)
    print(list(removeStars(lowerToStar(words))))

['House', '*', 'An', '*', 'Euro', 'Asian', '*', 'Master', 'Chef', 'Helene', 'Mama', 'An']
['The', '*', 'An', '*', 'Helene', '*']
['The', '*', 'Thanh', 'Long', '*', 'San', 'Francisco', '*', 'Euro', 'Vietnamese', '*']
['When', 'Crustacean', 'San', 'Francisco', '*', 'San', 'Francisco', 'Chronicle', '*', 'Asian', '*', 'Helene', 'An', '*', 'Mother', '*', 'Fusion']
['In', '*', 'Helene', '*', 'Elizabeth', 'An', '*', 'Crustacean', 'Restaurant', '*', 'Beverly', 'Hills', '*']
[]


## Step 4
In this step, we need to walk through the list of potential keywords from Step-3, from left to write, and combine consecutive words into a list of words separated by `*`. We will also remove all `*` in this steps. Please perform this task in the `combineWords` generator.

In [4]:
import re

def lowerToStar(words):
    for w in words:
        yield w if w[0].isupper() else '*'

def removeStars(words):
    prev = None
    for w in words:
        if w != '*' or prev != '*':
            yield w
        prev = w

def combineWords(words):
    sublist = []
    for w in words:
        if word == '*':
            yield sublist
            sublist = []
        else:
            sublist.append(w)
    if len(sublist) > 0:
        yield sublist

for sentence in about.split('.'):
    words = re.findall(r'\w+', sentence)
    print(list(combineWords(removeStars(lowerToStar(words)))))


[['House'], ['An'], ['Euro', 'Asian'], ['Master', 'Chef', 'Helene', 'Mama', 'An']]
[['The'], ['An'], ['Helene']]
[['The'], ['Thanh', 'Long'], ['San', 'Francisco'], ['Euro', 'Vietnamese']]
[['When', 'Crustacean', 'San', 'Francisco'], ['San', 'Francisco', 'Chronicle'], ['Asian'], ['Helene', 'An'], ['Mother'], ['Fusion']]
[['In'], ['Helene'], ['Elizabeth', 'An'], ['Crustacean', 'Restaurant'], ['Beverly', 'Hills']]
[]


## Step 5
Please modify your `combineWords` generator in Step-4 to output a string (aka. joining all words by a space) instead of outputing a list.

In [5]:
import re

def lowerToStar(words):
    for w in words:
        yield w if w[0].isupper() else '*'

def removeStars(words):
    prev = None
    for w in words:
        if w != '*' or prev != '*':
            yield w
        prev = w

def combineWords(words):
    sublist = []
    for w in words:
        if word == '*':
            yield ' '.join(sublist)
            sublist = []
        else:
            sublist.append(w)
    if len(sublist) > 0:
        yield ' '.join(sublist)

for sentence in about.split('.'):
    words = re.findall(r'\w+', sentence)
    print(list(combineWords(removeStars(lowerToStar(words)))))

['House', 'An', 'Euro Asian', 'Master Chef Helene Mama An']
['The', 'An', 'Helene']
['The', 'Thanh Long', 'San Francisco', 'Euro Vietnamese']
['When Crustacean San Francisco', 'San Francisco Chronicle', 'Asian', 'Helene An', 'Mother', 'Fusion']
['In', 'Helene', 'Elizabeth An', 'Crustacean Restaurant', 'Beverly Hills']
[]


## Step 6
We are very close to the expected output. However, all of the stopwords are not listed there. **House of An** is  being treated as **House** and **An**. In this step, we will take a list of stop words (defined as a set of strings defined in the variable `stopwords`), and:

1. Modify `lowerToStar` to not convert stopwords into `*` even though they do not start with an uppercase letter.

2. Modify `combineWords` to keep only keywords with two or more words, and the first word of those keywords cannot be a stopword.

In [16]:
import re

stopwords = set(['the', 'of', 'when', 'in'])

def lowerToStar(words):
    for w in words:
        yield w if (w[0].isupper() or w.lower() in stopwords) else '*'

def removeStars(words):
    prev = None
    for w in words:
        if w != '*' or prev != '*':
            yield w
        prev = w

def combineWords(words):
    sublist = []
    for w in words:
        if w == '*':
            if len(sublist) > 1:
                yield ' '.join(sublist)
            sublist = []
        elif len(sublist) > 0 or w.lower() not in stopwords:
            sublist.append(w)
    if len(sublist) > 1:
        yield ' '.join(sublist)

for sentence in about.split('.'):
    words = re.findall(r'\w+', sentence)
    print(list(combineWords(removeStars(lowerToStar(words)))))

['House of An', 'Euro Asian', 'Master Chef Helene Mama An']
[]
['Thanh Long in San Francisco', 'Euro Vietnamese']
['Crustacean San Francisco', 'San Francisco Chronicle', 'Helene An the Mother of Fusion']
['Elizabeth An', 'Crustacean Restaurant in Beverly Hills']
[]


## Step 7
Finally, we would like to wrap the entire code in Step-6 into a generator, and call it directly on the input. You can simply copy `lowerToStar`, `removeStars`, and `extractKeywords` from your Step-6 here and run. No additional work needed.

In [17]:
import re

stopwords = set(['the', 'of', 'when', 'in'])


def lowerToStar(words):
    for w in words:
        yield w if (w[0].isupper() or w.lower() in stopwords) else '*'

def removeStars(words):
    prev = None
    for w in words:
        if w != '*' or prev != '*':
            yield w
        prev = w

def combineWords(words):
    sublist = []
    for w in words:
        if w == '*':
            if len(sublist) > 1:
                yield ' '.join(sublist)
            sublist = []
        elif len(sublist) > 0 or w.lower() not in stopwords:
            sublist.append(w)
    if len(sublist) > 1:
        yield ' '.join(sublist)

def extractKeywords(about):
    for sentence in about.split('.'):
        words = re.findall(r'\w+', sentence)
        yield from combineWords(removeStars(lowerToStar(words)))

list(extractKeywords(about))

['House of An',
 'Euro Asian',
 'Master Chef Helene Mama An',
 'Thanh Long in San Francisco',
 'Euro Vietnamese',
 'Crustacean San Francisco',
 'San Francisco Chronicle',
 'Helene An the Mother of Fusion',
 'Elizabeth An',
 'Crustacean Restaurant in Beverly Hills']