# Applied Machine Learning Using Scikit-learn

In this tutorial, you will learn how to use the scikit-learn (sklearn) library to train a model on a test set of data, and then use this model to make predictions on new data. In addition, you will learn how to verify the correctness and quality of your machine learning model.

This tutorial also briefly covers an example in collecting, cleaning, and preparing data for the model, along with how to use Python Pickle to save Python objects to files for later use. The example we work through will show how to save the model, load it up again, and then make predictions on real data, much like you may do in a real project.

The model we are going to train will take the text from a yelp review for a restaurant, and predict the rating for that review.

In [None]:
# Set up library imports
import json
import requests
import sklearn
from bs4 import BeautifulSoup
from testing.testing import test

## Part 1: Data Collection

The first task in any machine learning application is collecting data. Typically, you will want 2 portions of data: a training set and a test set. The former will be used to train your model, and the latter will be used to test its effectiveness. While you can use 2 entirely different data sets, it is usually sufficient to just break one set into 2 pieces.

Because data collection and data parsing are two very different parts in the process, I have chosen to break it up here. Here, we will use the `BeautifulSoup` library to scrape `yelp.com` for all the reviews of a restaurant. While `yelp.com` maintains a useful API, most websites do not. I will collect the data using web scraping because that is more versatile in its application.

If you would like to learn more about `BeautifulSoup`, you can visit [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## how to do code in markdown
```python
{
    'author': 'Aaron W.' # str
    'rating': 4.0        # float
    'date': '2019-01-03' # str, yyyy-mm-dd
    'description': "Wonderful!" # str
}
``` 

In [31]:
def retrieve_html_test(retrieve_html):
    status_code, text = retrieve_html("http://www.example.com")
    test.equal(status_code, 200)
    test.true("This domain is established to be used for illustrative examples in documents." in text)
    # Note that the text hash may change depending on the remote server. Feel free to change the test.

@test
def retrieve_html(url, params=None, headers=None):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    
    response = requests.get(url, params=params, headers=headers)
    return (response.status_code, response.text)


### TESTING retrieve_html: PASSED 2/2
###



In [32]:
# You do not need to use regular expressions in this solution. This is only for testing.
import re

def reviews_check(reviews):
    type_check = lambda field, typ: all(field in r and typ(r[field]) for r in reviews)
    test.true(type_check("rating", lambda r: isinstance(r, float)))
    test.true(type_check("description", lambda r: isinstance(r, str)))

def parse_page_test(parse_page):
    reviews, num_pages = parse_page(retrieve_html("https://www.yelp.com/biz/the-porch-at-schenley-pittsburgh")[1])
    reviews_check(reviews)
    test.equal(len(reviews), 20)
    test.equal(num_pages, 33)

# This helper function returns more data than was requested so as to make extract_reviews easier
def parse_page_help(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Find the reviews
    tag = soup.find_all('script', type='application/ld+json')[0]
    my_json = json.loads(tag.string)
    reviews = my_json['review']
    results = []

    # Iterate through the reviews json and reformat the data
    for cur_review in reviews:
        results.append(
            {'rating':float(cur_review['reviewRating']['ratingValue']),
             'description':cur_review['description']}
        )
        
    # Find the url of the next page
    next_page_url = None
    temp = soup.find('link', rel='next')
    if temp is not None:
        next_page_url = temp['href']

    # Find how many total pages there are
    pages = soup.find_all(string=re.compile('^page', flags=re.IGNORECASE))[0]
    index = pages.find('of') + 3
    total_pages = int(pages[index:])
    
    return results, total_pages, next_page_url

@test
def parse_page(html):
    """
    Parse the reviews on a single page of a restaurant.
    
    Args:
        html (string): String of HTML corresponding to a Yelp restaurant

    Returns:
        tuple(list, string): a tuple of two elements
            first element: list of dictionaries corresponding to the extracted review information
            second element: number of pages total
    """
    
    results, total_pages, next_page_url = parse_page_help(html)
    return results, total_pages


### TESTING parse_page: PASSED 4/4
###



---

## Q 3.5: Extract all of the Yelp reviews for a Single Restaurant

So now that we have parsed a single page, and figured out a method to go from one page to the next we are ready to combine these two techniques and actually crawl through web pages! 

Using `requests`, programmatically retrieve __ALL__ of the reviews for a __single__ restaurant (provided as a parameter). Just like the API was paginated, the HTML paginates its reviews (it would be a very long web page to show 300 reviews on a single page) and to get all the reviews you will need to parse and traverse the HTML. As input your function will receive a URL corresponding to a Yelp restaurant. As output return a list of dictionaries (structured the same as question 3 containing the relevant information from the reviews.

Return reviews in the order that they are present on the page.

You will need to get the number of pages on the first request and generate the URL for subsequent pages automatically. Use the Yelp website to see how the URL changes for subsequent pages.

In [None]:
def extract_reviews_test(extract_reviews):
    reviews = extract_reviews("https://www.yelp.com/biz/larry-and-carols-pizza-pittsburgh")
    test.equal(len(reviews), 46) # This may change!
    reviews_check(reviews)

@test
def extract_reviews(url):
    """
    Retrieve ALL of the reviews for a single restaurant on Yelp.

    Parameters:
        url (string): Yelp URL corresponding to the restaurant of interest.

    Returns:
        reviews (list): list of dictionaries containing extracted review information
    """
    
    # Parse the results once so you know how many pages there are
    results, total_pages, next_page_url = parse_page_help(retrieve_html(url)[1])

    # Iterate total - 1 times, since you already read the first page of reviews
    for i in range(total_pages - 1):
        new_results, _, temp_url = parse_page_help(retrieve_html(next_page_url)[1])
        results = results + new_results
        next_page_url = temp_url
    
    return results


Now, we should clean the review description to make it more useful for parsing

In [35]:
def clean_reviews_test(clean_reviews):
    reviews = extract_reviews("https://www.yelp.com/biz/larry-and-carols-pizza-pittsburgh")
    og_len = len(reviews)

    clean_reviews(reviews)
    test.equal(len(reviews), og_len)

    for elem in reviews:
        desc = elem['description']
        test.equal(desc, desc.lower())

# Given a description, make it better for model training
#      Remove common words
#      Make all lowercase
#      Remove punctuation
def clean_description(review):
    desc = review['description']
    lower = desc.lower()
    
    # Remove all characters except a-z and ' '
    all_ascii = filter(lambda i: 97 <= ord(i) <= 122 or ord(i) == 32, lower)
    word_list = ''.join(all_ascii).split()
    
    # Remove most of the 100 most common words. Words not removed:
    # not, but, out, like, no, into, good, over, well, most
    # Taken from https://en.wikipedia.org/wiki/Most_common_words_in_English
    common_words = {
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'i',
        'it', 'for', 'on', 'with', 'he', 'as', 'you', 'do', 'at',
        'this', 'his', 'by', 'from', 'them', 'we', 'say', 'her', 'she',
        'or', 'an', 'will', 'my', 'one', 'or', 'would', 'there', 'their', 'what',
        'so', 'up', 'if', 'about', 'who', 'get', 'which', 'go', 'me',
        'when', 'make', 'can', 'time', 'just', 'him', 'know', 'take',
        'people', 'year', 'your', 'some', 'could', 'them', 'see', 'other',
        'than', 'then', 'now', 'look', 'only', 'come', 'its', 'think', 'also',
        'back', 'after', 'use', 'two', 'how', 'our', 'work', 'first', 'way',
        'even', 'new', 'want', 'because', 'any', 'these', 'give', 'day', 'us'
        }
    result = [word for word in word_list if word not in common_words]
    
    review['description'] = ' '.join(result)

# Given a list of reviews, clean the descriptions in place
@test
def clean_reviews(reviews):
    list(map(clean_description, reviews))

### TESTING clean_reviews: PASSED 47/47
###



In [50]:
import pickle

filename = "halal.pickle"
with open(filename, 'wb') as file:
    reviews = extract_reviews("https://www.yelp.com/biz/the-halal-guys-new-york-2")
    print(len(reviews))
    pickle.dump(reviews, file)

IndexError: list index out of range

In [49]:
with open(filename, 'rb') as file:
    reviews = pickle.load(file)
    print(len(reviews))
    print(reviews[0])

46
{'rating': 4.0, 'description': "Not bad.  It's basic pizza, folks.  Just what I wanted.  Can't do the heavy toppings.  This was perfect.\n3 stars for the food.  1 extra star because I got my order in a half an hour and the price with tip was under 20 bucks.  That's a winner in my book. \nThey have a big menu to choose from, so I will try them again soon."}


Now we have a way to obtain data. We now can start training our model. We will need:
    - training data taken from "https://www.yelp.com/biz/oishii-bento-pittsburgh"
    - testing data taken from "https://www.yelp.com/biz/mount-everest-sushi-pittsburgh"
   
We will:
    - train our model on the first set of data using sklearn
    - evaluate the accuracy on the testing data
    - time how long it takes to make a prediction on average - how good is it?
    - save our model to a file
    - load a model from a file
    - make predictions based on new data
    
    - Get the rating for a restaurant
    - Write a sample program that will use the above to predict the rating of a restaurant based on the reviews

Note that this is a supervised regression problem. Explain why

In [None]:
from sklearn.ensemble import RandomForestClassifier as rf
import sklearn
dependent_variable = 'qual_student'
x = df[df.columns.difference([dependent_variable])]
y = df[dependent_variable]
clf = rf(n_estimators = 1000)
clf.fit(x, y)

In [None]:
pred = clf.predict(x)
sklearn.metrics.f1_score(y, pred, average='binary')

It's not very good! We didn't even cross validate. You'll need to do better :)
Let's export this model so we can use it in a microservice (flask api)

In [None]:
import joblib
joblib.dump(clf, '/home/matrix/dockerfile/apps/model.pkl')

In [None]:
query_df = pd.DataFrame({ 'age' : pd.Series(1) ,'health' : pd.Series(15) ,'absences' : pd.Series(10)})
pred = clf.predict(query_df)