# Practical Machine Learning Using Scikit-learn
#### By Niko Gupta

This tutorial will take you through the process, from start to finish, of using machine learning in a real application. You will learn how to use the scikit-learn (sklearn) library to train a model on a test set of data, and then use it to make predictions on new data. In addition, you will learn how to verify the correctness and quality of your machine learning model.

This tutorial also covers an example of how to clean and prepare data for the model, along with how to use the Pickle library to save Python objects to files for later use. The example will show how to save the model, load it up again, and then make predictions on real data, much like you might do in a real application.

The model we are going to train will take the text from a yelp review for a restaurant, and predict the rating for that review.

#### Things the tutorial will cover:
* [Data Collection](#Part-1:-Data-Collection)
* [Cleaning the Data](#Part-2:-Cleaning-the-Data)
* [Training the Model](#Part-3:-Training-the-Model)
* [Evaluating the Model](#Part-4:-Evaluating-the-Model)
* [Improving the Model](#Part-5:-Improving-the-Model)

In [2]:
# Set up library imports
import json
import pickle
import requests
import sklearn
from bs4 import BeautifulSoup
from testing.testing import test

# Global variables
review_input_file = 'data/review.json'
training_data_file = 'training_data.pickle'
test_data_file = 'test_data.pickle'

## Part 1: Data Collection

The first task in any machine learning application is collecting data. Typically, you will want 2 sets of data: a training set and a test set. The former will be used to train your model, and the latter will be used to test its effectiveness. While you can use 2 entirely different data sets, it is usually sufficient to just break one set into 2 parts.

For this tutorial we will be using the [yelp dataset](https://www.yelp.com/dataset/). This dataset was released for students to use in data science applications. It is publicly accessible, but due to Yelp's limitations on on on caching their data, I cannot include the actual dataset in this tutorial. You can download the dataset [here](https://www.yelp.com/dataset/download) to follow along with the tutorial. TODO make preset model

The dataset includes interesting information such as business data, reviews, user information (including a pseudo social network through friend mappings), business checkins, restaurant reviews, and photo information. For this tutorial we are interested specifically in the restaurant reviews.

One problem we encounter with the yelp dataset is that the uncompressed archive is 8 GB. In particular, the reviews.json file that we are interested in is 5 GB, containing over 6 million revies. If we were to load this entire file as is into RAM, it would cause an OS error. In addition, we don't need all 6 million reviews for training or testing the data. Instead we are going to hardcode the number of reviews we would like per dataset, and only read that many from the file.

In [10]:
data_set_size = 20000

# Load the first data_set_size reviews from filename
def get_json(filename):
    seen = 0
    result = []
    with open(filename, 'rb') as json_file:
        while (seen < data_set_size):
            seen += 1
            line = json_file.readline()
            result.append(json.loads(line))
    
    return result

## Part 2: Cleaning the Data

Now that we have the reviews file loaded into a dictionary, we need to reformat it into something that is more readily useable by a machine learning model. The operations we are interested in doing to the data in order to clean it are:
    - Remove data we don't care about, i.e. removing extra json fields
    - Standardize the reviews by removing non alphabetical characters, such as punctuation
    - Make all words lowercase so that the model doesn't treat the same words with different case as
      different features
    - Remove the most common words in the English language (such as 'the' and 'as') so that the model trains on
      words that are actually relevant within each review. Certain words that are listed in the top 100 words but
      seem like they might be relevant to the review would not make sense to remove. In this case, I chose not to
      remove `not`, `but`, `out`, `like`, `no`, `into`, `good`, `over`, `well`, and `most`.

In [4]:
def get_datasets_test(get_datasets):
    get_datasets()

# Given a single review, remove unnecessary columns
def remove_extra(review):
    review.pop('review_id')
    review.pop('user_id')
    review.pop('business_id')
    review.pop('date')
    review.pop('useful')
    review.pop('funny')
    review.pop('cool')
    

# Given a single review, clean its description
def clean_description(review):
    lower = review['text'].lower()
    
    # Remove all characters except a-z and ' '
    all_ascii = filter(lambda i: 97 <= ord(i) <= 122 or ord(i) == 32, lower)
    
    # TODO: add comment about not removing these words
    # Convert the filter object into a list so we can remove common words
    word_list = ''.join(all_ascii).split()
    common_words = {
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'i',
        'it', 'for', 'on', 'with', 'he', 'as', 'you', 'do', 'at',
        'this', 'his', 'by', 'from', 'them', 'we', 'say', 'her', 'she',
        'or', 'an', 'will', 'my', 'one', 'or', 'would', 'there', 'their', 'what',
        'so', 'up', 'if', 'about', 'who', 'get', 'which', 'go', 'me',
        'when', 'make', 'can', 'time', 'just', 'him', 'know', 'take',
        'people', 'year', 'your', 'some', 'could', 'them', 'see', 'other',
        'than', 'then', 'now', 'look', 'only', 'come', 'its', 'think', 'also',
        'back', 'after', 'use', 'two', 'how', 'our', 'work', 'first', 'way',
        'even', 'new', 'want', 'because', 'any', 'these', 'give', 'day', 'us'
        }
    result = [word for word in word_list if word not in common_words]
    
    # Combine the result back into a string, and overwrite the original description
    review['text'] = ' '.join(result)

# Given a list of reviews, clean the descriptions and remove extra columns.
#    Note: this function modifies the input in place
def clean_reviews(reviews):
    list(map(remove_extra, reviews))
    list(map(clean_description, reviews))

# Load the data, clean it, and break it into our test and training sets
@test
def get_datasets():
    data = get_json(review_input_file)
    clean_reviews(data)

    mid = data_set_size // 2
    training_data = data[:mid]
    test_data = data[mid:]

    # Save as pickle objects so we don't have to reload and clean the data every time
    with open(training_data_file, 'wb') as file:
        pickle.dump(training_data, file)
    
    with open(test_data_file, 'wb') as file:
        pickle.dump(test_data, file)

### TESTING get_datasets: PASSED 0/0
###



---

## Part 3: Training the Model

The first part of training a machine learning model is identifying what kind of problem you have. It can fall into a few different categories:

* __Unsupervised learning:__ the model is given a set of inputs, with no target outputs. The model will attempt to find correlations between the inputs, and use that to find correlation with future inputs.

* __Supervised learning:__ the model is given both a set of inputs and the target outputs corresponding to those inputs. Depending on the type of the target output, this can be further broken down into classification and regression.

    * __Classification:__ output is one of a set of categories. The goal is to find patterns that map the input to the correct category, and then for unstructured input predict which output label to classify it with.

    * __Regression:__ output is more continuous. For example, if we were assigning expected GPA for students based on study time, absences, and age, the output would be continuous

In our case, the problem is a supervised regression problem. We already know the ratings for each review in the training set, and we want to be able to predict the rating for other future reviews.

One other problem comes to light: machine learning models are effectively tuning a complicated equation using the training data, and using that to "predict" output on new data. However, we have text data — which isn't easily useable as input for an equation. This means we need to do some further processing on the input datasets before we an use them. One of the most common transformations used in natural language processing is the "bag of words" model. A piece of text is parsed into a dictionary mapping each word to the frequency it appears within the text. Then each word is assigned a unique ID, and a separate data structure is kept to maintain this "vocabulary mapping". This new mapping of unique ID to frequency is turned into a 2 dimensional matrix, which can be used to train our model.


In [5]:
model_file = "model.pickle"

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [6]:
def train_model_test(train_model):
    train_model(MultinomialNB())

@test
def train_model(algorithm):
    # Load our training data from the pickle file
    with open(training_data_file, 'rb') as file:
        data = pickle.load(file)
    
    # For training, we must separate the input and output
    text = list(map(lambda x: x['text'], data))
    stars = list(map(lambda x: int(x['stars']), data))
    
    # Generate our "bag of words" frequency mapping
    count_vect = CountVectorizer()
    bag_of_words = count_vect.fit_transform(text)
    #TODO uncomment print(count_vect.vocabulary_) # Visualize the data

    # Convert the bag of words to a 2d matrix
    transformer = TfidfTransformer()
    train_data = transformer.fit_transform(bag_of_words)
    
    # Training time!
    with open(model_file, 'wb') as file:
        model = algorithm.fit(train_data, stars)
        pickle.dump((model, count_vect, transformer), file)

### TESTING train_model: PASSED 0/0
###



In order to streamline the process of vectorizing and transforming input text, then predicting the output for that input, sklearn provides the `Pipeline` class. This allows us to cleanly package everything together, and the above function simplifies to:

In [7]:
from sklearn.pipeline import Pipeline

def train_model_pipe_test(train_model_pipe):
    train_model_pipe(MultinomialNB())

@test
def train_model_pipe(algorithm):
    # Load our training data from the pickle file
    with open(training_data_file, 'rb') as file:
        data = pickle.load(file)
    
    # For training, we must separate the input and output
    text = list(map(lambda x: x['text'], data))
    stars = list(map(lambda x: int(x['stars']), data))
    
    # Note that the names here are arbitrary; they allow us
    # to refer back to it later
    model = Pipeline([
        ('count_vect', CountVectorizer()),
        ('transformer', TfidfTransformer()),
        ('algorithm', algorithm)
    ])
    
    # Training time!
    with open(model_file, 'wb') as file:
        model.fit(text, stars)
        pickle.dump(model, file)

### TESTING train_model_pipe: PASSED 0/0
###



## Part 4: Evaluating the Model

Now that we have a model, we have to test it to see how effective it is. Many metrics exist for evaluating performance, however we will be focusing primarily on the accuracy of the model against our test data.

In [8]:
from sklearn import metrics

def test_model_test(test_model):
    test_model()

@test
def test_model(verbose=True):
    # Load model and test data
    with open(model_file, 'rb') as file:
        model = pickle.load(file)
    
    with open(test_data_file, 'rb') as file:
        data = pickle.load(file)
    
    # Separate the input and output
    text = list(map(lambda x: x['text'], data))
    stars = list(map(lambda x: int(x['stars']), data))
    
    # See the accuracy of our prediction
    accuracy = model.score(text, stars) #TODO what is this
    if verbose:
        print('Accuracy: ', accuracy)
    
    # Get a little more insight
    if verbose:
        pred = model.predict(text)
        print(metrics.classification_report(stars, pred))
    
    return accuracy

Accuracy:  0.4619
              precision    recall  f1-score   support

           1       0.97      0.07      0.13      1403
           2       0.00      0.00      0.00       730
           3       0.00      0.00      0.00      1175
           4       0.43      0.01      0.01      2179
           5       0.46      1.00      0.63      4513

    accuracy                           0.46     10000
   macro avg       0.37      0.22      0.15     10000
weighted avg       0.44      0.46      0.30     10000

### TESTING test_model: PASSED 0/0
###



  'precision', 'predicted', average, warn_for)


This is pretty bad performance! Let's look into how we can improve it.

## Part 5: Improving the Model

There are a few different ways to improve our model. We will go through a couple simple methods, and then discuss a 
few more in depth techniques for improvement.

The first method (and easiest for us to implement) is to change the classification algorithm. Sklearn provides many different machine learning algorithms. Let's try a few out and see which gives us the best model.
#cross validation?
TODO discuss naive model you used above

In [9]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

#train_model_pipe(MultinomialNB())           # 0.4619
#train_model_pipe(SGDClassifier())           # 0.6345
#train_model_pipe(SVC(kernel='linear'))      # 0.6391
train_model_pipe(DecisionTreeClassifier())      # 0.46

test_model()

Accuracy:  0.46
              precision    recall  f1-score   support

           1       0.46      0.50      0.48      1403
           2       0.16      0.15      0.15       730
           3       0.22      0.19      0.20      1175
           4       0.29      0.28      0.29      2179
           5       0.63      0.66      0.64      4513

    accuracy                           0.46     10000
   macro avg       0.35      0.35      0.35     10000
weighted avg       0.45      0.46      0.45     10000



0.46

Other ways to improve:
    grid?
    play with size of training dataset (i.e. data_set_size)

## Part 5: Applying the model to a real task

TODO: add more info section at the end
Do this part if you need extra fluff
    - time how long it takes to make a prediction on average - how good is it?

    - Get the rating for a restaurant
    - Write a sample program that will use the above to predict the rating of a restaurant based on the reviews