# Bag of Words => Random Forest Model Training
Borrowed from https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
You will need to download two datasets, named "labeledTrainData.tsv" and "testData.tsv".  They can be found at: https://www.kaggle.com/c/word2vec-nlp-tutorial/data
An accompanying code file can be found at https://github.com/wendykan/DeepLearningMovies/blob/master/BagOfWords.py

### Read the data file, and take a minute to examine the dataset

In [1]:
# Import the pandas package, then use the "read_csv" function to read
# the labeled training data
import os
import pandas as pd
import numpy as np

In [2]:
# Set the working directory to the folder where your data is stored
os.chdir("C:\\Users\lickfett-john\Documents\Python Scripts")
# See what files are in the folder.
os.listdir()

['.ipynb_checkpoints',
 'all_training.csv',
 'all_training.xlsx',
 'archive',
 'from the scrapy projects folder',
 'test results files',
 'test.xlsx',
 'TIL_Webscrape_FINAL.ipynb',
 'train.xlsx',
 'url_extract_crawler',
 'validate.xlsx']

In [3]:
# Read the training data
train = pd.read_excel("train.xlsx", header=0, \
                    delimiter="\t", quoting=3)

#   "header=0" indicates that the first line of the file contains column names, 
#   "delimiter=\t" indicates that the fields are separated by tabs,
#   quoting=3 tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.

In [4]:
## This is a step you can use if you don't have your training data already broken apart into a training and test dataset.

#from sklearn.cross_validation import train_test_split
#train, test = train_test_split( data, train_size = 0.8, random_state = 44 )

In [5]:
## This is a step you can use if you don't have your training data already broken apart into a training and test dataset.

#all_i = np.arange( len( data ))
#train_i, test_i = train_test_split( all_i, train_size = 0.8, random_state = 44 )

#train = data.ix[train_i]
#test = data.ix[test_i]

In [6]:
# check the size of the data (rows, columns)
train.shape
# output should be (# rows, # columns)

(127, 6)

In [7]:
# check the column labels
train.columns.values
# output should be array(['PrimaryKey', 'Provider Type', 'SampleID', 'P_Name', 'P_City',
#       'P_State', 'P_Street', 'P_Zip', 'Stars', 'URL', 'CMS_Text',
#       'Contains CMS MM?'], dtype=object)

array(['id', 'Provider Type', 'root url', 'scraped url', 'Visible Text',
       'page_MM'], dtype=object)

In [8]:
# Look at the first five values of the CMS MM data.
# MMs are coded as 1 or 0, for MM found or MM not found.
train["page_MM"][:5]

0    0
1    0
2    1
3    0
4    1
Name: page_MM, dtype: int64

In [9]:
# take a look at the text in the first CMS MM
train["Visible Text"][0]

'Skip to main content Home Careers About Us Every Day Giving Excellence Search Menu For Patients About Your Surgery Your Rights and Responsibilities Your Privacy Financial Information Information in Espanol Nondiscrimination Notice Staff & Physicians About Us Employment Opportunities Contact Us Print Email Home \r\r\n$name Address and Directions \r\r\n        600 Sioux Point Road  \r\r\n        Dakota Dunes, South Dakota, 57049\r\r\n         Phone: 605-232-3332 Fax:  Hours of Operation: \r\r\n        Monday-Friday \r\r\n        6:00am-5:00pm\r\r\n     Driving Directions Header Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fusce nulla ligula, sollicitudin eu, adipiscing tincidunt, bibendum et, augue. Phasellus eget lacus sed lacus sagittis facilisis. Morbi id est sed mi facilisis venenatis. Nulla ut ante sit amet risus tempor vestibulum. Donec laoreet. Mauris convallis adipiscing arcu. Learn More About Your Surgery Your Rights and Responsibilities Your Privacy Financial Info

In [10]:
# There are HTML tags such as "<br/>", abbreviations, punctuation - all common issues when processing text from online. 
# First, we'll remove the HTML tags. For this purpose, we'll use the Beautiful Soup library.


# Import BeautifulSoup into your workspace
import bs4
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single CMS MM     
example1 = BeautifulSoup(train["Visible Text"][0],"lxml")

# Print the raw CMS MM and then the output of get_text(), for 
# comparison
# train["Visible Text"][0]

In [11]:
# compare the raw CMS MM above with the get_text ouput below
# example1.get_text()

### Clean the Text

When considering how to clean the text, we should think about the data problem we are trying to solve. 
For many problems, it makes sense to remove punctuation. 
In this tutorial, for simplicity, we remove the punctuation altogether, but it is something you can play with on your own.
Similarly, in this tutorial we will remove numbers, but there are other ways of dealing with them that make just as much sense. 
For example, we could treat them as words, or replace them all with a placeholder string such as "NUM".
To remove punctuation and numbers, we will use a package for dealing with regular expressions, called re. 
The package comes built-in with Python; no need to install anything. 
For a detailed description of how regular expressions work, see the package documentation. Now, try the following:

In [12]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z4-5&-]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search

# print the output of letters_only
# letters_only

# A full overview of regular expressions is beyond the scope of this tutorial, 
# but for now it is sufficient to know that [] indicates group membership and ^ means "not". 
# In other words, the re.sub() statement above says,
# "Find anything that is NOT a lowercase letter (a-z) or an upper case letter (A-Z), and replace it with a space."

In [13]:
# We'll also convert our CMS MMs to lower case and split them into individual words.
# This is called "tokenization" in Natural Language Processing (NLP) lingo.

lower_case = letters_only.lower()        # Convert to lower case
#words = lower_case.split()               # Split into words
letters_only

'Skip to main content Home Careers About Us Every Day Giving Excellence Search Menu For Patients About Your Surgery Your Rights and Responsibilities Your Privacy Financial Information Information in Espanol Nondiscrimination Notice Staff & Physicians About Us Employment Opportunities Contact Us Print Email Home     name Address and Directions                Sioux Point Road             Dakota Dunes  South Dakota  5  4             Phone    5-   -     Fax   Hours of Operation             Monday-Friday                am-5   pm        Driving Directions Header Lorem ipsum dolor sit amet  consectetuer adipiscing elit  Fusce nulla ligula  sollicitudin eu  adipiscing tincidunt  bibendum et  augue  Phasellus eget lacus sed lacus sagittis facilisis  Morbi id est sed mi facilisis venenatis  Nulla ut ante sit amet risus tempor vestibulum  Donec laoreet  Mauris convallis adipiscing arcu  Learn More About Your Surgery Your Rights and Responsibilities Your Privacy Financial Information Information i

In [14]:
# Finally, we need to decide how to deal with frequently occurring words that don't carry much meaning.
# Such words are called "stop words"; in English they include words such as "a", "and", "is", and "the".
# Conveniently, there are Python packages that come with stop word lists built in. 
# Let's import a stop word list from the Python Natural Language Toolkit (NLTK). 
# You'll need to install the library if you don't already have it on your computer;
# you'll also need to install the data packages that come with it, as follows:
#nltk.download()  # Download text data sets, including stop words
import nltk

In [15]:
# Now we can use NLTK to get a list of stopwords

from nltk.corpus import stopwords # Import the stop word list

# Show the stop words
# stopwords.words("english")

In [16]:
# This will allow you to view the list of English-language stop words. To remove stop words from our CMS MM, do:

# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]

# words

# This looks at each word in our "words" list, and discards anything that is found in the list of stop words.

NameError: name 'words' is not defined

### Loop the cleaning process through all of the text

Now we have code to clean one CMS MM - but we need to clean hundreds of training MMs! 
To make our code reusable, let's create a function that can be called many times:

In [18]:
def MM_to_words( raw_MM ):
    # Function to convert a raw CMS MM to a string of words
    # The input is a single string (a raw MM text), and 
    # the output is a single string (a preprocessed MM text)
    #
    # 1. Remove HTML
    MM_text = BeautifulSoup(raw_MM,"lxml").get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z4-5&-]", " ", MM_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #    a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    #meaningful_words = [w for w in words]
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    #    to make the output easier to use in our Bag of Words,
    #    then return the result. 
    return( " ".join( meaningful_words ))

In [19]:
# Call the function above to test it on a single MM text

clean_MM = MM_to_words( train["Visible Text"][0] )

clean_MM

'skip main content home careers us every day giving excellence search menu patients surgery rights responsibilities privacy financial information information espanol nondiscrimination notice staff & physicians us employment opportunities contact us print email home name address directions sioux point road dakota dunes south dakota 5 4 phone 5- - fax hours operation monday-friday am-5 pm driving directions header lorem ipsum dolor sit amet consectetuer adipiscing elit fusce nulla ligula sollicitudin eu adipiscing tincidunt bibendum et augue phasellus eget lacus sed lacus sagittis facilisis morbi id est sed mi facilisis venenatis nulla ut ante sit amet risus tempor vestibulum donec laoreet mauris convallis adipiscing arcu learn surgery rights responsibilities privacy financial information information espanol nondiscrimination notice affiliate united surgical partners international partnered local physicians accredited accreditation association hospitals health systems list owners please 

In [20]:
# Now let's loop through and clean all of the training set at once

# Get the number of MMs based on the dataframe column size
num_MM = train["Visible Text"].size

# Initialize an empty list to hold the clean MMs
clean_train_MM = []

# Loop over each MM; create an index i that goes from 0 to the length
# of theMM list 
for i in range( 0, num_MM ):
    # Call our function for each one, and add the result to the list of
    # clean MMs
    clean_train_MM.append( MM_to_words( train["Visible Text"][i] ) )

# This next bit is optional, and it will give you status updates for every 1,000 rows.
print ("Cleaning and parsing the training set CMS MMs...\n")
clean_train_MM = []
for i in range( 0, num_MM ):
    # If the index is evenly divisible by 10, print a message
    if( (i+1)%10 == 0 ):
        print ("MM %d of %d\n" % ( i+1, num_MM ))                                                                    
    clean_train_MM.append( MM_to_words( train["Visible Text"][i] ))

Cleaning and parsing the training set CMS MMs...

MM 10 of 127

MM 20 of 127

MM 30 of 127

MM 40 of 127

MM 50 of 127

MM 60 of 127

MM 70 of 127

MM 80 of 127

MM 90 of 127

MM 100 of 127

MM 110 of 127

MM 120 of 127



In [63]:
## If you want to see the output in a csv file, follow the steps below.
# Copy the results to a pandas dataframe.
# output = pd.DataFrame( data={"id":train["id"], "Visible Text":clean_train_MM} )

# Use pandas to write the comma-separated output file
# output.to_csv( "clean_model_text.csv", index=False, quoting=3 )

### Create a Bag of Words model
The Bag of Words model learns a vocabulary from all of the documents (CMS MMs), 
then models each document by counting the number of times each word appears.
We'll be using the feature_extraction module from scikit-learn to create bag-of-words features.
In the CMS MM data, we have a very large number of MMs, which will give us a large vocabulary. 
To limit the size of the feature vectors, we should choose some maximum vocabulary size. 
Below, we use the 5000 most frequent words (remembering that stop words have already been removed).

In [21]:
print ("Creating the bag of words...\n")

import sklearn
# If you are using Anaconda, Sci-Kit Learn should already be installed. Otherwise you will have to install it.
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 10000) # This limits to the 10,000 most frequent words

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_MM)

# Numpy arrays are easy to work with, so convert the result to an \

# array
train_data_features = train_data_features.toarray()


# NOTE: CountVectorizer comes with its own options to automatically do preprocessing,
# tokenization, and stop word removal -- for each of these, instead of specifying "None",
# we could have used a built-in method or specified our own function to use.
# However, we wanted to write our own function for data cleaning in this tutorial 
# to show you how it's done step by step.

Creating the bag of words...



In [22]:
# To see what the training data array now looks like, do:
train_data_features.shape
# The output should read (# rows, # of text features)
# The # of text features is a count of each vocabulary word.

(127, 7384)

In [23]:
# Now that the Bag of Words model is trained, let's look at the vocabulary.
vocab = vectorizer.get_feature_names()

vocab

['44',
 '444',
 '445',
 '4455',
 '44th',
 '45',
 '454',
 '455',
 '4th',
 '54',
 '544',
 '545',
 '55',
 '555',
 '5a',
 '5th',
 'aahrpp',
 'aamc',
 'abbeville',
 'abbey',
 'abdomen',
 'abilities',
 'ability',
 'abim',
 'abington',
 'able',
 'abos',
 'abpts',
 'abraham',
 'absences',
 'abundant',
 'abuse',
 'academic',
 'academy',
 'accentuates',
 'accept',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessibility',
 'accessible',
 'accessory',
 'acclaimed',
 'accolades',
 'accommodate',
 'accommodates',
 'accommodations',
 'accompanied',
 'accomplished',
 'accomplishments',
 'accordance',
 'according',
 'accordingly',
 'account',
 'accountability',
 'accounting',
 'accounts',
 'accreditation',
 'accreditations',
 'accredited',
 'accrediting',
 'accreditor',
 'accredits',
 'accurate',
 'accusantium',
 'accustomed',
 'ace',
 'achieve',
 'achieved',
 'achievement',
 'achievements',
 'achieves',
 'achieving',
 'acknowledges',
 'acls',
 'acquainted',
 'acquired',
 'acquires',
 'acres'

In [24]:
# Optionally, you can also choose to print the counts of each word in the vocabulary

import numpy as np


# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)

29 44
1 444
6 445
1 4455
1 44th
25 45
1 454
1 455
3 4th
29 54
1 544
3 545
38 55
2 555
1 5a
4 5th
2 aahrpp
3 aamc
3 abbeville
1 abbey
1 abdomen
1 abilities
4 ability
1 abim
1 abington
9 able
1 abos
1 abpts
1 abraham
1 absences
1 abundant
4 abuse
6 academic
2 academy
1 accentuates
7 accept
22 accepted
3 accepting
4 accepts
47 access
7 accessibility
7 accessible
1 accessory
1 acclaimed
5 accolades
2 accommodate
2 accommodates
15 accommodations
2 accompanied
1 accomplished
3 accomplishments
5 accordance
10 according
1 accordingly
6 account
4 accountability
1 accounting
1 accounts
57 accreditation
18 accreditations
32 accredited
2 accrediting
2 accreditor
4 accredits
3 accurate
1 accusantium
1 accustomed
1 ace
8 achieve
14 achieved
14 achievement
6 achievements
1 achieves
4 achieving
1 acknowledges
1 acls
1 acquainted
7 acquired
1 acquires
2 acres
11 across
2 acs
8 act
8 action
1 actions
3 activation
17 active
5 actively
43 activities
11 activity
1 actual
1 actually
1 acupuncture
30 acute
1

### Feed the Bag of Words into a Random Forest Model (supervised learning)
At this point, we have numeric training features from the Bag of Words and the original CMS MM labels for each feature vector, so let's do some supervised learning! 
Here, we'll use the Random Forest classifier.  The Random Forest algorithm is included in scikit-learn (Random Forest uses many tree-based classifiers to make predictions, hence the "forest"). 
Below, we set the number of trees to 100 as a reasonable default value. More trees may (or may not) perform better, but will certainly take longer to run. Likewise, the more features you include for each MM, the longer this will take.

In [27]:
print ("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 1000 trees
forest = RandomForestClassifier(n_estimators = 10000) 

# Fit the forest to the training set, using the bag of words as 
# features and the MM labels as the response variable

# This may take a few minutes to run
forest = forest.fit( train_data_features, train["page_MM"] )

Training the random forest...


### Predict on an out-of-sample dataset
Next you will run the trained Random Forest on a test dataset and save the results to a file.

In [28]:
# Note that when we use the Bag of Words for the test set we only call "transform", 
#  not "fit_transform" as we did for the training set. 
# In machine learning, you shouldn't use the test set to fit your model, 
# otherwise you run the risk of overfitting. 
# For this reason, we keep the test set off-limits until we are ready to make predictions.

# Read the test data
test = pd.read_excel("test.xlsx", header=0, delimiter="\t", \
                   quoting=3 )

# Look at the column headers.
test.columns.values


array(['id', 'Provider Type', 'root url', 'scraped url', 'Visible Text',
       'page_MM'], dtype=object)

In [29]:
# Verify that there are the expected number of rows and columns
test.shape

(41, 6)

In [30]:
# Create an empty list and append the clean MMs one by one
num_MM = len(test["Visible Text"])
clean_test_MM = [] 

print ("Cleaning and parsing the test set MMs...\n")
for i in range(0,num_MM):
    if( (i+1) % 10 == 0 ):
        print ("CMS MM %d of %d\n" % (i+1, num_MM))
    clean_MM = MM_to_words( test["Visible Text"][i] )
    clean_test_MM.append( clean_MM )

Cleaning and parsing the test set MMs...

CMS MM 10 of 41

CMS MM 20 of 41

CMS MM 30 of 41

CMS MM 40 of 41



In [31]:
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_MM)
test_data_features = test_data_features.toarray()

In [32]:
# Use the random forest to make CMS MM predictions
result = forest.predict(test_data_features)

In [33]:
p = forest.predict_proba( test_data_features )
auc = sklearn.metrics.auc( result, p[:,1], reorder=True )
auc

0.52910000000000001

In [34]:
outcometest=np.array(test['page_MM'])
expectedtest=outcometest
predicted_test=model.predict(testpredictors)
print(metrics.classification_report(expectedtest, predicted_test))
print(metrics.confusion_matrix(expectedtest, predicted_test))

NameError: name 'model' is not defined

In [63]:
# Copy the results to a pandas dataframe with an "PrimaryKey" column and
# a "MMExist" column
output = pd.DataFrame( data={"id":test["id"], "mm_predict":result} )

# Use pandas to write the comma-separated output file
output.to_csv( "results_train-test_100000tree_10000maxfeat_no-numbers_stops-removed.csv", index=False, quoting=3 )

### Congratulations, you have run your first supervised learning model! 
You can try different things and see how your results change. 
For example, you can clean the MMs differently, 
choose a different number of vocabulary words for the Bag of Words representation, 
try Porter Stemming, a different classifier, or any number of other things.

### Now onto Word Vectors