# Deep Text Learning

### With Machine Learning techniques in deep learning, classification, and regression, this notebook will demonstrate how to predict a blogger's gender and age with high accuracy based on his or her blog posts

In [1]:
import os
import graphlab as gl
from bs4 import  BeautifulSoup

[INFO] This non-commercial license of GraphLab Create is assigned to marvin.bertin@gmail.comand will expire on September 15, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-747 - Server binary: /Users/marvinbertin/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1444494833.log
[INFO] GraphLab Server Version: 1.6.1


## Preparing Dataset

In [15]:
BASE_DIR = "/Users/marvinbertin/graphlab_data" # NOTE: Update BASE_DIR to your own directory path
class BlogData2SFrameParser(object):
    #Some constants
    ID = "id"
    GENDER = "gender"
    AGE = "age"
    SIGN = "sign"
    POSTS = "posts"
    DATES = "dates"
    INDUSTRY = "industry"

    def __init__(self, xml_files_dir, sframe_outpath):
        """
        Parse all the blog posts XML files in the xml_files_dir and insert them into an SFrame object,
        which is later saved to `sframe_outpath`
        :param xml_files_dir: the directory which contains XML files of the The Blog Authorship Corpus
        :param sframe_outpath: the out path to save the SFrame.
        """
        self._bloggers_data = []


        for p in os.listdir(xml_files_dir):
            if p.endswith(".xml"):
                #We parse each XML file and convert it to a dict
                self._bloggers_data.append(self.parse_blog_xml_to_dict("%s%s%s" % (xml_files_dir, os.path.sep, p)))
        print "Successfully parsed %s blogs" % len(self._bloggers_data)

        # self._bloggers_data is a list of dict which we can easily load to a SFrame object. However, the dict object
        # are loaded into a single column named X1. To create separate column for each dict key we use the unpack function.        
        self._sf = gl.SFrame(self._bloggers_data).unpack('X1')

        #Now we can use the rename function in order to remove the X1. prefix from the column names and save the SFrame for later use
        self._sf.rename({c:c.replace("X1.", "") for c in self._sf.column_names()} )        
        self._sf.save(sframe_outpath)


    def parse_blog_xml_to_dict(self, path):
        """
        Parse the blog post in the input XML file and return dict with the  blogger's personal information and posts
        :param path: the path of the xml file
        :return: dict with the blogger's personal details and posts
        :rtype: dict
        """
        blogger_dict = {}
        #Extract the blogger personal details from the file name
        blog_id,gender,age,industry, sign = path.split(os.path.sep)[-1].split(".xml")[0].split(".")
        blogger_dict[self.ID] = blog_id
        blogger_dict[self.GENDER] = gender
        blogger_dict[self.AGE] = int(age)
        blogger_dict[self.INDUSTRY] = industry
        blogger_dict[self.SIGN] = sign
        blogger_dict[self.POSTS] = []
        blogger_dict[self.DATES] = []

        #The XML files are not well formatted, so we need to do some hacks.
        s = file(path,"r").read().replace("&nbsp;", " ")

        # First, strip the <Blog> and </Blog> tags at the beginning and end of the document
        s = s.replace("<Blog>", "").replace("</Blog>", "").strip()

        # Now, split the document into individual blog posts by the <date> tag
        for e in s.split("<date>")[1:]:
            # Separate the date stamp from the rest of the post
            date_and_post = e.split("</date>")
            blogger_dict[self.DATES].append(date_and_post[0].strip())
            post = date_and_post[1].replace("<post>","").replace("</post>","").strip()
            post = BeautifulSoup(post).get_text()
            blogger_dict[self.POSTS].append(post)


        if len(blogger_dict[self.DATES]) != len(blogger_dict[self.POSTS]):
            raise Exception("Warning: Mismatch between the number of posts and the number of dates in file %s" % path)

        return blogger_dict
    @property
    def sframe(self):
        return self._sf

sframe_save_path = "%s/blogs.sframe" % BASE_DIR
b = BlogData2SFrameParser("%s/blogs" % BASE_DIR, sframe_save_path)
sf = b.sframe

  '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP cl

Successfully parsed 19320 blogs


  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)


In [16]:
os.mkdir("%s/txt" % BASE_DIR)
sf.apply(lambda r: file("%s/txt/%s.txt" % (BASE_DIR, r["id"]),"w").write("\n".join(r['posts']))).__materialize__()

In [17]:
gl.canvas.set_target('ipynb')
sf.show()

## Training Word2Vec Model

### Warning this step can take awhile to train

In [41]:
import os
import gensim
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [43]:
BASE_DIR = "/Users/marvinbertin/graphlab_data" # NOTE: Update BASE_DIR to your own directory path
class TrainSentences(object):
    """
    Iterator class that returns Sentences from texts files in a input directory
    """
    RE_WIHTE_SPACES = re.compile("\s+")
    STOP_WORDS = set(stopwords.words("english"))
    def __init__(self, dirname):
        """
        Initialize a TrainSentences object with a input directory that contains text files for training
        :param dirname: directory name which contains the text files        
        """
        self.dirname = dirname

    def __iter__(self):
        """
        Sentences iterator that return sentences parsed from files in the input directory.
        Each sentences is returned as list of words
        """
        #First iterate  on all files in the input directory
        for fname in os.listdir(self.dirname):
            # read line from file (Without reading the entire file)
            for line in file(os.path.join(self.dirname, fname), "rb"):
                # split the read line into sentences using NLTK
                for s in txt2sentences(line, is_html=True):
                    # split the sentence into words using regex
                    w =txt2words(s, lower=True, is_html=False, remove_stop_words=False,
                                                 remove_none_english_chars=True)
                    #skip short sentneces with less than 3 words
                    if len(w) < 3:
                        continue
                    yield w

def txt2sentences(txt, is_html=False, remove_none_english_chars=True):
    """
    Split the English text into sentences using NLTK
    :param txt: input text.
    :param is_html: If True thenremove HTML tags using BeautifulSoup
    :param remove_none_english_chars: if True then remove non-english chars from text
    :return: string in which each line consists of single sentence from the original input text.
    :rtype: str
    """
    if is_html:
        txt = BeautifulSoup(txt).get_text()
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    # split text into sentences using nltk packages
    for s in tokenizer.tokenize(txt):
        if remove_none_english_chars:
            #remove none English chars
            s = re.sub("[^a-zA-Z]", " ", s)
        yield s
    

In [39]:
def txt2words(txt, lower=True, is_html=False, remove_none_english_chars=True, remove_stop_words=True):
    """
    Split text into words list
    :param txt: the input text
    :param lower: if to make the  text to lowercase or not.
    :param is_html: If True then  remove HTML tags using BeautifulSoup
    :param remove_none_english_chars: if True then remove non-english chars from text
    :param remove_stop_words: if True then remove stop words from text
    :return: words list create from the input text according to the input parameters.
    :rtype: list
    """
    if is_html:
        txt = BeautifulSoup(txt).get_text()
    if lower:
        txt = txt.lower()
    if remove_none_english_chars:
        txt = re.sub("[^a-zA-Z]", " ", txt)

    words = TrainSentences.RE_WIHTE_SPACES.split(txt.strip().lower())
    if remove_stop_words:
        #remove stop words from text
        words = [w for w in words if w not in TrainSentences.STOP_WORDS]
    return words

In [None]:
sentences = TrainSentences("%s/txt" % BASE_DIR)
model = gensim.models.Word2Vec(sentences, size=300, workers=8, min_count=40)
model.save("%s/blog_posts_300_c_40.word2vec" % BASE_DIR)

In [20]:
import gensim

In [21]:
BASE_DIR = "/Users/marvinbertin/graphlab_data/models" # NOTE: Update BASE_DIR to your own directory path
model_download_path = "%s/blog_posts_300_c_40.word2vec" % BASE_DIR
model = gensim.models.Word2Vec.load(model_download_path)

In [22]:
model.most_similar("lol")

[(u'haha', 0.7543795108795166),
 (u'lildevil', 0.7210554480552673),
 (u'dynamitedg', 0.7055381536483765),
 (u'hahaha', 0.704667329788208),
 (u'hellokittylovzme', 0.7038414478302002),
 (u'yea', 0.6890978813171387),
 (u'shevy', 0.6800159215927124),
 (u'fabityfabfab', 0.6786916851997375),
 (u'djkthegr', 0.6695927381515503),
 (u'sehne', 0.6663240790367126)]

In [25]:
model.most_similar_cosmul(positive=['young', 'black'], negative=['old'])

[(u'white', 0.9320946335792542),
 (u'supremacist', 0.9221380352973938),
 (u'caucasian', 0.8852033615112305),
 (u'bodied', 0.8755568861961365),
 (u'burgandy', 0.8735051155090332),
 (u'stripes', 0.87107253074646),
 (u'sequined', 0.8671029806137085),
 (u'bearded', 0.8589045405387878),
 (u'hispanic', 0.8537147641181946),
 (u'tights', 0.8529753088951111)]

In [27]:
model.most_similar("fantastic")

[(u'fabulous', 0.696677565574646),
 (u'terrific', 0.6917413473129272),
 (u'great', 0.6832793951034546),
 (u'wonderful', 0.6618353128433228),
 (u'superb', 0.6394547820091248),
 (u'brilliant', 0.6001139879226685),
 (u'marvelous', 0.5984939336776733),
 (u'amazing', 0.5513710379600525),
 (u'lovely', 0.5503456592559814),
 (u'delightful', 0.5462784767150879)]

In [28]:
print sf.num_rows()

19320


##Creating & Evaluating Classifiers

### Feature Engineering

In [29]:
sf.head()

age,dates,gender,id,industry,posts,sign
37,"[31,May,2004, 29,May,2004, 28,May,2 ...",female,1000331,indUnk,"[Well, everyone got up and going this morning. ...",Leo
17,"[23,November,2002, 20,November,2002, ...",female,1000866,Student,"[Yeah, sorry for not writing for a whole ...",Libra
23,"[19,June,2004, 19,June,2004, ...",male,1004904,Arts,"[cupid,please hear my cry, cupid, please let ...",Capricorn
25,"[31,May,2004, 30,May,2004, 30,May,2 ...",female,1005076,Arts,[and did i mention that i no longer have to deal ...,Cancer
25,"[05,July,2003, 04,July,2003, ...",male,1005545,Engineering,[B-Logs: The Business Blogs Paradox urlLink ...,Sagittarius
48,"[12,July,2003, 07,July,2003, ...",male,1007188,Religion,[1/03 DrKioni.com Awarded ByRegion.net Healers ...,Libra
26,"[23,July,2003, 14,July,2003, ...",female,100812,Architecture,[Friday My dear wife was walking on her ...,Aries
16,"[23,November,2002, 21,November,2002, ...",female,1008329,Student,"[Sorry, but I gotta..I couldn't remember the ...",Pisces
25,"[30,July,2004, 22,July,2004, ...",male,1009572,indUnk,"[Planning the Marathon I checked Active.com, ...",Cancer
27,"[28,April,2004, 26,April,2004, ...",female,1011153,Technology,[The astute among you will note that this run ...,Virgo


In [34]:
# first we join the posts list to a single string
sf["posts"] = sf['posts'].apply(lambda post: "\n".join(post))

In [36]:
# Construct Bag-of-Words model and evaluate it
sf['1gram features'] = gl.text_analytics.count_ngrams(sf['posts'], 1)
sf['2gram features'] = gl.text_analytics.count_ngrams(sf['posts'], 2)

In [37]:
from numpy import average
import graphlab as gl
import numpy as np
import gensim

class DeepTextAnalyzer(object):
    def __init__(self, word2vec_model):
        """
        Construct a DeepTextAnalyzer using the input Word2Vec model
        :param word2vec_model: a trained Word2Vec model
        """
        self._model = word2vec_model

    def txt2vectors(self,txt, is_html):
        """
        Convert input text into an iterator that returns the corresponding vector representation of each
        word in the text, if it exists in the Word2Vec model
        :param txt: input text
        :param is_html: if True, then extract the text from the input HTML
        :return: iterator of vectors created from the words in the text using the Word2Vec model.
        """
        words = txt2words(txt,is_html=is_html, lower=True, remove_none_english_chars=True)
        words = [w for w in words if w in self._model]
        if len(words) != 0:
            for w in words:
                yield self._model[w]


    def txt2avg_vector(self, txt, is_html):
        """
        Calculate the average vector representation of the input text
        :param txt: input text
        :param is_html: is the text is a HTML
        :return the average vector of the vector representations of the words in the text  
        """
        vectors = self.txt2vectors(txt,is_html=is_html)
        vectors_sum = next(vectors, None)
        if vectors_sum is None:
            return None
        count =1.0
        for v in vectors:
            count += 1
            vectors_sum = np.add(vectors_sum,v)
        
        #calculate the average vector and replace +infy and -inf with numeric values 
        avg_vector = np.nan_to_num(vectors_sum/count)
        return avg_vector

In [44]:
# Calculate each blogger's average vector
dt = DeepTextAnalyzer(model)
sf['vectors'] = sf['posts'].apply(lambda p: dt.txt2avg_vector(p, is_html=True))
sf['vectors'].head(1)

dtype: array
Rows: 1
[array('d', [0.0062507628463208675, -0.07606526464223862, 0.052258580923080444, 0.008684697560966015, -0.0038692429661750793, 0.02404574677348137, 0.012569146230816841, 0.02063075453042984, 0.015948006883263588, -0.012405338697135448, -0.022391920909285545, -0.022027013823390007, -0.035614486783742905, -0.01066309493035078, -0.03329771012067795, -0.020523464307188988, 0.023844784125685692, -0.013862643390893936, -0.04676587134599686, -0.05669616162776947, -0.004467571619898081, 0.02933463640511036, 0.03274542838335037, 0.010069825686514378, 0.017453908920288086, -0.008361246436834335, 0.01089525781571865, 0.04363299161195755, 0.048218220472335815, -0.0005510354530997574, -0.013837507925927639, -0.0027286745607852936, 0.02849958837032318, 0.0021772056352347136, -0.030761901289224625, -0.014515019953250885, 0.0350775271654129, 0.004702108912169933, -0.03487129136919975, 0.030265316367149353, 0.026896469295024872, -0.005913758650422096, -0.038957491517066956, -0.00260

In [45]:
sf = sf.dropna()
print sf.column_names()

['age', 'dates', 'gender', 'id', 'industry', 'posts', 'sign', '1gram features', '2gram features', 'vectors']


In [46]:
train_set, test_set = sf.random_split(0.8, seed=5)

In [48]:
train_set.head()

age,dates,gender,id,industry,posts,sign
37,"[31,May,2004, 29,May,2004, 28,May,2 ...",female,1000331,indUnk,"Well, everyone got up and going this morning. ...",Leo
17,"[23,November,2002, 20,November,2002, ...",female,1000866,Student,"Yeah, sorry for not writing for a whole ...",Libra
23,"[19,June,2004, 19,June,2004, ...",male,1004904,Arts,"cupid,please hear my cry, cupid, please let your ...",Capricorn
25,"[31,May,2004, 30,May,2004, 30,May,2 ...",female,1005076,Arts,and did i mention that i no longer have to deal ...,Cancer
25,"[05,July,2003, 04,July,2003, ...",male,1005545,Engineering,B-Logs: The Business Blogs Paradox urlLink ...,Sagittarius
48,"[12,July,2003, 07,July,2003, ...",male,1007188,Religion,1/03 DrKioni.com Awarded ByRegion.net Healers ...,Libra
16,"[23,November,2002, 21,November,2002, ...",female,1008329,Student,"Sorry, but I gotta..I couldn't remember the ...",Pisces
25,"[30,July,2004, 22,July,2004, ...",male,1009572,indUnk,"Planning the Marathon I checked Active.com, and ...",Cancer
27,"[28,April,2004, 26,April,2004, ...",female,1011153,Technology,The astute among you will note that this run is a ...,Virgo
25,"[29,November,2002, 28,November,2002, ...",female,1011289,indUnk,MSN conversation: 11.17am Iggbalbollywall ( this ...,Libra

1gram features,2gram features,vectors
"{'raining': 2, 'all': 5, 'infestations': 1, ...","{'know how': 1, 'in case': 1, 'many people': ...","[0.00625076284632, -0.0760652646422, ..."
"{'raining': 5, 'foul': 1, 'barraged': 1, 'woods': ...","{'or black': 3, 'my dinner': 3, 'probably ...","[0.00972067564726, -0.0654575005174, ..."
"{'hats': 2, 'saves': 1, 'four': 1, 'sleep': 2, ...","{'hear those': 1, 'kitties i': 1, 'the ...","[0.0155644528568, -0.0584896802902, ..."
"{'neighbors': 2, 'all': 19, 'forget': 1, ...","{'everyone says': 1, 'still yikes': 1, 'be ...","[0.0111539196223, -0.0412371978164, ..."
"{'limited': 3, 'similarity': 2, 'hats': ...","{'tutorials on': 1, 'say say': 1, 'your passwo ...","[-0.0735046640038, -0.149944871664, ..."
"{'limited': 1, 'all': 3, 'coach': 1, 'global': 1, ...","{'html and': 1, 'mp3motivators keen': 1, ...","[-0.0994467884302, -0.161703407764, ..."
"{'suicidal': 3, 'personally': 2, 'felt': ...","{'dance around': 1, 'above one': 1, 'went ...","[0.0120706614107, -0.0553446821868, ..."
"{'all': 6, 'forget': 1, 'people': 2, 'bombed' ...","{'i try': 1, 'mr smooth': 1, 'people i': 1, 'the ...","[0.0193899124861, -0.083295404911, ..."
"{'all': 6, 'managed': 1, 'chain': 3, 'whoever' ...","{'the second': 1, 'stairs it': 1, 'minutes ...","[0.00792560260743, -0.0958431214094, ..."
"{'raining': 7, 'daftness': 1, 'yellow': ...","{'or black': 1, 'things anyways': 1, 'downtown ...","[-0.00284207914956, -0.0830091014504, ..."


### Predicting blogger gender

### 1-gram

In [47]:
cls = gl.classifier.create(train_set, target = 'gender', features=['1gram features'])
baseline_result = cls.evaluate(test_set)
print baseline_result

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14777
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 652659
PROGRESS: Number of coefficients    : 652660
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +---

In [None]:
"""
WARNING very long training! 13 million features
cls2 = gl.classifier.create(train_set, target='gender',features=['2gram features', '1gram features'] )
ngram_result = cls2.evaluate(test_set)
print ngram_result

LogisticClassifier              : 0.758256
SVMClassifier                   : 0.743725
"""

In [49]:
cls3 = gl.classifier.create(train_set, target = 'gender', features=['vectors'])
word2vec_result = cls3.evaluate(test_set)
print word2vec_result

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14778
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Number of coefficients    : 301
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+---

In [51]:
cls4 = gl.classifier.create(train_set, target='gender', features=['vectors', 'industry', 'age'])
word2vec_ind_age_result = cls4.evaluate(test_set)
print word2vec_ind_age_result

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14765
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 3
PROGRESS: Number of unpacked features : 302
PROGRESS: Number of coefficients    : 341
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+---

In [52]:
train_set['posts_length'] = train_set['posts'].apply(lambda p: len(p))
test_set['posts_length'] = test_set['posts'].apply(lambda p: len(p))

cls5 = gl.classifier.create(train_set, target='gender', features=['vectors', 'industry', 'age', 'posts_length'])
word2vec_ind_age_len_result = cls5.evaluate(test_set)
print word2vec_ind_age_len_result

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14781
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 303
PROGRESS: Number of coefficients    : 342
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+---

In [54]:
# number of posts for blogger 1
len(train_set['dates'][0])

13

In [55]:
train_set['num_posts'] = train_set['dates'].apply(lambda d: len(d))
test_set['num_posts'] = test_set['dates'].apply(lambda d: len(d))

cls6 = gl.classifier.create(train_set, target='gender', features=['vectors', 'industry', 'age', 'posts_length', 'num_posts'])
word2vec_ind_age_len_num_result = cls6.evaluate(test_set)
print word2vec_ind_age_len_num_result

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14801
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 5
PROGRESS: Number of unpacked features : 304
PROGRESS: Number of coefficients    : 343
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+---

## Predicting Blogger Age

In [57]:
sf['age'].show()

### Regression Model

In [58]:
linear_model = gl.linear_regression.create(train_set, target = 'age', features=['vectors'])
linear_model.evaluate(test_set)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14781
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Number of coefficients    : 301
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 4.714229     | 27.470940          | 27.3

{'max_error': 27.407039865894234, 'rmse': 5.785284936754378}

In [60]:
boosted_tree_model = gl.boosted_trees_regression.create(train_set, target = 'age', features=['vectors'])
boosted_tree_model.evaluate(test_set)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Boosted trees regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14760
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter          RMSE          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   1.721e+01   1.654e+01        0.86s
PROGRESS:      1   1.280e+01   1.216e+01        1.71s
PROGRESS:      2   9.886e+00   9.365e+00        2.57s
PROGRESS:      3   8.026e+00   7.656e+00        3.42s
PROGRESS:      4   6.867e+00   6.633e+00        4.27s
PROGRESS:      5   6.149e+00   6.090e+00        5.17s
PROGRESS:      6   5.705e+00   5.831e+00        6.03s
PROGRESS:      7   5.

{'max_error': 33.30920549278245, 'rmse': 6.237615538266947}

### Classification Models

In [62]:
valid_age = range(13,18) + range(23,28) + range(33,43)
sf_age_catego = sf.filter_by(valid_age, 'age')

In [63]:
sf_age_catego.num_rows()

18787

In [64]:
def get_age_category(age):
    if 13 <= age <= 17:
        return "10s"
    elif 23 <= age <= 27:
        return "20s"
    elif 33 <= age <= 42:
        return "30s"
    return None

sf['age_category'] = sf['age'].apply(lambda age: get_age_category(age))
sf_age_categories = sf.dropna()
print sf_age_categories.num_rows()

18787


In [65]:
train_set2, test_set2 = sf_age_categories.random_split(0.8, seed = 5)
cls = gl.classifier.create(train_set2, target = 'age_category', features=['vectors'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14306
PROGRESS: Number of classes           : 3
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   7.263e-01   6.679e-01        1.43s
PROGRESS:      1   7.498e-01   6.916e-01        2.89s
PROGRESS:      2   7.6

AttributeError: 'BoostedTreesClassifier' object has no attribute 'evalute'

In [66]:
age_catego_result = cls.evaluate(test_set2)
print age_catego_result

{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 9

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|     20s      |       30s       |   42  |
|     30s      |       10s       |   37  |
|     20s      |       20s       |  1360 |
|     20s      |       10s       |  242  |
|     10s      |       20s       |  321  |
|     30s      |       30s       |   28  |
|     30s      |       20s       |  369  |
|     10s      |       10s       |  1262 |
|     10s      |       30s       |   19  |
+--------------+-----------------+-------+
[9 rows x 3 columns]
, 'accuracy': 0.720108695652174}
