For this tutorial we were missing an important component:  Data!   Therefore we generated our data set in 2 steps:

1.Initial Data was captured via Google Form which asked internal Red Hatters for car issues they currently have or had in the past.
That Data was classified into the following 3 categories: brakes, starter, other.  This data represented our initial training set.  As the training set was 'very small' - 50 examples, we decided to increase this training set by using Markovify.

2.Using Markovigy we were able to take the initial 'training set' and generate more data (e.g. car repair issues) for our tutorial.  

In [2]:
!pip install -r requirements.txt
import pandas as pd
import csv



In [3]:
#read in responses, only pull out client response, categorized issue and car symptom

df = pd.read_csv('dataset/response.csv') 
df = df.fillna('')
df['response']=df.iloc[:,3]+df.iloc[:,5]+df.iloc[:,6]
df['issue'] = df.iloc[:,1]
df['symptom'] = df.iloc[:,2] + df.iloc[:,4]
subset = df.iloc[:,-3:]
subset

Unnamed: 0,response,issue,symptom
0,my brakes make a squeaking noise whenever I tr...,Brakes,Car makes grinding noise
1,super frustrating every time I start my car it...,Starter,Car starts then stops
2,I can't open the damn door to my car,Other,
3,I turn the key and nothing happens,Starter,Car doesn't start
4,Car doesn't always start when it's low on blin...,Starter,Car doesn't start
...,...,...,...
104,Parking brake doesn’t return once released,Brakes,"Car brakes, but then brakes disengage"
105,my lights do not work,Other,
106,I try to start the engine only to find that th...,Starter,Car doesn't start
107,The driver side window auto function does not ...,Other,


In [4]:
import markovify
import codecs

In [5]:
#markovify is a simple, extensible Markov chain generator
#Its primary use is for building Markov models of large corpora of text and generating random sentences from that.  

#Function builds the model according to what issue (e.g. brakes, starter, other) is given
def train_markov_type(data, issue):
    return markovify.Text(data[data["issue"] == issue].response, retain_original=False, state_size=2)

#Function takes one of the 'issue' models and creates a randomly-generated sentence of length up to 200 characters.  Note only creates '1' sentence
def make_sentence(model, length=100):
    return model.make_short_sentence(length, max_overlap_ratio = .7, max_overlap_total=15)

#built models
other_model = train_markov_type(subset, "Other")
brakes_model = train_markov_type(subset, "Brakes")
starter_model = train_markov_type(subset, "Starter")

We can combine these models with relative weights

In [6]:
import numpy

def generate_cases(models, weights=None):
    if weights is None:
        weights = [1] * len(models)
    
    choices = [] # Array of tuples of weight and models
    
    total_weight = float(sum(weights))
    
    for i in range(len(weights)):
        choices.append((float(sum(weights[0:i+1])) / total_weight, models[i]))
    
    # Return a tuple of model and category that are randomly selected by given weights.
    def choose_model():
        r = numpy.random.uniform()
        for (model_weight, model) in choices:
            if r <= model_weight:
                return model
        return choices[-1][1]


    while True:
        local_model = choose_model() 
        # local_model[0]) is the markovify model, local_model[1] is the category
        yield make_sentence(local_model[0]), local_model[1]
   


Generate new sentences & classify them as:  other, brakes, starter.

Store new sentences in file:  testdata1.csv

In [7]:
import numpy as np

generated_cases = generate_cases([(other_model,'other'), (brakes_model,'brakes'), (starter_model,'starter')], [14,7,7])


# Tuples with sentence and category
sentence_tuples = [next(generated_cases)  for i in range(1000)]  # create 200 sentence/category tuples

# Write to csv file
with open('dataset/testdata1.csv', 'w') as file:
    writer = csv.writer(file, delimiter=',', lineterminator='\n')
    writer.writerows(sentence_tuples)

At this point we have created a new data set.  There is however a problem we must overcome.  Machine Learning models cannot understand 'text'.  Therefore we must convert the textual data into some numeric form.

We can do this, using Tokenization.  Jump to 02-TokenDemo.ipynb