# Classification

## Creating a Test Set

### Importing The Data
I have created a JS script `createShuffledIntentsUtterancesEntitiesCsv.js`, that gives more information about each utterance. In addition to the intent and complete utterance, it has the entitiy and content ID if applicable. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/training-data-with-entities.csv')

In [3]:
df.head()

Unnamed: 0,Intent,Utterance,Entitiy,Id
0,PlayRadioIntent,start playing radio w. m. ninety five point six,radio w. m. ninety five point six,bbc_wm
1,PlayRadioIntent,start radio shropshire please,radio shropshire,bbc_radio_shropshire
2,PlayRadioIntent,let me listen to b. b. c. london please,b. b. c. london,bbc_london
3,PlayRadioIntent,find me wales radio,wales radio,bbc_radio_wales_fm
4,PlayRadioIntent,can I play b. b. c. york please,b. b. c. york,bbc_radio_york


In [4]:
df.tail()

Unnamed: 0,Intent,Utterance,Entitiy,Id
1976576,PlayPodcastIntent,start podcast five live science podcast bbc ra...,five live science podcast bbc radio five live,p02pc9ny
1976577,PlayPodcastIntent,let me hear bbc newcastle rebecca o'neill,bbc newcastle rebecca o'neill,p06nzv9h
1976578,PlayPodcastIntent,continue listening to bbc world service world ...,bbc world service world update,p007dhp8
1976579,PlayPodcastIntent,start playing the five faces of leonardo please,the five faces of leonardo,m0004l8g
1976580,PlayPodcastIntent,can I hear podcast b. b. c. radio cumbria joe ...,b. b. c. radio cumbria joe costin,p001d72n


In [5]:
df.shape

(1976581, 4)

In [None]:
df['Intent'].apply(pd.value_counts)

There are far more examples of `PlayRadioIntent` and `PlayPodcastIntent` than the other intents. I am aware that this will undoubtedly affect the performance of the model.

### Creating a Stratified Training Set

Stratified to ensure that the information going into the training set is proportional to the overall data.

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X = df['Utterance']

In [11]:
Y = df['Intent']

Shuffle and split out 20% of the data to not be trained on at all.

In [12]:
x_in, x_out, y_in, y_out = train_test_split(X, Y, test_size=0.2, stratify=Y)

In [13]:
x_in.shape, y_in.shape

((1581264,), (1581264,))

In [14]:
x_out.shape, y_out.shape

((395317,), (395317,))

In [15]:
df_y_in = pd.DataFrame(y_in)

In [16]:
df_y_out = pd.DataFrame(y_out)

In [17]:
df_y_in.apply(pd.value_counts)

Unnamed: 0,Intent
PlayPodcastIntent,1567855
PlayRadioIntent,13409


In [18]:
df_y_out.apply(pd.value_counts)

Unnamed: 0,Intent
PlayPodcastIntent,391965
PlayRadioIntent,3352


## Parameter Tuning
To find the best performing parameters to train my model with.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.svm import LinearSVC

In [27]:
pipeline = Pipeline([('tfidf', TfidfVectorizer(sublinear_tf=True)),
                     ('selectkbest', SelectKBest(chi2, k=10000)),
                     ('linearscv', LinearSVC(max_iter=10000, dual=False))])

Listing out all the Params that I _could_ tune.

In [22]:
pipeline.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.float64'>, encoding='utf-8',
                   input='content', lowercase=True, max_df=1.0, max_features=None,
                   min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                   smooth_idf=True, stop_words=None, strip_accents=None,
                   sublinear_tf=True, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, use_idf=True, vocabulary=None)),
  ('selectkbest',
   SelectKBest(k=10, score_func=<function f_classif at 0x139841c20>)),
  ('linearscv',
   LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
             intercept_scaling=1, loss='squared_hinge', max_iter=10000,
             multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
             verbose=0))],
 'verbose': False,
 'tfidf': TfidfVectorizer(analyzer='word', binary=False, decode_error='str

In [28]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, x_in, y_in, scoring='accuracy', cv=5, n_jobs=1)

In [47]:
from sklearn.model_selection import GridSearchCV

grid = {
    'tfidf__ngram_range':[(1,2),(2,3)],
    'tfidf__stop_words': [None, 'english'],
    'selectkbest__k': [10000, 15000],
    'selectkbest__score_func': [f_classif, chi2],
    'linearscv__penalty': ['l1', 'l2'] }

In [48]:
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X=x_in, y=y_in)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [49]:
grid_search.best_score_

0.9997546266784041

This below was found to be the best combination.

In [50]:
grid_search.best_params_

{'linearscv__penalty': 'l1',
 'selectkbest__k': 15000,
 'selectkbest__score_func': <function sklearn.feature_selection.univariate_selection.f_classif(X, y)>,
 'tfidf__ngram_range': (1, 2),
 'tfidf__stop_words': None}

In [52]:
grid_search.score(x_out, y_out)

0.9997369199907922

## Training a Model

Make a model using the best parameters from the parameter tuning earlier. I increased the `max_iter` because it didn't converge on first try

In [51]:
tuned_pipeline = Pipeline([('tfidf', TfidfVectorizer(sublinear_tf=True, ngram_range=(1,2), stop_words=None)),
                         ('selectkbest', SelectKBest(f_classif, k=15000)),
                         ('linearscv', LinearSVC(max_iter=15000, dual=False, penalty='l1'))])

In [67]:
model = tuned_pipeline.fit(x_in, y_in)

Test the output with a test set that hasn't been tested against yet.

In [68]:
model.score(x_out, y_out)

0.9997369199907922

## Testing the Model
Trying out some example utterances that I made up. 

In [106]:
model_results = [['utterance', 'expectation', 'actual', 'correct?']]

In [108]:
samples = [['PlayPodcastIntent', 'can i listen to the new bbc womens show'],
          ['PlayPodcastIntent', 'can you play unexpected fluids for me'],
          ['PlayPodcastIntent', 'i want to listen to yesterdays quiz'],
          ['PlayPodcastIntent', 'play womans hour podcast'],
          ['PlayPodcastIntent', 'play me the nixtape'],
          ['PlayPodcastIntent', 'i want to hear the health show on demand'],
          ['PlayRadioIntent', 'play bbc radio one extra station'],
          ['PlayRadioIntent', 'stream bbc radio nottingham for me'],
          ['PlayRadioIntent', 'please may i listen to scottish radio'],
          ['PlayRadioIntent', 'i want to listen to new welsh radio station'],
          ['PlayRadioIntent', 'can i listen to radio leicester'],
          ['PlayRadioIntent', 'can you kindly play me bbc radio four']]

In [109]:
for sample in samples:
    prediction = model.predict([sample[1]])
    model_results.append([sample[1], sample[0], prediction[0], prediction[0] == sample[0]])

In [110]:
results_df = pd.DataFrame(model_results)

In [111]:
results_df

Unnamed: 0,0,1,2,3
0,utterance,expectation,actual,correct?
1,can i listen to the new bbc womens show,PlayPodcastIntent,PlayPodcastIntent,True
2,can you play unexpected fluids for me,PlayPodcastIntent,PlayPodcastIntent,True
3,i want to listen to yesterdays quiz,PlayPodcastIntent,PlayPodcastIntent,True
4,play womans hour podcast,PlayPodcastIntent,PlayPodcastIntent,True
5,play me the nixtape,PlayPodcastIntent,PlayPodcastIntent,True
6,i want to hear the health show on demand,PlayPodcastIntent,PlayPodcastIntent,True
7,play bbc radio one extra station,PlayRadioIntent,PlayRadioIntent,True
8,stream bbc radio nottingham for me,PlayRadioIntent,PlayRadioIntent,True
9,please may i listen to scottish radio,PlayRadioIntent,PlayPodcastIntent,False


The results don't seem to be too bad. My human brain is suprised that the model hasn't associated the word "radio" with "PlayRadioIntent", but I'm only a human. 🤷🏻‍♀️