# Initial Investigations
## Importing the Data
I have already written a JS sript that changed the data from the `interaction-model.json` to another JSON document that contains all the full utterances for each intent. 

On order to work with the data I need to read in the JSON document, the JSON import will convert the JSON document into a Python dictionary. For ease of use convert it to a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). 

In [1]:
import pandas as pd
import json

In [2]:
with open('./data/training-data.json') as training_data:
    training_data_json = json.load(training_data)

Checking that the data has been read in as expected.

In [3]:
training_data_json['PodcastOnlyIntent'][:10] # only first 10 items since I know this list is very long.

['the forum',
 'the forum bbc world service',
 'bbc world service the forum',
 'the forum from bbc world service',
 'the forum on bbc world service',
 'bbc world the forum',
 'the forum b. b. c. world service',
 'b. b. c. world service the forum',
 'the forum from b. b. c. world service',
 'the forum on b. b. c. world service']

In [4]:
training_data_json['WhatsPlayingTitleAndSynopsisIntent']

['what this is about',
 'what this programme is about',
 'what is this programme about']

I see if I can iterate through the dictionary (I can!). 

I do this because I am concerned about how the document will read into a DataFrame because the length of each list in the dictionary is very different.

In [5]:
# iterate through the python dictionary. Is there need for pandas? 
[dt for dt in training_data_json['WhatsPlayingTitleAndSynopsisIntent'] if dt not in 'what this is about']

['what this programme is about', 'what is this programme about']

As I expected, I had trouble reading in the the data due to the differing lengths of each list. I found I had to orient the DataFrame with the index at the top. I am concerned that this will make the data hard to work with and won't be any more useful than working directly with the native Python dictionary.

In [6]:
data = pd.DataFrame.from_dict(training_data_json, orient='index')

Checking that the data has read in as expected. As is shown below, the lengths of the lists are extremely different. 

In [7]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1959810,1959811,1959812,1959813,1959814,1959815,1959816,1959817,1959818,1959819
PlayPodcastIntent,continue the forum,continue the forum bbc world service,continue bbc world service the forum,continue the forum from bbc world service,continue the forum on bbc world service,continue bbc world the forum,continue the forum b. b. c. world service,continue b. b. c. world service the forum,continue the forum from b. b. c. world service,continue the forum on b. b. c. world service,...,activate podcast the reith lectures archive ni...,activate podcast the reith lectures archive ni...,activate podcast bbc radio four the reith lect...,activate podcast the reith lectures archive ni...,activate podcast b. b. c. radio four the reith...,activate podcast letter from america by alista...,activate podcast letter from america by alista...,activate podcast bbc radio four letter from am...,activate podcast letter from america by alista...,activate podcast b. b. c. radio four letter fr...
PodcastOnlyIntent,the forum,the forum bbc world service,bbc world service the forum,the forum from bbc world service,the forum on bbc world service,bbc world the forum,the forum b. b. c. world service,b. b. c. world service the forum,the forum from b. b. c. world service,the forum on b. b. c. world service,...,,,,,,,,,,
PlayRadioIntent,play b. b. c. radio one,play radio one,play bbc radio one,play b. b. c. radio one extra,play bbc radio one extra,play one extra,play radio one extra,play one xtra,play radio one xtra,play bbc radio one xtra,...,,,,,,,,,,
RadioOnlyIntent,b. b. c. radio one,radio one,bbc radio one,b. b. c. radio one extra,bbc radio one extra,one extra,radio one extra,one xtra,radio one xtra,bbc radio one xtra,...,,,,,,,,,,
WhatsPlayingIntent,who is this d. j.,tell me who the d. j. is,what's this song,what song is this,what tune is this,what song is playing,what song is playing right now,what this song is,what the song title is,who this is,...,,,,,,,,,,
WhatsPlayingTitleAndSynopsisIntent,what this is about,what this programme is about,what is this programme about,,,,,,,,...,,,,,,,,,,
WhatStationIsPlayingIntent,what is this station,what channel is this,what channel,what station I am listening to,what station this is,what station,what radio station I am listening to,what radio station this is,what radio station,,...,,,,,,,,,,


Checking how I would access data since I am not used to working with data oriented this way in a DataFrame. I am still unconvinced of the benefits of using a DataFrame over working directly with the dictionary.

In [8]:
data.at['PlayPodcastIntent', 6374]

'continue darryl morris b. b. c. radio manchester'

In [9]:
# comparing with getting the same data directly from the dictionary.
training_data_json['PlayPodcastIntent'][6374]

'continue darryl morris b. b. c. radio manchester'

In [10]:
data.loc['WhatStationIsPlayingIntent'][:10]

0                    what is this station
1                    what channel is this
2                            what channel
3          what station I am listening to
4                    what station this is
5                            what station
6    what radio station I am listening to
7              what radio station this is
8                      what radio station
9                                    None
Name: WhatStationIsPlayingIntent, dtype: object

In [11]:
data.loc['PlayPodcastIntent'][:10]

0                                continue the forum
1              continue the forum bbc world service
2              continue bbc world service the forum
3         continue the forum from bbc world service
4           continue the forum on bbc world service
5                      continue bbc world the forum
6         continue the forum b. b. c. world service
7         continue b. b. c. world service the forum
8    continue the forum from b. b. c. world service
9      continue the forum on b. b. c. world service
Name: PlayPodcastIntent, dtype: object

Even though the text from amazon would not contain punctuation such as parenthesis, I have noticed that some entities contain them. I suspect that this is a mistake by whoever entered this data to be read into the interaction model. Full stops, however, are usually returned by the speech to text service especially when the user said an initialism like "BBC".

All the text is already lowercase.

In [12]:
data.at['PlayPodcastIntent', 1959819]

'activate podcast b. b. c. radio four letter from america by alistair cooke the early years (1940s 1950s and 1960s)'

In [13]:
data.at['PlayPodcastIntent', 1959818]

'activate podcast letter from america by alistair cooke the early years (1940s 1950s and 1960s) b. b. c. radio four'

In [14]:
sentenceWithPunc=data.at['PlayPodcastIntent', 1959817]

Checking the shape of the data. There are 7 classes and each is 1959820 items long, but I am aware that most of these are 'None'. This isn't very useful.

In [15]:
data.shape

(7, 1959820)

## Removing Stop Words
[Stop words](https://en.wikipedia.org/wiki/Stop_words) are commonly used word that do not generally help with deriving meaning during natural language processing (NLP). I am deciding to remove them here (at least initially), but I wonder if in this case, since each utterance is so short, whether it would be best to keep them in.

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/leives01/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
from string import punctuation
from nltk.corpus import stopwords

In [18]:
stopwords=set(stopwords.words('english')+list(punctuation))

I see if I can tokenize the words (break each string up into a list of words) so that I can then remove the stopwords. I had trouble getting the NLTK `word_tokenize` method to work. I finally, after a lot of internet searching, found I had to download 'punkt'. 

In [19]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/leives01/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [20]:
word_tokenize('Try this sentence here.')

['Try', 'this', 'sentence', 'here', '.']

In [21]:
[word for word in word_tokenize(sentenceWithPunc) if word not in stopwords]

['activate',
 'podcast',
 'bbc',
 'radio',
 'four',
 'letter',
 'america',
 'alistair',
 'cooke',
 'early',
 'years',
 '1940s',
 '1950s',
 '1960s']

## Bag of Words and TF-IDF

### What is Bag of Words?
A [Bag of Words](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) model is a way of representing text data when modeling text with machine learning algorithms. It is a representation of text that describes the occurrence of words within a document. It involves two things:
1. A vocabulary of known words.
1. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. Here I am going to use TF-IDF as the measure of presence of known words.

### What is TF-IDF?
[term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) 

... is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

* Term Frequency: is a scoring of the frequency of the word in the current document.
* Inverse Document Frequency: is a scoring of how rare the word is across documents.

In [22]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

The `TfidfVectorizer` creates the Bag of Words 'vocabulary bag' and perfoms the TF-IDF vectorization together in one step.

In [23]:
vectorizer1=TfidfVectorizer(max_df=0.5,min_df=2,stop_words=stopwords)

In [24]:
# using the sklearn built in stop_words for english rather than my own for comparison. 
vectorizer2=TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')

In [25]:
X1 = vectorizer1.fit_transform(training_data_json['PlayRadioIntent'])

In [26]:
X2 = vectorizer2.fit_transform(training_data_json['PlayRadioIntent'])

From what is printed out below, the built in `stop_words='english'` seems to have removed more words. This has also affected the weightings. 

In [27]:
X1

<16761x98 sparse matrix of type '<class 'numpy.float64'>'
	with 54550 stored elements in Compressed Sparse Row format>

In [28]:
X2

<16761x89 sparse matrix of type '<class 'numpy.float64'>'
	with 42299 stored elements in Compressed Sparse Row format>

In [29]:
print(X1[30])

  (0, 48)	0.6802645092751256
  (0, 25)	0.5636176454553009
  (0, 3)	0.29976642843186957
  (0, 65)	0.3601602914499266


In [30]:
print(X2[30])

  (0, 44)	0.8235290841639401
  (0, 3)	0.36289762129829173
  (0, 60)	0.43601050903865124


## Working with the CSV file Instead

I have decided that it would be better if my DataFrame was not so sparce because I have concluded that there is no benefit of using the DataFrame over a native Python dictionary. Therefore, a DataFrame built from a CSV file would be better because it would not be sparce and could be oriented the 'normal' way. I have made another JS script to convert the JSON training data into a CSV document. 

For completeness, I import things again where I feel it will help me understand where I am at and what is going on. 

In [31]:
import pandas as pd

In [32]:
intents_df = pd.read_csv('./data/training-data.csv')

This looks so much better to work with 😅

In [33]:
intents_df.tail(12)

Unnamed: 0,Intent,Utterance
1995041,WhatsPlayingTitleAndSynopsisIntent,what this is about
1995042,WhatsPlayingTitleAndSynopsisIntent,what this programme is about
1995043,WhatsPlayingTitleAndSynopsisIntent,what is this programme about
1995044,WhatStationIsPlayingIntent,what is this station
1995045,WhatStationIsPlayingIntent,what channel is this
1995046,WhatStationIsPlayingIntent,what channel
1995047,WhatStationIsPlayingIntent,what station I am listening to
1995048,WhatStationIsPlayingIntent,what station this is
1995049,WhatStationIsPlayingIntent,what station
1995050,WhatStationIsPlayingIntent,what radio station I am listening to


In [34]:
intents_df.shape

(1995053, 2)

## Splitting out 20% for testing later
It is standard practice to split out a bit of the training data to test the model later. This is done with the sklearn `train_test_split`.

In [35]:
from sklearn.model_selection import train_test_split

In [36]:
X = intents_df['Utterance'] # data to train from

In [37]:
Y = intents_df['Intent'] # answers

`train_test_split` shuffles the data for you.

In [38]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # 20% of data held for testing

In [39]:
x_train.shape, y_train.shape

((1596042,), (1596042,))

In [40]:
x_test.shape, y_test.shape

((399011,), (399011,))

Resources: [Text Classification](https://www.youtube.com/watch?v=5xDE06RRMFk), [Parameter Tuning](https://www.youtube.com/watch?v=CArkneSPNr4)

## Initial Attempt at Text Classification

To have all this in one place, helping me mentally as I experiment, I re-made `stopwords`.

In [41]:
import nltk
nltk.download('stopwords')

from string import punctuation
from nltk.corpus import stopwords

stopwords=set(stopwords.words('english')+list(punctuation))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/leives01/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
from nltk.tokenize import word_tokenize

Making a new class in my pandas dataframe of tokenized utterances that had their stop words removed. All the text is already in lowercase with minimal punctuation. I therefore do not need to do much cleaning of the data.

In [43]:
intents_df['processed_utterance'] = intents_df['Utterance'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stopwords]))


In [44]:
intents_df.head()

Unnamed: 0,Intent,Utterance,processed_utterance
0,PlayPodcastIntent,continue the forum,continue forum
1,PlayPodcastIntent,continue the forum bbc world service,continue forum bbc world service
2,PlayPodcastIntent,continue bbc world service the forum,continue bbc world service forum
3,PlayPodcastIntent,continue the forum from bbc world service,continue forum bbc world service
4,PlayPodcastIntent,continue the forum on bbc world service,continue forum bbc world service


In [45]:
intents_df.tail()

Unnamed: 0,Intent,Utterance,processed_utterance
1995048,WhatStationIsPlayingIntent,what station this is,station
1995049,WhatStationIsPlayingIntent,what station,station
1995050,WhatStationIsPlayingIntent,what radio station I am listening to,radio station I listening
1995051,WhatStationIsPlayingIntent,what radio station this is,radio station
1995052,WhatStationIsPlayingIntent,what radio station,radio station


In [46]:
intents_df.Utterance.head()

0                           continue the forum
1         continue the forum bbc world service
2         continue bbc world service the forum
3    continue the forum from bbc world service
4      continue the forum on bbc world service
Name: Utterance, dtype: object

In [47]:
intents_df.shape

(1995053, 3)

In [48]:
from sklearn.model_selection import train_test_split

In [49]:
X = intents_df['processed_utterance'] # data to train from
Y = intents_df['Intent'] # answers

Again, I put aside 20% of the data.

In [50]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # 20% of data held for testing

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
#creates the bag of words matrix and tfidf vectorizer in one method call

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC

I initially set the maximum number of iterations to 3000 but it didn't converge.

In [52]:
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), sublinear_tf=True)),
                     ('chi', SelectKBest(chi2, k=10000)),
                     ('clf', LinearSVC(C=1.0, penalty='l1', max_iter=10000, dual=False))])

In [53]:
model = pipeline.fit(x_train, y_train)



In [54]:
vectorizer = model.named_steps['vect']
chi = model.named_steps['chi']
clf = model.named_steps['clf']

In [55]:
import numpy as np

In [56]:
feature_names = vectorizer.get_feature_names()
feature_names = [feature_names[i] for i in chi.get_support(indices=True)]
feature_names = np.asarray(feature_names)

In [57]:
target_names = [ 'PlayPodcastIntent', 'PodcastOnlyIntent', 'PlayRadioIntent', 'RadioOnlyIntent', 'WhatsPlayingIntent', 'WhatsPlayingTitleAndSynopsisIntent', 'WhatStationIsPlayingIntent']


Listing out the top ten keywords for each class

In [58]:
# top 10 keywords per class
for i, label in enumerate(target_names):
    top10 = np.argsort(clf.coef_[i])[-10:]
    print('%s: %s' %(label, ' '.join(feature_names[top10])))

PlayPodcastIntent: hear listen start begin play please launch continue resume podcast
PodcastOnlyIntent: cwr gaidheal cambridge sports extra coventry midlands devon leicester jersey norfolk
PlayRadioIntent: get job service listen fresh start global news naked podcast radio hour h2o early start let get listen play
RadioOnlyIntent: gaidheal 3cr swindon southampton somerset tees newcastle coventry cymru radio
WhatsPlayingIntent: get oxford get physical get please get podcast get radio get saturday get sheffield zoo channel station
WhatsPlayingTitleAndSynopsisIntent: song title artist singing right listening song playing sings song band singing playing song
WhatStationIsPlayingIntent: get oxford get physical get please get podcast get radio get saturday get scotland get one zoo programme


I was e

In [59]:
#accuracy score
model.score(x_test, y_test)

0.9989448912435998

In [60]:
model.predict(['can i listen to the new bbc womens show'])

array(['PlayPodcastIntent'], dtype=object)

In [61]:
model.predict(['i want to listen to bbc radio four'])

array(['PlayPodcastIntent'], dtype=object)

In [62]:
model.predict(['listen to bbc radio four'])

array(['PlayPodcastIntent'], dtype=object)

In [63]:
model.predict(['can you play bbc radio four'])

array(['PlayPodcastIntent'], dtype=object)

In [64]:
model.predict(['bbc radio four'])

array(['PodcastOnlyIntent'], dtype=object)

In [65]:
model.predict(['what am i listening to'])

array(['PodcastOnlyIntent'], dtype=object)

In [66]:
model.predict(['what artist is singing'])

array(['WhatsPlayingIntent'], dtype=object)

## Conclusions and Next Steps

I am happy in that I have concluded that my data can be read in, processed and classified. However, the results of the classification model are disappointing.

The high accuracy level was only achieved because of the imbalance in classes of the Dataset. I [need to look at ways to overcome this](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/) and implement them. 

Depending on the outcome, I should also consider parameter tuning.