## Project Name: CSML1010 NLP Course Project - Part 1 - Proposal): Problem, Dataset,  and Exploratory Data Analysis
#### Authors (Group3): Paul Doucet, Jerry Khidaroo
#### Project Repository: https://github.com/CSML1010-3-2020/NLPCourseProject

#### 1. Problem Definition and Data Preparation Notebook:

This notebook will review the following sections as part of our project proposal:

- <a href='#Problem_Definition'>__Problem Definition__</a>
- <a href='#Dataset_Description'>__Dataset Description__</a>
- <a href='#Roadmap'>__Roadmap__</a>
- <a href='#Import_the_Dataset'>__Import the Dataset__</a>


<a id='Problem_Definition'></a>
## Problem Definition
The problem we will be analysing is supervised text classification.
The goal is to investigate which supervised mahine learning methods will give the best results in classifying the texts from our dataset into the pre-defined categories. This is a multi-class text classification problem. The input will be the text elements of each conversation concatenated together. The output will be the instruction_id.

<a id='Dataset_Description'></a>
## Dataset Description
The dataset we will be using for our project is the __Taskmaster-1__ dataset from Google.
[Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/)

The dataset can be optained from: https://github.com/google-research-datasets/Taskmaster

>The dataset consists of 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations. Our initial data exploration will use the written dialog file with 7,708 records.

<a id='Roadmap'></a>
## Roadmap
As part of our study, we will be consider the following steps to find the ideal classifier for incoming texts.
- __Feature Engineering:__
    - Count Vectors: 
    Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document. These provide no context, nor any consideration of the words in relation to other words or position in the sentence.
        - Bag-of-words
        - Bag of n-grams
    - TF-IDF Vectors: 
    TF-IDF score represents the relative importance of a term in the document and the entire corpus. TF-IDF score is composed by two terms: the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
        - Word level Tfidf
        - N-gram Level TF-IDF
    - Word Embeddings:
    A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. These generate a context free representation of each word in the vocabulary.
        - Word2vec
        - Glove
    - NLP Based features: 
    An example of this would be Frequency distribution of Part of Speech Tags.
        - Noun, Verb, Adjective, Adverb, Pronoun Counts
    - Language Models:
    These are recent breakthroughs that provide context and generate a representation of each word based on other words in the sentence.
        - BERT or FLAIR
- __Model Training:__
    - Naive Bayes (multinominal): the one most suitable for word counts is multinominal.
    - logistic regression.
    - support vector machine.
    - decision tree (random forest).
    - Ensemble: Bagging, Boosting
- __Model Evaluation:__
    - Confusion Matrix
    - Metrics: Presicion, Recall, F1 Score

<a id='Import_the_Dataset'></a>
## Import the Dataset

Two JSON format file we will be using from the __Taskmaster-1__ dataset is the following:
- __self-dialogs.json__ contains all the one-person dialogs.

This file can be divided into train/dev/test sets by matching the dialog IDs from the following files:
- train.csv
- dev.csv
- test.csv

Supplementary information is provided to describe the data structure and annotation schema.
* __sample.json__  - A sample conversation describing the format of the data.
* __ontology.json__ - Schema file describing the annotation ontology.

The structure of the conversations in the data files is as follows:
* __conversationId:__ A universally unique identifier with the prefix 'dlg-'. The ID has no meaning.
* __utterances:__ An array of utterances that make up the conversation.
* __instructionId:__ A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation.

The __utterances__ category, has the following sub-categories of which we will be using the __text__ to perform our analysis:
* __index:__ A 0-based index indicating the order of the utterances in the conversation.
* __speaker:__ Either USER or ASSISTANT, indicating which role generated this utterance.
* __text:__ The raw text of the utterance. In case of self dialogs, this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers.
* __segments:__ An array of various text spans with semantic annotations.


### Import the libraries

In [5]:
import json
import pandas as pd
from pandas.io.json import json_normalize

Open the `self-dialogs.json` file and view the entire content

In [6]:
with open(r'./data/self-dialogs.json') as f:
    data = json.load(f)

Extract the `utterances` column and normalize it to view all individual text fields.  
This will increase the dataframe rows from 7708 to 169469 as each text field is now available

In [7]:
tt = pd.json_normalize(data, 'utterances', ['conversation_id','instruction_id'])

View the dataframe with the text field visible outside the dictionary

In [8]:
tt

Unnamed: 0,index,speaker,text,segments,conversation_id,instruction_id
0,0,USER,"Hi, I'm looking to book a table for Korean fod.",,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
1,1,ASSISTANT,"Ok, what area are you thinking about?",,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
2,2,USER,"Somewhere in Southern NYC, maybe the East Vill...","[{'start_index': 13, 'end_index': 49, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
3,3,ASSISTANT,"Ok, great. There's Thursday Kitchen, it has g...","[{'start_index': 20, 'end_index': 35, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
4,4,USER,That's great. So I need a table for tonight at...,"[{'start_index': 26, 'end_index': 31, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
...,...,...,...,...,...,...
169464,15,ASSISTANT,Ok.,,dlg-fffa6565-32bb-4592-8d30-fff66df29633,movie-tickets-3
169465,16,USER,I think we'll pass for tonight. Thanks anyhow.,,dlg-fffa6565-32bb-4592-8d30-fff66df29633,movie-tickets-3
169466,17,ASSISTANT,Ok. Just let me know if you change your mind.,,dlg-fffa6565-32bb-4592-8d30-fff66df29633,movie-tickets-3
169467,18,USER,I will. Thanks,,dlg-fffa6565-32bb-4592-8d30-fff66df29633,movie-tickets-3


Remove all columns but the `text` and `conversation_id` from the dataframe and view

tt.drop('index', axis=1, inplace=True)
tt.drop('segments', axis=1, inplace=True)
tt.drop('speaker', axis=1, inplace=True)
tt

View the columns of the dataframe

In [9]:
tt.columns

Index(['index', 'speaker', 'text', 'segments', 'conversation_id',
       'instruction_id'],
      dtype='object')

View the content of the `text` column, then the `conversation_id`

In [10]:
tt['text']

0           Hi, I'm looking to book a table for Korean fod.
1                     Ok, what area are you thinking about?
2         Somewhere in Southern NYC, maybe the East Vill...
3         Ok, great.  There's Thursday Kitchen, it has g...
4         That's great. So I need a table for tonight at...
                                ...                        
169464                                                  Ok.
169465       I think we'll pass for tonight. Thanks anyhow.
169466        Ok. Just let me know if you change your mind.
169467                                       I will. Thanks
169468                                          No problem!
Name: text, Length: 169469, dtype: object

In [11]:
tt['conversation_id']

0         dlg-00055f4e-4a46-48bf-8d99-4e477663eb23
1         dlg-00055f4e-4a46-48bf-8d99-4e477663eb23
2         dlg-00055f4e-4a46-48bf-8d99-4e477663eb23
3         dlg-00055f4e-4a46-48bf-8d99-4e477663eb23
4         dlg-00055f4e-4a46-48bf-8d99-4e477663eb23
                            ...                   
169464    dlg-fffa6565-32bb-4592-8d30-fff66df29633
169465    dlg-fffa6565-32bb-4592-8d30-fff66df29633
169466    dlg-fffa6565-32bb-4592-8d30-fff66df29633
169467    dlg-fffa6565-32bb-4592-8d30-fff66df29633
169468    dlg-fffa6565-32bb-4592-8d30-fff66df29633
Name: conversation_id, Length: 169469, dtype: object

View of one line of the dataframe filtered by `conversation_id`

In [12]:
tt[tt.conversation_id == 'dlg-00055f4e-4a46-48bf-8d99-4e477663eb23']

Unnamed: 0,index,speaker,text,segments,conversation_id,instruction_id
0,0,USER,"Hi, I'm looking to book a table for Korean fod.",,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
1,1,ASSISTANT,"Ok, what area are you thinking about?",,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
2,2,USER,"Somewhere in Southern NYC, maybe the East Vill...","[{'start_index': 13, 'end_index': 49, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
3,3,ASSISTANT,"Ok, great. There's Thursday Kitchen, it has g...","[{'start_index': 20, 'end_index': 35, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
4,4,USER,That's great. So I need a table for tonight at...,"[{'start_index': 26, 'end_index': 31, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
5,5,ASSISTANT,They don't have any availability for 7 pm.,"[{'start_index': 37, 'end_index': 41, 'text': ...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
6,6,USER,What times are available?,,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
7,7,ASSISTANT,5 or 8.,"[{'start_index': 0, 'end_index': 1, 'text': '5...",dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
8,8,USER,"Yikes, we can't do those times.",,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2
9,9,ASSISTANT,"Ok, do you have a second choice?",,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,restaurant-table-2


Categorize the `conversation_id` TODO: confirm this step is necessary

In [13]:
tt2 = tt.conversation_id.unique()

In [14]:
tt2

array(['dlg-00055f4e-4a46-48bf-8d99-4e477663eb23',
       'dlg-0009352b-de51-474b-9f13-a2b0b2481546',
       'dlg-00123c7b-15a0-4f21-9002-a2509149ee2d', ...,
       'dlg-ffcd1d53-c080-4acf-897d-48236513bc58',
       'dlg-ffd9db94-36e3-4534-b99d-89f7560db17c',
       'dlg-fffa6565-32bb-4592-8d30-fff66df29633'], dtype=object)

Verify the length of the `tt2` array to confirm the number of conversations: note that it should match initial dataframe length of 7708

In [15]:
len(tt2)

7708

Loop thru the entire `tt2` dataframe and combine all the text based on the conversation_id 

In [16]:
# Loop thru all the conversation_id unique values
#df = pd.DataFrame(columns=['Conversation', 'ident'])
conversation_id = []
conv_text = [] 
instr_id = []
for i in tt2:
    conv2 = ''
    tti = tt[tt.conversation_id == i]
    conv = ''
    conv2 = ''
    instr3 = tti['instruction_id']
    instr_id.append(instr3.iloc[0])
    for j in tti:
        conv = tti['text']
    for k in conv:
        conv2 = conv2 + k + " "
    conversation_id.append(i)
    conv_text.append(conv2)

In [17]:
# View the content of the concatenated conversation list, created by combining all 'text' fields per conversation_id
conv_text[0:5]

["Hi, I'm looking to book a table for Korean fod. Ok, what area are you thinking about? Somewhere in Southern NYC, maybe the East Village? Ok, great.  There's Thursday Kitchen, it has great reviews. That's great. So I need a table for tonight at 7 pm for 8 people. We don't want to sit at the bar, but anywhere else is fine. They don't have any availability for 7 pm. What times are available? 5 or 8. Yikes, we can't do those times. Ok, do you have a second choice? Let me check. Ok. Lets try Boka, are they free for 8 people at 7? Yes. Great, let's book that. Ok great, are there any other requests? No, that's it, just book. Great, should I use your account you have open with them? Yes please. Great. You will get a confirmation to your phone soon. ",
 "Hi I would like to see if the Movie What Men Want is playing here. Yes it's showing here would you like to purchase a ticket? Yes, for me and a friend so two tickets please Okay. What time is that moving playing today? That movie is showing a

In [18]:
# View the content of the conversation_id list, which will be used to merge with original dataframe to match up topics
conversation_id[0:5]

['dlg-00055f4e-4a46-48bf-8d99-4e477663eb23',
 'dlg-0009352b-de51-474b-9f13-a2b0b2481546',
 'dlg-00123c7b-15a0-4f21-9002-a2509149ee2d',
 'dlg-0013673c-31c6-4565-8fac-810e173a5c53',
 'dlg-001d8bb1-6f25-4ecd-986a-b7eeb5fa4e19']

In [19]:
instr_id[0:5]

['restaurant-table-2',
 'movie-tickets-1',
 'movie-tickets-3',
 'pizza-ordering-2',
 'pizza-ordering-2']

In [20]:
# Create a dictionary to store the conversation_id and text lists, which will be stored to a dataframe
ex_dict = {'id':conversation_id, 'conv':conv_text, 'instr':instr_id}

In [21]:
# Create a dataframe with the conversation id and conversation
df = pd.DataFrame(ex_dict)
df.columns = ['id', 'Conversation','Instruction_id']
df

Unnamed: 0,id,Conversation,Instruction_id
0,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23,"Hi, I'm looking to book a table for Korean fod...",restaurant-table-2
1,dlg-0009352b-de51-474b-9f13-a2b0b2481546,Hi I would like to see if the Movie What Men W...,movie-tickets-1
2,dlg-00123c7b-15a0-4f21-9002-a2509149ee2d,I want to watch avengers endgame where do you ...,movie-tickets-3
3,dlg-0013673c-31c6-4565-8fac-810e173a5c53,I want to order a pizza from Bertuccis in Chel...,pizza-ordering-2
4,dlg-001d8bb1-6f25-4ecd-986a-b7eeb5fa4e19,"Hi I'd like to order two large pizzas. Sure, w...",pizza-ordering-2
...,...,...,...
7703,dlg-ffc0c5fb-573f-40e0-b739-0e55d84100e8,I feel like eating at a nice restaurant tonigh...,restaurant-table-1
7704,dlg-ffc87550-389a-432e-927e-9a9438fc4f1f,"Hi Sally, I need a Grande iced Americano with ...",coffee-ordering-2
7705,dlg-ffcd1d53-c080-4acf-897d-48236513bc58,Good afternoon. I would like to order a pizza ...,pizza-ordering-2
7706,dlg-ffd9db94-36e3-4534-b99d-89f7560db17c,Hey. I'm thinking of seeing What Men Want toni...,movie-tickets-1


In [22]:
# View first three rows of the data frame conversation columns
df['Conversation'][0:3]

0    Hi, I'm looking to book a table for Korean fod...
1    Hi I would like to see if the Movie What Men W...
2    I want to watch avengers endgame where do you ...
Name: Conversation, dtype: object

In [23]:
# Export the dataframe to csv to confirm content
df.to_csv(r'./data/DF_selfDialogs.csv', index=False)