# Intent Detection

### Data Conversion
Data Format in the ATIS Dataset is as follows:

* Actual Statement : BOS i want to fly from baltimore to dallas round trip EOS 
* Parsed Statement : O O O O O O B-fromloc.city_name O B-toloc.city_name B-round_trip I-round_trip atis_flight

Using only parsed statement results into loss of information about the semantics of the sentence and hence it is not wise to use only parsed statement. However, NER detected for location,places, time etc could be helpful.

Hence, I am converting above statement in the above format

* message : i want to fly from B-fromloc.city_name to B-toloc.city_name B-round_trip I-round_trip
* label : atis_flight

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Data conversion
def getData(filename):
    raw_data = open(filename,"r")

    messages = []
    labels = []
    for line in raw_data.readlines(): 
        sentence = line.split("\t")

        actual_words = sentence[0].split(" ")
        encoded_words = sentence[1].split(" ")

        for index, word in enumerate(encoded_words):
            if word == "O":
                encoded_words[index] = actual_words[index]

        msg = " ".join(encoded_words[1:-1])
        label = encoded_words[-1][0:-1]

        messages.append(msg)
        labels.append(label)

    data = pd.DataFrame(data={'message':messages,'label':labels})
    return data

In [2]:
train = getData("atis-2.train.w-intent.iob (3).txt")
test = getData("atis.test.w-intent.iob (2).txt")

### Data Understanding
* From below, We can see dataset is highly imbalanced, atis_flight occupies the major portion of the dataset.
* There are almost 21 classes and comparitively less data for each class. and it looks difficult to train a classifier that is void of bias to few classes


In [3]:
train.groupby('label')['message'].nunique()

label
atis_abbreviation                             56
atis_aircraft                                 67
atis_aircraft#atis_flight#atis_flight_no       1
atis_airfare                                 321
atis_airline                                 109
atis_airline#atis_flight_no                    2
atis_airport                                  16
atis_capacity                                 13
atis_cheapest                                  1
atis_city                                     17
atis_distance                                 16
atis_flight                                 2567
atis_flight#atis_airfare                      11
atis_flight_no                                12
atis_flight_time                              40
atis_ground_fare                              14
atis_ground_service                          177
atis_ground_service#atis_ground_fare           1
atis_meal                                      6
atis_quantity                                 37
atis_restricti

### Preparing training data
* First, We are cleaning the text information. I have observed that usually stop words are removed before processing the text. However, in one of the post in fast.ai and with the experimentation below, stop words helps to generalize model better. For generalization, some redundant information is must.
* Second, We are extracting features using TfidfVectorizer ( a numerical statistic that describe how important a word is to corpus). We are using n_gram to include combination of words.
* This approach is based only on the words used in the context and not on the semantics of the sentence.

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download("stopwords")

## Clean Data
stops = set(stopwords.words("english"))
def cleanData(text, lowercase = False, remove_stops = False, stemming = False):
    txt = str(text)
    #txt = re.sub(r'[^A-Za-z\s]',r'',txt)
    txt = re.sub(r'\n',r' ',txt)
    
    if lowercase:
        txt = " ".join([w.lower() for w in txt.split()])
        
    if remove_stops:
        txt = " ".join([w for w in txt.split() if w not in stops])
    
    if stemming:
        st = PorterStemmer()
        txt = " ".join([st.stem(w) for w in txt.split()])

    return txt

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/myworld/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,HashingVectorizer

## Clean data 
trainClean = train['message'].map(lambda x: cleanData(x, lowercase=True,remove_stops=False, stemming=True))
testClean = test['message'].map(lambda x: cleanData(x, lowercase=True,remove_stops=False, stemming=True))

# Feature extraction
vectorizer = TfidfVectorizer(analyzer='word', min_df=0.0, max_df=1.0,max_features=1024, ngram_range=(1,2))
vec = vectorizer.fit(trainClean)

X_train = vec.transform(trainClean)
X_test = vec.transform(testClean)

In [9]:
# Optional - to save tfidf matrix to csv
result =  pd.DataFrame(X_train.todense(), columns = vec.get_feature_names())
result.to_csv("tfidf.csv",index = False)

### Training model

* Though we know the limitation of our data, let see how classifier perform on this dataset.
* We would roughly test some of the classifier like KNeighborsClassifier,GaussianNB and SVM.

In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

neigh = KNeighborsClassifier(n_neighbors=3, weights="distance", p=2)
neigh_train = neigh.fit(X_train, train["label"]) 
y_pred = neigh_train.predict(X_test)

print("Multi-class accuracy:",accuracy_score(test["label"], y_pred),"\n")
print(classification_report(test["label"], y_pred))

Multi-class accuracy: 0.9025755879059351 

                             precision    recall  f1-score   support

          atis_abbreviation       1.00      1.00      1.00        33
              atis_aircraft       0.86      0.67      0.75         9
               atis_airfare       0.89      0.69      0.78        48
   atis_airfare#atis_flight       0.00      0.00      0.00         1
               atis_airline       1.00      0.89      0.94        38
               atis_airport       1.00      0.44      0.62        18
              atis_capacity       0.95      1.00      0.98        21
                  atis_city       0.20      0.17      0.18         6
              atis_day_name       0.00      0.00      0.00         2
              atis_distance       0.70      0.70      0.70        10
                atis_flight       0.91      0.98      0.95       632
   atis_flight#atis_airfare       0.00      0.00      0.00        12
   atis_flight#atis_airline       0.00      0.00      0.00 

  'precision', 'predicted', average, warn_for)


In [11]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train.toarray(),train["label"])
y_pred = clf.predict(X_test.toarray())

print("Multi-class accuracy:",accuracy_score(test["label"], y_pred),"\n")
print(classification_report(test["label"], y_pred))

Multi-class accuracy: 0.6606942889137738 

                             precision    recall  f1-score   support

          atis_abbreviation       0.97      0.91      0.94        33
              atis_aircraft       0.30      0.33      0.32         9
               atis_airfare       0.32      0.67      0.43        48
   atis_airfare#atis_flight       0.00      0.00      0.00         1
               atis_airline       0.65      0.82      0.72        38
atis_airline#atis_flight_no       0.00      0.00      0.00         0
               atis_airport       0.78      0.39      0.52        18
              atis_capacity       0.85      0.81      0.83        21
                  atis_city       0.00      0.00      0.00         6
              atis_day_name       0.00      0.00      0.00         2
              atis_distance       1.00      0.80      0.89        10
                atis_flight       0.89      0.66      0.76       632
   atis_flight#atis_airfare       0.00      0.00      0.00 

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [12]:
from sklearn.svm import SVC
clf = SVC(kernel="linear", C=10)

clf.fit(X_train.toarray(),train["label"])
y_pred = clf.predict(X_test.toarray())

print("Multi-class accuracy:",accuracy_score(test["label"], y_pred),"\n")
print(classification_report(test["label"], y_pred))

Multi-class accuracy: 0.9574468085106383 

                             precision    recall  f1-score   support

          atis_abbreviation       1.00      1.00      1.00        33
              atis_aircraft       0.73      0.89      0.80         9
               atis_airfare       0.89      1.00      0.94        48
   atis_airfare#atis_flight       0.00      0.00      0.00         1
               atis_airline       1.00      0.97      0.99        38
               atis_airport       1.00      0.78      0.88        18
              atis_capacity       0.95      0.95      0.95        21
                  atis_city       1.00      0.50      0.67         6
              atis_day_name       0.00      0.00      0.00         2
              atis_distance       1.00      0.90      0.95        10
                atis_flight       0.97      0.99      0.98       632
   atis_flight#atis_airfare       0.80      0.33      0.47        12
   atis_flight#atis_airline       0.00      0.00      0.00 

  'precision', 'predicted', average, warn_for)


### Cross-validation
* As we saw, Support vector machine works best for classification based approach. Let check with k-fold cross validation to avoid any overfitting.


In [13]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4)

clf = SVC(kernel="linear", C=10)

for train_index, test_index in skf.split(X_train, train["label"]):
    X_train_k, X_test_k = X_train[train_index], X_train[test_index]
    y_train_k, y_test_k = train["label"][train_index], train["label"][test_index]
       
    clf.fit(X_train_k,y_train_k)
    y_pred = clf.predict(X_test_k)
    print("Multi-class accuracy:",accuracy_score(y_test_k, y_pred),"\n")




Multi-class accuracy: 0.9708222811671088 

Multi-class accuracy: 0.9741302408563782 

Multi-class accuracy: 0.9802690582959641 

Multi-class accuracy: 0.9846984698469847 



### Analysis
* The results from k-fold are impressive and there are no large bias and variance in the trained model.
* If we analyse the failure cases, the f1-score are relatively small for class atis_flight#atis_airfare and atis_quantity.

#### What more to do ?
* Using glove vector instead of tf-idf vectorizer.
* To try out semantics based method for intent detection.
* To do more literature survey for finding more appropriate approach.