<a href="https://colab.research.google.com/github/JVCarmich0959/CSC228/blob/main/Jacquelyn's_Copy_of_CSC228_Lesson04_IntentClassification_with_TFIDF_ATIS_DataSet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Intent Classification using TF-IDF with the ATIS Dataset. 
#
# Uses Libraries:  sklearn
# Runtime:  Google CoLab (cpu)
#
# Owner:  Lorrie Tomek
# 
# Data: 
# The data for ATIS intent classification was downloaded from kaggle to 
# my personal google drive for this course.  The URLs for downloading are
# given in the notebook code below.  
#
# This data should not be shared outside of this course.  
#

# What is the ATIS Dataset? 

The Airline Travel Information System (ATIS) dataset is often used as a benchmark dataset for intent classification in natural language processing. The dataset consists of a collection of spoken and written queries from users looking to book a flight, along with their corresponding intent labels. The intent labels in the dataset include information such as flight departure and arrival times, flight availability, and pricing. The ATIS dataset is commonly used to train and evaluate models for natural language understanding and intent classification.

The ATIS (Airline Travel Information System) dataset is often distributed in a tab-separated format (.tsv) and it typically includes the following columns:

- Sentence #: a unique identifier for each sentence in the dataset.
- Utterance: the spoken or written text input from the user, which represents their request or query related to airline travel information such as flight departure and arrival times, flight availability, and pricing.
- Intent: the corresponding label that indicates the intent behind the utterance. The intents in the ATIS dataset include information such as flight booking, flight cancellation, and flight information queries.

# Where can we find the ATIS Dataset?

The ATIS dataset can be downloaded from the official website of the dataset, which is the Linguistic Data Consortium (LDC) website.  However, this would require creating an account on that website to allow a download of the dataset.  

The dataset (or a subset of it) can often be found on github.  If you search for "atis datasset" on github you will see several links.  Here's one I found. 

There is an intent/utterance dataset here: 
https://raw.githubusercontent.com/nawaz-kmr/Airline-Travel-Information-System-ATIS-Text-Analysis/nawaz-kmr/atis_intents.csv

It used to be available at https://github.com/yvchen/JointSLU, but I did not find the csv there any more.  (I did find .iob files where are used for Entity Recognition training).

# Tutorial on Intent Classification using TF-IDF Features with the ATIS Dataset

Start by installing the necessary libraries, such as scikit-learn and pandas.

Download the ATIS dataset from a reliable source (such as https://github.com/yvchen/JointSLU).

Load the dataset into a pandas DataFrame. The dataset should contain two columns: "text" and "intent". The "text" column contains the conversation utterances, and the "intent" column contains the annotated intent labels.

Pre-process the text data. This may involve steps such as lowercasing, tokenization, and removing stop words.

Split the dataset into training and test sets.

Calculate the TF-IDF scores for each word in the training set.

Use the TF-IDF scores as features to train a machine learning model, such as a support vector machine (SVM) or a random forest classifier.

Evaluate the performance of the trained model on the test set.

Use the trained model to predict the intent of new, unseen text data.

In [2]:
# Import and/or load needed libraries
import pandas as pd
import sklearn
from pprint import pprint
import requests
from io import StringIO

# The above code imports several libraries that will be used in the program:

1. "pandas" is a library used for data analysis and manipulation. It is imported and aliased as "pd".

2. "sklearn" is a library for machine learning in Python. It is imported without an alias.

3. "pprint" is a library for pretty-printing data structures. It is imported from the "pprint" module.

4. "requests" is a library for sending HTTP requests. It is imported from the "requests" module.

5. "StringIO" is a library for working with string data as files or streams. It is imported from the "io" module.

As the data is no longer in JointSLU, this has been downlaoded originally from kaggle.  It should not be shared outside of this course, but can be downloaded from kaggle by signing up.

In [3]:
# atis_intent.csv
file_id = '1QU3YH0hfbC4Swq5Rs-zIWfHANJjOVJET'
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)

# This code is downloading a file from Google Drive and converting its contents into a string:

1. The file to be downloaded is identified by its Google Drive file ID, stored in the variable "file_id".

2. The download URL is constructed by concatenating the Google Drive URL with the file ID, stored in the variable "dwn_url".

3. The "requests" library is used to send a GET request to the download URL, and the response text is stored in a variable "url".

4. The contents of the file are converted into a string and stored in a "StringIO" object, "csv_raw". The "StringIO" object allows the string to be treated as a file-like object, so it can be used in functions that normally work with files.

In [4]:
# Load the ATIS intents dataset into a pandas DataFrame
# The column names are not in the first row, so we can set them on read_csv using names
df = pd.read_csv(csv_raw, names = ['intent', 'utterance'])

# This code loads the contents of the file, stored as a "StringIO" object in the variable "csv_raw", into a pandas DataFrame:

1. The "read_csv" function from the "pandas" library is used to read the contents of the "csv_raw" object and store the resulting DataFrame in the variable "df".

2. The "names" parameter is set to a list of column names to be used for the DataFrame. In this case, the columns are named "intent" and "utterance".

3. This allows the data to be stored in a structured format, making it easier to manipulate and analyze.





In [17]:
# Print the column names
print(f"columns: {df.columns}")

# Print the first 5 rows of the DataFrame
print(df.head())

# Print size of the dataframe
print(df.shape)

columns: Index(['intent', 'utterance', 'utterance_text'], dtype='object')
             intent                                          utterance  \
0       atis_flight   i want to fly from boston at 838 am and arriv...   
1       atis_flight   what flights are available from pittsburgh to...   
2  atis_flight_time   what is the arrival time in san francisco for...   
3      atis_airfare            cheapest airfare from tacoma to orlando   
4      atis_airfare   round trip fares from pittsburgh to philadelp...   

                                      utterance_text  
0     want fly boston 838 arrive denver 1110 morning  
1  flights available pittsburgh baltimore thursda...  
2  arrival time san francisco 755 flight leaving ...  
3                    cheapest airfare tacoma orlando  
4  round trip fares pittsburgh philadelphia 1000 ...  
(4978, 3)


# This code prints some information about the "df" DataFrame:

1. The first line uses an f-string to print the column names of the DataFrame, which are stored in the "columns" attribute of "df".

2. The second line uses the "head" method of the DataFrame to print the first 5 rows of the DataFrame.

3. The third line prints the dimensions of the DataFrame using the "shape" attribute of "df". The "shape" attribute returns a tuple of the number of rows and columns in the DataFrame.

**These lines provide an overview of the contents and structure of the DataFrame, helping to ensure that it was loaded correctly and to get a feel for the data.**

## Understanding the Data

In [5]:
# What are all possible unique intent strings?
set(df['intent'].tolist())

{'atis_abbreviation',
 'atis_aircraft',
 'atis_aircraft#atis_flight#atis_flight_no',
 'atis_airfare',
 'atis_airfare#atis_flight_time',
 'atis_airline',
 'atis_airline#atis_flight_no',
 'atis_airport',
 'atis_capacity',
 'atis_cheapest',
 'atis_city',
 'atis_distance',
 'atis_flight',
 'atis_flight#atis_airfare',
 'atis_flight_no',
 'atis_flight_time',
 'atis_ground_fare',
 'atis_ground_service',
 'atis_ground_service#atis_ground_fare',
 'atis_meal',
 'atis_quantity',
 'atis_restriction'}


This code is setting a dataframe (df) column called 'intent' as a list of unique values. This list can be used to determine all the possible unique intent strings that are present in the dataframe.

# Preprocessing Utterance Text

In [8]:
import nltk
from nltk.corpus import stopwords

# downloads needed from nltk
nltk.download('punkt') 
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

This code imports the NLTK library and the stopwords corpus from NLTK. It then downloads the punkt and stopwords packages from the NLTK library, which are needed for the code that follows. 

This code is helpful for preprocessing because it will allow for the removal of stopwords from a text. Stopwords are common words that are usually filtered out before natural language processing, as they don't carry much meaning and can interfere with obtaining meaningful results from text analysis.

In [6]:
df[:10] # looking at our 2 column dataframe

Unnamed: 0,intent,utterance
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what is the arrival time in san francisco for...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...
5,atis_flight,i need a flight tomorrow from columbus to min...
6,atis_aircraft,what kind of aircraft is used on a flight fro...
7,atis_flight,show me the flights from pittsburgh to los an...
8,atis_flight,all flights from boston to washington
9,atis_ground_service,what kind of ground transportation is availab...


In [10]:
# Preprocess the Text Data by lowercasing, tokenization, and removing stopwords
# We will name the column utterance_text after doing the preprocessing

# Lowercase the utterance
df["utterance_text"] = df["utterance"].str.lower() # re-assigning the utterance column to the utterance_text column

# Tokenize the text and remove stopwords
df["utterance_text"] = df["utterance_text"].apply(nltk.word_tokenize)

# Remove stop words
stop_words = set(stopwords.words("english"))
df["utterance_text"] = df["utterance_text"].apply(lambda x: [word for word in x if word not in stop_words])
df["utterance_text"] = df["utterance_text"].apply(lambda x: ' '.join(x))

**This code is helping with preprocessing the text data by :**


1. The code lowercases the text by using the str.lower() method on the "utterance" column and stores the result in a new column called "utterance_text". 

2. It then tokenizes the text by using the word_tokenize method on the "utterance_text" column. 

3. it removes stopwords by using the set method on the stopwords library, comparing each word in the "utterance_text" column to the set of stopwords, and then joining the remaining words back together into a string. All of these steps help make the text data more manageable and easier to analyze.

In [12]:
df[:5] # looking at our new columns and the cleaned "utterance_text" column

Unnamed: 0,intent,utterance,utterance_text
0,atis_flight,i want to fly from boston at 838 am and arriv...,want fly boston 838 arrive denver 1110 morning
1,atis_flight,what flights are available from pittsburgh to...,flights available pittsburgh baltimore thursda...
2,atis_flight_time,what is the arrival time in san francisco for...,arrival time san francisco 755 flight leaving ...
3,atis_airfare,cheapest airfare from tacoma to orlando,cheapest airfare tacoma orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...,round trip fares pittsburgh philadelphia 1000 ...


## Split Data into Train and Test

In [13]:
# Split the Dataset into Training and Testing
from sklearn.model_selection import train_test_split

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["utterance_text"], df["intent"], test_size=0.2, random_state=42)

This code is using the scikit-learn train_test_split() function to :

1. split a given dataset into training and test sets. The variables X_train and X_test represent the training and test data for the independent variables, while y_train and y_test represent the training and test data for the dependent variable.

2.  The test_size argument is set to 0.2, indicating that 20% of the data will be used for testing and 80% for training. The random_state argument is set to 42, thus the data will be randomly split into the training and test set

In [14]:
# Let's take a look the results of the split
print(f"X_train.shape = {X_train.shape}")
print(f"X_test.shape = {X_test.shape}")
print(f"y_train.shape = {y_train.shape}")
print(f"y_test.shape = {y_test.shape}")

X_train.shape = (3982,)
X_test.shape = (996,)
y_train.shape = (3982,)
y_test.shape = (996,)


In [15]:
print(X_train[:5]) # exploring and looking at your training sets are a good way to get a feel for how your model is being trained and tested

# here we take a look at the utterances that are in the testing data.

3414    american airlines flight denver san francisco ...
1499    twa flight leaving early morning san francisco...
4003    show flights atlanta washington dc wednesday n...
3866    types aircraft get first class ticket philadel...
422                                        many seats 100
Name: utterance_text, dtype: object


In [16]:
print(y_train[:5])  # and here are the intents!

3414      atis_flight
1499      atis_flight
4003      atis_flight
3866    atis_aircraft
422     atis_capacity
Name: intent, dtype: object


##Train TF-IDF Vectorizer

This code is calculating the TF-IDF score for each word in the training set. 

1. It starts by importing the TfidfVectorizer from the scikit-learn library. 

2. It then initializes the TfidfVectorizer and calculates the TF-IDF scores for the training set. 

3. Finally, it prints the shape of the training set.

In [18]:
# Calculate the TF-IDF score for each word in the training set
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Calculate the TF-IDF scores for the training set
X_train_tfidf = vectorizer.fit_transform(X_train)

# Print the shape of the training set
print(X_train_tfidf.shape)

(3982, 724)


In [19]:
# Let's look at the TF-IDF vectors
# X_train_tfidf is a 2 dimentional matrix of floating point numbers
# Each row represents the TF-IDF of one utterance
# It doesn't look like a matrix, because most of the matrix elements are zero, and it is stored in a sparce memory representation
print(X_train_tfidf[:1])

  (0, 459)	0.38911248129023623
  (0, 281)	0.5061299392133041
  (0, 328)	0.28014363567453604
  (0, 562)	0.26574624881946984
  (0, 253)	0.26355606460712255
  (0, 317)	0.2503194208876888
  (0, 138)	0.3493304655633338
  (0, 147)	0.4349356852801214


## Train Machine Learning Model

This code is creating and fitting an SVM classifier using TF-IDF scores as features:

1.  The SVM classifier is initialized using the "linear" kernel.

2. Then it is fit to the training data, which consists of the TF-IDF scores (X_train_tfidf) and the training labels (y_train).

**SVM classifiers are helpful because they can be used for both classification and regression tasks.** They are also effective in high-dimensional spaces and can create non-linear decision boundaries. Additionally, they are more robust than other classifiers, meaning that **they are less likely to overfit the data.**

In [20]:
# Use the TF-IDF scores as features to train a machine learning model selecting a classifier
# Here we are using a support vector machine (SVM) but we could use other types of classifiers (e.g., random forest classifier)
from sklearn.svm import SVC

# Initialize the SVM classifier
clf = SVC(kernel="linear")

# Fit the classifier to the training data
clf.fit(X_train_tfidf, y_train)


SVC(kernel='linear')

## Evaluate the Performance of a Trained Model

After training, we can evaluate the performance of the trained model on a test set (the test set not used for training).  

Various evaluation metrics can be used such as accuracy, precision, recall, and F1 score.

In [42]:
# Calculate the TF-IDF scores for the test set
X_test_tfidf = vectorizer.transform(X_test)

# Predict the labels for the test set
y_pred = clf.predict(X_test_tfidf)

# Evaluate the performance of the model
accuracy = sum(y_pred == y_test) / len(y_test)
print(f"Accuracy on the test set: {accuracy:.2f}")

# This score is very good!

Accuracy on the test set: 0.95


# Let's Explore this model to predict new intents on custom data:

This code uses a trained machine learning model to predict the intent of a new piece of text. 

1. The new text is preprocessed in the same way as the training data, including lowercasing and tokenizing the text, removing stop words, and joining the remaining tokens back into a single string.

2. The TF-IDF scores for the new text are calculated using the vectorizer, and the clf model is used to make a prediction on the intent of the text based on the TF-IDF scores.

3. The prediction is printed to the console with the statement print(prediction). **Note that the prediction output will be a label or class assigned to the new text based on the intent classification model.**

In [35]:
# Use the model to predict intents on custom data

# Pre-process the text we are evaluating the same way we 
# preprocessed the training data

# Define a new piece of text
new_text = "tickets?"

# Lower case
new_text = new_text.lower()
print(f"STEP 1: {new_text}") #passing it through the .lower() method

new_text_tokens = nltk.word_tokenize(new_text) # tokenizing the text
print(f"STEP 2: {new_text_tokens}")

new_text_tokens = [word for word in new_text_tokens if word not in stop_words] # for every word in the new utterance that isn't in the stop words corpus print the tokenized text
print(f"STEP 3: {new_text_tokens}")
new_text = ''.join(new_text_tokens)

# Calculate the TF-IDF scores for the new text
new_text_tfidf = vectorizer.transform([new_text]) # fresh baby utterance is classified

# Predict the label for the new text
prediction = clf.predict(new_text_tfidf) # It's a boy!

# Print the prediction
print(prediction) # post to social media!

STEP 1: tickets?
STEP 2: ['tickets', '?']
STEP 3: ['tickets', '?']
['atis_airfare']


In [41]:

# defining a function for the above code

import nltk
from nltk.corpus import stopwords

def predict_intent(text, vectorizer, clf, stop_words=None):
    if stop_words is None:
        stop_words = set(stopwords.words('english'))
    
    # Pre-process the text
    text = text.lower()
    text_tokens = nltk.word_tokenize(text)
    text_tokens = [word for word in text_tokens if word not in stop_words]
    text = ' '.join(text_tokens)
    
    # Calculate the TF-IDF scores for the text
    text_tfidf = vectorizer.transform([text])
    
    # Predict the label for the text
    prediction = clf.predict(text_tfidf)
    
    return prediction[0]



predict_intent("How many bags can I carry?", vectorizer, clf)



'atis_quantity'

#Big Picture
## Why is intent classification important in NLP?:

Classifying intents in NLP is important because **it helps in understanding the goal** or purpose of a text message, whether it's a question, request, or statement. This information is crucial for building conversational AI systems, such as chatbots, virtual assistants, and customer service bots, which need to understand the intent behind a user's message in order to respond appropriately.

Intent classification also **helps to improve the accuracy and efficiency** of NLP-based systems by reducing the number of potential interpretations for a given text message, and allowing the system to focus on a smaller set of potential responses. It also enables NLP systems to handle a wide range of user requests, even when the requests are phrased in different ways.

In addition, intent classification can also be **used to improve user experience** by anticipating user needs, providing relevant information proactively, and personalizing the interaction!

In this case the need for fast solutions to flight inquiry using the ATIS model is crucial. 

