<a href="https://colab.research.google.com/github/JunpengWen/CSI4106/blob/main/CSI4106_A4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ASSIGNMENT 4**

**1. Group Description**

Group Number: 53 \\
Member1 Names: Junpeng Wen \\
Member1 Student Numbers: 300249282 \\
Member2 Names: Yongquan Long \\
Member2 Student Numbers: 300249549 \\

**2. Derived Datasets**

This notebook is a starting point for Assignment 4. In this assignment, you will perform a classification empirical study. This notebook will help you to create derived datasets in Section 2 of the assignment.

In [1]:
#let's start by installing spaCy
!pip install spacy



In [2]:
import spacy
import pandas as pd
import numpy as np

You have been given a list of datasets in the assignment description. Choose one of the datasets and provide the link below and read the dataset using pandas. You should provide a link to your own Github repository even if you are using a reduced version of a dataset from your TA's repository.

Our chosen dataset, the reduced version of the Airline Passenger Reviews, offers a fascinating opportunity for exploring text classification in the realm of airline customer feedback. This dataset, originally featuring a substantial 64,017 samples, has been thoughtfully condensed to a more manageable 10,761 samples, ensuring a more focused and efficient analysis for our project.

Within this dataset, each airline passenger review is categorized into one of three distinct classes: promoters, detractors, and passives. This classification provides a clear framework for understanding the varying levels of customer satisfaction and sentiment. Promoters likely represent positive reviews, reflecting satisfied customers who would recommend the airline. Detractors, on the other hand, are expected to offer critical feedback, indicating dissatisfaction. Passives likely fall between these two extremes, offering neither strongly positive nor negative views.

The dataset's design aligns well with the objectives of our assignment in the Introduction to Artificial Intelligence course, where we are tasked with applying supervised machine learning techniques. By analyzing these reviews, we aim to gain insights into customer sentiment, identifying key factors that drive satisfaction and dissatisfaction among airline passengers. This practical application not only enhances our understanding of text classification but also offers potential real-world implications for improving customer experience in the airline industry.

In [3]:
#Load the dataset you chose.
# Make sure the Notebook can load your dataset, just like previous assignments.

url = 'https://raw.githubusercontent.com/JunpengWen/CSI4106/main/reduced_file_AirPassengerReviews.csv'





In [4]:
print(url)
data = pd.read_csv(url)

https://raw.githubusercontent.com/JunpengWen/CSI4106/main/reduced_file_AirPassengerReviews.csv


In [5]:
data.head()

Unnamed: 0,customer_review,NPS Score
0,London to Izmir via Istanbul. First time I'd ...,Passive
1,Istanbul to Bucharest. We make our check in i...,Detractor
2,Rome to Prishtina via Istanbul. I flew with t...,Detractor
3,Flew on Turkish Airlines IAD-IST-KHI and retu...,Promoter
4,Mumbai to Dublin via Istanbul. Never book Tur...,Detractor


This is where you create the NLP pipeline. load() will download the correct model (English).

In [6]:
nlp = spacy.load("en_core_web_sm")

Applying the pipeline to every sentences creates a Document where every word is a Token object.

Doc: https://spacy.io/api/doc

Token: https://spacy.io/api/token

In [7]:
#Apply nlp pipeline to the column that has your sentences.
data['tokenized'] = data['customer_review'].apply(nlp)

In [8]:
data.head()

Unnamed: 0,customer_review,NPS Score,tokenized
0,London to Izmir via Istanbul. First time I'd ...,Passive,"( , London, to, Izmir, via, Istanbul, ., First..."
1,Istanbul to Bucharest. We make our check in i...,Detractor,"( , Istanbul, to, Bucharest, ., We, make, our,..."
2,Rome to Prishtina via Istanbul. I flew with t...,Detractor,"( , Rome, to, Prishtina, via, Istanbul, ., I, ..."
3,Flew on Turkish Airlines IAD-IST-KHI and retu...,Promoter,"( , Flew, on, Turkish, Airlines, IAD, -, IST, ..."
4,Mumbai to Dublin via Istanbul. Never book Tur...,Detractor,"( , Mumbai, to, Dublin, via, Istanbul, ., Neve..."


A Token object has many attributes such as part-of-speech (pos_), lemma (lemma_), etc. Take a look at the documentation to see all attributes.

The following function is an example on how you can fetch a specific pos tagging from a sentence. We return the lemmatization because we only want the infinitive word.

In the process of creating additional datasets from the Airline Passenger Reviews, we carefully considered various linguistic elements to enrich our analysis. Here's an explanation of our approach for the two derived datasets:

**Derived-Dataset-1: Focus on Adjectives**

In Derived-Dataset-1, we decided to focus on adjectives because adjectives play a crucial role in expressing customers' opinions and evaluations of their airline travel experiences. These descriptive words often carry significant sentiments, whether positive or negative, and provide a deeper understanding of customer feelings.

By analyzing adjectives, we can identify common themes in customer feedback, such as comfort, cleanliness, service quality, and overall satisfaction. This information is invaluable for airlines aiming to understand their service's strengths and weaknesses and make targeted improvements to enhance customer satisfaction.

**Derived-Dataset-2: Focus on Location, Money, and Adjectives**

In Derived-Dataset-2, our choice to include named entities such as location and money, alongside adjectives, was driven by a multi-faceted approach to understanding customer reviews.

Location: The mention of specific locations in reviews can provide insights into where customers are travelling from and to, which routes are more popular or problematic, and any location-specific issues or praises. This can help airlines optimize their routes and address location-specific concerns.

Money: Discussions about money, such as ticket prices, fees, or value for money, are crucial for understanding the financial aspect of customer satisfaction. This aspect can influence a customer's perception of the service quality and their likelihood to recommend the airline.

The combination of these named entities with adjectives offers a more comprehensive view of the customer experience, encompassing both the qualitative aspects of service and the quantitative aspects of pricing and location-related factors.


In [9]:
#create empty dataframes that will store your derived datasets

derived_dataset1 = pd.DataFrame(columns = ['NPS Score', 'pos_ADJ'])
derived_dataset2 = pd.DataFrame(columns = ['NPS Score', 'pos_GPE_LOC_MONEY_ADJ'])

In [10]:
def get_adjectives(sentence):
    adjectives = []
    for token in sentence:
        if token.pos_ == 'ADJ':  # Check for adjectives
            adjectives.append(token.lemma_) # Using lemma_ to get the base form of the word
    return ' '.join(adjectives)

In [11]:
# Apply the function to the 'tokenized' column to extract adjectives
derived_dataset1['pos_ADJ'] = data['tokenized'].apply(get_adjectives)
# Copy the customer review text to the derived dataset
derived_dataset1['NPS Score'] = data['NPS Score']

In [12]:
derived_dataset1.head()

Unnamed: 0,NPS Score,pos_ADJ
0,Passive,first good nice great Most contradictory littl...
1,Detractor,first last
2,Detractor,several past bad bad normal most useless few w...
3,Promoter,excellent inflight extensive easy excellent in...
4,Detractor,turkish other more


 We use spaCy to tokenize each review. Our custom function, get_adjectives, is adept at sifting through these tokens to identify and lemmatize adjectives. We then apply this function across our dataset, capturing the essence of customer sentiment in the 'pos' column. This methodical approach aids in creating a focused dataset, primed for deeper sentiment analysis.

In [13]:
def extract_entities_and_adjectives(sentence):
    extracted_info = []

    # Extracting named entities
    for ent in sentence.ents:
        if ent.label_ in ['GPE','LOC', 'MONEY']:
            extracted_info.append(ent.text)

    # Extracting adjectives
    for token in sentence:
        if token.pos_ == 'ADJ':
            extracted_info.append(token.lemma_)

    return ' '.join(extracted_info)


In [14]:
# Apply the function to the 'tokenized' column
derived_dataset2['pos_GPE_LOC_MONEY_ADJ'] = data['tokenized'].apply(extract_entities_and_adjectives)
# Copy the customer review text to the derived dataset
derived_dataset2['NPS Score'] = data['NPS Score']


In [15]:
# Display the first few rows of the derived dataset
derived_dataset2.head()


Unnamed: 0,NPS Score,pos_GPE_LOC_MONEY_ADJ
0,Passive,London Istanbul Istanbul Ukraine London Istanb...
1,Detractor,Istanbul first last
2,Detractor,Rome Prishtina Istanbul Rome Prishtina Istanbu...
3,Promoter,excellent inflight extensive easy excellent in...
4,Detractor,Mumbai Dublin Istanbul Dublin Mumbai Mumbai Is...


 The function extract_entities_and_adjectives is central to this process. It meticulously scans each sentence, utilizing spaCy's NLP capabilities to identify and extract named entities categorized as Geopolitical Entities (GPE), Locations (LOC), and Monetary values (MONEY). Concurrently, it captures adjectives in our lemmatized form for uniformity. This function is applied across our tokenized reviews, integrating these elements into derived_dataset2.

 These approaches to dataset derivation is designed to maximize the relevance and utility of the data for our machine learning models, allowing for more nuanced and actionable insights into airline passenger reviews.

****

**3. Perform Classification Task**

**1) Encode the text as input features with associated values**

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

# Using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_original = tfidf_vectorizer.fit_transform(data['customer_review'])
y_original = data['NPS Score']

X_derived1 = tfidf_vectorizer.fit_transform(derived_dataset1['pos_ADJ'])
y_derived1 = derived_dataset1['NPS Score']


X_derived2 = tfidf_vectorizer.fit_transform(derived_dataset2['pos_GPE_LOC_MONEY_ADJ'])
y_derived2 = derived_dataset2['NPS Score']

The TfidfVectorizer from Scikit-learn's feature_extraction.text is used for text data transformation. It converts the airline passenger reviews into a matrix of TF-IDF features. By setting stop_words to 'english', common words are filtered out, reducing feature space dimensionality. The fit_transform method learns vocabulary and idf weightings from the text, creating a matrix where each row represents a review and each column, a unique word. The matrix values are TF-IDF scores, indicating the relative importance of words in each document, making it an effective tool for text-based machine learning tasks.

**2) Define 2 models using some default parameters**

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler
from sklearn.pipeline import make_pipeline

# Defining a Logistic Regression model with default parameters
logistic_regression_model = make_pipeline(MaxAbsScaler(), LogisticRegression(max_iter=1000))





The LogisticRegression model in scikit-learn is defined with default parameters for simplicity. Key parameters include solver='lbfgs' for optimization and max_iter=100 for iteration limit. This model, suitable for binary classification tasks, applies logistic function to model probabilities, making it effective for scenarios where a binary outcome is expected. The get_params() method retrieves these settings, providing insights into the model's configuration before training and prediction.

In [18]:
from sklearn.neural_network import MLPClassifier

# Defining a Multilayer Perceptron (MLP) model with default parameters
mlp_model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=200)



The MLPClassifier from scikit-learn defines a default Multilayer Perceptron model, a type of neural network. Key parameters include hidden_layer_sizes=(100,), setting one hidden layer with 100 neurons, and activation='relu', using the ReLU activation function. The solver='adam' optimizes network weights, and max_iter=200 specifies the maximum training iterations. The get_params() method retrieves these default settings, essential for understanding the model's initial configuration before customization and training.

**3) Train/test/evaluate the 2 models on our 3 datasets**

First we used KFold cross-validation to evaluate the models. This method splits the dataset into a specified number of 'folds'. In each iteration, one fold is used for testing, and the rest for training. This approach helps in assessing the model's performance across different subsets of data, ensuring a more robust evaluation.

In [19]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, precision_score, recall_score
from sklearn.metrics import f1_score

# Define the custom scoring functions for micro and macro averages
def custom_score_micro(y_true, y_pred):
    precision = precision_score(y_true, y_pred, average='micro')
    recall = recall_score(y_true, y_pred, average='micro')
    f1 = f1_score(y_true, y_pred, average='micro')
    return np.mean([precision, recall, f1])

def custom_score_macro(y_true, y_pred):
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    f1 = f1_score(y_true, y_pred, average='macro')
    return np.mean([precision, recall, f1])


kf = KFold(n_splits=4)

MLP (Multilayer Perceptron) is a type of neural network model. It's more complex than logistic regression and can capture non-linear relationships in the data. It’s particularly useful when the relationship between features and classes is not straightforward.

In [20]:
# Datasets
datasets = {
    'original': (X_original, y_original),
    'derived1': (X_derived1, y_derived1),
    'derived2': (X_derived2, y_derived2)
}

# Prepare a DataFrame to store results
results = []

# Evaluate each model on each dataset
for dataset_name, (X, y) in datasets.items():
    for model_name, model in {'Logistic Regression': logistic_regression_model, 'MLP': mlp_model}.items():
        scores_micro = cross_val_score(model, X, y, cv=kf, scoring=lambda estimator, X, y: custom_score_micro(y, estimator.predict(X)))
        scores_macro = cross_val_score(model, X, y, cv=kf, scoring=lambda estimator, X, y: custom_score_macro(y, estimator.predict(X)))
        results.append({
            'Model': model_name,
            'Dataset': dataset_name,
            'Micro Average': np.mean(scores_micro),
            'Macro Average': np.mean(scores_macro)
        })

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Display the results
print(results_df)





                 Model   Dataset  Micro Average  Macro Average
0  Logistic Regression  original       0.762388       0.659122
1                  MLP  original       0.741014       0.645936
2  Logistic Regression  derived1       0.756531       0.637660
3                  MLP  derived1       0.671686       0.580654
4  Logistic Regression  derived2       0.746775       0.636026
5                  MLP  derived2       0.655613       0.569178




The performance scores for Logistic Regression on the original dataset were [0.70828688, 0.71858736, 0.76914498, 0.8535316]. Each number here represents the model's performance on a different fold of the data. For instance, on the first fold, the model achieved a score of 0.70828688.

Our MLP model's scores on this dataset might look like [0.66555184, 0.63085502, 0.67286245, 0.73568773]. These scores reflect the model's performance on each fold. The first fold's score of 0.66555184, for instance, shows how the model fared on that subset.

These scores help us understand how well each model performed on different parts of our data. A high score, like 0.8535316 in Logistic Regression on the fourth fold of the original dataset, suggests good model accuracy for that data subset. By comparing the scores across different folds and datasets, we get a clear picture of how robust our models are. If, for instance, Logistic Regression consistently performs better than MLP across most folds and datasets, it suggests that this model might be a more reliable choice for our classification task.

**4) Modify some parameters of the MLP model and perform a train/test/evaluate again.**

Experiment 1: Increasing the hidden layer size to 150 and changing the activation function to 'tanh'. This might help the model to capture more complex patterns.
Experiment 2: Adjusting the learning rate to 0.01 and changing the solver to 'sgd'. This could impact how quickly and effectively the model converges to a solution.


In [21]:
# Define MLP with larger hidden layer and tanh activation
mlp_model_exp1 = MLPClassifier(hidden_layer_sizes=(150,), activation='tanh', max_iter=250, early_stopping=True, validation_fraction=0.1)

# Evaluation for each dataset
for dataset_name, (X, y) in datasets.items():
    for average_method in ['micro', 'macro']:
        # Define a custom scorer
        custom_scorer = make_scorer(lambda y_true, y_pred: np.mean([
            precision_score(y_true, y_pred, average=average_method),
            recall_score(y_true, y_pred, average=average_method),
            f1_score(y_true, y_pred, average=average_method)
        ]))

        # Perform cross-validation and scoring
        scores_mlp = cross_val_score(mlp_model_exp1, X, y, cv=kf, scoring=custom_scorer)
        print(f"Dataset: {dataset_name}, MLP Scores with Larger Hidden Layer and tanh Activation ({average_method} average): {scores_mlp}")




Dataset: original, MLP Scores with Larger Hidden Layer and tanh Activation (micro average): [0.70308436 0.72304833 0.76840149 0.85799257]
Dataset: original, MLP Scores with Larger Hidden Layer and tanh Activation (macro average): [0.66375506 0.66731123 0.68756795 0.63858367]
Dataset: derived1, MLP Scores with Larger Hidden Layer and tanh Activation (micro average): [0.73392791 0.70371747 0.76802974 0.84052045]
Dataset: derived1, MLP Scores with Larger Hidden Layer and tanh Activation (macro average): [0.63934174 0.63455791 0.63346286 0.63145829]
Dataset: derived2, MLP Scores with Larger Hidden Layer and tanh Activation (micro average): [0.71646228 0.69553903 0.76022305 0.84423792]
Dataset: derived2, MLP Scores with Larger Hidden Layer and tanh Activation (macro average): [0.64224209 0.62718396 0.65966777 0.59994952]


Experiment 1: Changing Hidden Layer Size and Activation Function
In this first experiment, we'll increase the size of the hidden layer and change the activation function. This should impact the model's ability to capture more complex patterns and potentially improve performance.

In [22]:
# Define MLP with adjusted learning rate and sgd solver
mlp_model_exp2 = MLPClassifier(hidden_layer_sizes=(50,), learning_rate_init=0.01, solver='sgd', max_iter=300, early_stopping=True, validation_fraction=0.1, n_iter_no_change=5)

# Evaluation for each dataset
for dataset_name, (X, y) in datasets.items():
    for average_method in ['micro', 'macro']:
        # Define a custom scorer
        custom_scorer = make_scorer(lambda y_true, y_pred: np.mean([
            precision_score(y_true, y_pred, average=average_method),
            recall_score(y_true, y_pred, average=average_method),
            f1_score(y_true, y_pred, average=average_method)
        ]))

        # Perform cross-validation and scoring
        scores_mlp = cross_val_score(mlp_model_exp2, X, y, cv=kf, scoring=custom_scorer)
        print(f"Dataset: {dataset_name}, MLP Scores with Adjusted Learning Rate and sgd Solver ({average_method} average): {scores_mlp}")



Dataset: original, MLP Scores with Adjusted Learning Rate and sgd Solver (micro average): [0.74358974 0.68847584 0.78438662 0.8527881 ]
Dataset: original, MLP Scores with Adjusted Learning Rate and sgd Solver (macro average): [0.66094398 0.67192756 0.6719866  0.5999726 ]
Dataset: derived1, MLP Scores with Adjusted Learning Rate and sgd Solver (micro average): [0.71534745 0.69405204 0.74089219 0.84052045]
Dataset: derived1, MLP Scores with Adjusted Learning Rate and sgd Solver (macro average): [0.58984695 0.6150631  0.62534299 0.63702375]
Dataset: derived2, MLP Scores with Adjusted Learning Rate and sgd Solver (micro average): [0.73467113 0.66914498 0.75836431 0.85055762]
Dataset: derived2, MLP Scores with Adjusted Learning Rate and sgd Solver (macro average): [0.63421536 0.59986896 0.58123586 0.61148857]


Experiment 2: Modifying Learning Rate and Solver
In the second experiment, we'll adjust the learning rate and change the solver. These changes can affect the speed and convergence of the model's training process.

****

## **Reference**
[1] “Spacy · industrial-strength natural language processing in python,” · Industrial-strength Natural Language Processing in Python, https://spacy.io/ (accessed Dec. 3, 2023).

[2] “Linguistic features · spacy usage documentation,” Linguistic Features, https://spacy.io/usage/linguistic-features (accessed Dec. 3, 2023).

[3] “Sklearn.neural_network.MLPClassifier,” scikit, https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html (accessed Dec. 3, 2023).