**ASSIGNMENT 4**

1. Group #: 18
   Member Names: Natasha Hussain, Daanish Khan 

   Member Student Numbers: 300122562, 300126840 
   
   Report Title: Classification Empirical Study 

**Derived Datasets**

In [177]:
import spacy
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    f1_score,
    classification_report,
    precision_score, 
    recall_score
)

You have been given a list of datasets in the assignment description. Choose one of the datasets and provide the link below and read the dataset using pandas. You should provide a link to your own Github repository even if you are using a reduced version of a dataset from your TA's repository.

**Dataset:** Airline Passenger Reviews 

**Description:** This dataset provides 64,017 data samples of passenger reviews. They are separated into 3 categories - Passive (Neutral), Detractors (Negative), and Promoters (Positive). The reduced version we will be using contains 10,761 samples. 

***[EXPLAIN WHAT WE DID TO THE DATA SETS HERE AND WHY]***

In [81]:
#Load the dataset you chose.
url = 'https://raw.githubusercontent.com/NatashaNaima/AI-NLPs/main/reduced_file_AirPassengerReviews.csv'

In [82]:
print(url)
data = pd.read_csv(url)

https://raw.githubusercontent.com/NatashaNaima/AI-NLPs/main/reduced_file_AirPassengerReviews.csv


In [83]:
data.head()

Unnamed: 0,customer_review,NPS Score
0,London to Izmir via Istanbul. First time I'd ...,Passive
1,Istanbul to Bucharest. We make our check in i...,Detractor
2,Rome to Prishtina via Istanbul. I flew with t...,Detractor
3,Flew on Turkish Airlines IAD-IST-KHI and retu...,Promoter
4,Mumbai to Dublin via Istanbul. Never book Tur...,Detractor


This is where you create the NLP pipeline. load() will download the correct model (English).

In [84]:
nlp = spacy.load("en_core_web_sm")

Applying the pipeline to every sentences creates a Document where every word is a Token object.

Doc: https://spacy.io/api/doc

Token: https://spacy.io/api/token

In [85]:
#Apply nlp pipeline to the column that has your sentences.
data['tokenized'] = data['customer_review'].apply(nlp)

In [86]:
data.head()

Unnamed: 0,customer_review,NPS Score,tokenized
0,London to Izmir via Istanbul. First time I'd ...,Passive,"( , London, to, Izmir, via, Istanbul, ., First..."
1,Istanbul to Bucharest. We make our check in i...,Detractor,"( , Istanbul, to, Bucharest, ., We, make, our,..."
2,Rome to Prishtina via Istanbul. I flew with t...,Detractor,"( , Rome, to, Prishtina, via, Istanbul, ., I, ..."
3,Flew on Turkish Airlines IAD-IST-KHI and retu...,Promoter,"( , Flew, on, Turkish, Airlines, IAD, -, IST, ..."
4,Mumbai to Dublin via Istanbul. Never book Tur...,Detractor,"( , Mumbai, to, Dublin, via, Istanbul, ., Neve..."


A Token object has many attributes such as part-of-speech (pos_), lemma (lemma_), etc. Take a look at the documentation to see all attributes.

The following function is an example on how you can fetch a specific pos tagging from a sentence. We return the lemmatization because we only want the infinitive word.

***[EXPLAIN WHY WE CHOSE ADJECTIVES ONLY FOR THE FIRST DATASET]***

In [87]:
#create empty dataframes that will store derived datasets

derived_dataset1 = pd.DataFrame(columns = ['Class', 'pos'])
derived_dataset2 = pd.DataFrame(columns = ['Class', 'pos-np'])

In [88]:
def get_pos(sentence, wanted_pos): #wanted_pos refers to the desired pos tagging
    verbs = []
    for token in sentence:
        if token.pos_ in wanted_pos:
            verbs.append(token.lemma_) # lemma returns a number. lemma_ return a string
    return ' '.join(verbs) # return value is as a string and not a list for countVectorizer

In [110]:
derived_dataset1['pos'] = data['tokenized'].apply(lambda sent : get_pos(sent, ['ADJ']))
derived_dataset1['Class'] = data['NPS Score']

In [111]:
derived_dataset1.head(10)

Unnamed: 0,Class,pos
0,Passive,first good nice great Most contradictory littl...
1,Detractor,first last
2,Detractor,several past bad bad normal most useless few w...
3,Promoter,excellent inflight extensive easy excellent in...
4,Detractor,turkish other more
5,Detractor,stuck rude slow unhelpful only only positive g...
6,Detractor,same technical complete armed civilian total m...
7,Detractor,unfriendly
8,Passive,comfortable small okay mid okay great friendly...
9,Detractor,devastating final final turkish much unhelpful...


In [100]:
def get_entities(sentence, wanted_entities):
    entity = []
    for ent in sentence.ents:
        if ent.label_ in wanted_entities:
            entity.append(ent.text)
            
    return ' '.join(entity) 

***[EXPLAIN WHY WE CHOSE LOCATION AND ORGANIZATIONS FOR DATASET]***

In [109]:
derived_dataset2['ent-np'] = data['tokenized'].apply(lambda sent : get_entities(sent, ['ORG', 'GPE']))
derived_dataset2['pos-np'] = data['tokenized'].apply(lambda sent : get_pos(sent, ['ADJ']))
derived_dataset2['Class'] = data['NPS Score']
derived_dataset2.head(10)

Unnamed: 0,Class,pos-np,ent-np
0,Passive,first good nice great Most contradictory littl...,London Istanbul LHR Istanbul Ukraine London Is...
1,Detractor,first last,Istanbul
2,Detractor,several past bad bad normal most useless few w...,Rome Prishtina Istanbul Rome Prishtina Istanbu...
3,Promoter,excellent inflight extensive easy excellent in...,Turkish Airlines Turkish Airlines Turkish Airl...
4,Detractor,turkish other more,Mumbai Dublin Istanbul Dublin Mumbai Mumbai Is...
5,Detractor,stuck rude slow unhelpful only only positive g...,Istanbul Budapest Dublin Turkish Airlines Ista...
6,Detractor,same technical complete armed civilian total m...,Istanbul Algiers Algiers Algiers Turkish Airlines
7,Detractor,unfriendly,Basel Cape Town Istanbul Istanbul Burger King
8,Passive,comfortable small okay mid okay great friendly...,Abu Dhabi Luxembourg Istanbul AUH-IST AUH
9,Detractor,devastating final final turkish much unhelpful...,Turkish Airlines Turkish Airlines JetBlue Veni...


Now that you have your derived datasets, you can move to perform your classificaton task.

**Perform Classification Task**

In [181]:
from sklearn.feature_extraction.text import TfidfVectorizer

dataset_tfidf = TfidfVectorizer().fit_transform(data.customer_review)
derived_dataset_1_tfidf = TfidfVectorizer().fit_transform(derived_dataset1.pos)
derived_dataset_2_tfidf = TfidfVectorizer().fit_transform(np.concatenate((derived_dataset2['pos-np'].to_numpy(), derived_dataset2['ent-np'].to_numpy()), axis=0))

## Logistical Regression Model

In [182]:
def logistical_regression(x_train, x_test, y_train, y_test, C=1, solver='lbfgs'):
	model = LogisticRegression(max_iter=1000, C=1, solver='lbfgs').fit(x_train, y_train)

	# Make prediction from model
	y_pred = model.predict(x_test)

	# calculate accuracy and f1
	accuracy = accuracy_score(y_pred, y_test)
	f1 = f1_score(y_pred, y_test, average="weighted")

	# calculating micro and macro precision/recall
	micro_precision = precision_score(y_test, y_pred, average="micro", zero_division=0)
	micro_recall = recall_score(y_test, y_pred, average="micro", zero_division=0)

	# calculating macro precision/recall 
	macro_precision = precision_score(y_test, y_pred, average="macro", zero_division=0)
	macro_recall = recall_score(y_test, y_pred, average="macro", zero_division=0)

	evaluation = {'accuracy': accuracy, 'f1': f1, 'macro_p': macro_precision, 'macro_r':macro_recall, 'micro_p':micro_precision, 'micro_r': micro_recall}
	return model, y_pred, evaluation

## MLP Model

In [186]:
from sklearn.neural_network import MLPClassifier

def mlp(x_train, x_test, y_train, y_test):
	model = MLPClassifier().fit(x_train, y_train)

	# Make prediction from model
	y_pred = model.predict(x_test)

	# calculate accuracy and f1
	accuracy = accuracy_score(y_pred, y_test)
	f1 = f1_score(y_pred, y_test, average="weighted")

	# calculating micro and macro precision/recall
	micro_precision = precision_score(y_test, y_pred, average="micro", zero_division=0)
	micro_recall = recall_score(y_test, y_pred, average="micro", zero_division=0)

	# calculating macro precision/recall 
	macro_precision = precision_score(y_test, y_pred, average="macro", zero_division=0)
	macro_recall = recall_score(y_test, y_pred, average="macro", zero_division=0)

	evaluation = {'accuracy': accuracy, 'f1': f1, 'macro_p': macro_precision, 'macro_r':macro_recall, 'micro_p':micro_precision, 'micro_r': micro_recall}
	return model, y_pred, evaluation