<a href="https://colab.research.google.com/github/Kavi-Dew-23/4338-Dewapura/blob/main/Smart_complaint_analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder

In [3]:
df_train = pd.read_csv("/content/train.csv")
df_test = pd.read_csv("/content/test.csv")

print(df_train.head())


   Unnamed: 0      id  Gender      Customer Type  Age   Type of Travel  \
0           0   70172    Male     Loyal Customer   13  Personal Travel   
1           1    5047    Male  disloyal Customer   25  Business travel   
2           2  110028  Female     Loyal Customer   26  Business travel   
3           3   24026  Female     Loyal Customer   25  Business travel   
4           4  119299    Male     Loyal Customer   61  Business travel   

      Class  Flight Distance  Inflight wifi service  \
0  Eco Plus              460                      3   
1  Business              235                      3   
2  Business             1142                      2   
3  Business              562                      2   
4  Business              214                      3   

   Departure/Arrival time convenient  ...  Inflight entertainment  \
0                                  4  ...                       5   
1                                  2  ...                       1   
2                

In [4]:
df_train.drop(columns=["Unnamed: 0", "id"], inplace=True)

In [5]:
# Check for null values
print(df_train.isnull().sum())

Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             310
satisfaction                           0
dtype: int64


In [6]:
#drop missing values
df_train.dropna(inplace=True)


In [7]:
print(df_train.head())

   Gender      Customer Type  Age   Type of Travel     Class  Flight Distance  \
0    Male     Loyal Customer   13  Personal Travel  Eco Plus              460   
1    Male  disloyal Customer   25  Business travel  Business              235   
2  Female     Loyal Customer   26  Business travel  Business             1142   
3  Female     Loyal Customer   25  Business travel  Business              562   
4    Male     Loyal Customer   61  Business travel  Business              214   

   Inflight wifi service  Departure/Arrival time convenient  \
0                      3                                  4   
1                      3                                  2   
2                      2                                  2   
3                      2                                  5   
4                      3                                  3   

   Ease of Online booking  Gate location  ...  Inflight entertainment  \
0                       3              1  ...                

Encode the categorical variables

In [8]:
# List of columns to encode
label_cols = ['Gender', 'Customer Type', 'Type of Travel', 'Class','satisfaction']

# Encode each column
for col in label_cols:
    le = LabelEncoder()
    df_train[col] = le.fit_transform(df_train[col])


print(df_train.head(5))


   Gender  Customer Type  Age  Type of Travel  Class  Flight Distance  \
0       1              0   13               1      2              460   
1       1              1   25               0      0              235   
2       0              0   26               0      0             1142   
3       0              0   25               0      0              562   
4       1              0   61               0      0              214   

   Inflight wifi service  Departure/Arrival time convenient  \
0                      3                                  4   
1                      3                                  2   
2                      2                                  2   
3                      2                                  5   
4                      3                                  3   

   Ease of Online booking  Gate location  ...  Inflight entertainment  \
0                       3              1  ...                       5   
1                       3           

In [9]:
# Generate synthetic complaint text

def generate_complaint(row):
    complaint = []
    if row['Inflight wifi service'] <= 2:
        complaint.append("wifi was slow")
    if row['Food and drink'] <= 2:
        complaint.append("food was bad")
    if row['Departure/Arrival time convenient'] <= 2:
        complaint.append("departure and arrival time slow")
    if row['Ease of Online booking'] <= 2:
        complaint.append("online booking was difficult")
    if row['Seat comfort'] <= 2:
        complaint.append("seat was uncomfortable")
    if row['Gate location'] <= 2:
        complaint.append("gate location was bad")
    if row['Inflight entertainment'] <= 2:
        complaint.append("inflight entertaintment was bad")
    if row['On-board service'] <= 2:
        complaint.append("onboard service was poor")
    if row["Leg room service"] <= 2:
        complaint.append("leg room service was poor")
    if row['Baggage handling'] <= 2:
        complaint.append("baggage handling was bad")
    if row['Checkin service'] <= 2:
        complaint.append("checking service was poor")
    if row['Inflight service'] <= 2:
        complaint.append("inflight service was unhelpful")
    if row['Cleanliness'] <= 2:
        complaint.append("plane was not clean")
    return " and ".join(complaint) if complaint else "no complaint"

df_train['Complaint_Text'] = df_train.apply(generate_complaint, axis=1)


In [10]:
print(df_train.columns)

Index(['Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class',
       'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'satisfaction', 'Complaint_Text'],
      dtype='object')


Text Preprocessing

In [11]:
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [13]:
def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text.lower())
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    stop_words = set(stopwords.words('english'))
    cleaned = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return " ".join(cleaned)

df_train["Cleaned_Text"] = df_train["Complaint_Text"].apply(clean_text)
print(df_train["Cleaned_Text"])

0                                         gate location bad
1         food bad departure arrival time slow seat unco...
2         wifi slow departure arrival time slow online b...
3         wifi slow food bad seat uncomfortable inflight...
4                                                 complaint
                                ...                        
103899    wifi slow food bad departure arrival time slow...
103900                                             food bad
103901    wifi slow departure arrival time slow online b...
103902    wifi slow food bad departure arrival time slow...
103903    wifi slow food bad seat uncomfortable inflight...
Name: Cleaned_Text, Length: 103594, dtype: object


Word Embedding

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_text = tfidf.fit_transform(df_train["Cleaned_Text"])

Transformer based model

In [19]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch

# Create a dataset
df_small = df_train.sample(n=2000, random_state=42).reset_index(drop=True)
df_nlp = df_small[['Cleaned_Text', 'satisfaction']].rename(columns={'Cleaned_Text': 'text', 'satisfaction': 'label'})
dataset = Dataset.from_pandas(df_nlp)

In [20]:
# Tokenize
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)
dataset = dataset.map(tokenize, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# Train the BERT model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdewapuraka[0m ([33mdewapuraka-university-of-ruhuna[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


In [None]:
matrices = trainer.evaluate(eval_dataset=dataset)
print(matrices)

In [None]:
# Save the models

trainer.save_model("./saved_distilbert_model")
tokenizer.save_pretrained("./saved_distilbert_model")

('./saved_distilbert_model/tokenizer_config.json',
 './saved_distilbert_model/special_tokens_map.json',
 './saved_distilbert_model/vocab.txt',
 './saved_distilbert_model/added_tokens.json')

In [17]:
# To reload the saved model

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model = DistilBertForSequenceClassification.from_pretrained("./saved_distilbert_model")
tokenizer = DistilBertTokenizer.from_pretrained("./saved_distilbert_model")

HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './saved_distilbert_model'.

In [None]:
# Autoencoder implementation

from keras.models import Sequential
from keras.layers import Dense

# BUild the autoencoder
autoencoder = Sequential([
    Dense(128, activation='relu', input_shape=(X_text.shape[1],)),
    Dense(64, activation='relu'),
    Dense(128, activation='relu'),
    Dense(X_text.shape[1], activation='sigmoid')
])

# Compile the model

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train the autoencoder
autoencoder.fit(X_text.toarray(), X_text.toarray(), epochs=5, batch_size=64)


Epoch 1/5
[1m1619/1619[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - loss: 0.2575
Epoch 2/5
[1m1619/1619[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - loss: 0.1955
Epoch 3/5
[1m1619/1619[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - loss: 0.1951
Epoch 4/5
[1m1619/1619[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - loss: 0.1950
Epoch 5/5
[1m1619/1619[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - loss: 0.1949


<keras.src.callbacks.history.History at 0x7f9ac26e8910>

In [None]:
from transformers import pipeline, set_seed

# Initialize text generation pipeline
generator = pipeline("text-generation", model="gpt2")

# Set random seed for reproducibility
set_seed(42)

# Prompt for generation
prompt = "The passenger was dissatisfied because"

# Generate 3 complaints with max 40 tokens
results = generator(
    prompt,
    max_length=40,
    num_return_sequences=3,
    pad_token_id=50256,  # to avoid padding warning
    truncation=True
)

# Display results
for i, r in enumerate(results):
    print(f"Generated Complaint {i+1}: {r['generated_text']}")

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated Complaint 1: The passenger was dissatisfied because she didn't get the job she wanted.

Rochester police say that after they received an anonymous tip from a woman who said she was a woman, they asked Rochester police for help locating the woman.

They took the woman to a local hospital and were able to identify her.

The woman was then taken to the hospital to be checked by a doctor, and then flown to a local hospital for a psychological evaluation.

The woman was charged with aggravated harassment with a weapon, and was in jail for 10 days.

Rochester police say that they will be appealing the incident to the Rochester Police Department, but they are still looking for the woman.
Generated Complaint 2: The passenger was dissatisfied because he did not see any signs of foul play.

"I was in the passenger seat when I heard four or five loud bangs and I thought, 'Oh my God, maybe he's going to explode,'" said the passenger. "I looked down and I saw the engine was going down."



In [None]:
# Split the data into features and target

X = df_train.drop(columns=['satisfaction'])
y = df_train['satisfaction']

In [None]:
# Split the data into training and testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Model Training

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report