# **AIRLINE REVIEW SENTIMENT ANALYSIS**


### **1. BUSINESS PROBLEM**


In today's highly competitive and customer-centric market, businesses receive a lot of customer feedback through reviews across many different platforms. However, manually analyzing this data is time-consuming and often inconsistent. Our application addresses this problem by using sentiment analysis to automatically categorize customer reviews as positive, negative, or neutral. This enables businesses to quickly identify areas of improvement, monitor brand perception in real-time, and make data-driven decisions to enhance customer satisfaction and loyalty

### **2. PROBLEM STATEMENT**

ADD CONTENT HERE...

### **3. DATA UNDERSTANDING**

Visualize as much of the dataset as possible in terms of distribution, relationships between variables and so forth to get a solid understanding of the data.

### **4. DATA PREPARATION & TRAINING**

In [18]:
import pandas as pd
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords', download_dir='C:\\Users\\marcu\\AppData\\Roaming\\nltk_data')
nltk.download('punkt', download_dir='C:\\Users\\marcu\\AppData\\Roaming\\nltk_data')
nltk.download('wordnet', download_dir='C:\\Users\\marcu\\AppData\\Roaming\\nltk_data')
nltk.download('omw-1.4', download_dir='C:\\Users\\marcu\\AppData\\Roaming\\nltk_data')  # Needed for lemmatization


df = pd.read_csv("Tweets.csv")

# Keep only relevant columns
df = df[['airline_sentiment', 'text']]

# Load stopwords once
stop_words = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()

# stemmer = PorterStemmer()

def clean_text(text):
    try:
        text = str(text).lower()
        text = re.sub(r"http\S+|www\S+|https\S+", '', text)
        text = re.sub(r'\@\w+|\#','', text)
        text = re.sub(r'[^\w\s]', '', text)
        tokens = text.split()
        tokens = [word for word in tokens if word not in stop_words]
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        return ' '.join(tokens)
    except Exception as e:
        print(f"Error cleaning text: {text} -> {e}")
        return ""

try:
    df['clean_text'] = df['text'].apply(clean_text)
except Exception as e:
    print("Error:", e)

# Encode labels
label_encoder = LabelEncoder()
df['sentiment_encoded'] = label_encoder.fit_transform(df['airline_sentiment'])

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), lowercase=False)

# Added bi-grams to help the model understand context better
# A bi-gram is a sequence of two adjacent elements from a string of tokens, which can help capture context better than unigrams alone.
X = vectorizer.fit_transform(df['clean_text']).toarray()

# Define features and labels
y = df['sentiment_encoded']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### **5. EVALUATING DIFFERENT TYPES OF MODELS**

##### **LOGISTIC REGRESSION**


In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Initialize and train Logistic Regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Predictions
y_pred_logreg = logreg.predict(X_test)

# Evaluation
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_pred_logreg))


Logistic Regression Accuracy: 0.7991803278688525
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.94      0.87      1889
           1       0.67      0.50      0.57       580
           2       0.82      0.61      0.70       459

    accuracy                           0.80      2928
   macro avg       0.77      0.68      0.72      2928
weighted avg       0.79      0.80      0.79      2928



##### **TRAINING A MODEL USING RANDOM FOREST**

In [20]:
# from sklearn.ensemble import RandomForestClassifier

# # Initialize and train Random Forest model
# rf = RandomForestClassifier(n_estimators=100, random_state=42)
# rf.fit(X_train, y_train)

# # Predictions
# y_pred_rf = rf.predict(X_test)

# # Evaluation
# print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
# print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))


##### **COMPARING XGBOOST & LOGISTIC REGRESSION**

In [21]:
import xgboost as xgb

xgb_model = xgb.XGBClassifier(eval_metric='mlogloss', use_label_encoder=False)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate Logistic Regression
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_pred_logreg))

# Evaluate XGBoost
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Logistic Regression Accuracy: 0.7991803278688525
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.94      0.87      1889
           1       0.67      0.50      0.57       580
           2       0.82      0.61      0.70       459

    accuracy                           0.80      2928
   macro avg       0.77      0.68      0.72      2928
weighted avg       0.79      0.80      0.79      2928

XGBoost Accuracy: 0.7612704918032787
XGBoost Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.94      0.85      1889
           1       0.66      0.30      0.41       580
           2       0.73      0.61      0.67       459

    accuracy                           0.76      2928
   macro avg       0.72      0.62      0.64      2928
weighted avg       0.75      0.76      0.73      2928



##### **TRAINING USING VADER**

In [22]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon', download_dir='C:\\Users\\marcu\\AppData\\Roaming\\nltk_data')
sia = SentimentIntensityAnalyzer()

# VADER prediction function
def get_vader_sentiment(text):
    score = sia.polarity_scores(text)['compound']
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'
    
    # Get the original text of your test set (aligning indexes with y_test)
X_test_text = df.loc[y_test.index, 'text']

# Get VADER predictions for each tweet
vader_preds = X_test_text.apply(get_vader_sentiment)


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


##### **COMPARING VADER & LOGISTIC REGRESSION**

In [23]:
from sklearn.metrics import classification_report, accuracy_score

# Decode your model's predictions (you already trained your ML model above)
ml_preds_labels = label_encoder.inverse_transform(y_pred_logreg)
y_test_labels = label_encoder.inverse_transform(y_test)

# Compare results
print("ML Model Accuracy:", accuracy_score(y_test_labels, ml_preds_labels))
print("ML Classification Report:\n", classification_report(y_test_labels, ml_preds_labels))

print("\nVADER Accuracy:", accuracy_score(y_test_labels, vader_preds))
print("VADER Classification Report:\n", classification_report(y_test_labels, vader_preds))


ML Model Accuracy: 0.7991803278688525
ML Classification Report:
               precision    recall  f1-score   support

    negative       0.82      0.94      0.87      1889
     neutral       0.67      0.50      0.57       580
    positive       0.82      0.61      0.70       459

    accuracy                           0.80      2928
   macro avg       0.77      0.68      0.72      2928
weighted avg       0.79      0.80      0.79      2928


VADER Accuracy: 0.5502049180327869
VADER Classification Report:
               precision    recall  f1-score   support

    negative       0.90      0.51      0.65      1889
     neutral       0.36      0.43      0.39       580
    positive       0.34      0.88      0.49       459

    accuracy                           0.55      2928
   macro avg       0.54      0.60      0.51      2928
weighted avg       0.71      0.55      0.57      2928



##### **TRAINING AN MLP TO COMPARE WITH LOGISTIC REGRESSION**

In [24]:
# from sklearn.neural_network import MLPClassifier
# from sklearn.metrics import classification_report, accuracy_score

# # Initialize and train MLP model
# mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)  # You can tune the hidden_layer_sizes
# mlp.fit(X_train, y_train)

# # Predictions
# y_pred_mlp = mlp.predict(X_test)

# # Evaluate MLP
# print("MLP Accuracy:", accuracy_score(y_test, y_pred_mlp))
# print("MLP Classification Report:\n", classification_report(y_test, y_pred_mlp))

# # Compare the results with Logistic Regression (already done in your code)
# print("\nML Model Accuracy:", accuracy_score(y_test, y_pred_logreg))
# print("ML Classification Report:\n", classification_report(y_test, y_pred_logreg))



##### **SAVING THE MODEL FILE**

In [25]:
best_model = LogisticRegression()
best_model.fit(X_train, y_train)

# Save the model and vectorizer
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

with open('label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)