## ANALYZING AIRLINE CUSTOMER SENTIMENT THROUGH SOCIAL MEDIA FEEDBACK

#### AUTHORS

* Jeremiah Waiguru
* Mercy Kiragu
* Paul Ngatia
* Winfred Kinya

## 1.0  PROJECT OVERVIEW

This project aims to analyze customer sentiment expressed on twitter regarding various airlines. By employing Natural Language Processing (NLP) techniques, we will classify customer sentiments and identify key themes in their feedback. The insights derived from this analysis will help airlines enhance their customer service, identify common issues, and improve overall customer satisfaction.

## 1.1  BUSINESS UNDERSTANDING

In the competitive landscape of the airline industry, understanding and managing customer sentiment is crucial for maintaining high levels of customer satisfaction and loyalty. Airlines receive a substantial amount of feedback through various channels such as social media, customer service interactions, and surveys. Analyzing this feedback to discern customer sentiment and predict future trends can provide significant benefits. By proactively addressing customer concerns, airlines can enhance their service quality, optimize operational efficiency, and build a strong reputation. This project seeks to provide airlines with the tools and insights necessary to identify common themes in customer feedback, understand how sentiment evolves over time, and forecast future sentiment trends. These insights will enable airlines to make data-driven decisions that improve customer experiences, resolve issues proactively, and maintain a competitive advantage in the market.

## 1.2  PROBLEM STATEMENT

The airline industry is currently facing a notable decrease in customer satisfaction, leading to unfavorable brand perception and diminished customer loyalty. This decline in satisfaction can be attributed to several factors, including flight delays, inadequate customer service, mishandling of luggage, and other operational inefficiencies. As a result, addressing these customer concerns and enhancing the overall brand perception has become a crucial focus for airlines.

## 1.3  OBJECTIVES

### Primary objective

To analyze customer sentiment towards various airlines through sentiment analysis, providing actionable insights that will enhance customer satisfaction and optimize operational strategies.

### Specific objectives

1.	Implement a real-time monitoring system to continuously capture and process tweets related to airlines from Twitter.
2.	Implement and compare various NLP models (e.g., Logistic Regression ) for sentiment classification.
3.	Generate actionable insights and recommendations based on sentiment analysis to improve customer satisfaction, address pain points, and enhance overall brand reputation. 
4.	Establish an effective response and engagement strategy to manage negative sentiment, address customer complaints, and foster positive customer experiences.


## 2.0  DATA UNDERSTANDING

Our dataset was publicly sourced from crowdflower website and is made up of Twitter users' tweets and retweets. The dataset has 14,640 rows and 20 columns. This Twitter data was collected from February 2015 and contributors were engaged in classifying tweets into categories of positive, negative, and neutral sentiments.

The dataset has 14,640 rows and 20 columns. Below are the columns and their descriptions: 

Unit id  : A unique identifier for each data unit. 

Golden : A boolean value indicating whether the entry is a golden unit in the dataset. 

Unit state : The state of the unit (e.g., golden). 

Trusted judgments : The number of trusted judgments for the entry. 

Last judgment at : Timestamp of the last judgment for the entry. 

Airline sentiment : The target variable, which represents the sentiment of the airline tweet (positive, negative, or neutral).

Airline sentiment confidence: The confidence level associated with the airline sentiment. 

Negative reason : The reason for negative sentiment in the tweet. 

Negative reason confidence: The confidence level associated with the negative sentiment reason.

airline: The airline associated with the tweet. 

Airline sentiment gold: Additional information about airline sentiment (gold 
standard). 

name: The name of the user who posted the tweet. 

Negative reason gold: Additional information about the negative sentiment reason 
(gold standard). 

Retweet count: The number of retweets for the tweet. 

text: The text content of the tweet. 

Tweet cord: Coordinates of the tweet (if available). 

Tweet created: Timestamp of when the tweet was created. 

Tweet id: The unique identifier of the tweet. 

Tweet location: The location associated with the tweet (if provided). 

User time zone: The time zone of the user who posted the tweet. 


## Importimg Necessary Libraries

In [3]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
import seaborn as sns
import plotly.express as px 
import re
import string
import joblib

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_curve, auc

#downloading dependencies
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('vader_lexicon')

from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [4]:
# loading the dataset
df = pd.read_csv('Airline-Sentiment-2-w-AA.csv', encoding='latin1')
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,airline_sentiment,airline_sentiment:confidence,negativereason,negativereason:confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,681448150,False,finalized,3,2/25/15 5:24,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2/24/15 11:35,5.70306e+17,,Eastern Time (US & Canada)
1,681448153,False,finalized,3,2/25/15 1:53,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
2,681448156,False,finalized,3,2/25/15 10:01,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2/24/15 11:15,5.70301e+17,Lets Play,Central Time (US & Canada)
3,681448158,False,finalized,3,2/25/15 3:05,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2/24/15 11:15,5.70301e+17,,Pacific Time (US & Canada)
4,681448159,False,finalized,3,2/25/15 5:50,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2/24/15 11:14,5.70301e+17,,Pacific Time (US & Canada)


In [5]:
def describe_columns(df):
    # Print column names
    print("Column Names:")
    print(df.columns)

    # Print data types
    print("\nData Types:")
    print(df.dtypes)

    # Print number of rows and columns
    print("\nShape:")
    print(df.shape)

    # Print df information
    print("\nInfo:")
    print(df.info())

    # Print descriptive statistics for numerical columns
    print("\nDescriptive Statistics:")
    print(df.describe())

    # Print missing values count per column
    print("\nMissing Values in percentages:")
    print((df.isna().sum()/len(df)) * 100)

describe_columns(df)

Column Names:
Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'airline_sentiment',
       'airline_sentiment:confidence', 'negativereason',
       'negativereason:confidence', 'airline', 'airline_sentiment_gold',
       'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord',
       'tweet_created', 'tweet_id', 'tweet_location', 'user_timezone'],
      dtype='object')

Data Types:
_unit_id                          int64
_golden                            bool
_unit_state                      object
_trusted_judgments                int64
_last_judgment_at                object
airline_sentiment                object
airline_sentiment:confidence    float64
negativereason                   object
negativereason:confidence       float64
airline                          object
airline_sentiment_gold           object
name                             object
negativereason_gold              object
retweet_count                     i