# Airline Sentiment Analysis

   As a business changes, so do their customer interests and sentiments. Businesses can use sentiment analysis to track customer sentiment when launching a new product or changing the price of their products. Tracking sentiment analysis can help businesses improve products and services. Twitter is a great way to gather sentiment from customers because the platform was created to communicate and stay connected through the exchange of quick, frequent messages. Using Twitter is a great way to keep track of customer sentiment about your product, service, or business on social media. It also can help detect angry customers or negative comments. A business can use this for marketing. Using Twitter for your sentiment analysis can provide valuable insights that drive business decisions. What do customers like about your business? What do customers hate or love about your products? A business can also use this for customer support. When a model can predict if a tweet is positive or negative, a bot can respond with an appropriate message to help a customer. My goal is to gather the sentiment analysis of a variety of airlines. 

## Table of Contents:
* [Data Collection](#DataCollection)
* [Data Organization](#DataOrganization)
* [Data Definition](#DataDefinition)
* [Data Cleaning](#DataCleaning)
    * [Vectorization](#Vectorization)

In [1]:
### imports ###
import pandas as pd

### this will deal with punctuation ###
import string
import re

import nltk ### using the Natural Language Toolkit ###
from nltk.corpus import stopwords

#### Lemmatization ###
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# 1. Data Collection <a class="anchor" id="DataCollection"></a>

The data was collected from Kaggle and based on tweets that are classified as positive, negative, or neutral. It originally had 14,640 rows and 15 columns. 

In [2]:
### get data ###
tweets = pd.read_csv("../Data/Tweets.csv")

# 2. Data Organization <a class="anchor" id="DataOrganization"></a>

In [3]:
### view the heading of data ###
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [4]:
### check shape ###
tweets.shape

(14640, 15)

In [5]:
### view summary of data ###
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

In [6]:
### check for null values ###
tweets.isnull().sum()

tweet_id                            0
airline_sentiment                   0
airline_sentiment_confidence        0
negativereason                   5462
negativereason_confidence        4118
airline                             0
airline_sentiment_gold          14600
name                                0
negativereason_gold             14608
retweet_count                       0
text                                0
tweet_coord                     13621
tweet_created                       0
tweet_location                   4733
user_timezone                    4820
dtype: int64

In [7]:
### percentage of null values ###
((tweets.isnull().sum() * 100)/ len(tweets.index))

tweet_id                         0.000000
airline_sentiment                0.000000
airline_sentiment_confidence     0.000000
negativereason                  37.308743
negativereason_confidence       28.128415
airline                          0.000000
airline_sentiment_gold          99.726776
name                             0.000000
negativereason_gold             99.781421
retweet_count                    0.000000
text                             0.000000
tweet_coord                     93.039617
tweet_created                    0.000000
tweet_location                  32.329235
user_timezone                   32.923497
dtype: float64

"airline_sentiment_gold" , "negativereason_gold", and "tweet_coord" columns are missing over 90% of their input. I will drop them

In [8]:
tweets = tweets.drop(columns=["airline_sentiment_gold" , "negativereason_gold", "tweet_coord"])
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,yvonnalynn,0,@VirginAmerica I didn't today... Must mean I n...,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [9]:
### check for more null values ###

print(tweets.isnull().sum())
print(((tweets.isnull().sum() * 100)/ len(tweets.index)))

tweet_id                           0
airline_sentiment                  0
airline_sentiment_confidence       0
negativereason                  5462
negativereason_confidence       4118
airline                            0
name                               0
retweet_count                      0
text                               0
tweet_created                      0
tweet_location                  4733
user_timezone                   4820
dtype: int64
tweet_id                         0.000000
airline_sentiment                0.000000
airline_sentiment_confidence     0.000000
negativereason                  37.308743
negativereason_confidence       28.128415
airline                          0.000000
name                             0.000000
retweet_count                    0.000000
text                             0.000000
tweet_created                    0.000000
tweet_location                  32.329235
user_timezone                   32.923497
dtype: float64


In [10]:
### view negativereason column ###
tweets[(tweets["negativereason"].isnull()) & (tweets["airline_sentiment"] == "negative")]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone


### The columns that are null from the "negativereason" column is because the "airline_sentiment" column is neutral or positive.

Let's view the negativereason_confidence column

In [11]:
### negativereason_confidence ###
tweets[(tweets["negativereason_confidence"].isnull())  & (tweets["airline_sentiment"] == "negative")]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone


### The columns that are null from the "negativereason_confidence" column is because the "airline_sentiment" column is neutral or positive.

In [12]:
### tweet_location ###
tweets[(tweets["tweet_location"].isnull())].head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,jnardino,0,@VirginAmerica seriously would pay $30 a fligh...,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)


In [13]:
tweets[(tweets["tweet_location"].isnull())].groupby("airline_sentiment").size()

airline_sentiment
negative    3142
neutral      962
positive     629
dtype: int64

Most of the null tweet location are from negative sentiments.

### The columns that are null from the "tweet_location" column is because the user more than likely do not have their locations on.

Below is a view of all of the user_timezone column that is null.

In [14]:
### view null user_timezone ###
tweets[tweets["user_timezone"].isnull()].groupby("airline_sentiment").size()

airline_sentiment
negative    3170
neutral      971
positive     679
dtype: int64

In [15]:
### user_timezone > where the user timezone is null and the tweet location isn't null ###
tzTweet = tweets[(tweets["user_timezone"].isnull()) & (tweets["tweet_location"].notnull())]

tzTweet.groupby("airline_sentiment").size()

airline_sentiment
negative    1362
neutral      430
positive     357
dtype: int64

In [16]:
### user_timezone > where the user timezone is null and the tweet location is null ###
timezoneTweet = tweets[(tweets["user_timezone"].isnull()) & (tweets["tweet_location"].isnull())]

timezoneTweet.groupby("airline_sentiment").size()

airline_sentiment
negative    1808
neutral      541
positive     322
dtype: int64

In [17]:
### check for duplicates ###
tweets.tweet_id.duplicated().sum()

155

In [18]:
tweets[(tweets["tweet_id"].duplicated())].head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
12040,570308309682675712,negative,1.0,Customer Service Issue,1.0,American,SweeLoTmac,0,@AmericanAir why would I even consider continu...,2015-02-24 11:44:31 -0800,,Quito
12041,570308064185880577,neutral,0.6755,,0.0,American,LancasterPattie,0,@AmericanAir we've already made other arrangem...,2015-02-24 11:43:32 -0800,,
12042,570307949614256128,negative,1.0,Can't Tell,1.0,American,ELLLORRAC,0,@AmericanAir thanks for getting back to me. Bu...,2015-02-24 11:43:05 -0800,,Central Time (US & Canada)
12043,570307948171423745,negative,1.0,Can't Tell,0.6758,American,SweeLoTmac,0,@AmericanAir why would I pay $200 to reactivat...,2015-02-24 11:43:05 -0800,,Quito
12044,570307434113310720,negative,1.0,Late Flight,1.0,American,LauraMolito,0,"@AmericanAir stranded for 24 hours in MIA, Pat...",2015-02-24 11:41:02 -0800,"New York, NY",Atlantic Time (Canada)


In [19]:
### remove tweet_id duplicates ###
tweets = tweets.drop_duplicates(subset=["tweet_id"], keep=False) 

### check for duplicates ###
tweets.tweet_id.duplicated().sum()

0

# 3. Data Definition <a class="anchor" id="DataDefinition"></a>

Below are the main columns I am working with.
                    
* <b>Airline_sentiment</b>
    * sentiment of the airline, has 3 different sentiments: positive, neutral and negative
    * object data type
* <b>Negativereason</b>
    * reason for the negative sentiment 
    * Object data type
* <b>Airline</b>
    * name of the airline
    * Object data type
* <b>Text</b>
    * The tweet
    * Object

# 4. Data Cleaning <a class="anchor" id="DataCleaning"></a>

The following needs to be done to preprocess the data:

1. Make text lowercase
2. Removing punctuations, URLs, numbers, whitespace, and emojis
3. Lemmatizing
4. Removing stopwords ("this", "is", etc.)
4. Tokenize Words


After I preprocess the data, I will vectorize the text.

### Lowercase
Converting the "text" column to lowercase to help with preprocessing.

In [20]:
tweets["text"] = tweets["text"].str.lower()
tweets["text"].head()

0                  @virginamerica what @dhepburn said.
1    @virginamerica plus you've added commercials t...
2    @virginamerica i didn't today... must mean i n...
3    @virginamerica it's really aggressive to blast...
4    @virginamerica and it's a really big bad thing...
Name: text, dtype: object

### Remove numbers

In [21]:
### remove numbers ###
tweets["text"] = tweets["text"].str.replace('\d+', '', regex=True)
tweets["text"].head()

0                  @virginamerica what @dhepburn said.
1    @virginamerica plus you've added commercials t...
2    @virginamerica i didn't today... must mean i n...
3    @virginamerica it's really aggressive to blast...
4    @virginamerica and it's a really big bad thing...
Name: text, dtype: object

### Punctuation

In [22]:
### remove punctuations function ###
def remove_punc(text):
    words_wo_punct = re.sub(r"[^A-Za-z0-9\s]+", "", text)
    return words_wo_punct

In [23]:
### remove all of the punctuations ###
tweets["text"] = tweets["text"].apply(lambda x: remove_punc(x))
tweets["text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: text, dtype: object

### Remove whitespace

In [24]:
### remove whitespace ###
tweets["text"] = tweets["text"].str.strip()
tweets["text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: text, dtype: object

### Remove Emojis

In [25]:
from emoji import UNICODE_EMOJI
import emoji

tweets["text"] = tweets["text"].apply(lambda x: emoji.demojize(x))
tweets["text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: text, dtype: object

### Lemmatizing

Reduce word to its root form.

In [26]:
### lemmatizer ###
lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    word_list = nltk.word_tokenize(text)
    lem_text = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lem_text

In [27]:
tweets["text"] = tweets["text"].apply(lambda x: word_lemmatizer(x))
tweets.text.head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercial to t...
2    virginamerica i didnt today must mean i need t...
3    virginamerica it really aggressive to blast ob...
4    virginamerica and it a really big bad thing ab...
Name: text, dtype: object

### Remove Stopwords and tokenize Words

In [28]:
def remove_stopwords(text):
    tokens = word_tokenize(text)
    english_stopwords = stopwords.words('english')
    tokens_wo_stopwords = [t for t in tokens if t not in english_stopwords]
    
    return tokens_wo_stopwords

tweets["text"] = tweets["text"].apply(lambda x: remove_stopwords(str(x)))
tweets["text"].head()

0                      [virginamerica, dhepburn, said]
1    [virginamerica, plus, youve, added, commercial...
2    [virginamerica, didnt, today, must, mean, need...
3    [virginamerica, really, aggressive, blast, obn...
4             [virginamerica, really, big, bad, thing]
Name: text, dtype: object

## All of the tweets are in english

### Vectorization <a class="anchor" id="Vectorization"></a>


In [29]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

### CountVec

In [30]:
tweets["text"].dtypes

dtype('O')

In [31]:
### change text to string ###
tweets["text"] = tweets["text"].astype(str)

### change sentiment to int ###

def changeSentiment(sentiment):
    if  sentiment == "positive":
        return 1
    elif sentiment == "neutral":
        return 0
    elif sentiment == "negative":
        return -1
    
tweets['airline_sentiment'] = tweets['airline_sentiment'].apply(lambda x : changeSentiment(x))

In [32]:
### train test split ###
cv = CountVectorizer(max_df=0.70)
X = cv.fit_transform(tweets.text)
y = tweets['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y)

#### Get base model with Logistic Regression 

In [33]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(max_iter=1000).fit(X_train, y_train)

### print accuracy ###
print("Logistic Regression Accuracy: %0.2f%%" % (100 * logReg.score(X_test, y_test)))

training_accuracy = logReg.score(X_train, y_train)
test_accuracy = logReg.score(X_test, y_test)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))


Logistic Regression Accuracy: 77.98%
Accuracy on training data: 0.94
Accuracy on test data:     0.78


### Tf-idf 

In [34]:
td = TfidfVectorizer(max_df=0.70)
X = td.fit_transform(tweets['text'])
y = tweets['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [35]:
logReg = LogisticRegression(max_iter=1000).fit(X_train, y_train)

### print accuracy ###
print("Logistic Regression Accuracy: %0.2f%%" % (100 * logReg.score(X_test, y_test)))

training_accuracy = logReg.score(X_train, y_train)
test_accuracy = logReg.score(X_test, y_test)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))

Logistic Regression Accuracy: 78.01%
Accuracy on training data: 0.88
Accuracy on test data:     0.78


Vectorizing with Tfidf performed the best.

In [36]:
### TF-IDF Scores ###
df = pd.DataFrame(X[0].T.todense(), index=td.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
df

Unnamed: 0,TF-IDF
dhepburn,0.817802
said,0.450446
virginamerica,0.358187
aa,0.000000
owed,0.000000
...,...
getyouracttogether,0.000000
getyourlife,0.000000
gf,0.000000
gfc,0.000000
