# Airline Sentiment Analysis

## Table of Contents:
* [Data Collection](#DataCollection)
* [Data Organization](#DataOrganization)
* [Data Definition](#DataDefinition)
* [Data Cleaning](#DataCleaning)
    *[Vectorization](#Vectorization)

In [1]:
#imports
import pandas as pd
#this will deal with punctuation
import string
import re
#from string import digits
import nltk #using the Natural Language Toolkit
from nltk.corpus import stopwords
#Lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# 1. Data Collection <a class="anchor" id="DataCollection"></a>

In [2]:
#get data
tweets = pd.read_csv("../Data/Tweets.csv")

# 2. Data Organization <a class="anchor" id="DataOrganization"></a>

In [3]:
#view the heading of data
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [4]:
#check shape
tweets.shape

(14640, 15)

In [5]:
#view summary of data
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

In [6]:
#check for null values
tweets.isnull().sum()

tweet_id                            0
airline_sentiment                   0
airline_sentiment_confidence        0
negativereason                   5462
negativereason_confidence        4118
airline                             0
airline_sentiment_gold          14600
name                                0
negativereason_gold             14608
retweet_count                       0
text                                0
tweet_coord                     13621
tweet_created                       0
tweet_location                   4733
user_timezone                    4820
dtype: int64

In [7]:
#percentage of null values
((tweets.isnull().sum() * 100)/ len(tweets.index))

tweet_id                         0.000000
airline_sentiment                0.000000
airline_sentiment_confidence     0.000000
negativereason                  37.308743
negativereason_confidence       28.128415
airline                          0.000000
airline_sentiment_gold          99.726776
name                             0.000000
negativereason_gold             99.781421
retweet_count                    0.000000
text                             0.000000
tweet_coord                     93.039617
tweet_created                    0.000000
tweet_location                  32.329235
user_timezone                   32.923497
dtype: float64

"airline_sentiment_gold" , "negativereason_gold", and "tweet_coord" columns are missing over 90% of their input. I will drop them

In [8]:
tweets = tweets.drop(columns=["airline_sentiment_gold" , "negativereason_gold", "tweet_coord"])
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,yvonnalynn,0,@VirginAmerica I didn't today... Must mean I n...,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [9]:
#check for more null values
print(tweets.isnull().sum())

print(((tweets.isnull().sum() * 100)/ len(tweets.index)))

tweet_id                           0
airline_sentiment                  0
airline_sentiment_confidence       0
negativereason                  5462
negativereason_confidence       4118
airline                            0
name                               0
retweet_count                      0
text                               0
tweet_created                      0
tweet_location                  4733
user_timezone                   4820
dtype: int64
tweet_id                         0.000000
airline_sentiment                0.000000
airline_sentiment_confidence     0.000000
negativereason                  37.308743
negativereason_confidence       28.128415
airline                          0.000000
name                             0.000000
retweet_count                    0.000000
text                             0.000000
tweet_created                    0.000000
tweet_location                  32.329235
user_timezone                   32.923497
dtype: float64


In [10]:
#view negativereason column
tweets[(tweets["negativereason"].isnull()) & (tweets["airline_sentiment"] == "negative")]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone


### The columns that are null from the "negativereason" column is because the "airline_sentiment" column is neutral or positive.

Let's view the negativereason_confidence column

In [11]:
#negativereason_confidence 
tweets[(tweets["negativereason_confidence"].isnull())  & (tweets["airline_sentiment"] == "negative")]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone


### The columns that are null from the "negativereason_confidence" column is because the "airline_sentiment" column is neutral or positive.

In [12]:
#tweet_location
tweets[(tweets["tweet_location"].isnull())].head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,jnardino,0,@VirginAmerica seriously would pay $30 a fligh...,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)


In [13]:
tweets[(tweets["tweet_location"].isnull())].groupby("airline_sentiment").size()

airline_sentiment
negative    3142
neutral      962
positive     629
dtype: int64

Most of the null tweet location are from negative sentiments.

### The columns that are null from the "tweet_location" column is because the user more than likely do not have their locations on.

Below is a view of all of the user_timezone column that is null.

In [14]:
#view null user_timezone
tweets[tweets["user_timezone"].isnull()].groupby("airline_sentiment").size()

airline_sentiment
negative    3170
neutral      971
positive     679
dtype: int64

In [15]:
#user_timezone > where the user timezone is null and the tweet location isn't null 
tzTweet = tweets[(tweets["user_timezone"].isnull()) & (tweets["tweet_location"].notnull())]

tzTweet.groupby("airline_sentiment").size()

airline_sentiment
negative    1362
neutral      430
positive     357
dtype: int64

In [16]:
#user_timezone > where the user timezone is null and the tweet location is null 
timezoneTweet = tweets[(tweets["user_timezone"].isnull()) & (tweets["tweet_location"].isnull())]

timezoneTweet.groupby("airline_sentiment").size()

airline_sentiment
negative    1808
neutral      541
positive     322
dtype: int64

In [17]:
#check for duplicates
tweets.text.duplicated().sum()

213

In [18]:
tweets[(tweets["text"].duplicated())].head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
331,568605449659895808,positive,0.6482,,,Virgin America,SuuperG,0,@VirginAmerica Thanks!,2015-02-19 18:57:58 -0800,Wandering So-Cal-ian,Pacific Time (US & Canada)
515,570299889688702976,positive,0.6634,,,United,nydia376,0,@united thanks,2015-02-24 11:11:04 -0800,USA,
1477,569705563287896064,positive,0.6629,,,United,tiamariaroxs,0,@united thank you!,2015-02-22 19:49:25 -0800,,
1864,569545805826166784,positive,0.3512,,0.0,United,worldknits,0,@united thank you,2015-02-22 09:14:36 -0800,"Fredericksburg, VA",Eastern Time (US & Canada)
1938,569513703722393601,positive,1.0,,,United,Angry_VBK,0,@united thank you,2015-02-22 07:07:02 -0800,,


# 3. Data Definition <a class="anchor" id="DataDefinition"></a>

                    
 1   airline_sentiment              
 2   airline_sentiment_confidence  
 3   negativereason                 
 4   negativereason_confidence    
 5   airline                        
 6   airline_sentiment_gold         
 7   name                          
 8   negativereason_gold            
 9   retweet_count                 
 10  text                           
 11  tweet_coord                  
 12  tweet_created                 
 13  tweet_location                
 14  user_timezone   

# 4. Data Cleaning <a class="anchor" id="DataCleaning"></a>

The following needs to be done to preprocess the data:

1. Make text lowercase
2. Removing punctuations, URLs, names
3. Tokenization
4. Removing stopwords ("this", "is", etc.)
4. Stemming

### Lowercase
Converting the "text" column to lowercase to help with preprocessing.

In [19]:
tweets["text"] = tweets["text"].str.lower()
tweets["text"].head()

0                  @virginamerica what @dhepburn said.
1    @virginamerica plus you've added commercials t...
2    @virginamerica i didn't today... must mean i n...
3    @virginamerica it's really aggressive to blast...
4    @virginamerica and it's a really big bad thing...
Name: text, dtype: object

### Remove numbers

In [20]:
#remove numbers
tweets["text"] = tweets["text"].str.replace('\d+', '', regex=True)
tweets["text"].head()

0                  @virginamerica what @dhepburn said.
1    @virginamerica plus you've added commercials t...
2    @virginamerica i didn't today... must mean i n...
3    @virginamerica it's really aggressive to blast...
4    @virginamerica and it's a really big bad thing...
Name: text, dtype: object

### Punctuation

In [21]:
##REMOVE PUNCTUATIONS
def remove_punc(text):
    words_wo_punct = re.sub(r"[^A-Za-z0-9\s]+", "", text)
    return words_wo_punct

In [22]:
#remove all of the punctuations
tweets["text"] = tweets["text"].apply(lambda x: remove_punc(x))
tweets["text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: text, dtype: object

### Remove whitespace

In [23]:
#REMOVE WHITESPACE
tweets["text"] = tweets["text"].str.strip()
tweets["text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: text, dtype: object

### Remove Emojis

In [24]:
from emoji import UNICODE_EMOJI
import emoji

tweets["text"] = tweets["text"].apply(lambda x: emoji.demojize(x))
tweets["text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: text, dtype: object

### Lemmatizing

Reduce word to its root form.

In [25]:
#lemmatizer
lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    word_list = nltk.word_tokenize(text)
    lem_text = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lem_text

In [26]:
tweets["text"] = tweets["text"].apply(lambda x: word_lemmatizer(x))
tweets.text.head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercial to t...
2    virginamerica i didnt today must mean i need t...
3    virginamerica it really aggressive to blast ob...
4    virginamerica and it a really big bad thing ab...
Name: text, dtype: object

### Remove Stopwords and tokenize Words

In [27]:
def remove_stopwords(text):
    tokens = word_tokenize(text)
    english_stopwords = stopwords.words('english')
    tokens_wo_stopwords = [t for t in tokens if t not in english_stopwords]
    
    return tokens_wo_stopwords

tweets["text"] = tweets["text"].apply(lambda x: remove_stopwords(str(x)))
tweets["text"].head()

0                      [virginamerica, dhepburn, said]
1    [virginamerica, plus, youve, added, commercial...
2    [virginamerica, didnt, today, must, mean, need...
3    [virginamerica, really, aggressive, blast, obn...
4             [virginamerica, really, big, bad, thing]
Name: text, dtype: object

## All of the tweets are in english

### Vectorization <a class="anchor" id="Vectorization"></a>


In [28]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

### CountVec

In [29]:
tweets["text"].dtypes

dtype('O')

In [30]:
#change text to string
tweets["text"] = tweets["text"].astype(str)

#change sentiment to int

def changeSentiment(sentiment):
    if  sentiment == "positive":
        return 1
    elif sentiment == "neutral":
        return 0
    elif sentiment == "negative":
        return -1
    
tweets['airline_sentiment'] = tweets['airline_sentiment'].apply(lambda x : changeSentiment(x))

In [31]:
#train test split

cv = CountVectorizer(max_df=0.70)
X = cv.fit_transform(tweets.text)
y = tweets['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y)

#### get base model with Logistic Regression

In [32]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(max_iter=1000).fit(X_train, y_train)
#print accuracy

print("Logistic Regression Accuracy: %0.2f%%" % (100 * logReg.score(X_test, y_test)))

training_accuracy = logReg.score(X_train, y_train)
test_accuracy = logReg.score(X_test, y_test)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))


Logistic Regression Accuracy: 78.63%
Accuracy on training data: 0.94
Accuracy on test data:     0.79


### Tf-idf 

In [33]:
td = TfidfVectorizer(max_df=0.70)
X = td.fit_transform(tweets['text'])
y = tweets['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [34]:
logReg = LogisticRegression(max_iter=1000).fit(X_train, y_train)
#print accuracy

print("Logistic Regression Accuracy: %0.2f%%" % (100 * logReg.score(X_test, y_test)))

training_accuracy = logReg.score(X_train, y_train)
test_accuracy = logReg.score(X_test, y_test)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))

Logistic Regression Accuracy: 78.39%
Accuracy on training data: 0.88
Accuracy on test data:     0.78


Vectorizing with Tfidf performed the best.

In [35]:
#TF-IDF Scores
df = pd.DataFrame(X[0].T.todense(), index=td.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
df

Unnamed: 0,TF-IDF
dhepburn,0.818278
said,0.448619
virginamerica,0.359390
owen,0.000000
owes,0.000000
...,...
gfc,0.000000
gg,0.000000
ggqzqd,0.000000
ggreenwald,0.000000
