# Project Description – Twitter US Airline Sentiment

## Data Description:

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

## Dataset:

The project is from a dataset from Kaggle.

Link to the Kaggle project site: https://www.kaggle.com/crowdflower/twitter-airline-sentiment The dataset has to be downloaded from the above Kaggle website.

The dataset has the following columns:
    
- tweet_id                             
- airline_sentiment                    
- airline_sentiment_confidence         
- negativereason                       
- negativereason_confidence            
- airline                              
- airline_sentiment_gold               
- name
- negativereason_gold
- retweet_count
- text
- tweet_coord
- tweet_created
- tweet_location
- user_timezone

## Objective:

To implement the techniques learnt as a part of the course.

## Learning Outcomes:

- Basic understanding of text pre-processing.
- What to do after text pre-processing:
    - Bag of words
    - Tf-idf
- Build the classification model.
- Evaluate the Model.


In [101]:
SEED=1906

In [102]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
pd.set_option('display.max_colwidth',None)

In [103]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.

In [104]:
import re
import unicodedata
import contractions
from bs4 import BeautifulSoup

In [105]:
FILE_NAME="Tweets.csv"

In [106]:
# Loading data into pandas dataframe
data = pd.read_csv(FILE_NAME)

In [107]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [108]:
#Determine if there are any missing values. The number of variables, whether numeric and/or string and the total number of rows.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

In [109]:
data2 = data.drop(['tweet_id','airline_sentiment_confidence','negativereason','negativereason_confidence','airline', 'airline_sentiment_gold','name','negativereason_gold','retweet_count','tweet_coord','tweet_created','tweet_location','user_timezone'], axis=1)

In [110]:
data2.shape

(14640, 2)

In [111]:
data2.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.
2,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
4,negative,@VirginAmerica and it's a really big bad thing about it


### Data Pre-processing:

- Remove html tags.
- Replace contractions in string. (e.g. replace I'm --> I am) and so on.\
- Remove numbers.
- Tokenization
- To remove Stopwords.
- Lemmatized data

We have used NLTK library to tokenize words , remove stopwords and lemmatize the remaining words.

### Text pre-processing: Data preparation. (20 Marks)
- a. Html tag removal.
- b. Tokenization.
- c. Remove the numbers.
- d. Removal of Special Characters and Punctuations.
- e. Conversion to lowercase.
- f. Lemmatize or stemming.
- g. Join the words in the list to convert back to text string in the dataframe. (So that each row
contains the data in text format.)
- h. Print first 5 rows of data after pre-processing.

In [112]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def tokenize_text(text):
    return nltk.word_tokenize(text)

def remove_numbers(text):
    text = re.sub(r'\d+', '', text)
    return text

def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)


def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words



def lemmatize_list(words):
    new_words = []
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos='v'))
        
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words


def normalize(text):
    text = strip_html(text)
    words = tokenize_text(text)
    text = remove_numbers(text)
    text = replace_contractions(text)
    
    
    words = remove_non_ascii(words)
    words = remove_punctuation(words)
    words = to_lowercase(words)
    words = lemmatize_list(words)
    words = remove_stopwords(words)
    return ' '.join(words)

In [113]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kareemstreek/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kareemstreek/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [114]:
from nltk.corpus import stopwords   
stopwords = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stopwords = list(set(stopwords) - set(customlist))

lemmatizer = WordNetLemmatizer()

In [115]:
data2.head(5).apply(lambda row: normalize(row['text']), axis=1)

0                                                                   virginamerica dhepburn say
1                                          virginamerica plus add commercials experience tacky
2                                      virginamerica nt today must mean need take another trip
3    virginamerica really aggressive blast obnoxious entertainment guests face little recourse
4                                                           virginamerica really big bad thing
dtype: object

### Vectorization: (10 Marks)
- a. Use CountVectorizer.
- b. Use TfidfVectorizer.

In [116]:
# Use CountVectorizer object to create a matrix will all the words in every tweet
# For this analysis will use the default parameters for CountVectorizer
tweet_transformer = CountVectorizer().fit(data['text'])

# Print total number of vocabulary words
print(len(tweet_transformer.vocabulary_))

15051


In [117]:
# Check an example in detail and take the 4th tweet in the dataset and see its vector representation
# For reference this is the text of the tweet: 'realli aggress blast obnoxi entertain guest face littl recours'
tweet_3 = data['text'][3]

vector_3 = tweet_transformer.transform([tweet_3])
print(vector_3)
print(vector_3.shape)

# There are 9 unique words in this message and the second number (e.g. 124) will allow one to see what word that is

  (0, 2054)	1
  (0, 2263)	1
  (0, 3070)	1
  (0, 5455)	1
  (0, 5740)	1
  (0, 6733)	1
  (0, 6868)	1
  (0, 7381)	1
  (0, 7685)	1
  (0, 8392)	1
  (0, 9726)	1
  (0, 11020)	1
  (0, 11078)	1
  (0, 13167)	1
  (0, 13326)	1
  (0, 14273)	1
  (0, 14953)	1
(1, 15051)


In [118]:
# Apply the transformer in the entire tweets series
tweet_bag_of_words = tweet_transformer.transform(data['text'])
# Check the shape and number of non-zero ocurrences
print('Shape of Matrix: ', tweet_bag_of_words.shape)
print('Amount of Non-Zero occurences: ', tweet_bag_of_words.nnz)

Shape of Matrix:  (14640, 15051)
Amount of Non-Zero occurences:  234281


In [119]:
# Adjust the weights with TF-IDF
# Each weight is calculated with the following formula:
# Scenario: in one document with 100 words the word 'data' appears 5 times. There are 1,000 documents to classify and the word 'data' appears 90 times in all of them
# TF = 5/100 = 0.05
# IDF = log(1,000/90) = 1
# Tf-idf weight = 0.05 * 1 = 0.05
from sklearn.feature_extraction.text import TfidfTransformer
# Apply the transformer to the bag of words
tweet_tfidf_transformer = TfidfTransformer().fit(tweet_bag_of_words)
tweet_tfidf = tweet_tfidf_transformer.transform(tweet_bag_of_words)
# Check the shape
print(tweet_tfidf.shape)

(14640, 15051)


In [120]:
# Add new column for airline sentiment with binary outcome: 1 for negative comment 0 for not negative
# Create dictionary to map
sentiment_dictionary = {'negative': 1, 'neutral': 0, 'positive': 0}
# Add new column mapping the dictionary
data2['airline_sentiment_model'] = data2['airline_sentiment'].map(sentiment_dictionary)
# Check first 5 rows
data2.head()

Unnamed: 0,airline_sentiment,text,airline_sentiment_model
0,neutral,@VirginAmerica What @dhepburn said.,0
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.,0
2,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!,0
3,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",1
4,negative,@VirginAmerica and it's a really big bad thing about it,1


In [121]:
# Take X and y variables using the TF-IDF vectorization from the previous step
X = tweet_tfidf
y = data2['airline_sentiment_model']
# Do a train test split - will choose to leave the default 30% of the data for testing
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3)
# Check size of each sample
print(X_train.shape[0], X_test.shape[0], X_train.shape[0] + X_test.shape[0])

10248 4392 14640


In [122]:
# Create the Multinomial Naives Bayes object
tweet_sentiment_model = MultinomialNB()
# Fit X_train and y_train to train the model
tweet_sentiment_model.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [123]:
# Make one prediction 
print('predicted:', tweet_sentiment_model.predict(X_test)[0])
print('expected:', y_test.iloc[0])

predicted: 1
expected: 1


In [124]:
# Apply the model to predict X_test values
predictions = tweet_sentiment_model.predict(X_test)

### Model Evaluation

In [125]:
# Print confusion matrix
from sklearn.metrics import confusion_matrix
print (confusion_matrix(y_test, predictions))

[[ 670  965]
 [  47 2710]]


In [126]:
# Print classification report
from sklearn.metrics import classification_report
print (classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.93      0.41      0.57      1635
           1       0.74      0.98      0.84      2757

    accuracy                           0.77      4392
   macro avg       0.84      0.70      0.71      4392
weighted avg       0.81      0.77      0.74      4392



Performance will always be subjective to what metric is more relevant to the model's objective. 
Will go through a few key points: 

- 0 - tweet is not negative 
- 1 - tweet is negative 


Model is showing a promsing recall rate, or probability of detection, by classifying correctly 99% of all negative tweets.


Precision indicates there is potential for improvement with about 72% of negative preditions to be correctly classified.



### KNN Algorithm

In [127]:
from sklearn.neighbors import KNeighborsClassifier
text_classifier2 = KNeighborsClassifier(n_neighbors = 5)#no of neighbors is hpyer parameter
text_classifier2.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [128]:
predictions2 = text_classifier2.predict(X_test)

In [129]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions2))
print(classification_report(y_test,predictions2))
print(accuracy_score(y_test, predictions2))

[[1200  435]
 [ 539 2218]]
              precision    recall  f1-score   support

           0       0.69      0.73      0.71      1635
           1       0.84      0.80      0.82      2757

    accuracy                           0.78      4392
   macro avg       0.76      0.77      0.77      4392
weighted avg       0.78      0.78      0.78      4392

0.7782331511839709


### Logistic Regression

In [130]:
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression(max_iter=2000,multi_class='ovr')
cross_val_score(model2,X_train,y_train).mean()

0.8338215191229512

### Random Forest Classifier

In [131]:
from sklearn.ensemble import RandomForestClassifier
model3 = RandomForestClassifier()
cross_val_score(model3,X_train,y_train).mean()

0.8029872513659251