##### Objective 

A sentiment analysis job about the problems of each major U.S. airline. The twitter data sentiments are classed as positive, negative, and neutral .

###### Import the necessary libraries

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt# matplotlib.pyplot plots data
%matplotlib inline 
from sklearn.model_selection import train_test_split
import missingno as msno
import warnings
pd.options.display.max_columns = None
pd.options.display.max_rows = None
warnings.filterwarnings("ignore")
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
# Import necessary libraries.
import re, string, unicodedata
import pandas as pd
import nltk           

# Natural language processing tool-kit
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


import contractions
from bs4 import BeautifulSoup                 # Beautiful soup is a parsing library that can use different parsers.
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [None]:
# install and import necessary libraries.

#!pip install contractions

import re, string, unicodedata                          # Import Regex, string and unicodedata.
import contractions                                     # Import contractions library.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.

nltk.download('stopwords')                              # Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords                       # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.

Read the tweet data from the Tweets CSV

In [None]:
data1=pd.read_csv('Tweets.csv')

In [None]:
data1.head()

In [None]:
def basic_checks(df):
    
    print('='*50)
    print('Shape of the dataframe is: \n',df.shape)
    print('='*50)
    print('Basic stats for the data: \n',df.describe())
    print('='*50)
    print('Data type and info :')
    print(df.info())
    print('='*50)
    print('Missing value information : \n',df.isnull().any())
    print('='*50)
    print('Sum of missing values if any : \n',df.isnull().sum())

In [None]:
basic_checks(data1)

###### EDA - Part 1

1. There are a total of 14641 records in the tweet reviews and 15 columns

2. Airline sentiment confidence varies between a minimum of 0.34 and a mean value of 0.9

3. The highest number of retweets are 44

4. All the columns have data in the object datatype

5. There is missing values in certain fields - negative reason, negative reason confidence, airline sentiment gold, negative reason gold, tweet location and user timezone


###### Exploratory Data Analysis

Group by Airline and the Sentiment

In [None]:
data1.groupby(['airline','airline_sentiment']).size().unstack().plot(kind='bar',figsize=(11, 5))

Group by Sentiment - value counts and bar plot

In [None]:
data1.groupby('airline_sentiment').size().plot(kind='bar')

In [None]:
data1['airline_sentiment'].value_counts()

In [None]:
round(data1.airline_sentiment.value_counts(normalize=True)*100,2)

###### Number of reviews by Airlines

In [None]:
print("Total number of tweets for each airline \n ",data1.groupby('airline')['airline_sentiment'].count().sort_values(ascending=False))
airlines= ['US Airways','United','American','Southwest','Delta','Virgin America']
plt.figure(1,figsize=(12, 12))
for i in airlines:
    indices= airlines.index(i)
    plt.subplot(2,3,indices+1)
    new_df=data1[data1['airline']==i]
    count=new_df['airline_sentiment'].value_counts()
    Index = [1,2,3]
    plt.bar(Index,count, color=['pink', 'orange', 'lightblue'])
    plt.xticks(Index,['negative','neutral','positive'])
    plt.ylabel('Mood Count')
    plt.xlabel('Mood')
    plt.title('Count of Moods of '+i)

###### Sentiment Word Clouds - Negative and positive 

In [None]:
from wordcloud import WordCloud,STOPWORDS

###### Negative Sentiment 

In [None]:
new_df=data1[data1['airline_sentiment']=='negative']
words = ' '.join(new_df['text'])
cleaned_word = " ".join([word for word in words.split()
                            if 'http' not in word
                                and not word.startswith('@')
                                and word != 'RT'
                            ])
wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='black',
                      width=3000,
                      height=2500
                     ).generate(cleaned_word)
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

###### Positive Sentiment 

In [None]:
new_df=data1[data1['airline_sentiment']=='positive']
words = ' '.join(new_df['text'])
cleaned_word = " ".join([word for word in words.split()
                            if 'http' not in word
                                and not word.startswith('@')
                                and word != 'RT'
                            ])
wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='black',
                      width=3000,
                      height=2500
                     ).generate(cleaned_word)
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

###### Reasons for negative reviews

In [None]:
data1['negativereason'].nunique()

NR_Count=dict(data1['negativereason'].value_counts(sort=False))
def NR_Count(Airline):
    if Airline=='All':
        a=data1
    else:
        a=data1[data1['airline']==Airline]
    count=dict(a['negativereason'].value_counts())
    Unique_reason=list(data1['negativereason'].unique())
    Unique_reason=[x for x in Unique_reason if str(x) != 'nan']
    Reason_frame=pd.DataFrame({'Reasons':Unique_reason})
    Reason_frame['count']=Reason_frame['Reasons'].apply(lambda x: count[x])
    return Reason_frame
def plot_reason(Airline):
    
    a=NR_Count(Airline)
    count=a['count']
    Index = range(1,(len(a)+1))
    plt.bar(Index,count, color=['hotpink','pink','lightblue','lightgreen','violet','orange','gray','cyan','purple','orange'])
    plt.xticks(Index,a['Reasons'],rotation=90)
    plt.ylabel('Count')
    plt.xlabel('Reason')
    plt.title('Count of Reasons for '+Airline)
    
plot_reason('All')
plt.figure(2,figsize=(13, 13))
for i in airlines:
    indices= airlines.index(i)
    plt.subplot(2,3,indices+1)
    plt.subplots_adjust(hspace=0.9)
    plot_reason(i)

###### EDA - part 2

1. There are an overall of 62% negative tweets followed by 21.16 neutral and 16% of positive tweets
2. The highest number of negative tweets are for United followed by US Airways and American 
3. The highest number of positive tweets are for Delta followed by Southwest and united .
4. The positive ,negative and neutral reviews are closest in Virgin America,Delta and Southwest indicating an overall satisfactory customer sentiment.
5. The negative word clouds indicate issues related to flight,bag,customer service ,help
6. Issues raised in negative reviews were majorly customer service issues followed by late flight and bad flight experience
7. Breaking down further we see that the negative reviews for US airways and United are due to Customer Service and late flights ,American Airline shows negative reviews due to Customer service issues, late/cancelled flights.

**Dropping unnecessary columns** 

`Only extracting the useful columns for the sentiment analysis and discarding the remainder of the columns` 

1.From the original dataset we drop all columns except for Airline_sentiment and text 
2.There is no missing values in this data 

**Preprocessing of the text data**

`We go through the following steps to pre process the text data :`

1. Remove HTML tags - remove '<.*?>'

2. Replace contractions - expand any contractions 

3. Remove numbers 

4. Tokenization - tokenize the text

5. Remove stop words - form a list of stop words and remove those from the text

6. Lemmatize the data

7. Join the words in the list of words 

In [None]:
data = data1[['airline_sentiment', 'text']]

In [None]:
data.head()

In [None]:
basic_checks(data)

###### Missing values

In [None]:
msno.matrix(data)

In [None]:
data.head()

In [None]:
data.columns

In [None]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

data['text'] = data['text'].apply(lambda x: replace_contractions(x))
data.head()

In [None]:
def remove_numbers(text):
  text = re.sub(r'\d+', '', text)
  return text

data['text'] = data['text'].apply(lambda x: remove_numbers(x))
data.head()

In [None]:
import re

TAG_RE = re.compile('<.*?>')

def remove_tags(text):
    return TAG_RE.sub('', text)

data['text']=data['text'].apply(lambda x:remove_tags(x))
data.head()

In [None]:
data['text'] = data.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) # Tokenization of data

In [None]:
data.head()

In [None]:
stopwords = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Set custom stop-word's list as not, couldn't etc. words matter in Sentiment, so not removing them from original data.

stopwords = list(set(stopwords) - set(customlist))                              

In [None]:
lemmatizer = WordNetLemmatizer()

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words



In [None]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words



In [None]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words



In [None]:
def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words



In [None]:
def lemmatize_list(words):
    new_words = []
    for word in words:
      new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words



In [None]:
def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

data['text'] = data.apply(lambda row: normalize(row['text']), axis=1)
data.head()

In [None]:
words_list=[each.split(" ") for each in data['text']]

#words_list # list of lists 

import itertools
corpus=set(itertools.chain(*words_list))
len(corpus)

###### Applying Count and TFidf Vectorizers 

###### Count Vectorization 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)                # Keep only 1000 features as number of features will increase the processing time.
data_features = vectorizer.fit_transform(data['text'])

data_features = data_features.toarray()         

In [None]:
data_features.shape

In [None]:
data['sentiment']=data['airline_sentiment'].apply(lambda x: 0 if x=='negative' else  (2 if x=='neutral' else 1))

In [None]:
data['sentiment'].value_counts()

In [None]:
data.drop(['airline_sentiment'],axis=1,inplace=True)

In [None]:
data.head()

###### Split the data into training and testing sets 

In [None]:
labels=data['sentiment']

In [None]:
labels.dtype

In [None]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_features, labels,test_size=0.2, random_state=42)

###### Count Vectorizer - RF Model 

In [None]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest = RandomForestClassifier(n_estimators=500, n_jobs=4)

forest = forest.fit(X_train, y_train)

print(forest)


In [None]:
model_score=forest.score(X_test,y_test)
print('RandomForest Test Accuracy Score :',model_score)

In [None]:
result_rf = forest.predict(X_test)

###### Count Vectorizer - ExtraTrees Regressor

In [None]:

from sklearn.ensemble import ExtraTreesClassifier

xtrees = ExtraTreesClassifier(n_jobs=4,n_estimators=500)

xtrees = xtrees.fit(X_train, y_train)

print(xtrees)


In [None]:
model_score=xtrees.score(X_test,y_test)
print('Xtra Trees Test Accuracy Score :',model_score)

In [None]:

result_xt = xtrees.predict(X_test)

###### RF model - Confusion Matrix

In [None]:
print(metrics.classification_report(y_test,result_rf))

In [None]:

from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_test, result_rf)

print(conf_mat)

df_cm = pd.DataFrame(conf_mat, index = [i for i in "123"],
                  columns = [i for i in "123"])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

###### ExtraTrees-Confusion Matrix

In [None]:
print(metrics.classification_report(y_test,result_xt))

In [None]:

conf_mat = confusion_matrix(y_test, result_xt)

print(conf_mat)

df_cm = pd.DataFrame(conf_mat, index = [i for i in "123"],
                  columns = [i for i in "123"])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

###### Tf- IDF vectorizer

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
data_features = vectorizer.fit_transform(data['text'])

data_features = data_features.toarray()

data_features.shape

###### Tf-idf vectorizer-RF Classifier

In [None]:
# Using Random Forest to build model for the classification of reviews.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

import numpy as np

forestvect = RandomForestClassifier(n_estimators=500, n_jobs=-1)

forestvect = forestvect.fit(X_train, y_train)

print(forestvect)


In [None]:
model_score=forestvect.score(X_test,y_test)
print('Random Forest - TFIDF Test Accuracy Score :',model_score)

In [None]:
result_rfvect = forestvect.predict(X_test)

In [None]:
print(metrics.classification_report(y_test,result_rfvect))

In [None]:

conf_mat = confusion_matrix(y_test, result_rfvect)

print(conf_mat)

df_cm = pd.DataFrame(conf_mat, index = [i for i in "123"],
                  columns = [i for i in "123"])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

###### Tf-idf vectorizer-ExtraTrees Classifier

In [None]:

xtrees_vect = ExtraTreesClassifier(n_jobs=4,n_estimators=500)

xtrees_vect = xtrees_vect.fit(X_train, y_train)

print(xtrees_vect)


In [None]:
model_score=xtrees_vect.score(X_test,y_test)
print('Extra Trees regressor - TFIDF Test Accuracy Score :',model_score)

In [None]:
result_xtreesvect = xtrees_vect.predict(X_test)

In [None]:
print(metrics.classification_report(y_test,result_xtreesvect))

In [None]:

conf_mat = confusion_matrix(y_test, result_xtreesvect)

print(conf_mat)

df_cm = pd.DataFrame(conf_mat, index = [i for i in "123"],
                  columns = [i for i in "123"])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

**Results - TF IDF & Count Vectorizer - Randome Forest vs Extra Trees Regressor**


`Count Vectorizer` 

1. We visually compare the results of 2 of the classifiers that have given best results from a list of classifiers tried :RandomForest Classifier and the ExtraTrees Classifier 
2. The accuracy is definitely better in the Extra Trees Regressor than in the Random Forest Regressor 
3. The accuracy on the Test set is 79% in the Extra Trees Regressor  while 77% accuracy in Random Forest Regressor 
4. The confusion Matrix and the classification report show FP,TN,TP,FN values .Comparing these values we conclude that the Extra Trees Regressor is a better model than Random Forest Classifier as the recall and f1 scores are better in Extra Trees regressor 

`TF IDF Vectorizer`

1. Again, we visually compare the results of 2 of the classifiers that have given best results from a list of classifiers tried :RandomForest Classifier and the ExtraTrees Classifier
2. The accuracy of the Extra Trees Regressor is better than the Random Forest Regressor ,78.5 % compared to 77.4%
3. The recall,f1 score are better in Extra Trees regressor 

`Business Insights`


True Negative (observed=0,predicted=0)

False Positive (observed=0,predicted=1)

True Negative (observed=0,predicted=0)

False Negative(observed=1,predicted=0)


The metric of main interest should in my opinion here be Accuracy as we are trying to gauge the number of customers that left a negative tweet and were recognized as a negative tweet and left a positive/neutral tweet and were right identified so .

True negatives and True positives here enable us in analyzing the customer behavior correctly

Although false positive and False negative numbers still have a significant impact on decision making ,identifying the reviews under the correct category is still of utmost importance.


**We can therefore conclude that the Extra Trees Regressor is a better model as opposed to Random Forest model in enabling to make better business decisions**

`N.B`

1. The exercise was tried with multiple classifiers and they aren't all listed here keeping in mind the motive of the exercise ,only the 2 best models were handpicked and compared for both Count and TF IDF vectorizers
2. The exercse was tested using just 2 classes instead of 1 and the accuracy seemed better in this case .However to retain the classes that were originally in the document we keep all 3 classes - positive negative and neutral 


