Here we will build a simple review classifier for airlines, based off of tweets found in a dataset on Kaggle (https://www.kaggle.com/crowdflower/twitter-airline-sentiment). Without any feature engineering, we will see how accurate our model can be.

In [None]:
import pandas as pd
import numpy as np

In [47]:
df = pd.read_csv("Tweets.csv", sep = ',')
df.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


We will reassign our dataset to be just 'airline_sentiment' and 'text', i.e. Tweet.

In [6]:
df = df[['airline_sentiment', 'text']]

df.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [7]:
df.isnull().sum()

airline_sentiment    0
text                 0
dtype: int64

In [8]:
blanks = []   # check for empty strings in dataset

for index, sentiment, text in df.itertuples():
    if text.isspace():
        blanks.append(i)

In [9]:
blanks

[]

All NA values removed, now let's to a train/test split.

In [10]:
X = df['text']
y = df['airline_sentiment']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
                                                    random_state = 42)


In [11]:
# Build a pipeline and a linear support vector classifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_classifier = Pipeline([('tfidf', TfidfVectorizer()),
                             ('classifier', LinearSVC())])

text_classifier.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
        

In [12]:
predictions = text_classifier.predict(X_test)

In [13]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print(confusion_matrix(y_test, predictions))

[[2561  191   62]
 [ 324  493   67]
 [ 134   86  474]]


In [14]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

    negative       0.85      0.91      0.88      2814
     neutral       0.64      0.56      0.60       884
    positive       0.79      0.68      0.73       694

    accuracy                           0.80      4392
   macro avg       0.76      0.72      0.74      4392
weighted avg       0.80      0.80      0.80      4392



In [16]:
print(accuracy_score(y_test, predictions))

0.8032786885245902


80% accuracy. Not bad. Higher for negative and positive than for neutral. 

This could be improved with sentiment analysis, and with experimenting with different ML models.

Let's try some simple predictions:


In [43]:
text_classifier.predict(["Thanks guys! We had a great flight!"])

array(['positive'], dtype=object)

In [44]:
text_classifier.predict(["Honestly, what a terrible flight. Delayed, bad service, and not a smile in sight!"])

array(['negative'], dtype=object)