# Notebook 6 - Putting it all together

Although in this notebook we will not go over new topics, we are going to see how we can fully handle a text sentiment classification task. We are going to use many different techniques covered in previous notebooks, that (hopefully) let us develop quite an efficient model. Additionally, we are going to create some visualizations of dataset statistics and model performance, so we can understand the problem better.

We are going to use the dataset containing tweets about US airlines. The dataset is stored in the CSV file `airline_tweets.csv` and contains tweet text labeled with either *"positive"*, *"neutral"* or *"negative"*. Besides, it contains (potentially) helpful information like the name of the airline and the user location. Let's explore it!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import swifter
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, TfidfTransformer
from langdetect import detect
import seaborn as sns
import spacy
import en_core_web_sm

In [None]:
df_tweets = pd.read_csv("./datasets/airline_tweets.csv")

In [None]:
df_tweets.head()

Ok, so there are many columns with additional information. However, some of their fields are empty. To determine whether we want to include a certain column in the further analysis, we may want to inspect how many fields in this columns are null. Let's do it using the `.info()` method.

In [None]:
df_tweets.info()

Ok, now we have an overview of the dataset and we can see that columns like `airline_sentiment_gold` or `tweet_coord` are generally empty (contain null values). How about columns like `tweet_location` or `user_timezone`? We can inspect them visually and see if "null" values are stacked or randomly scattered throughout the whole dataset.

In [None]:
sns.heatmap(df_tweets.isnull(), cbar=False)

In our study, beside "text" and "airline_sentiment" we will use the "airline" and "tweet_location" columns.

In [None]:
df = df_tweets[[ "airline_sentiment", "airline", "text", "tweet_location"]]

In [None]:
df.head()

Now, let's train the Naive-Bayes classifier without any preprocessing or parameter tuning and let's see what results we get.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_tweets.text, df_tweets.airline_sentiment, test_size=0.2, 
                                                                                random_state=10)
count_vectorizer = CountVectorizer(ngram_range=(1,1),binary=True)


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

In [None]:
# X_train_counts.shape

In [None]:
# X_test_counts.shape

In [None]:
mnb = MultinomialNB()
mnb.fit(X_train_counts, y_train)

In [None]:
y_predicted_counts = mnb.predict(X_test_counts)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def get_metrics(y_test, y_predicted):  
    precision = precision_score(y_test, y_predicted, pos_label=None, average='weighted')             
    recall = recall_score(y_test, y_predicted, pos_label=None, average='weighted')
    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')
    
    accuracy = accuracy_score(y_test, y_predicted)
    return accuracy, precision, recall, f1


accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

It seems like even withouth much work we can get not bad results! Let's verify them using cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(mnb, X_train_counts, y_train, cv=5)

In [None]:
scores.mean()

Ok, so this baseline classifier run on the totally unprocessed dataset gives us **around 75% accuracy** using cross-validation. We will come back to this number later!

From now we aim to get a higher accuracy score. We are going to preprocess data, extract some features and work on the model parameters selection, so all of this work will hopefully let us get better results.

## Further dataset exploration

Let's see what is the structure of the dataset in terms of the airline and sentiment.

In [None]:
sns.catplot(x="airline",kind="count", data=df)

In terms of airlines, the situations doesn't look very bad except for the low number of tweets regarding Virgin America.

In [None]:
sns.catplot(x="airline_sentiment", kind="count", data=df)

However, when it comes to the sentiment analysis, we can see that the dataset is highly imbalanced. We have much more negative tweets than neutral or positive ones. Often the result of such imbalance is the lower accuracy. To partially solve this issue we may want to *oversample* later.

We may also want to see what is the sentiment statistic for each airline.

In [None]:
sns.factorplot("airline", data=df, hue="airline_sentiment", kind="count")

What we can read from this plot? Actually quite usefull information! We can see that the airline people tweet about has a strong influence on the tweet sentiment. For example the great majority of all tweets about United Airlines are negative. On the other hand, neutral and positive tweets regarding Delta Airlines outnumber negative samples. Hence, we will definitely want to treat the tweet-related airline as a feature. 

So how can we use the name of the airline as a feature? One way is to add it at the end of the text. By doing so, we will give a classifier a "hint", that is related to the sentiment. Interestingly, since every tweet mentions one of 6 airlines, they already contain airlines names in the beginning! Let's make sure that it is true for all tweets.

In [None]:
df_starting_word = df["text"].apply(word_tokenize)
pd.Series(x[1].lower() for x in df_starting_word).value_counts().head(10)

As you can see, the great majority of all tweets start with the airline name. Some tweets do not, but we will ingnore them since these are rare cases. In the preprocessing, we will remove "@" characters and make all words lowercase so we will treat "united" and "United" as the same token.

Hence, we can drop the "airline" column, since we will not use it anymore.

In [None]:
del df["airline"]
df

## LSA (FANCY VISUAL)

(deleted)

## LOCATION EXTRACTION (FEATURE)

We can also extract the location from the "tweet_location" column. One may think that since the column already exists, the only thing we need to do is to match it with the label column. In this case this won't work, since the "tweet_location" column contains geographical names in different formats, sometimes even more than one. We are going to extract them from the column using `SpaCy` english language model - "en_core_web_sm".

We have already seen named-entity recognition in the Notebook2, so let's apply it here. We will extract all geographical entities (GPE). Let's see the example below.

In [None]:
text="San Mateo, CA & Las Vegas, NV"
NLP = en_core_web_sm.load()
output = NLP(text)
for item in output.ents:
    print(item.label_, item)


Now, let's create a function, that given a text from the "tweet_location" column returns all found GPE names. If the column field is empty, the function will return "nolocationplaceholder" to consistently specify the lack of the location.

In [None]:
def filter_location(text):
    if text == "":
        return "nolocationplaceholder"
    else:
        try:
            output = NLP(text)
            locations = []
            for item in output.ents:
                if item.label_ == "GPE":
                    locations.append(str(item))
            if not len(locations):
                return "nolocationplaceholder"
            return locations
        except:
            return "nolocationplaceholder"

Let's create a special series in the dataframe for the extracted locations list (or nolocationplaceholder).

In [None]:
df["location_ner"] = df.tweet_location.swifter.apply(filter_location)

In [None]:
df

However, we would like to have each extracted location in a specific column.

In [None]:
location_df = df["location_ner"].apply(pd.Series)
location_df

Now some concatenation and renaming...

In [None]:
df = pd.concat([location_df, df], axis=1)

In [None]:
df.rename(columns={0: 'loc_1', 1: 'loc_2', 2: "loc_3", 3: "loc_4"}, inplace="True")

And done!

In [None]:
df

So what can we do with these extracted location names? We can try to find a relation between the location name and the sentiment of the tweet. Firstly, let's see top 20 locations.

In [None]:
dict(df.loc_1.value_counts().iloc[:20])

Now, let's see if there is any relation between tweets sentiment and the location. We can 

In [None]:
sns.catplot(x="airline_sentiment",kind="count", data=df[df.loc_1 == "London"])

In [None]:
sns.catplot(x="airline_sentiment",kind="count", data=df[df.loc_1 == "San Francisco"])

In [None]:
sns.catplot(x="airline_sentiment",kind="count", data=df[df.loc_1 == "New York"])

It seems like there is no big difference from where the tweet was tweeted since the majority seems to be negative. However, let's take loc_1 as a feature and concatinate it with the tweet text.

In [None]:
df["text_loc"] = df["text"].astype(str) + " " + df["loc_1"].astype(str)

Also, now, we can drop all other unused columns and just keep our edited text and the airline_sentiment label.

In [None]:
df = df[["text_loc", "airline_sentiment" ]]

In [None]:
df

## Preprocessing and extracting features from text
Now since we have already extracted the precise location of the tweets (if it was an actual geographical location) we can start with the preprocessing of the raw text.

## Language

Firtly, we will start with determining the actuall language of each tweet. The fact that 10 tweets from the top and the bottom are in English doesn't mean that all of them are! We don't want to mix languages since words often repeat in many languages and they may have different meaning. To detect language of each tweet we will use the `langdetect` package.

In [None]:
df['language'] = df["text_loc"].swifter.apply(detect)

In [None]:
df.language.value_counts()

In [None]:
df[df.language =="it"]

In [None]:
df[df.language =="fr"]

It seems like the langdetect was mistaken and the majority of all tweets is actually in English, so **let's not delete anything**.

## Emojis

One interesting text-based feature that can be extracted is the occurence of emojis with the negative/positive meaning. The intuition behind this is to develop a list of negative emojis and if a tweet contains one of them, mark that tweet with a certain token, that the classifier will (hopefully) relate with the negative sentiment. However, before treating it as a feature we should check if it is related with the sentiment.

To start, we need a way of extracting emojis from text. Let's explore the example below. 

In [None]:
import emoji
from emoji import UNICODE_EMOJI
import re

example_emoji_sentence = "@VirginAmerica 👍 Need to start flying to @KCIAirport .  😊😀😃😄"

def extract_emojis(s):
    return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])

In [None]:
extract_emojis(example_emoji_sentence)

We may want to explore what kind of emojis and how joined together occur in our dataset.

In [None]:
emojis_ = []

In [None]:
corpus = df.text_loc.tolist()
for i in corpus:
    h = extract_emojis(i)
    if len(h) != 0:
        emojis_.append(h)

In [None]:
# emojis_

Let's create a list of emojis with negative meaning and develop a function which will return the "negemoji" token if there is a negative emoji in the tweet text.

In [None]:
neg_emojis = "😭,🆘, 😡, 😩, 😞, 😢, 👿, 👎, 😔, 😪, 😫, 😤, 😖, 😠, 💩, 😑, 😕, 😒"

def contains_neg_emoji(text):
    emojis =  ''.join(c for c in text if c in emoji.UNICODE_EMOJI['en'])
    contains_neg = ""
    for e in emojis:
        if e in neg_emojis:
            contains_neg = "negemoji"
            
    return contains_neg

In [None]:
df["neg_emoji"] = df.text_loc.apply(contains_neg_emoji)

Ok, now let's see how many tweets actually have negative emojis.

In [None]:
df.neg_emoji.value_counts()

Oops, only 1% of all tweets contain negative emojis!  Let's check the relatin between negative emojis and the sentiment of the tweet.

In [None]:
sns.catplot("neg_emoji", data=df[df["neg_emoji"] == 'negemoji'], hue="airline_sentiment", kind="count")

However, the relation between containing negative emoji and the sentiment of a tweet looks very good - there is a very small number of positive-labeled tweets containing a negative emoji. 

In [None]:
df

The occurence of negative emojis probably won't help much the classifier since it occurs in only 1% of all tweets but let's still implement it as a feature!

In [None]:
df["text_loc"] = df["text_loc"].astype(str) + " " + df["neg_emoji"].astype(str)

In [None]:
df = df[["text_loc", "airline_sentiment"]]

In [None]:
df

## Further feature extraction - urls, numbers, hashtags...

Here, we are going to exract more features from text. If we research that a specific feature discriminates well (is related to the positive or negative sentiment), we will mark a tweet with a specific token - as previously.

## URLS

In [None]:
import re
def contains_link(text):
    text = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr',
                     text)
    text = word_tokenize(text)
    if "httpaddr" in text:
        return "YES"
    else:
        return "NO"
    

In [None]:
df["contains_link"] = df.text_loc.apply(contains_link)

In [None]:
sns.factorplot("contains_link", data=df, hue="airline_sentiment", kind="count")

## Number

In [None]:
def contains_number(text):
        
    text = re.sub(r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'numbr',
                     text)
    
    text = re.sub(r'\d+(\.\d+)?', 'numbr',
                     text)
    text = word_tokenize(text)
    if "numbr" in text:
        return "YES"
    else:
        return "NO"

In [None]:
df["contains_number"] = df.text_loc.apply(contains_number)

In [None]:
sns.factorplot("contains_number", data=df, hue="airline_sentiment", kind="count")

## HASHTAG

In [None]:
def contains_hashtag(text):
        
    text = re.sub(r'\b#w+\b', 'hashtg',
                     text)
    
    text = re.sub(r'\d+(\.\d+)?', 'hashtg',
                     text)
    text = word_tokenize(text)
    if "hashtg" in text:
        return "YES"
    else:
        return "NO"

In [None]:

df["contains_hashtag"] = df.text_loc.apply(contains_hashtag)


In [None]:

sns.factorplot("contains_hashtag", data=df, hue="airline_sentiment", kind="count")

Sadly, the occurence of URLs, numbers or hashtags doesn't seem to have a strong influence on the tweet sentiment. We will not add them as features.

## Preprocessing - cleaning

Now we need to clean the data. Here, we will replace all urls, numbers with a standard token. Beside, we are going to remove all punctuation so all mentions and hashtags will become normal words. Beside, we will make all tweets lowercase.

In [None]:
from nltk.corpus import stopwords
stop_words = list(stopwords.words('english'))

# After our quick reserach location also doesn't seem to have 
# a strong influence so we will remove the "nolocationplaceholder" token as well.
stop_words.append("nolocationplaceholder")

# Convert to set, so lookup operations are much faster
stop_words = set(stop_words)

In [None]:
def preprocess(text):

    text = re.sub(r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', ' numbr ',
                  text)

    text = re.sub(r'\d+(\.\d+)?', ' numbr ',
                  text)

    text = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', ' httpaddr ',
                  text)

    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", text)

    # convert to lower case and split
    words = word_tokenize(letters_only_text.lower())

    # remove stopwords
    cleaned_words = []
    for word in words:
        if word not in stop_words:
            cleaned_words.append(word)

    sen = ' '.join(cleaned_words)

    return sen

In [None]:
df["cleaned_text"] =df.text_loc.swifter.apply(preprocess)

In [None]:
df

## REMOVING RARE WORDS

If a word occurs once or twice in the whole dataset, its occurence doesn't change much. However, since these words are still "analyzed" by the classifier, they may infact violate the classification process! It is common to remove such words. Let's start with creating a dictionary of the word counts - basically a term frequency dictionary.

In [None]:
app_desc_clean_list = df["cleaned_text"].tolist()
whole_corpus = " ".join(app_desc_clean_list)

In [None]:
from collections import Counter
rare_words = {} # again, storing it in dictionary instead of list will make looking through it much faster

counter_dic = Counter(whole_corpus.split())

Let's define word as "rare", if it occurs only once or twice in the corpus.

In [None]:
for (key,value) in counter_dic.items():
    if value < 3:
        rare_words[key] = value

Let's explore which words are actually rare.

In [None]:
rare_words

In [None]:
len(rare_words)

In [None]:
len(set(corpus))

Now, we can delete them.

In [None]:
def delete_rare_words(text):
    words = word_tokenize(text)

    cleaned_words = []
    
    for word in words:
        if word not in rare_words:
            cleaned_words.append(word)
            
    return ' '.join(cleaned_words)

In [None]:
df["cleaned_text"] = df["cleaned_text"].swifter.apply(delete_rare_words)

In [None]:
df

One more thing - if there are any duplicates let's delete them as well.

In [None]:
df = df.drop_duplicates()

In [None]:
df

## Change textual labels into numeric values (PROCESSING)

And the very last thing we need to do before training classifiers, is to encode text labels into numbers. Some `sklearn` classifiers like mulitnomial Logistic Regression do not accept textual labels.

In [None]:
possible_labels = df.airline_sentiment.unique()
possible_labels

In [None]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

# We can refer to this mapping later
label_dict

In [None]:
df['label'] = df.airline_sentiment.replace(label_dict)

In [None]:
df

At the end, let's select only important Series.

In [None]:
df = df[["cleaned_text", "airline_sentiment", "label"]]
df

Now we are ready for the final confrontation!

# TODO

## Comparing with baseline

Firstly, let's use the same classifier and its configuration, so we can see how our processing and feature engineering changed the accuracy.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.cleaned_text, df.airline_sentiment, test_size=0.1, 
                                                                                random_state=40, stratify=df.airline_sentiment)
count_vectorizer = CountVectorizer(ngram_range=(1,1),binary=True)


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

In [None]:
mnb = MultinomialNB()
y_pred = mnb.fit(X_train_counts, y_train)
y_predicted_counts = mnb.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
scores = cross_val_score(mnb, X_train_counts, y_train, cv=5)

In [None]:
scores.mean()

In [None]:
#old score: 0.7596467191733065


In [None]:
#different vectorizer

count_vectorizer = TfidfVectorizer(ngram_range=(1,1))


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

mnb = MultinomialNB()
y_pred = mnb.fit(X_train_counts, y_train)
y_predicted_counts = mnb.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
scores = cross_val_score(mnb, X_train_counts, y_train, cv=5)
scores.mean()

so we will stick to countvectorizer because it gives us different results. lets play around with the parameters of the vectorizer before we tweak the classifier

In [None]:
count_vectorizer = CountVectorizer(ngram_range=(1,1) ,binary=True)


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)
y_pred = mnb.fit(X_train_counts, y_train)
y_predicted_counts = mnb.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
scores = cross_val_score(mnb, X_train_counts, y_train, cv=5)
scores.mean()

increasing n-gram range also doesnt increase the performance of the classifier. so lets try out another classifier

In [None]:
count_vectorizer = CountVectorizer(ngram_range=(1,2) ,binary=True)


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

In [None]:
clf = LogisticRegression(solver="liblinear") #(C=2.3, class_weight='balanced',solver='newton-cg',multi_class='multinomial',random_state=10)
clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)

In [None]:
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
scores = cross_val_score(clf, X_train_counts, y_train, cv=5)
scores.mean()

nice - logistic regression performs bsignificantly better (you can even increase ngram range and performance slighly increaes). lets do a grid search

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'C': np.linspace(0.1, 1, 5), "solver":["newton-cg"]}#, "penalty":["l1","l2"], "solver": ["newton-cg", "lbfgs", "liblinear"]}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(X_train_counts, y_train)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

In [None]:
clf = LogisticRegression(C=0.325, solver='newton-cg' ) #(C=2.3, class_weight='balanced',solver='newton-cg',multi_class='multinomial',random_state=10)
clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
scores = cross_val_score(clf, X_train_counts, y_train, cv=5)
scores.mean()

Increasing ngram range doesnt make a difference, so one could either play around more with the parameters or try out different classifiers. lets try our last one: SVM

In [None]:
from sklearn import svm
clf = svm.SVC()

clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
scores = cross_val_score(clf, X_train_counts, y_train, cv=5)
scores.mean()

In [None]:
clf = svm.SVC(kernel="poly", degree=2)

clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
clf = svm.SVC(kernel="linear")

clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


In [None]:
parameters = {'C': np.linspace(0.1, 5, 5), "kernel":["rbf"]}#, "penalty":["l1","l2"], "solver": ["newton-cg", "lbfgs", "liblinear"]}
grid_search = GridSearchCV(svm.SVC(), parameters)
grid_search.fit(X_train_counts, y_train)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

In [None]:
#increasing training set

X_train, X_test, y_train, y_test = train_test_split(df.cleaned_text, df.airline_sentiment, test_size=0.1, 
                                                                                random_state=40, stratify=df.airline_sentiment)
count_vectorizer = CountVectorizer(ngram_range=(1,1),binary=True)


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

clf = LogisticRegression(C=0.325, solver='newton-cg' ) #(C=2.3, class_weight='balanced',solver='newton-cg',multi_class='multinomial',random_state=10)
clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


scores = cross_val_score(clf, X_train_counts, y_train, cv=5)
scores.mean()  #no big difference - overfitting on training set

## Conclusion
it seems like with linear classifiers we will not get much more than 78% accuracy out of it. of course one should also look at the missclassification

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, X_test_counts, y_test) 
plt.show()

In [None]:
y_df = pd.DataFrame({"text": X_test, "real target": y_test, "predicted target": y_predicted_counts})

In [None]:
y_df

In [None]:
wrong = y_df[y_df["real target"] != y_df["predicted target"]]

In [None]:
wrong.to_csv("wrong_predictions_airline.csv")

In [None]:
#what happens if we deete the airline info from the tweets so that they dont ifluence them?
stop_words.extend(["united", "virginamerica", "united", "delta", "usairways", "americanair", "southwestair", "jetblue"])

In [None]:
df["cleaned_text"] =df.text_loc.swifter.apply(preprocess)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.cleaned_text, df.airline_sentiment, test_size=0.2, 
                                                                                random_state=40, stratify=df.airline_sentiment)
count_vectorizer = CountVectorizer(ngram_range=(1,2),binary=True)


X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

clf = LogisticRegression(C=0.325, solver='newton-cg' ) #(C=2.3, class_weight='balanced',solver='newton-cg',multi_class='multinomial',random_state=10)
clf.fit(X_train_counts, y_train)

y_predicted_counts = clf.predict(X_test_counts)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))


scores = cross_val_score(clf, X_train_counts, y_train, cv=5)
scores.mean()  #no big difference - overfitting on training set

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, X_test_counts, y_test) 
plt.show()

so now the classifier makes fewer mislabels as neutral if something was negative but mislabels neutrals more often as negative. however, as we can see, the airline has no significant impact on the sentiment

In [None]:
y_df = pd.DataFrame({"text": X_test, "real target": y_test, "predicted target": y_predicted_counts})
wrong = y_df[y_df["real target"] != y_df["predicted target"]]
wrong.to_csv("wrong_predictions_airline.csv")

In [None]:
from textblob import TextBlob

def polarity_tb(text):
    text = TextBlob(text)
    pol = text.sentiment.polarity
    return pol

df["polarity"] = df.text_loc.swifter.apply(polarity_tb)

In [None]:
df[df.airline_sentiment == "neutral"].polarity.mean()

In [None]:
def translate_polarity(pol):
    if pol < 0.1 and pol> -0.1:
        return "neutral"
    
    if pol >= 0.1:
        return "positive"
    
    if pol <= -0.1:
        return "negative"

In [None]:
df["polarity_tb"] = df.polarity.apply(translate_polarity)

In [None]:
df

In [None]:
tb_wrong = df[df["airline_sentiment"] != df["polarity_tb"]]

In [None]:
len(tb_wrong)/len(df)

In [None]:
df.polarity_tb.value_counts()

as we can see textblob doesnt do a great job when classifying the sentiment as it tends to label tweets as neutran and positive rather than negative