# Context
## Description
A sentiment analysis job about the problems of each major U.S. airline. <br/>
Twitter data was scraped from February of 2015 and contributors were asked to:
- Classify positive, negative, and neutral tweets.
- Categorizing negative reasons (such as "late flight" or "rude service").

## Dataset
The dataset has to be downloaded from: https://www.kaggle.com/crowdflower/twitter-airline-sentiment<br/>
* tweet_id
* airline_sentiment
* airline_sentiment_confidence
* negativereason
* negativereason_confidence
* airline
* airline_sentiment_gold
* name
* negativereason_gold
* retweet_count
* text
* tweet_coord
* tweet_created
* tweet_location
* user_timezone

# Objective
To implement the techniques learnt as a part of the course.

# Learning Outcomes
- Basic understanding of text pre-processing.
- What to do after text pre-processing:
    - Bag of words
    - Tf-idf
- Build the classification model.
- Evaluate the Model.

## 1.1 Import libraries and load dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_colwidth', 0) # Display full dataframe information (Non-turncated Text column.)

In [2]:
tweetsData = pd.read_csv("Tweets.csv")

## 1.2 Shape of the dataset

In [3]:
tweetsData.shape

(14640, 15)

## 1.3 Data description

In [4]:
tweetsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
tweet_id                        14640 non-null int64
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
negativereason_confidence       10522 non-null float64
airline                         14640 non-null object
airline_sentiment_gold          40 non-null object
name                            14640 non-null object
negativereason_gold             32 non-null object
retweet_count                   14640 non-null int64
text                            14640 non-null object
tweet_coord                     1019 non-null object
tweet_created                   14640 non-null object
tweet_location                  9907 non-null object
user_timezone                   9820 non-null object
dtypes: float64(2), int64(2), object(11)
memory usage: 1.7+ MB


Columns have null values:
* negativereason
* negativereason_confidence
* airline_sentiment_gold
* negativereason_gold
* tweet_coord
* tweet_location
* user_timezone

In [5]:
tweetsData.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [6]:
tweetsData["airline"].value_counts(dropna = False)

United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America    504 
Name: airline, dtype: int64

In [7]:
tweetsData["airline_sentiment"].value_counts(dropna = False)

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [8]:
tweetsData["negativereason"].value_counts(dropna = False)

NaN                            5462
Customer Service Issue         2910
Late Flight                    1665
Can't Tell                     1190
Cancelled Flight               847 
Lost Luggage                   724 
Bad Flight                     580 
Flight Booking Problems        529 
Flight Attendant Complaints    481 
longlines                      178 
Damaged Luggage                74  
Name: negativereason, dtype: int64

In [9]:
tweetsData["airline_sentiment_gold"].value_counts(dropna = False)

NaN         14600
negative    32   
positive    5    
neutral     3    
Name: airline_sentiment_gold, dtype: int64

In [10]:
tweetsData["negativereason_gold"].value_counts(dropna = False)

NaN                                         14608
Customer Service Issue                      12   
Late Flight                                 4    
Can't Tell                                  3    
Cancelled Flight                            3    
Cancelled Flight\nCustomer Service Issue    2    
Customer Service Issue\nLost Luggage        1    
Flight Attendant Complaints                 1    
Customer Service Issue\nCan't Tell          1    
Bad Flight                                  1    
Late Flight\nLost Luggage                   1    
Lost Luggage\nDamaged Luggage               1    
Late Flight\nFlight Attendant Complaints    1    
Late Flight\nCancelled Flight               1    
Name: negativereason_gold, dtype: int64

## 2.1 Drop all other columns except “text” and “airline_sentiment”

In [11]:
tweetsData_reduced = tweetsData[["text", "airline_sentiment"]]

## 2.2 Shape of new dataset

In [12]:
tweetsData_reduced.shape

(14640, 2)

## 2.3 Print first 5 rows of this dataset

In [13]:
tweetsData_reduced.head(5)

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials to the experience... tacky.,positive
2,@VirginAmerica I didn't today... Must mean I need to take another trip!,neutral
3,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",negative
4,@VirginAmerica and it's a really big bad thing about it,negative


## 3.1 Text pre-processing: HTML tag removal

In [14]:
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

tweetsData_reduced['text'] = tweetsData_reduced['text'].apply(lambda x: strip_html(x))
tweetsData_reduced.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials to the experience... tacky.,positive
2,@VirginAmerica I didn't today... Must mean I need to take another trip!,neutral
3,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse",negative
4,@VirginAmerica and it's a really big bad thing about it,negative


* On line 3, "$&amp;$" was converted back to "&"

## 3.2 Text pre-processing: Tokenization

In [15]:
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.

tweetsData_reduced['text'] = tweetsData_reduced.apply(lambda row: word_tokenize(row['text']), axis=1) # Tokenization of data

tweetsData_reduced.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,airline_sentiment
0,"[@, VirginAmerica, What, @, dhepburn, said, .]",neutral
1,"[@, VirginAmerica, plus, you, 've, added, commercials, to, the, experience, ..., tacky, .]",positive
2,"[@, VirginAmerica, I, did, n't, today, ..., Must, mean, I, need, to, take, another, trip, !]",neutral
3,"[@, VirginAmerica, it, 's, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, &, they, have, little, recourse]",negative
4,"[@, VirginAmerica, and, it, 's, a, really, big, bad, thing, about, it]",negative


## 3.3 Text pre-processing: Remove the numbers

In [16]:
import re

def remove_numbers(words):
    new_words = []
    for word in words:
        new_word = re.sub(r'\d+', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

## 3.4 Text pre-processing: Removal of Special Characters and Punctuations

In [17]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [18]:
import unicodedata

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [19]:
from nltk.corpus import stopwords                       # Import stopwords.

stopwords = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Set custom stop-word's list as not, couldn't etc. words matter in Sentiment, so not removing them from original data.

stopwords = list(set(stopwords) - set(customlist))  

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

## 3.5 Text pre-processing: Convert to lowercase

In [20]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

## 3.6 Text pre-processing: Lemmatize

In [21]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
import nltk

lemmatizer = WordNetLemmatizer()
    
def lemmatize_list(words):
    new_words = []
    
    for word, pos in nltk.pos_tag(words):
        if word != '':
            new_words.append(lemmatizer.lemmatize(word, get_wordnet_pos(pos)))
    return new_words

## 3.7 Text pre-processing: Normalize

In [22]:
def normalize(words):
    words = remove_numbers(words)
    words = remove_punctuation(words)
    words = remove_non_ascii(words)
    words = remove_stopwords(words)
    words = to_lowercase(words)
    words = lemmatize_list(words)
    return ' '.join(words)

tweetsData_reduced['text'] = tweetsData_reduced.apply(lambda row: normalize(row['text']), axis=1)
tweetsData_reduced.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,text,airline_sentiment
0,virginamerica what dhepburn say,neutral
1,virginamerica plus added commercial experience tacky,positive
2,virginamerica i nt today must mean i need take another trip,neutral
3,virginamerica really aggressive blast obnoxious entertainment guest face little recourse,negative
4,virginamerica really big bad thing,negative


## 4.1 Vectorization - Use CountVectorizer + 1000 most-frequently used features

In [23]:
# Vectorization (Convert text data to numbers).
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

vectorizer = CountVectorizer()                # Keep only 1000 features as number of features will increase the processing time.
tweetsData_reduced_features_1 = vectorizer.fit_transform(tweetsData_reduced['text'])
svd1 = TruncatedSVD(n_components=1000, random_state=1)
tweetsData_reduced_features_1 = svd1.fit_transform(tweetsData_reduced_features_1) 
tweetsData_reduced_features_1.shape

(14640, 1000)

## 4.2 Vectorization - Use TfidfVectorizer + 1000 most-frequently used features

In [24]:
# Using TfidfVectorizer to convert text data to numbers.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

vectorizer = TfidfVectorizer()
tweetsData_reduced_features_2 = vectorizer.fit_transform(tweetsData_reduced['text'])
svd2 = TruncatedSVD(n_components=1000, random_state=1)
tweetsData_reduced_features_2 = svd2.fit_transform(tweetsData_reduced_features_2) 
tweetsData_reduced_features_2.shape

(14640, 1000)

## 4.3 Split 4.1 & 4.2 data into training and testing sets

In [25]:
labels = tweetsData_reduced['airline_sentiment']
labels = labels.replace("neutral", "0").replace("negative", "-1").replace("positive", "1")
labels = labels.astype('int')

from sklearn.model_selection import train_test_split

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(tweetsData_reduced_features_1, labels, test_size=0.3, random_state=1)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(tweetsData_reduced_features_2, labels, test_size=0.3, random_state=1)

## 5.1 Evaluate model - Use LinearSVC + 4.1 data

In [26]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

svc1 = LinearSVC(random_state=1)
svc1 = svc1.fit(X_train_1, y_train_1)

print(svc1)
print(np.mean(cross_val_score(svc1, tweetsData_reduced_features_1, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7574453551912568


In [27]:
# Predict the result for test data using the model built above.

result1 = svc1.predict(X_test_1)
svc1.score(X_test_1, y_test_1)

0.7789162112932605

In [28]:
from sklearn.metrics import confusion_matrix

conf_mat_1 = confusion_matrix(y_test_1, result1)
print(conf_mat_1)

[[2396  252   93]
 [ 301  542   93]
 [ 135   97  483]]


## 5.2 Evaluate model - Use LinearSVC + 4.2 data

In [29]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

svc2 = LinearSVC(random_state=1)
svc2 = svc2.fit(X_train_2, y_train_2)

print(svc2)
print(np.mean(cross_val_score(svc2, tweetsData_reduced_features_2, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7661202185792348


In [30]:
# Predict the result for test data using the model built above.

result2 = svc2.predict(X_test_2)
svc2.score(X_test_2, y_test_2)

0.7871129326047359

In [31]:
from sklearn.metrics import confusion_matrix

conf_mat_2 = confusion_matrix(y_test_2, result2)
print(conf_mat_2)

[[2473  198   70]
 [ 341  520   75]
 [ 150  101  464]]


## 5.3 Evaluate model - Use H2O + 4.1 & 4.2 data

In [32]:
#!pip install requests
#!pip install tabulate
#!pip install "colorama>=0.3.8"
#!pip install future
#!pip uninstall h2o
#!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

In [33]:
# For automated model selection
import h2o
from h2o.automl import H2OAutoML

In [34]:
# Start H2O cluster
h2o.init(max_mem_size = 24)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)
  Starting server from C:\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\tanvu\AppData\Local\Temp\tmpvk94qo0w
  JVM stdout: C:\Users\tanvu\AppData\Local\Temp\tmpvk94qo0w\h2o_tanvu_started_from_python.out
  JVM stderr: C:\Users\tanvu\AppData\Local\Temp\tmpvk94qo0w\h2o_tanvu_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Asia/Kuala_Lumpur
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.3
H2O_cluster_version_age:,8 days
H2O_cluster_name:,H2O_from_python_tanvu_bo5yvl
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,24 Gb
H2O_cluster_total_cores:,6
H2O_cluster_allowed_cores:,6


In [35]:
train_1 = np.column_stack((X_train_1, y_train_1))
test_1 = np.column_stack((X_test_1, y_test_1))

# Convert pandas tables into H2O tables
h2o_train = h2o.H2OFrame(train_1)
h2o_test = h2o.H2OFrame(test_1)

h2o_train[1000] = h2o_train[1000].asfactor()
h2o_test[1000] = h2o_test[1000].asfactor()

h2o_model = H2OAutoML(max_models = 20, max_runtime_secs = 1800, seed = 1)
h2o_model.train(x = list(range(1000)), 
                y = 1000, 
                training_frame = h2o_train,  
                leaderboard_frame = h2o_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |
22:15:55.505: AutoML: XGBoost is not available; skipping it.

████████████████████████████████████████████████████████| 100%


In [36]:
h2o_model.leaderboard

model_id,mean_per_class_error,logloss,rmse,mse
DeepLearning_1_AutoML_20200521_221555,0.341929,1.09001,0.478358,0.228827
DeepLearning_grid__1_AutoML_20200521_221555_model_1,0.352857,1.89441,0.48942,0.239532
GBM_grid__1_AutoML_20200521_221555_model_1,0.367098,0.598104,0.447953,0.200662
DeepLearning_grid__2_AutoML_20200521_221555_model_1,0.418003,0.650591,0.453851,0.205981
GBM_4_AutoML_20200521_221555,0.423221,0.809137,0.548632,0.300997
GBM_2_AutoML_20200521_221555,0.426486,0.730861,0.510498,0.260608
GBM_grid__1_AutoML_20200521_221555_model_2,0.426687,0.755823,0.521551,0.272016
GBM_3_AutoML_20200521_221555,0.427254,0.754689,0.522226,0.27272
GBM_5_AutoML_20200521_221555,0.429603,0.775607,0.532899,0.283981
GBM_1_AutoML_20200521_221555,0.436896,0.727036,0.508376,0.258446




In [37]:
predictions = h2o_model.leader.predict(h2o_test)
predictions = predictions.as_data_frame()

deeplearning prediction progress: |███████████████████████████████████████| 100%


In [38]:
# Compare predictions with actuals
pd.crosstab(y_test_1.reset_index(drop=True), predictions['predict'], colnames = ['Predictions'], margins=True, margins_name="Total")

Predictions,-1,0,1,Total
airline_sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1,2286,311,144,2741
0,341,486,109,936
1,152,119,444,715
Total,2779,916,697,4392


In [39]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test_1.reset_index(drop=True), predictions['predict'])

0.73224043715847

In [40]:
train_2 = np.column_stack((X_train_2, y_train_2))
test_2 = np.column_stack((X_test_2, y_test_2))

# Convert pandas tables into H2O tables
h2o_train_2 = h2o.H2OFrame(train_2)
h2o_test_2 = h2o.H2OFrame(test_2)

h2o_train_2[1000] = h2o_train_2[1000].asfactor()
h2o_test_2[1000] = h2o_test_2[1000].asfactor()

h2o_model_2 = H2OAutoML(max_models = 20, max_runtime_secs = 1800, seed = 1)
h2o_model_2.train(x = list(range(1000)), 
                y = 1000, 
                training_frame = h2o_train_2,  
                leaderboard_frame = h2o_test_2)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |
22:46:09.271: AutoML: XGBoost is not available; skipping it.

████████████████████████████████████████████████████████| 100%


In [41]:
h2o_model_2.leaderboard

model_id,mean_per_class_error,logloss,rmse,mse
StackedEnsemble_BestOfFamily_AutoML_20200521_224609,0.335121,0.574048,0.432644,0.187181
DeepLearning_1_AutoML_20200521_224609,0.33814,1.12791,0.478508,0.22897
DeepLearning_grid__1_AutoML_20200521_224609_model_1,0.345546,1.76519,0.48572,0.235924
GBM_grid__1_AutoML_20200521_224609_model_1,0.38209,0.611784,0.454529,0.206596
DeepLearning_grid__3_AutoML_20200521_224609_model_1,0.385412,0.860154,0.503181,0.253191
DeepLearning_grid__2_AutoML_20200521_224609_model_1,0.425665,0.665526,0.45962,0.211251
GBM_2_AutoML_20200521_224609,0.426436,0.719101,0.504586,0.254607
GBM_3_AutoML_20200521_224609,0.429936,0.753945,0.521706,0.272177
GBM_4_AutoML_20200521_224609,0.432245,0.784857,0.537003,0.288372
GBM_1_AutoML_20200521_224609,0.432814,0.713927,0.501767,0.25177




In [42]:
predictions_2 = h2o_model_2.leader.predict(h2o_test_2)
predictions_2 = predictions_2.as_data_frame()

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [43]:
# Compare predictions with actuals
pd.crosstab(y_test_2.reset_index(drop=True), predictions_2['predict'], colnames = ['Predictions'], margins=True, margins_name="Total")

Predictions,-1,0,1,Total
airline_sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1,2522,153,66,2741
0,423,435,78,936
1,181,98,436,715
Total,3126,686,580,4392


In [44]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test_2.reset_index(drop=True), predictions_2['predict'])

0.7725409836065574

## 6.1 Vectorization - Use CountVectorizer + 2000 most-frequently used features

In [45]:
# Vectorization (Convert text data to numbers).
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

vectorizer = CountVectorizer()                # Keep only 1000 features as number of features will increase the processing time.
tweetsData_reduced_features_3 = vectorizer.fit_transform(tweetsData_reduced['text'])
svd1 = TruncatedSVD(n_components=2000, random_state=1)
tweetsData_reduced_features_3 = svd1.fit_transform(tweetsData_reduced_features_3) 
tweetsData_reduced_features_3.shape

(14640, 2000)

## 6.2 Vectorization - Use TfidfVectorizer + 2000 most-frequently used features

In [47]:
# Using TfidfVectorizer to convert text data to numbers.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

vectorizer = TfidfVectorizer()
tweetsData_reduced_features_4 = vectorizer.fit_transform(tweetsData_reduced['text'])
svd2 = TruncatedSVD(n_components=2000, random_state=1)
tweetsData_reduced_features_4 = svd2.fit_transform(tweetsData_reduced_features_4) 
tweetsData_reduced_features_4.shape

(14640, 2000)

## 6.3 Split 6.1 & 6.2 data into training and testing sets

In [48]:
labels = tweetsData_reduced['airline_sentiment']
labels = labels.replace("neutral", "0").replace("negative", "-1").replace("positive", "1")
labels = labels.astype('int')

from sklearn.model_selection import train_test_split

X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(tweetsData_reduced_features_3, labels, test_size=0.3, random_state=1)
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(tweetsData_reduced_features_4, labels, test_size=0.3, random_state=1)

## 7.1 Evaluate model - Use LinearSVC + 6.1 data

In [50]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

svc3 = LinearSVC(random_state=1)
svc3 = svc1.fit(X_train_3, y_train_3)

print(svc3)
print(np.mean(cross_val_score(svc3, tweetsData_reduced_features_3, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7476092896174864


In [51]:
# Predict the result for test data using the model built above.

result3 = svc3.predict(X_test_3)
svc3.score(X_test_3, y_test_3)

0.7668488160291439

In [52]:
from sklearn.metrics import confusion_matrix

conf_mat_3 = confusion_matrix(y_test_3, result3)
print(conf_mat_3)

[[2363  260  118]
 [ 298  517  121]
 [ 132   95  488]]


## 7.2 Evaluate model -  Use LinearSVC + 6.2 data

In [53]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

svc4 = LinearSVC(random_state=1)
svc4 = svc4.fit(X_train_4, y_train_4)

print(svc4)
print(np.mean(cross_val_score(svc4, tweetsData_reduced_features_4, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7594945355191257


In [54]:
# Predict the result for test data using the model built above.

result4 = svc4.predict(X_test_4)
svc4.score(X_test_4, y_test_4)

0.7841530054644809

In [55]:
from sklearn.metrics import confusion_matrix

conf_mat_4 = confusion_matrix(y_test_4, result4)
print(conf_mat_4)

[[2480  189   72]
 [ 347  500   89]
 [ 153   98  464]]
