# Sentiment Analysis 

## Project Preparation

1. Load libraries
2. Load dataset

In [1]:
import pandas as pd
import glob

fpattern = './data/twitter-20*train-*.tsv'
filenames = [filename for filename in sorted(glob.glob(fpattern))]
print(filenames)

['./data/twitter-2013train-A.tsv', './data/twitter-2015train-A.tsv', './data/twitter-2016train-A.tsv']


Having created a pattern for the filenames of our data we can start loading our data to a pandas dataframe

In [2]:
column_names = ['id', 'tag', 'tweet']
df = pd.concat([pd.read_csv(f, sep="\t", quoting=3, names=column_names) for f in filenames], ignore_index=True, sort=True)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16173 entries, 0 to 16172
Data columns (total 3 columns):
id       16173 non-null int64
tag      16173 non-null object
tweet    16173 non-null object
dtypes: int64(1), object(2)
memory usage: 379.1+ KB


Unnamed: 0,id,tag,tweet
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I'm going to Cha...
1,263405084770172928,negative,Not Available
2,262163168678248449,negative,Not Available
3,264249301910310912,negative,Iranian general says Israel's Iron Dome can't ...
4,262682041215234048,neutral,Not Available


Now all our train data are stored in a dataframe. However we can see that some tweets Not Available. This information is considered not usefull for our future classifier, so we must remove these rows

In [3]:
# Drop rows having 'Not Available'...
df = df[df.tweet != 'Not Available']
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12016 entries, 0 to 16172
Data columns (total 3 columns):
id       12016 non-null int64
tag      12016 non-null object
tweet    12016 non-null object
dtypes: int64(1), object(2)
memory usage: 375.5+ KB


Unnamed: 0,id,tag,tweet
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I'm going to Cha...
3,264249301910310912,negative,Iranian general says Israel's Iron Dome can't ...
6,264105751826538497,positive,with J Davlar 11th. Main rivals are team Polan...
7,264094586689953794,negative,"Talking about ACT's &amp;&amp; SAT's, deciding..."
9,254941790757601280,negative,"They may have a SuperBowl in Dallas, but Dalla..."


# Problem Definition

Our goal is to train a classifier that when given a new text string aka tweet, to be able to classify it by our given tag, which is positive, neutral or negative. 
Before diving into the exploratory analysis, we must define our features. We must split our tweets into words.

In [4]:
import string
import re
from nltk.corpus import stopwords

# turn a document into a list of clean tokens
def clean_doc(doc):
    # Remove links...
    doc = re.sub("\w+:\/\/\S+", " ", doc)
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

We apply this functions to our dataset

In [19]:
import numpy as np

df['tokens'] = np.array([ clean_doc(tweet) for tweet in df.tweet ])
df.info()  
df.head()
df.groupby('tag').size()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12016 entries, 0 to 16172
Data columns (total 6 columns):
id               12016 non-null int64
tag              12016 non-null object
tweet            12016 non-null object
tokens           12016 non-null object
vector_tokens    12016 non-null object
btag             12016 non-null int8
dtypes: int64(1), int8(1), object(4)
memory usage: 895.0+ KB


tag
negative    1671
neutral     5181
positive    5164
dtype: int64

We repeat this procedure for dev and test data

In [6]:
fpattern = './data/twitter-20*dev-*.tsv'
devfs    = [filename for filename in sorted(glob.glob(fpattern))]
fpattern = './data/twitter-20*test-*.tsv'
testfs   = [filename for filename in sorted(glob.glob(fpattern))]
df_dev   = pd.concat([pd.read_csv(f, sep="\t", quoting=3, names=column_names) for f in devfs],  ignore_index=True, sort=True)
df_test  = pd.concat([pd.read_csv(f, sep="\t", quoting=3, names=column_names) for f in testfs], ignore_index=True, sort=True)
df_dev   = df_dev[df_dev.tweet != 'Not Available']
df_test  = df_test[df_test.tweet != 'Not Available']
df_dev['tokens']  = np.array([ clean_doc(tweet) for tweet in df_dev.tweet ])
df_test['tokens'] = np.array([ clean_doc(tweet) for tweet in df_test.tweet ])

To extract our vocabulary, we need to iterate through all the tokens that are found and count the occurencies of each token

In [7]:
from collections import Counter
import itertools

vocabulary = Counter()
for tweet_tokens in itertools.chain(df.tokens, df_dev.tokens, df_test.tokens):
    vocabulary.update(tweet_tokens)

print('Total tweets: ', sum(1 for _ in itertools.chain(df.tokens, df_dev.tokens, df_test.tokens)))
vocabulary.most_common(10)

Total tweets:  30790


[('may', 4040),
 ('tomorrow', 3942),
 ('The', 1934),
 ('Im', 1754),
 ('going', 1704),
 ('amp', 1687),
 ('see', 1667),
 ('day', 1667),
 ('Friday', 1648),
 ('like', 1582)]

Now we can convert our tweet's from a list of tokens to a vector of discrete tokens, using our vocabulary.

In [8]:
def token_to_vector_words(tokens, vocabulary):
    tokens = [w for w in tokens if w in vocabulary]
    return ' '.join(tokens)

print(df.tweet[0])
token_to_vector_words(df.tokens[0], vocabulary)

Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)


'Gas house hit Im going Chapel Hill Sat'

And we apply this to all our dataframes

In [9]:
df['vector_tokens']      = np.array([ token_to_vector_words(tweet, vocabulary) for tweet in df.tokens ])
df_dev['vector_tokens']  = np.array([ token_to_vector_words(tweet, vocabulary) for tweet in df_dev.tokens ])
df_test['vector_tokens'] = np.array([ token_to_vector_words(tweet, vocabulary) for tweet in df_test.tokens ])
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12016 entries, 0 to 16172
Data columns (total 5 columns):
id               12016 non-null int64
tag              12016 non-null object
tweet            12016 non-null object
tokens           12016 non-null object
vector_tokens    12016 non-null object
dtypes: int64(1), object(4)
memory usage: 883.2+ KB


Unnamed: 0,id,tag,tweet,tokens,vector_tokens
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I'm going to Cha...,"[Gas, house, hit, Im, going, Chapel, Hill, Sat]",Gas house hit Im going Chapel Hill Sat
3,264249301910310912,negative,Iranian general says Israel's Iron Dome can't ...,"[Iranian, general, says, Israels, Iron, Dome, ...",Iranian general says Israels Iron Dome cant de...
6,264105751826538497,positive,with J Davlar 11th. Main rivals are team Polan...,"[Davlar, Main, rivals, team, Poland, Hopefully...",Davlar Main rivals team Poland Hopefully make ...
7,264094586689953794,negative,"Talking about ACT's &amp;&amp; SAT's, deciding...","[Talking, ACTs, ampamp, SATs, deciding, want, ...",Talking ACTs ampamp SATs deciding want go coll...
9,254941790757601280,negative,"They may have a SuperBowl in Dallas, but Dalla...","[They, may, SuperBowl, Dallas, Dallas, aint, w...",They may SuperBowl Dallas Dallas aint winning ...


As our Labels are categorical, we need to convert our classes to numeric tags

In [10]:
# Map tag from class (positive, negative) to numbers...
df['btag']      = df.tag.astype('category').cat.codes
df_dev['btag']  = df_dev.tag.astype('category').cat.codes
df_test['btag'] = df_test.tag.astype('category').cat.codes
df_dev.head(6)

Unnamed: 0,id,tag,tweet,tokens,vector_tokens,btag
0,638060586258038784,neutral,05 Beat it - Michael Jackson - Thriller (25th ...,"[Beat, Michael, Jackson, Thriller, Anniversary...",Beat Michael Jackson Thriller Anniversary Edit...,1
1,638061181823922176,positive,Jay Z joins Instagram with nostalgic tribute t...,"[Jay, joins, Instagram, nostalgic, tribute, Mi...",Jay joins Instagram nostalgic tribute Michael ...,2
2,638083821364244480,neutral,Michael Jackson: Bad 25th Anniversary Edition ...,"[Michael, Jackson, Bad, Anniversary, Edition, ...",Michael Jackson Bad Anniversary Edition Pictur...,1
4,638125563790557184,positive,18th anniv of Princess Diana's death. I still ...,"[anniv, Princess, Dianas, death, still, want, ...",anniv Princess Dianas death still want believe...,2
5,638130776727535617,positive,@oridaganjazz The 1st time I heard Michael Jac...,"[oridaganjazz, The, time, heard, Michael, Jack...",oridaganjazz The time heard Michael Jackson si...,2
8,638162155250954241,negative,@etbowser do u enjoy his 2nd rate Michael Jack...,"[etbowser, enjoy, rate, Michael, Jackson, bit,...",etbowser enjoy rate Michael Jackson bit Honest...,0


**To vectorize our data we will use pre-trained embeddings**
Word2Vec: https://code.google.com/archive/p/word2vec/

From our vocabulary we will find the pre-trained embeddings and create an embedding matrix that will be used to fit our classification algorithms!

In [12]:
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin', binary=True)
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df.vector_tokens)

word_index = tokenizer.word_index
embedding_dimension = 300

embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimension))

for word, i in word_index.items():
    try:
        embedding_vector = word_vectors[word]
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector;#[:embedding_dimension]
    except KeyError:
        embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),embedding_dimension)

del(word_vectors)

print(embedding_matrix.shape)
print(embedding_matrix)

Xtrain = tokenizer.texts_to_sequences(df.vector_tokens)
Ytrain = df.btag
Xtest  = tokenizer.texts_to_sequences(df_test.vector_tokens)
Ytest  = df_test.btag
from keras.utils import np_utils
Ytrain_one_hot = np_utils.to_categorical(Ytrain)
Ytest_one_hot  = np_utils.to_categorical(Ytest)
print(Xtrain[0])

## Get the longest tweet...
longest = max(df.tokens,key=len)
print(longest)
longest = max(Xtrain,key=len)
print(longest)
longest = len(longest)
print("Longest tweet (in words): ", longest)   


(23740, 300)
[[ 0.          0.          0.         ...  0.          0.
   0.        ]
 [-0.02001953  0.01135254  0.18847656 ... -0.02197266  0.07666016
  -0.16894531]
 [ 0.05981445  0.02380371  0.06738281 ... -0.10986328 -0.00872803
   0.03710938]
 ...
 [ 0.22594667 -0.22594741  0.18940675 ... -0.17294055  0.42542338
  -0.7116653 ]
 [-0.04418945 -0.0177002  -0.06982422 ... -0.10791016  0.01586914
   0.0378418 ]
 [-0.65509801  0.03796282  0.39025736 ...  0.34383929 -0.22580719
   0.76601887]]
[2625, 141, 358, 7, 6, 2340, 1249, 49]
['Make', 'Sure', 'To', 'Come', 'To', 'The', 'Bob', 'Jones', 'Game', 'Friday', 'Free', 'Hot', 'Dogs', 'Hamburgers', 'amp', 'Food', 'outside', 'gate', 'amp', 'watch', 'Bob', 'Jones', 'take', 'Austin', 'High']
[34, 119, 327, 29, 327, 5, 192, 1542, 15, 10, 96, 470, 1564, 14202, 11, 635, 1020, 3845, 11, 24, 192, 1542, 85, 1321, 465]
Longest tweet (in words):  25


Setting all tweets lengths equal to the longest one

In [13]:
from keras.preprocessing.sequence import pad_sequences
print(Xtrain[0])
Xtrain = pad_sequences(Xtrain, maxlen=longest)
Xtest  = pad_sequences(Xtest,  maxlen=longest)
print(Xtrain[0])

[2625, 141, 358, 7, 6, 2340, 1249, 49]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0 2625  141  358    7    6 2340 1249   49]


In [15]:
n_words = embedding_matrix.shape[0]
print(n_words)

23740


In [23]:
print(Ytrain_one_hot[0])

[0. 0. 1.]


In [24]:
print(Ytrain[0])

2


# Question A
# Supervised Classifier Selection

## 1. Let's try to train a simple neural network

In [28]:
from keras.utils.vis_utils import plot_model
from keras.layers import Flatten, Embedding, Dense
from keras.models import Sequential

# define network
model = Sequential()
model.add(Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
    weights=[embedding_matrix], input_length=longest, trainable=False))
model.add(Flatten())
model.add(Dense(units=3, activation='softmax'))
# compile network
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])
# summarize defined model
model.summary()

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model).create(prog='dot', format='svg'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 25, 300)           7122000   
_________________________________________________________________
flatten_5 (Flatten)          (None, 7500)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 3)                 22503     
Total params: 7,144,503
Trainable params: 22,503
Non-trainable params: 7,122,000
_________________________________________________________________


ImportError: Failed to import `pydot`. Please install `pydot`. For example with `pip install pydot`.

Time to fit our data to our network

In [30]:
from keras import backend as K
model.fit(K.cast_to_floatx(Xtrain), K.cast_to_floatx(Ytrain_one_hot), batch_size=10, epochs=50, verbose=2)

Epoch 1/50
 - 1s - loss: 0.1170 - acc: 0.7813
Epoch 2/50
 - 1s - loss: 0.1162 - acc: 0.7841
Epoch 3/50
 - 1s - loss: 0.1154 - acc: 0.7856
Epoch 4/50
 - 1s - loss: 0.1146 - acc: 0.7876
Epoch 5/50
 - 1s - loss: 0.1138 - acc: 0.7895
Epoch 6/50
 - 1s - loss: 0.1130 - acc: 0.7923
Epoch 7/50
 - 1s - loss: 0.1123 - acc: 0.7934
Epoch 8/50
 - 1s - loss: 0.1116 - acc: 0.7949
Epoch 9/50
 - 1s - loss: 0.1110 - acc: 0.7967
Epoch 10/50
 - 1s - loss: 0.1103 - acc: 0.7980
Epoch 11/50
 - 1s - loss: 0.1097 - acc: 0.8001
Epoch 12/50
 - 1s - loss: 0.1090 - acc: 0.8010
Epoch 13/50
 - 1s - loss: 0.1084 - acc: 0.8025
Epoch 14/50
 - 1s - loss: 0.1078 - acc: 0.8042
Epoch 15/50
 - 1s - loss: 0.1073 - acc: 0.8049
Epoch 16/50
 - 1s - loss: 0.1067 - acc: 0.8058
Epoch 17/50
 - 1s - loss: 0.1062 - acc: 0.8088
Epoch 18/50
 - 1s - loss: 0.1056 - acc: 0.8112
Epoch 19/50
 - 1s - loss: 0.1051 - acc: 0.8123
Epoch 20/50
 - 1s - loss: 0.1046 - acc: 0.8129
Epoch 21/50
 - 1s - loss: 0.1041 - acc: 0.8151
Epoch 22/50
 - 1s - lo

<keras.callbacks.History at 0x7f638e561748>

Time to evaluate our model

In [31]:
loss, acc = model.evaluate(K.cast_to_floatx(Xtest), K.cast_to_floatx(Ytest_one_hot), verbose=2)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 54.383315


add onemore layer

In [32]:
from keras.utils.vis_utils import plot_model
from keras.layers import Flatten
from keras.layers import Embedding

# define network
model = Sequential()
model.add(Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
    weights=[embedding_matrix], input_length=longest, trainable=False))
model.add(Flatten())
model.add(Dense(units=3, activation='softmax'))
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# summarize defined model
model.summary()

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model).create(prog='dot', format='svg'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 25, 300)           7122000   
_________________________________________________________________
flatten_6 (Flatten)          (None, 7500)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 22503     
Total params: 7,144,503
Trainable params: 22,503
Non-trainable params: 7,122,000
_________________________________________________________________


ImportError: Failed to import `pydot`. Please install `pydot`. For example with `pip install pydot`.

In [33]:
from keras import backend as K
model.fit(K.cast_to_floatx(Xtrain), K.cast_to_floatx(Ytrain_one_hot), batch_size=10, epochs=60, verbose=2)

Epoch 1/60
 - 1s - loss: 0.5510 - acc: 0.7090
Epoch 2/60
 - 1s - loss: 0.4410 - acc: 0.7919
Epoch 3/60
 - 1s - loss: 0.3962 - acc: 0.8199
Epoch 4/60
 - 1s - loss: 0.3664 - acc: 0.8387
Epoch 5/60
 - 1s - loss: 0.3463 - acc: 0.8481
Epoch 6/60
 - 1s - loss: 0.3285 - acc: 0.8590
Epoch 7/60
 - 1s - loss: 0.3152 - acc: 0.8681
Epoch 8/60
 - 1s - loss: 0.3028 - acc: 0.8734
Epoch 9/60
 - 1s - loss: 0.2928 - acc: 0.8791
Epoch 10/60
 - 1s - loss: 0.2853 - acc: 0.8839
Epoch 11/60
 - 1s - loss: 0.2763 - acc: 0.8900
Epoch 12/60
 - 1s - loss: 0.2696 - acc: 0.8918
Epoch 13/60
 - 1s - loss: 0.2632 - acc: 0.8960
Epoch 14/60
 - 1s - loss: 0.2573 - acc: 0.8984
Epoch 15/60
 - 1s - loss: 0.2524 - acc: 0.9017
Epoch 16/60
 - 1s - loss: 0.2474 - acc: 0.9025
Epoch 17/60
 - 1s - loss: 0.2427 - acc: 0.9055
Epoch 18/60
 - 1s - loss: 0.2388 - acc: 0.9068
Epoch 19/60
 - 1s - loss: 0.2351 - acc: 0.9105
Epoch 20/60
 - 1s - loss: 0.2310 - acc: 0.9112
Epoch 21/60
 - 1s - loss: 0.2275 - acc: 0.9141
Epoch 22/60
 - 1s - lo

<keras.callbacks.History at 0x7f64a78ca470>

In [34]:
loss, acc = model.evaluate(K.cast_to_floatx(Xtest), K.cast_to_floatx(Ytest_one_hot), verbose=2)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 66.436221


## 2. Let's try several classification algorithms

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA

# Create a list, with one item per algorithm. Each item has a name, and a classifier object.

models = []
models.append(('LR',  LogisticRegression(solver='liblinear', multi_class='auto')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('kNN', KNeighborsClassifier()))
models.append(('DT',  DecisionTreeClassifier()))
models.append(('NB',  GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

In [35]:
# We are going to evaluate all classifiers, and store results in two lists:
results = []
names   = []
for name, model in models:
  kfold = KFold(n_splits=10, random_state=7)
  cv_results = cross_val_score(model, Xtrain, Ytrain, cv=kfold, scoring='f1_weighted')
  results.append(cv_results)
  names.append(name)
  print("%03s: %f (+/- %f)" % (name, cv_results.mean(), cv_results.std()))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


 LR: 0.387996 (+/- 0.035956)
LDA: 0.387613 (+/- 0.033707)




kNN: 0.436849 (+/- 0.017242)
 DT: 0.448094 (+/- 0.030456)
 NB: 0.235622 (+/- 0.092053)


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


SVM: 0.267242 (+/- 0.082280)


  'precision', 'predicted', average, warn_for)


In [49]:
# Let's create an ensemble with thealgorithms that performed best!
from sklearn.ensemble import VotingClassifier

models2 = []
models2.append(('LR', models[0][1]))
models2.append(('kNN', models[2][1]))
models2.append(('DT', models[3][1]))
ensemble = VotingClassifier(models2)
results = cross_val_score(ensemble, Xtrain, Ytrain, cv=kfold, scoring='f1_weighted')
print(results.mean())



0.4406354990166544


In [50]:
ensemble.fit(Xtrain,Ytrain)
print(ensemble.score(Xtest,Ytest))



0.4124208447103933


Let's optimize these three to achieve an even better score

In [51]:
from sklearn.model_selection import GridSearchCV 

params = {"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
gridlogreg = GridSearchCV(models[0][1], params)
gridlogreg.fit(Xtrain, Ytrain)
print(gridlogreg.best_score_)
print(gridlogreg.best_params_)





0.4388315579227696
{'C': 0.1, 'penalty': 'l2'}




In [52]:
params = {'n_neighbors':[4,5,6,7],
              'leaf_size':[1,3,5],
              'algorithm':['auto', 'kd_tree'],
              'n_jobs':[-1]}
gridknn = GridSearchCV(models[2][1], params)
gridknn.fit(Xtrain, Ytrain)
print(gridknn.best_score_)
print(gridknn.best_params_)



0.4509820239680426
{'algorithm': 'auto', 'leaf_size': 3, 'n_jobs': -1, 'n_neighbors': 7}


In [53]:
params = {'max_depth': np.arange(3, 10)}

griddt = GridSearchCV(models[3][1], params)
griddt.fit(Xtrain, Ytrain)
print(griddt.best_score_)
print(griddt.best_params_)



0.47594873501997337
{'max_depth': 4}


Let's run again our Voting classifier this time with the optimal parameters

In [54]:
models3 = []
models3.append(('LR',  gridlogreg.best_estimator_))
models3.append(('kNN', gridknn.best_estimator_))
models3.append(('DT',  griddt.best_estimator_))
ensemble2 = VotingClassifier(models3)
results = cross_val_score(ensemble2, Xtrain, Ytrain, cv=kfold, scoring='f1_weighted')
print(results.mean())



0.4405529238653859


In [55]:
ensemble2.fit(Xtrain,Ytrain)
print(ensemble2.score(Xtest,Ytest))



0.4484982280834253


We can see a considerable improvement.

# Question B

For this question I will use a personal amateur project that did not use pretrained embeddings.
On the other hand the tweets were collected via the twitter API and tweet vectorization was done with sklearn's tfidf.
As a classifier, only logistic regression was used.
I use this implementation to answer this question purely out of curiosity.

For simplicity I will copy and paste here the necessary code skipping the collection part. The script will be available though.

In [63]:
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
# will hold text
data = []
# will hold label
target = []
# Read the data from tweets1.csv
csv_file = open("./data/tweets1.csv", "r", encoding="utf8")
reader = csv.reader(csv_file, delimiter=',', quotechar='"')
data_column = 4
for row in reader:
    data.append(row[data_column])
    target.append(int(row[0]))

# instantiate the TFIDF vecotorizer    
extr = TfidfVectorizer(min_df=0, max_features=None, strip_accents='unicode', lowercase=True,
                                   analyzer='word', token_pattern=r'\w{3,}', ngram_range=(1, 1),
                                   use_idf=True, smooth_idf=True, sublinear_tf=True, stop_words="english")
# Vectorize our data
transformed_data = extr.fit_transform(data)

# Fit a Logistic regression model
model = LogisticRegression(C=1.)
model.fit(transformed_data, target)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [66]:
from sklearn.metrics import f1_score
new_transformed_data = extr.transform(df_test.vector_tokens)
Y_predicted = model.predict(new_transformed_data)
f1score = f1_score(Ytest, Y_predicted, average="weighted")
print("F1_score is: ", f1score)

F1_score is:  0.4684877651469472


We can see a slightly improved result with this approach.
Probably the fact that these vectors were trained on actual twitter data, in contrast to word2vec, made a difference.
Of course the result is nowhere near to the result of the neural network.