# Exploration Notebook v3 - Jacopo
Hi, it's me again

### Goal
Pass the 0.9 threshold for both accuracy and F1 score

### Plan
- Start to use external datasets, embeddings etc
- Initially Keep same vectorization found in v2, start by incrementing the complexity of the model
- We can tweak both together by using pipelines, grid and randomized search with (stratified) k-fold cross-validation as introduced in last notebook
- Keep track of the best model and its parameters, start saving them as pickle files as introduced in v2
- Start thinking about generalizing the prpoblem to sentiment analysis instead of just detection of smiley face which we don't even have access to.

### Next
With v3 we plan on passing the 0.9 threshold, with v3 we can aggregate the results of the best model and the best parameters and tweak them to get the best possible result in a cleaner way to prepare a final submission. Modularize, improve helpers etc. Keeping this self contained turned out to be better and the Helper file just created more friction

## Helper functions

In [11]:
def build_feature_matrix(df, vocab, embeddings, mode='avg'):
    X = np.zeros((df.shape[0], embeddings.shape[1]))
    for i, tweet in enumerate(df['tweet']):
        words = tweet.split()
        for word in words:
            if word in vocab:
                X[i] += embeddings[vocab[word]]
        if mode == 'avg':
            X[i] /= len(words)
        elif mode == 'sum':
            pass
        else:
            raise ValueError('Unknown mode: {}'.format(mode))
    return X
    
def load_train_data(path_pos='data/twitter-datasets/train_pos_full.txt', path_neg='data/twitter-datasets/train_neg_full.txt'):
    # Load data, txt as csv
    #data_path = 'data/twitter-datasets/'
    df_train_pos = pd.read_csv(path_pos, sep = '\t', names = ['tweet'])
    df_train_pos['label'] = 1
    df_train_neg = pd.read_csv(path_neg, sep = '\t', names = ['tweet'], on_bad_lines='skip')
    df_train_neg['label'] = 0
    df_train = pd.concat([df_train_pos, df_train_neg])
    print('Train set: ', df_train.shape)
    print('Train set positives: ', df_train_pos.shape)
    print('Train set negatives: ', df_train_neg.shape)
    return df_train   

def load_test_data():
    # Load test data: id, tweet for each row
    data_path = 'data/twitter-datasets/'
    df_test = pd.read_csv(data_path + 'test_data.txt', header=None, names=['line'], sep='\t')
    # Extract id and tweet, limit split by 1 so we don't split the tweet (this is v0, at least we keep it intact)
    df_test['id'] = df_test['line'].apply(lambda x: x.split(',',1)[0]) 
    df_test['tweet'] = df_test['line'].apply(lambda x: x.split(',',1)[1])
    df_test = df_test.drop('line', axis=1)
    return df_test

def predict_test_data(X_test, classifier, filename='submission.csv'):
    # Predict test data and save to csv
    y_pred = classifier.predict(X_test)
    df_test['Prediction'] = y_pred
    df_test.rename(columns={'id': 'Id'}, inplace=True)
    df_test['Prediction'] = df_test['Prediction'].apply(lambda x: -1 if x == 0 else x)
    df_test.to_csv(filename, columns=['Id', 'Prediction'], index=False)
    return df_test
    
def predict_test_data_pipeline(df_test, pipe, filename='submission.csv'):
    # Predict test data and save to csv
    y_pred = pipe.predict(df_test['tweet'])
    df_test['Prediction'] = y_pred
    df_test.rename(columns={'id': 'Id'}, inplace=True)
    df_test['Prediction'] = df_test['Prediction'].apply(lambda x: -1 if x == 0 else x)
    df_test.to_csv(filename, columns=['Id', 'Prediction'], index=False)
    return df_test

def train_test(clf, X_train, y_train, X_eval=None, y_eval=None, cv=None):
    from sklearn.metrics import accuracy_score, f1_score
    if X_eval is None:
        from sklearn.model_selection import train_test_split
        X_train, X_eval, y_train, y_eval = train_test_split(X_train, y_train, test_size=0.2)
    if cv is not None:
        from sklearn.model_selection import cross_val_score
        scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1, shuffle=True)
        print('Cross validation Accuracy Scores: ', scores)
        print('Cross validation mean score: ', scores.mean())
        print('Cross validation std score: ', scores.std())
        clf.fit(X_train, y_train)
        return clf
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_eval)
    print('Accuracy: ', accuracy_score(y_eval, y_pred))
    print('F1 score: ', f1_score(y_eval, y_pred))
    return clf

## Notebook

In [3]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve, auc, f1_score
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [4]:
# Load data, lets work on partial data for now, we'll come back to this later
df_train = load_train_data(path_pos='data/twitter-datasets/train_pos.txt', path_neg='data/twitter-datasets/train_neg.txt')
# Load vectorization and classifier obtained from the previous notebook as a reference
with open('data/out/trained/tfidf_vectorizer-linSVC-pipeline-v2_4.pickle', 'rb') as f:
    pipe = pickle.load(f)

# we know that this achieves acc: 0.848	f1: 0.850 and runs in less than 3min on full data from scratch
# let's train a linear SVC on the partial data and see how it performs so we can compare and iterate
svm = LinearSVC()
vec_pipe = pipe.steps[0][1]
svm_pipe = pipe.steps[1][1]
print(vec_pipe, svm_pipe)

Train set:  (196970, 2)
Train set positives:  (97902, 2)
Train set negatives:  (99068, 2)
TfidfVectorizer(binary=True, min_df=3, ngram_range=(1, 4)) LinearSVC()


In [5]:
# # Simple check to compare the fresh one fitted on partial data with the one fitted on full data
# # note that the full one will overfit because we are using data it was trained on to test it
# # but we can still compare the results as a rule of thumb (also because we have the real test data results)
# X_train, X_eval, y_train, y_eval = train_test_split(df_train['tweet'], df_train['label'], test_size=0.2, random_state=42)
# X_train = vec.fit_transform(X_train)
# X_eval = vec.transform(X_eval)
# svm.fit(X_train, y_train)
# y_pred = svm.predict(X_eval)
# print('Accuracy fresh svm: ', accuracy_score(y_eval, y_pred))
# print('F1 score fresh svm: ', f1_score(y_eval, y_pred))
# y_pred = svm_pipe.predict(X_eval)
# print('Accuracy: ', accuracy_score(y_eval, y_pred))
# print('F1 score: ', f1_score(y_eval, y_pred))

In [6]:
# I think a misconception I had was that for a second I thought the vectorizer was 'saved' as well
# but it's not, it's just a step in the pipeline, so we need to re-fit it on the partial data
# the fitting of the vectorizer is not the expensive part, the actual transformation is
# let's thus re-fit the vectorizer on the partial data and see how it performs

In [7]:
# check svm_pipe performance on partial data, don't even modify the vectorizer
X_train, X_eval, y_train, y_eval = train_test_split(df_train['tweet'], df_train['label'], test_size=0.2)
X_train = vec_pipe.transform(X_train)
X_eval = vec_pipe.transform(X_eval)
y_pred = svm_pipe.predict(X_eval)
print('Accuracy: ', accuracy_score(y_eval, y_pred))


Accuracy:  0.9642331319490278


In [8]:
# check svm performance on partial data
svm.fit(X_train, y_train)
y_pred = svm.predict(X_eval)
print('Accuracy: ', accuracy_score(y_eval, y_pred))
print('F1 score: ', f1_score(y_eval, y_pred))


Accuracy:  0.8333248718078895
F1 score:  0.8358335833583358


we can see the partial data is indeed a good representation of the full data and as of now
There won't be a huge difference. To be even more fair in the comparison we could make an actual 
submission just with the partial data but for now it's fine. Let's now work on improving the classifier
and keeping the same vectirizer that we know works well as a baseline.
we'll then come back to it and play with embeddings and other vectorizers

In [12]:
# # A few classifiers to have an initial idea and direction
# # I still like the random forest, lets try
# clf = RandomForestClassifier(
#     max_depth=5
#     )
# train_test(clf, X_train, y_train, X_eval, y_eval)
# # 34s
# # Accuracy:  0.5864852515611515
# # F1 score:  0.36840880893300243

Accuracy:  0.5864852515611515
F1 score:  0.36840880893300243


In [13]:
# clf = RandomForestClassifier(
#     max_depth=5,
#     n_estimators=100,
#     criterion='entropy',
#     min_samples_leaf=1,
#     min_samples_split=5,
# )
# clf = train_test(clf, X_train, y_train, X_eval, y_eval)
# # 34s
# # Accuracy:  0.582779103416764
# # F1 score:  0.3353283726949207

Accuracy:  0.582779103416764
F1 score:  0.3353283726949207


In [15]:
clf = RandomForestClassifier(
    max_depth=25,
    n_estimators=3,
    criterion='gini',
    min_samples_leaf=1,
    min_samples_split=2,
)
clf = train_test(clf, X_train, y_train, X_eval, y_eval)

Accuracy:  0.5370614814438747
F1 score:  0.679799841980511


In [16]:
# clf = RandomForestClassifier(
#     max_depth=None,
#     n_estimators=3,
#     criterion='gini',
#     min_samples_leaf=1,
#     min_samples_split=2,
#     n_jobs=-1,
# )
# clf = train_test(clf, X_train, y_train, X_eval, y_eval)
# # 7min 
# # Accuracy:  0.7342742549626847
# # F1 score:  0.7434062162957153

Accuracy:  0.7342742549626847
F1 score:  0.7434062162957153


In [19]:
#7min 9s without the n_jobs=-1
# Accuracy:  0.7342742549626847
# F1 score:  0.7434062162957153
# Improving but not much if we compare to the speed and accuracy of the linear SVC

In [17]:
svm = LinearSVC(verbose=1)
svm = train_test(svm, X_train, y_train, X_eval, y_eval)

[LibLinear]....*
optimization finished, #iter = 49
Objective value = -33163.060320
nSV = 123234
Accuracy:  0.8333248718078895
F1 score:  0.8358335833583358


In [20]:
lr = LogisticRegression(max_iter=4000, verbose=1, n_jobs=-1)
lr = train_test(lr, X_train, y_train, X_eval, y_eval)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =      3018686     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.09223D+05    |proj g|=  5.71000D+02


 This problem is unconstrained.



           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
*****     38     47      1     0     0   1.324D-01   6.325D+04
  F =   63254.813249193365     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
Accuracy:  0.8233741178859725
F1 score:  0.8258933039735762


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   11.4s finished


Not bad, linear svm is still king but we can now start to add layers of complexity
I'm quite sad about random forests, I find them so cool :(
    

In [1]:
# ADA BOOST CLASSIFIER 
# Using lin SVM as base estimator - use svm linear and probability True
# ada = AdaBoostClassifier( 
#     base_estimator= SVC(kernel='linear', probability=True, verbose=1),
#     n_estimators=50,
#     learning_rate=1,
# )
# ada = train_test(ada, X_train, y_train, X_eval, y_eval)
# 145min into the computation using only one core............ I'll stop it here
# I'm testing in another notebook in the meantime
# [LibSVM]..............................................................
# Warning: using -h 0 may be faster
# *
# optimization finished, #iter = 62644
# obj = -0.794988, rho = 0.999154
# nSV = 125286, nBSV = 125286
# Total nSV = 125286
# ............................................

While the above cell has been running for the last 65min and counting, on a single poor cpu at 100% and I have the impression it hasn't even finished the first base estimator...

Here's a few things I found:
- I really consider speed as a measure of quality, so when a cell runs for more than 5min on 1/10 of the data I should get suspicious. In this spirit, we can see how using SVC instead of linearSVC does indeed expose us to the paramtere 'probability', but runs extremely slower. 
- Note how SVM is not base on probability, but on distance from the hyperplane. With optimized linear kernel calling linearSVC we can see how fast it is. Adding probability basically nullifies the optimizations that can be done around not having to compute the distance between every single point (I think), and this 'probability' is furthermore a proxy of probability found by 'Platt calibration'.
- I need to reeavaluate from scratch the use of Decision Trees as a basis for either Random Forest (ensemble) or Boosting (sequential improvements on each subsequent tree). From previous testings we can see how deeper trees with more leaves are better, without too much worry about canceling out 'noisy' data.
- We should definitely visualize the model, see if for examble SVC is able to perfrctly separate the train data or not. This would give insight on how much we are overfitting and how much we can improve the model by adding regularization or other techniques. I didn't play with the C parameter so in theory as of now we are not regularizing at all and we just try to fit perfectly.

Trees:
- Diving deeper into these, I think they should work much better if we create a lower dimensional space. As of now the vectorization with 1-4 grams is quite humongous and allows a linearSVC to quickly divide the space but it just overcomplicates a Decision Tree. Using embeddings and PCA could probably help, in an effort to increase the information density of the data, giving more 'context' to each word, not just by n-grams.

Vectorization:
- We should start playing with embeddings to give more context or reduce the token size.
- We could try to visualize some vectorizations of tweets do have a better understanding of what we are feeding our models with. How are we handling the various 'lolololol', 'whyyyy' and such?

Labels:
- It shouldnt change much theoretically but maybe the libraries would be able to optimize better if we use a True and False instead of 1 and 0. We can try that. Also, same vibe but maybe -1 and 1 instead of 0 and 1. Could potentially speed up slightly.

110min in...
Essencially we are throwing every single tweet composed of <140 words into a vector space which is absolutely huge.
This assures that linearSVC is able to find hyperplane, but at a huge cost in the case above.
What if we bounded the maximum amount of dimensions that the classifier can use?
We could bound it by the average number of non-zero entries in the vectorized tweets.

Also, we could then play around with C values.

In [None]:
# It finished first base estimator in 80+min...

# [LibSVM]..............................................................
# Warning: using -h 0 may be faster
# *
# optimization finished, #iter = 62644
# obj = -0.794988, rho = 0.999154
# nSV = 125286, nBSV = 125286
# Total nSV = 125286

In [None]:
# ADA BOOST CLASSIFIER
# Using decision tree as base estimator - I won't give u on u beautiful trees
adatree = AdaBoostClassifier(
    base_estimator= DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1,
)