# Picking Out the HyperParameters in Bag of Words

Using the Bag of Words method of vectorising documents, there are several ways of manipulating how Bag of Words will interpret the words found within the document.

## CountVectorizer & TF-IDF
The options that can be found are:

- Stop Words
- N-gram Range
- Min-DF

(Question for Self: Why have I short-listed these?)

## Metrics

Using Naive-Bayes Classifier
Calculating the F1 Score of Each Set

In [41]:
import pandas as pd
import numpy as np
import logging
from pprint import pprint
from time import time
from sklearn import metrics
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [42]:
trainPath = '../data/hateval2019_en_train_clean.csv'
testPath = '../data/hateval2019_en_test_clean.csv'

trainSet = pd.read_csv(trainPath)
testSet = pd.read_csv(testPath)

In [43]:
trainText = trainSet.text
trainHate = trainSet.HS
trainTarget = trainSet.TR
trainAggressive = trainSet.AG

In [44]:
def pipeSetUp(clf):
    pipe = Pipeline(steps=[('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', clf)])
    return pipe

In [45]:
def runPipe(training_text, training_score, parameters, pipe):
    if __name__ == "__main__":
        grid_pipeline = GridSearchCV(pipe,parameters,n_jobs=4,verbose=1, scoring='f1')

        print("Performing grid search...")
        print("pipeline:", [name for name, _ in pipe.steps])
        print("parameters:")
        pprint(parameters)
        t0 = time()
        grid_pipeline.fit(training_text, training_score)
        print("done in %0.3fs" % (time() - t0))
        print("scoring paramater: f1")

        print("Best score: %0.3f" % grid_pipeline.best_score_)
        F1 = grid_pipeline.best_score_
        print("Best parameters set:")
        best_parameters = grid_pipeline.best_estimator_.get_params()
        for param_name in sorted(parameters.keys()):
            print("\t%s: %r" % (param_name, best_parameters[param_name]))
        return F1;

In [6]:
pipe = pipeSetUp(MultinomialNB())

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0, 0.9),
    'vect__stop_words': ('english',),
    'vect__min_df': (2, 0.1, 3, 0.2, 4),
    'vect__ngram_range': ((1, 1), (1, 2),),  
    'tfidf__use_idf': (True, False),
#     'tfidf__norm': ('l1','l2'),
#     'clf__max_iter': (100000,),
#    'clf__penalty': ('l1','l2', 'elasticnet'),
#    'clf__alpha': (0.0001,0.00001,0.0002,0.00002),
}

print('Getting Hate Score...')
hate_F1 = runPipe(trainText, trainHate, parameters, pipe)
print('Getting Target Score...')
target_F1 = runPipe(trainText, trainTarget, parameters, pipe)
print('Getting Aggressive Score...')
aggressive_F1 = runPipe(trainText, trainAggressive, parameters, pipe)

overall_F1 = (hate_F1 + target_F1 + aggressive_F1)/3

print("Overall F1 Score : %0.3f" % (overall_F1))

Getting Hate Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    5.5s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   18.4s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   36.4s finished


done in 36.685s
scoring paramater: f1
Best score: 0.654
Best parameters set:
	tfidf__use_idf: False
	vect__max_df: 0.5
	vect__min_df: 4
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'
Getting Target Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.5s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   17.4s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   36.8s finished


done in 37.253s
scoring paramater: f1
Best score: 0.312
Best parameters set:
	tfidf__use_idf: True
	vect__max_df: 0.75
	vect__min_df: 4
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
Getting Aggressive Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.6s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   17.1s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   35.2s finished


done in 35.660s
scoring paramater: f1
Best score: 0.236
Best parameters set:
	tfidf__use_idf: True
	vect__max_df: 0.5
	vect__min_df: 4
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
Overall F1 Score : 0.400


In [9]:
pipe = pipeSetUp(BernoulliNB())

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0, 0.9),
    'vect__stop_words': ('english',),
    'vect__min_df': (2, 0.1, 3, 0.2, 4),
    'vect__ngram_range': ((1, 1), (1, 2),),  
    'tfidf__use_idf': (True, False),
#     'tfidf__norm': ('l1','l2'),
#     'clf__max_iter': (100000,),
#    'clf__penalty': ('l1','l2', 'elasticnet'),
#    'clf__alpha': (0.0001,0.00001,0.0002,0.00002),
}

print('Getting Hate Score...')
hate_F1 = runPipe(trainText, trainHate, parameters, pipe)
print('Getting Target Score...')
target_F1 = runPipe(trainText, trainTarget, parameters, pipe)
print('Getting Aggressive Score...')
aggressive_F1 = runPipe(trainText, trainAggressive, parameters, pipe)

overall_F1 = (hate_F1 + target_F1 + aggressive_F1)/3

print("Overall F1 Score : %0.3f" % (overall_F1))

Getting Hate Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.6s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   16.5s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   34.1s finished


done in 34.390s
scoring paramater: f1
Best score: 0.680
Best parameters set:
	tfidf__use_idf: True
	vect__max_df: 0.75
	vect__min_df: 3
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'
Getting Target Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.7s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   16.6s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   35.0s finished


done in 35.278s
scoring paramater: f1
Best score: 0.558
Best parameters set:
	tfidf__use_idf: True
	vect__max_df: 0.5
	vect__min_df: 4
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'
Getting Aggressive Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.6s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   16.5s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   38.7s finished


done in 39.182s
scoring paramater: f1
Best score: 0.386
Best parameters set:
	tfidf__use_idf: True
	vect__max_df: 0.75
	vect__min_df: 4
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
Overall F1 Score : 0.541


In [13]:
pipe = pipeSetUp(SGDClassifier())

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0, 0.9),
    'vect__stop_words': ('english',),
    'vect__min_df': (2, 0.1, 3, 0.2, 4),
    'vect__ngram_range': ((1, 1), (1, 2),),  
    'tfidf__use_idf': (True, False),
#     'tfidf__norm': ('l1','l2'),
#     'clf__max_iter': (100000,),
    'clf__penalty': ('l1','l2', 'elasticnet'),
    'clf__alpha': (0.0001,0.00001,0.0002,0.00002),
}

print('Getting Hate Score...')
hate_F1 = runPipe(trainText, trainHate, parameters, pipe)
print('Getting Target Score...')
target_F1 = runPipe(trainText, trainTarget, parameters, pipe)
print('Getting Aggressive Score...')
aggressive_F1 = runPipe(trainText, trainAggressive, parameters, pipe)

overall_F1 = (hate_F1 + target_F1 + aggressive_F1)/3

print("Overall F1 Score : %0.3f" % (overall_F1))

Getting Hate Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'clf__alpha': (0.0001, 1e-05, 0.0002, 2e-05),
 'clf__penalty': ('l1', 'l2', 'elasticnet'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 960 candidates, totalling 4800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   20.7s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   46.9s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:  2.2min
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:  3.2min
[Parallel(n_jobs=4)]: Done 2442 tasks      | elapsed:  4.5min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done 4042 tasks      | elapsed:  7.2min
[Parallel(n_jobs=4)]: Done 4800 out of 4800 | elapsed:  8.6min finished


done in 516.359s
scoring paramater: f1
Best score: 0.688
Best parameters set:
	clf__alpha: 0.0001
	clf__penalty: 'elasticnet'
	tfidf__use_idf: False
	vect__max_df: 1.0
	vect__min_df: 3
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'
Getting Target Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'clf__alpha': (0.0001, 1e-05, 0.0002, 2e-05),
 'clf__penalty': ('l1', 'l2', 'elasticnet'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 960 candidates, totalling 4800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    4.3s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   19.6s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   45.1s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:  2.1min
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:  3.1min
[Parallel(n_jobs=4)]: Done 2442 tasks      | elapsed:  4.2min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed:  5.4min
[Parallel(n_jobs=4)]: Done 4042 tasks      | elapsed:  6.9min
[Parallel(n_jobs=4)]: Done 4800 out of 4800 | elapsed:  8.2min finished


done in 490.511s
scoring paramater: f1
Best score: 0.533
Best parameters set:
	clf__alpha: 2e-05
	clf__penalty: 'elasticnet'
	tfidf__use_idf: False
	vect__max_df: 0.9
	vect__min_df: 2
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
Getting Aggressive Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'clf__alpha': (0.0001, 1e-05, 0.0002, 2e-05),
 'clf__penalty': ('l1', 'l2', 'elasticnet'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 960 candidates, totalling 4800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    4.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   19.0s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   43.6s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:  2.0min
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:  3.0min
[Parallel(n_jobs=4)]: Done 2442 tasks      | elapsed:  4.1min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed:  5.3min
[Parallel(n_jobs=4)]: Done 4042 tasks      | elapsed:  6.8min
[Parallel(n_jobs=4)]: Done 4800 out of 4800 | elapsed:  8.0min finished


done in 482.550s
scoring paramater: f1
Best score: 0.396
Best parameters set:
	clf__alpha: 1e-05
	clf__penalty: 'l2'
	tfidf__use_idf: False
	vect__max_df: 0.9
	vect__min_df: 3
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'
Overall F1 Score : 0.539


In [21]:
pipe = pipeSetUp(LogisticRegression())

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0, 0.9),
    'vect__stop_words': ('english',),
    'vect__min_df': (2, 0.1, 3, 0.2, 4),
    'vect__ngram_range': ((1, 1), (1, 2),),  
    'tfidf__use_idf': (True, False),
#     'tfidf__norm': ('l1','l2'),
    'clf__max_iter': (100000,),
#    'clf__penalty': ('l1','l2', 'elasticnet'),
#    'clf__alpha': (0.0001,0.00001,0.0002,0.00002),
}

print('Getting Hate Score...')
hate_F1 = runPipe(trainText, trainHate, parameters, pipe)
print('Getting Aggressive Score...')
aggressive_F1 = runPipe(trainText, trainAggressive, parameters, pipe)

pipe = pipeSetUp(GradientBoostingClassifier)

print('Getting Target Score...')
target_F1 = runPipe(trainText, trainTarget, parameters, pipe)

overall_F1 = (hate_F1 + target_F1 + aggressive_F1)/3

print("Overall F1 Score : %0.3f" % (overall_F1))

Getting Hate Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'clf__max_iter': (100000,),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    7.6s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   26.8s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   53.6s finished


done in 54.093s
scoring paramater: f1
Best score: 0.683
Best parameters set:
	clf__max_iter: 100000
	tfidf__use_idf: False
	vect__max_df: 0.5
	vect__min_df: 4
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'
Getting Target Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'clf__max_iter': (100000,),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    5.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   24.5s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   51.8s finished


done in 52.644s
scoring paramater: f1
Best score: 0.485
Best parameters set:
	clf__max_iter: 100000
	tfidf__use_idf: False
	vect__max_df: 0.5
	vect__min_df: 2
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
Getting Aggressive Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
classifier: MultinomialNB
parameters:
{'clf__max_iter': (100000,),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   25.5s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:   51.9s finished


done in 52.550s
scoring paramater: f1
Best score: 0.299
Best parameters set:
	clf__max_iter: 100000
	tfidf__use_idf: False
	vect__max_df: 0.75
	vect__min_df: 4
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
Overall F1 Score : 0.489


In [46]:
pipe = pipeSetUp(GradientBoostingClassifier())

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0, 0.9),
    'vect__stop_words': ('english',),
    'vect__min_df': (2, 0.1, 3, 0.2, 4),
    'vect__ngram_range': ((1, 1), (1, 2),),  
    'tfidf__use_idf': (True, False),
#     'tfidf__norm': ('l1','l2'),
#    'clf__max_iter': (100000,),
#    'clf__penalty': ('l1','l2', 'elasticnet'),
#    'clf__alpha': (0.0001,0.00001,0.0002,0.00002),
    'clf__n_estimators':(0, 100, 1000),
    'clf__learning_rate':(0, 0.5, 1.0 , 1.5, 2.0),
    'clf__max_depth':(0, 1, 2),
    'clf__random_state':(0, 1)
}

print('Getting Target Score...')
target_F1 = runPipe(trainText, trainTarget, parameters, pipe)



Getting Target Score...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__learning_rate': (0, 0.5, 1.0, 1.5, 2.0),
 'clf__max_depth': (0, 1, 2),
 'clf__n_estimators': (0, 100, 1000),
 'clf__random_state': (0, 1),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0, 0.9),
 'vect__min_df': (2, 0.1, 3, 0.2, 4),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__stop_words': ('english',)}
Fitting 5 folds for each of 7200 candidates, totalling 36000 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    5.4s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   15.8s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   34.4s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:   59.1s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:  2.1min
[Parallel(n_jobs=4)]: Done 2442 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed:  3.6min
[Parallel(n_jobs=4)]: Done 4042 tasks      | elapsed:  4.6min
[Parallel(n_jobs=4)]: Done 4992 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done 6042 tasks      | elapsed:  6.9min
[Parallel(n_jobs=4)]: Done 7192 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done 8442 tasks      | elapsed:  9.6min
[Parallel(n_jobs=4)]: Done 9792 tasks      | elapsed: 11.1min
[Parallel(n_jobs=4)]: Done 11242 tasks      | elapsed: 15.6mi

done in 14352.091s
scoring paramater: f1
Best score: 0.573
Best parameters set:
	clf__learning_rate: 1.5
	clf__max_depth: 2
	clf__n_estimators: 100
	clf__random_state: 0
	tfidf__use_idf: True
	vect__max_df: 0.75
	vect__min_df: 4
	vect__ngram_range: (1, 1)
	vect__stop_words: 'english'


In [40]:
vect = CountVectorizer(stop_words='english', ngram_range=(1, 2), min_df=2, max_df=0.75)
# do comparison with one where you don't filter out HS
target_train_set = trainSet[(trainSet["HS"]==1)]
target_test_set = testSet[(testSet["HS"]==1)]

x_train_dtm = vect.fit_transform(target_train_set.text)
x_test_dtm = vect.transform(target_test_set.text)

bernoulli_nb = BernoulliNB()

bernoulli_nb.fit(x_train_dtm, target_train_set.AG)

y_pred_class_bernoulli_nb = bernoulli_nb.predict(x_test_dtm)

bernoulli_nb_acc = metrics.accuracy_score(target_test_set.AG, y_pred_class_bernoulli_nb)
bernoulli_nb_acc

print("Aggressive Score")
print(classification_report(target_test_set.AG, y_pred_class_bernoulli_nb, labels=[0,1]))

Aggressive Score
              precision    recall  f1-score   support

           0       0.59      0.81      0.68       666
           1       0.64      0.37      0.47       594

    accuracy                           0.60      1260
   macro avg       0.61      0.59      0.57      1260
weighted avg       0.61      0.60      0.58      1260



In [52]:
vect = CountVectorizer(stop_words='english', ngram_range=(1, 2), min_df=4, max_df=0.75)
# do comparison with one where you don't filter out HS
target_train_set = trainSet[(trainSet["HS"]==1)]
target_test_set = testSet[(testSet["HS"]==1)]

x_train_dtm = vect.fit_transform(target_train_set.text)
x_test_dtm = vect.transform(target_test_set.text)

# grid pipeline is missing these variables which is one of the reasons why the F1-score i slow
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.5,max_depth=2, random_state=0)

gb.fit(x_train_dtm, target_train_set.TR)

y_pred_class_gb = gb.predict(x_test_dtm)

gb_acc = metrics.accuracy_score(target_test_set.TR, y_pred_class_gb)
gb_acc

print("Target Score")
print(classification_report(target_test_set.TR, y_pred_class_gb, labels=[0,1]))

Target Score
              precision    recall  f1-score   support

           0       0.86      0.91      0.88       731
           1       0.86      0.80      0.83       529

    accuracy                           0.86      1260
   macro avg       0.86      0.85      0.86      1260
weighted avg       0.86      0.86      0.86      1260



Target Score
              precision    recall  f1-score   support

           0       0.94      0.84      0.89       731
           1       0.81      0.93      0.87       529

    accuracy                           0.88      1260
   macro avg       0.88      0.89      0.88      1260
weighted avg       0.89      0.88      0.88      1260

Target Score
              precision    recall  f1-score   support

           0       0.86      0.91      0.88       731
           1       0.86      0.80      0.83       529

    accuracy                           0.86      1260
   macro avg       0.86      0.85      0.86      1260
weighted avg       0.86      0.86      0.86      1260




In [36]:
target_test_set[(y_pred_class_gb==1) & (target_test_set.TR==0)]

Unnamed: 0.1,Unnamed: 0,id,text,HS,TR,AG
2381,2381,30806,I can never get mad over a bitch cause i know ...,1,0,0
2384,2384,33611,I #want me🙌 and I will do anything to #keep me...,1,0,1
2387,2387,30408,I like my niggas mean😍😍😍 If a bitch touch you ...,1,0,1
2394,2394,31203,Bitches be selling they body and acting boogie...,1,0,1
2401,2401,33069,Bitch I'm the man. Hoe I'm the man. You know I...,1,0,1
2424,2424,31211,"swear, next bitch i fuck ima get her pregnant ...",1,0,1
2435,2435,32291,"3) ""you okay?"" ""Fuck you you fat bitch."" ""Fuck...",1,0,0
2441,2441,33807,You won't catch me on this hoe arguing with a ...,1,0,0
2447,2447,31109,A nigga never go respect a bitch who post noth...,1,0,1
2451,2451,33718,"When you're just tryna be happy, but then a bi...",1,0,0


# Overall F1 Scores

F1 = (F1(HS) + F1(AR) + F1(TR))/3

MultinomialNB Score: 0.400

BernoulliNB Score: 0.541

Logistic Regression: 0.489

SGDClassifier Score: 0.539

