# Classifying tweet relevance to vaping with supervised machine learning algorithms

Authors: Patrick O'Halloran and Sanya Taneja

Home repository: https://github.com/CRMTH/AnnotationProjects

Summary: We evaluate several classification algorithms with 2000 annotated tweets in Dataset 1 (D1). The target of this analysis is relevance as coded by annotators (note: would be good to report inter-rater reliability metrics here). For classification, we evaluate the Bernoulli Naive Bayes, Random Forest, Logistic Regression, and Linear SVM algorithms, while comparing count vectorization with the TFIDF statistic for feature engineering. We find that the best combination of algorithms and feature engineering with respect to the accuracy evaluation metric is Logistic Regression using count vectorization to construct features. We anticipate this result will not generalize as the dataset grows.

First, we import all python libraries and read in D1, which we preprocess using nlp_preprocess.py (written by Sanya) prior to analysis. Below, we display a sample of both relevant and non-relevant tweets.

In [63]:
# import python libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier, Perceptron, RidgeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from time import time

# read in data and display first 5 rows for relevant and non-relevant tweets
d1 = pd.read_table('./data/processed_D1.tsv')
d1.groupby('relevant').head()

Unnamed: 0,tweetID,text,quote,relevant,com_vape,news_vape,pro_vape,anti_vape,metadata_text,mention_count,url_count,unicode_count,emoji_count,hashtag_count,keyword_count,hashtags,clean_text
0,'1037612658755821568,20% discount ON UBLO CBD E-Liquids ENTER UBL...,,1,1,0,0,0,discount ublo cbd e liquid enter ublo checkout...,0,0,0,0,0,1,[],discount ublo cbd e liquid enter ublo checkout...
1,'1032695372110413824,New Vaping SMOK Rolo Badge Starter Kit 250mAh...,,1,1,0,0,0,new vape smok rolo badg starter kit mah,0,0,0,0,0,1,[],new vape smok rolo badg starter kit mah
2,'1032148741359370240,knew she was real when it wasn't a Juul but a...,if you can smoke a jack & beat my ass like th...,1,0,0,0,1,knew real neg_a neg_juul neg_but neg_a jack,0,0,0,0,0,0,[],knew real neg_a neg_juul neg_but neg_a jack
3,'1031947667851472896,Take a minute today to check out some of the ...,,1,1,0,0,0,take minut today check new e juic help peopl a...,0,0,0,0,0,1,[],take minut today check new e juic help peopl a...
4,'1030995802305396736,New Vaping Vaporesso Revenger X 220W Kit with...,,1,1,0,0,0,new vape vaporesso reveng x w kit nrg tank ml ml,0,0,0,0,0,1,[],new vape vaporesso reveng x w kit nrg tank ml ml
5,'1037013414550228992,Parang ka usok ng vape napapasaya mo ako pero...,,0,0,0,0,0,parang ka usok ng vape napapasaya mo ako pero ...,0,0,0,0,0,1,[],parang ka usok ng vape napapasaya mo ako pero ...
6,'1033243474760323072,@hazim_wafiy,Sebab apa aku prefer # NanoSTIX dari rokok da...,0,0,0,0,0,,0,0,0,0,0,0,[],
8,'1034414473413517312,Nasty punya company semakin mantap dah. Try l...,Job vacancy. _newline_ Office yang suasananya...,0,0,0,0,0,nasti punya compani semakin mantap dah tri lah...,0,0,0,0,0,0,[],nasti punya compani semakin mantap dah tri lah...
26,'1032458291983659008,2nd\u3044\u3088\u3044\u3088\u660e\u65e5\u767a...,,0,0,0,0,0,nd vsc vscmod vscmodjapan kiyomasa nd vape vap...,0,0,0,0,0,3,[],nd vsc vscmod vscmodjapan kiyomasa nd vape vap...
27,'1036301216592982016,Ada lagi eh lelaki ceni sekarang?,Jangan tinggalkan lelaki yang: _newline_ 1.Ta...,0,0,0,0,0,ada lagi eh lelaki ceni sekarang,0,0,0,0,0,0,[],ada lagi eh lelaki ceni sekarang


Next, we describe the completeness of D1 and check how much data is lost by dropping rows with any null values. In so doing, we end up with 1965 tweets available for evaluating various classifier algorithms.

In [21]:
d1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 17 columns):
tweetID          2000 non-null object
text             2000 non-null object
quote            2000 non-null object
relevant         2000 non-null int64
com_vape         2000 non-null int64
news_vape        2000 non-null int64
pro_vape         2000 non-null int64
anti_vape        2000 non-null int64
metadata_text    1965 non-null object
mention_count    2000 non-null int64
url_count        2000 non-null int64
unicode_count    2000 non-null int64
emoji_count      2000 non-null int64
hashtag_count    2000 non-null int64
keyword_count    2000 non-null int64
hashtags         2000 non-null object
clean_text       1965 non-null object
dtypes: int64(11), object(6)
memory usage: 265.7+ KB


In [22]:
d1.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1965 entries, 0 to 1999
Data columns (total 17 columns):
tweetID          1965 non-null object
text             1965 non-null object
quote            1965 non-null object
relevant         1965 non-null int64
com_vape         1965 non-null int64
news_vape        1965 non-null int64
pro_vape         1965 non-null int64
anti_vape        1965 non-null int64
metadata_text    1965 non-null object
mention_count    1965 non-null int64
url_count        1965 non-null int64
unicode_count    1965 non-null int64
emoji_count      1965 non-null int64
hashtag_count    1965 non-null int64
keyword_count    1965 non-null int64
hashtags         1965 non-null object
clean_text       1965 non-null object
dtypes: int64(11), object(6)
memory usage: 276.3+ KB


In [23]:
d1.dropna(inplace=True)
d1.reset_index(drop=True,inplace=True)
x = d1.clean_text
y = d1.relevant

We use 5-fold cross validation to assess the performance of logistic regression with L2 regularization, linear SVM with L2 regularization, Bernoulli naive Bayes, and random forest algorithms. Using scikit-learn, we construct a pipeline to search for the best parameters over a large parameter space  (# of count vectorized features, choice of uni/bi/trigrams, and classifier hyperparameters).

In [39]:
from sklearn.base import BaseEstimator


class ClassifierPipeline(BaseEstimator):

    def __init__(self, estimator = LogisticRegression(),):
        """
        A custom BaseEstimator that can switch between classifiers in the pipe.
        Defaults to Logistic Regression.
        
        :param estimator: sklearn object; switches between any sklearn estimator
        """
        self.estimator = estimator
    
    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self
    
    def predict(self, X, y=None):
        return self.estimator.predict(X)
    
    def predict_proba(self, X):
        return self.estimator.predict_proba(X)
    
    def score(self, X, y):
        return self.estimator.score(X, y)
    
cvec = CountVectorizer()

pipe = Pipeline(steps=[('vectorizer', cvec), ('clf', ClassifierPipeline())])

param_grid = [
    {
        'clf__estimator': [LogisticRegression()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__penalty': ['l2'],
        'clf__estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    },
    {
        'clf__estimator': [LinearSVC()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__penalty': ['l2'],
        'clf__estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    },
    {
        'clf__estimator': [BernoulliNB()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__alpha': np.logspace(-4, 4, 5),
    },
    {
        'clf__estimator': [RandomForestClassifier()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__n_estimators': [200, 500],
        'clf__estimator__max_features': ['auto', 'sqrt', 'log2'],
        'clf__estimator__max_depth': [4,5,6,7,8],
        'clf__estimator__criterion': ['gini', 'entropy'],
    },
]
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, return_train_score=False, verbose=3)
search.fit(x, y)

Fitting 5 folds for each of 21330 candidates, totalling 106650 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 248 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 888 tasks      | elapsed:   11.9s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   22.8s
[Parallel(n_jobs=-1)]: Done 2936 tasks      | elapsed:   37.1s
[Parallel(n_jobs=-1)]: Done 4344 tasks      | elapsed:   54.8s
[Parallel(n_jobs=-1)]: Done 6008 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 7928 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 10104 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 12536 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 15224 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 18168 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 21368 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 24824 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 27097 tasks

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...enalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'clf__estimator': [LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)], 've...'], 'clf__estimator__max_depth': [4, 5, 6, 7, 8], 'clf__estimator__criterion': ['gini', 'entropy']}],
       pre_dispatch='2*n_jobs', refit=T

We find that logistic regression with L2 regularization performs the best, with a CV accuracy of 0.884. The logistic regression model performed best with a regularization parameter, C, of 10. It also used count vectorization to construct features from the preprocessed tweet text with 3700 trigram features. 

In [40]:
### add dictionary of scores; refit AUC for best CV set of params
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.884):
{'clf__estimator': LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False), 'clf__estimator__C': 10, 'clf__estimator__penalty': 'l2', 'vectorizer__max_features': 3700, 'vectorizer__ngram_range': (1, 3), 'vectorizer__stop_words': None}


Here we add the NLP preprocess pipeline as a custom transformer to be used in the pipeline above as the step prior to vectorization and classification. Using the transformer in the pipeline will let us know the combination of text preprocessing options that gives the best results. 

Transformer calls the nlp_preprocess function used above for cleaning the text.


In [None]:
# %load nlp_preprocess.py
"""
Version: 03-13-2019
Author: sanyabt
"""
import re, string, sys
from nltk import tokenize
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

'''
Replace emojis with text translations given in emoji list (emojis: dictionary of translated emojis).
'''
def emojify(text, emojis):
	text = str(text.encode('unicode-escape'))[2:-1]
	if '\\u' in text.lower():
		text = text.replace('\\\\U' , '\\\\u')
		text = text.replace('\\\\u' , ' \\\\u')
		words = text.split(' ')
		for word in words:
			if '\\u' in word:
				if word in emojis.keys():
					words[words.index(word)] = emojis[word]
				elif word[0:11] in emojis:
					word_1 = word[11:len(word)]
					words[words.index(word)] = emojis[word[0:11]] + ' ' + word_1
				elif word[0:7] in emojis:
					word_1 = word[7:len(word)]
					words[words.index(word)] = emojis[word[0:7]] + ' ' + word_1
		return ' '.join(words)
	return text

'''
Remove translated emojis from text from text.
'''
def emoji_remove(text):
	words = tokenize.word_tokenize(text)
	for word in words:
		if 'emoj_' in word:
			words[words.index(word)] = ''
	return ' '.join(words)

'''
Remove twitter metadata information (_url_, _mention_, _hashtag_ etc) from text (excluding emojis).
'''
def metadata_remove(text):
	words = tokenize.word_tokenize(text)
	for word in words:
		if '_' in word and 'neg_' not in word and 'emoj_' not in word:
			words[words.index(word)] = ''
	return ' '.join(words)

'''
Replace twitter url's, hashtags, unrecognized unicodes and mentions.
'''
def metadata_clean(text, url_pat, unicode_pat, mention_pat):
	#Handle URL's
	text = url_pat.sub('_url_', text)

	#Unidentified unicodes
	text = unicode_pat.sub('_unicode_', text)
	
	#Handle @mentions and hashtags
	text = mention_pat.sub('_mention_', text)
	text = text.replace('#', '_hashtag_')
	
	#remove extra backslack due to post-parse emojify, but dont want to remove the \ in other unicodes
	text = text.replace('\\', '')
	return text

'''
Expand the negation words in text using negation dictionary (defined below).
'''
def negation_expand(text, neg_pattern, negations_dic):
	
	#Expand negation contractions mentioned in negations dictionary
	text = neg_pattern.sub(lambda x: negations_dic[x.group()], text)
	return text

'''
Remove punctuation from text. Returns tokenized text either with or without punctuation.
'''
def punctuation_remove(text):

	tzer = tokenize.RegexpTokenizer(r'[A-Za-z0-9_]+')
	tokenized = tzer.tokenize(text)
	return ' '.join(tokenized)

'''
Remove digits from text.
'''
def digits_remove(text):
	result = ''.join(i for i in text if not i.isdigit())
	return result

def check_negation(token):
	flag = False
	if token in string.punctuation:
		flag = True
		return False, flag
	if '_' in token:
		return False, flag
	else:
		return True, flag

'''
Negation marking of text for all negation words defined below. 
1. Find all negation words in the text
2. Add NEG_ to tokens following the negation word till the next punctuation (end of sentence) - if punctuation present in next 4 tokens
3. Else add NEG_ to next 4 tokens (non-punctuation)
'''
def negation_marking(text, neg_mark_pattern):
	tokens = tokenize.word_tokenize(text)
	neg_matched = neg_mark_pattern.findall(' '.join(tokens))
	
	for item in neg_matched:
		if item in tokens:
			loc = tokens.index(item)
		
			if (len(tokens) - loc) <= 4:
				for tok in tokens[loc+1:]:
					ans, flag = check_negation(tok)
					if ans is True:
						tokens[tokens.index(tok)] = 'NEG_'+tok
					if flag is True:
						break
			else:
				for tok in tokens[loc+1:loc+5]:
					ans, flag = check_negation(tok)
					if ans is True:
						tokens[tokens.index(tok)] = 'NEG_'+tok
					if flag is True:
						break
	return ' '.join(tokens)

'''
Fix lengthening in text where consecutive similar characters occurring more than 2 times are reduced to 2.
'''
def normalize_text(text, pattern):
	return pattern.sub(r"\1\1", text)
	
'''
Remove stopwords from text using NLTK English stopwords list.
'''
def stopwords_remove(text):
	words = tokenize.word_tokenize(text)
	stop_words = set(stopwords.words('english'))
	words = [word for word in words if word.lower() not in stop_words]
	return ' '.join(words)

'''
Porter stemming algorithm applied to the text. Note: converts text to lowercase and also stem stopwords (such as 'was' to 'wa'). Do with caution.
'''
def stemming_apply(text, stemmer):
	tokens = tokenize.word_tokenize(text)
	stems = []
	for t in tokens:
		stems.append(stemmer.stem(t))
	return ' '.join(stems)

'''
Based on text options specified, run the pipeline and process tweets text.
'''
def preprocess(tweets, text_options):

	#Translate emojis for all tweets (this is the default in parsing)
	emojis = {}
	with open('data/emojilist5.csv', 'r') as f:
		for line in f:
			unic=line.split(',')[0].lower()
			trans=line.split(',')[1]
			emojis[unic]=trans
	tweets = tweets.apply(emojify, args=(emojis,))
	
	#Metadata information clean for all tweets
	pat1 = r'https?://[A-Za-z0-9./]+'
	pat2 = r'www\\.[^ ]+'
	combined_pat = r'|'.join((pat1, pat2))
	url_pat = re.compile(combined_pat)

	pat3 = r'\\u[^ ]+'
	unicode_pat = re.compile(pat3)

	pat4 = r'@[A-Za-z0-9_]+'
	mention_pat = re.compile(pat4)
	tweets = tweets.apply(metadata_clean, args=(url_pat, unicode_pat, mention_pat))

	#Expand negations
	if text_options['negation_expand'] is True:
		negations_dic = {"isn\'t" : "is not",
				"aren\'t" : "are not",
				"wasn\'t" : "was not",
				"weren\'t" : "were not",
				"haven\'t" : "have not",
				"hasn\'t" : "has not",
				"hadn\'t" : "had not",
				"won\'t" : "will not",
				"wouldn\'t" : "would not",
				"don\'t" : "do not",
				"doesn\'t" : "does not",
				"didn\'t" : "did not",
				"can\'t" : "can not",
				"couldn\'t" : "could not",
				"shouldn\'t" : "should not",
				"mightn\'t" : "might not",
				"mustn\'t" : "must not",
				"shan\'t" : "shall not",
				"ain\'t" : "am not"}
	
		neg_expand_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')
		tweets = tweets.apply(negation_expand, args=(neg_expand_pattern, negations_dic))
	
	#Remove punctuation
	if text_options['punctuation_remove'] is True:
		tweets = tweets.apply(punctuation_remove)

	#Remove metadata- hashtags, urls, mentions, unicode
	if text_options['metadata_remove'] is True:
		tweets = tweets.apply(metadata_remove)

	#Remove emojis from tweets
	if text_options['emoji_remove'] is True:
		tweets = tweets.apply(emoji_remove)

	#Remove digits
	if text_options['digits_remove'] is True:
		tweets = tweets.apply(digits_remove)
	
	#Mark negations
	if text_options['negation_mark'] is True:
		neg_words = ['not', 'never', 'no', 'nothing', 'noone', 'nowhere', 'none',
				'isnt', 'arent', 'wasnt', 'werent', 'havent', 'hasnt', 'hadnt',
				'wont', 'wouldnt', 'dont', 'doesnt', 'didnt', 'cant', 'couldnt',
				'shouldnt', 'mightnt', 'mustnt', 'shant', 'aint']
		neg_mark_pattern = re.compile(r'\b(' + '|'.join(neg_words) + r')\b')
		tweets = tweets.apply(negation_marking, args=(neg_mark_pattern,))
	
	#Normalization
	if text_options['normalize'] is True:
		repeat_pattern = re.compile(r"(.)\1{2,}")
		tweets = tweets.apply(normalize_text, args=(repeat_pattern,))

	#Remove stopwords: done before stemming and after negation marking (stemmer stems stopwords, if done before negation marking, some negation words removed)
	if text_options['stopwords_remove'] is True:
		tweets = tweets.apply(stopwords_remove)

	#Stemming. Note: will convert to lowercase and also stem stopwords (such as 'was' to 'wa')
	if text_options['stemming'] is True:
		ps = PorterStemmer()
		tweets = tweets.apply(stemming_apply, args=(ps,))

	#Lowercasing of text
	if text_options['lower'] is True:
		tweets = tweets.str.lower()

	return tweets


There are 10 possible preprocessing steps:
1. *'metadata_remove'*: True or False. Metadata includes hashtag, url's, mentions, and unicodes. Emojis and other metadata are translated by default.
2. *'punctuation_remove'*: True if remove from text, else False
3. *'negation_expand'*: True if expand negation words, else False
4. *'digits_remove'*: True if remove digits (0-9), else False
5. *'negation_mark'*: True if negation marking applied to tokens, else False
6. *'normalize'*: True for text normalization, else False
7. *'stopwords_remove'*: True for removal of stop words, else False
8. *'stemming'*: True if stemming applied, else False
9. *'lower'*: True for lowercasing tweet text, else False
10. *'emoji_remove'*: True to remove emojis from text, else False

Options are all True by default.

In [49]:
import inspect
from sklearn.base import TransformerMixin, BaseEstimator
class NLP_transformer(BaseEstimator, TransformerMixin):
    def __init__(self, 
                 metadata_remove=True, emoji_remove=True, negation_expand=True, punctuation_remove=True, 
                 digits_remove=True, negation_mark=True, normalize=True, stemming=True, 
                 stopwords_remove=True, lower=True):
        """
        A custom Transformer that takes in text column of dataframe and transforms to clean text.
        Arguments are text options specified above.
        """
        args, _, _, values = inspect.getargvalues(inspect.currentframe())
        values.pop('self')
        self.text_options = {}
        for arg, val in values.items():
            self.text_options[arg] = val
    def transform(self, X, y=None):
        return preprocess(X, self.text_options)
        
    def fit(self, X, y=None):
        return self

In [98]:
from sklearn.metrics import make_scorer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

pipe = Pipeline(
    steps=[
    ('preprocess', NLP_transformer()), 
    ('vectorizer', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('clf', ClassifierPipeline())
    ]
)

scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score),
           'Brier': 'brier_score_loss', 'f1-score': make_scorer(f1_score),
           'precision': make_scorer(precision_score), 'recall': make_scorer(recall_score)}

param_grid = [
    {
        'preprocess__metadata_remove': [False, True],
        'preprocess__emoji_remove': [False, True],
        'preprocess__punctuation_remove': [False, True],
        'preprocess__negation_expand': [False, True],
        'preprocess__digits_remove': [False, True],
        'preprocess__negation_mark': [False, True],
        'preprocess__normalize': [False, True],
        'preprocess__stopwords_remove': [False, True],
        'preprocess__stemming': [False, True],
        'preprocess__lower': [False, True],
        'clf__estimator': [LogisticRegression()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'tfidf__use_idf': [False, True],
        'clf__estimator__penalty': ['l2'],
        'clf__estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    },
    {
        'preprocess__metadata_remove': [False, True],
        'preprocess__emoji_remove': [False, True],
        'preprocess__punctuation_remove': [False, True],
        'preprocess__negation_expand': [False, True],
        'preprocess__digits_remove': [False, True],
        'preprocess__negation_mark': [False, True],
        'preprocess__normalize': [False, True],
        'preprocess__stopwords_remove': [False, True],
        'preprocess__stemming': [False, True],
        'preprocess__lower': [False, True],
        'clf__estimator': [LinearSVC()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__penalty': ['l2'],
        'clf__estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    },
    {
        'preprocess__metadata_remove': [False, True],
        'preprocess__emoji_remove': [False, True],
        'preprocess__punctuation_remove': [False, True],
        'preprocess__negation_expand': [False, True],
        'preprocess__digits_remove': [False, True],
        'preprocess__negation_mark': [False, True],
        'preprocess__normalize': [False, True],
        'preprocess__stopwords_remove': [False, True],
        'preprocess__stemming': [False, True],
        'preprocess__lower': [False, True],
        'clf__estimator': [BernoulliNB()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__alpha': np.logspace(-4, 4, 5),
    },
    {
        'preprocess__metadata_remove': [False, True],
        'preprocess__emoji_remove': [False, True],
        'preprocess__punctuation_remove': [False, True],
        'preprocess__negation_expand': [False, True],
        'preprocess__digits_remove': [False, True],
        'preprocess__negation_mark': [False, True],
        'preprocess__normalize': [False, True],
        'preprocess__stopwords_remove': [False, True],
        'preprocess__stemming': [False, True],
        'preprocess__lower': [False, True],
        'clf__estimator': [RandomForestClassifier()],
        'vectorizer__max_features': np.arange(1000,10000,100),
        'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'vectorizer__stop_words': [None],
        'clf__estimator__n_estimators': [200, 500],
        'clf__estimator__max_features': ['auto', 'sqrt', 'log2'],
        'clf__estimator__max_depth': [4,5,6,7,8],
        'clf__estimator__criterion': ['gini', 'entropy'],
    },
]

In [97]:
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.907):
{'clf__estimator': LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False), 'clf__estimator__C': 1, 'clf__estimator__penalty': 'l2', 'preprocess__digits': False, 'preprocess__emoji': False, 'preprocess__lower': False, 'preprocess__metadata': 1, 'preprocess__negation_expand': False, 'preprocess__normalize': False, 'preprocess__punctuation': False, 'preprocess__stemming': False, 'preprocess__stopwords': False, 'tfidf__use_idf': False, 'vectorizer__max_features': 3500, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}


In [75]:
results = search.cv_results_


In [85]:
results

{'mean_fit_time': array([0.0403512 , 0.03981028, 0.03670082, 0.03957047, 0.03650131,
        0.0398365 , 0.03961735, 0.03921094, 0.04045658, 0.03748951,
        0.04044547, 0.04018364, 0.04098425, 0.03730659, 0.03902502,
        0.03624907, 0.03958673, 0.03570094, 0.04386249, 0.0398757 ,
        0.03998594, 0.03855577, 0.03800268, 0.03647332, 0.04157043,
        0.03937721, 0.03979721, 0.03822179, 0.03811841, 0.03595705,
        0.03901234, 0.03559742, 0.04003134, 0.03902564, 0.03748217,
        0.0406796 , 0.03680849, 0.03754802, 0.03798261, 0.0407989 ,
        0.03781343, 0.03967919, 0.04260292, 0.03512135, 0.03941388,
        0.03880229, 0.04035039, 0.03936143, 0.03674197, 0.03978357,
        0.03875518, 0.03802009, 0.04066854, 0.03805256, 0.03836823,
        0.03971586, 0.03804245, 0.03980861, 0.03775954, 0.04282408,
        0.03944077, 0.04122005, 0.04133177, 0.03642893, 0.03982892,
        0.03559251, 0.03715873, 0.03926282, 0.03975039, 0.04234357,
        0.04080062, 0.03594041,

In [89]:
search.best_estimator_

Pipeline(memory=None,
     steps=[('preprocess', NLP_transformer(digits=False, emoji=False, lower=False, metadata=1,
        negation_expand=False, negation_mark=None, normalize=False,
        punctuation=False, stemming=False, stopwords=False)), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='stri...enalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)))])