# Predict tweet sentiment

#### Data: 
about 45,000 tweets about COVID 19 collected in 2020 manually labelled with sentiments.

**corona_train.csv** contains tweets with the target variable (Sentiment) included. We will use the data to train the models and check their quality.

**corona_test.csv** contains tweets without associated values of the target variable. We will predict sentiments for every tweet from this file. 

#### Quality measure: 
accuracy

## 1. Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette('muted')
sns.set_color_codes('muted')
sns.set_style('white')

import warnings
warnings.filterwarnings('ignore')

import re
import nltk
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ParameterGrid
from tqdm import tqdm
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /home/zarina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /home/zarina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/zarina/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /home/zarina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/zarina/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 2. Upload the files and look through them

In [2]:
%config InlineBackend.figure_format = 'retina'

In [3]:
train = pd.read_csv('corona_train.csv', encoding='ISO-8859-1', index_col=0)
test = pd.read_csv('corona_test.csv', encoding='ISO-8859-1', index_col=0)
train.head(10)

Unnamed: 0_level_0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
1,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
2,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,As news of the regions first confirmed COVID-1...,Positive
3,3807,48759,"Atlanta, GA USA",16-03-2020,Due to COVID-19 our retail store and classroom...,Positive
4,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,"For corona prevention,we should stop to buy th...",Negative
5,3809,48761,"Makati, Manila",16-03-2020,All month there hasn't been crowding in the su...,Neutral
6,3810,48762,"Pitt Meadows, BC, Canada",16-03-2020,"Due to the Covid-19 situation, we have increas...",Extremely Positive
7,3811,48763,Horningsea,16-03-2020,#horningsea is a caring community. Lets ALL lo...,Extremely Positive
8,3813,48765,,16-03-2020,ADARA Releases COVID-19 Resource Center for Tr...,Positive
9,3814,48766,"Houston, Texas",16-03-2020,Lines at the grocery store have been unpredict...,Positive


In [4]:
train['type']='train'
test['type']='test'
test['Sentiment']=''

In [5]:
train=train.append(test)

In [6]:
train.drop_duplicates(subset='OriginalTweet').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44955 entries, 0 to 17981
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       44955 non-null  int64 
 1   ScreenName     44955 non-null  int64 
 2   Location       35531 non-null  object
 3   TweetAt        44955 non-null  object
 4   OriginalTweet  44955 non-null  object
 5   Sentiment      44955 non-null  object
 6   type           44955 non-null  object
dtypes: int64(2), object(5)
memory usage: 2.7+ MB


In [7]:
train.UserName = train.UserName.astype(str)
train.ScreenName = train.ScreenName.astype(str)

In [8]:
train.describe()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,type
count,44955,44955,35531,44955,44955,44955.0,44955
unique,44955,44955,13127,44,44955,6.0,2
top,3801,48753,United States,20-03-2020,Coronavirus Australia: Woolworths to give elde...,,train
freq,1,1,603,3448,1,17982.0,26973


In [9]:
train.TweetAt = pd.to_datetime(train.TweetAt)

In [10]:
train.TweetAt.loc[0]

ID
0   2020-03-16
0   2020-07-04
Name: TweetAt, dtype: datetime64[ns]

In [11]:
# tweets_per_day
tweets_per_day = train[['TweetAt']].set_index(train['TweetAt']).resample('D').count()
tweets_per_day

Unnamed: 0_level_0,TweetAt
TweetAt,Unnamed: 1_level_1
2020-01-04,630
2020-01-05,0
2020-01-06,0
2020-01-07,0
2020-01-08,0
...,...
2020-11-30,0
2020-12-01,0
2020-12-02,0
2020-12-03,685


In [12]:
# tweets_per_day_simple  strftime
tweets_per_day_simple = train.TweetAt.dt.strftime('%m-%d').value_counts().sort_index()
tweets_per_day_simple

01-04     630
02-03       4
02-04     954
03-03       4
03-04     810
03-13    1233
03-14     614
03-15     519
03-16    1128
03-17    1977
03-18    2742
03-19    3215
03-20    3448
03-21    2653
03-22    2114
03-23    2062
03-24    1480
03-25    2979
03-26    1277
03-27     345
03-28      23
03-29     125
03-30      87
03-31     316
04-03       8
04-04     767
04-13    1428
04-14     284
05-03       6
05-04    1131
06-03       2
06-04    1742
07-03       7
07-04    1843
08-03       9
08-04    1881
09-03      16
09-04    1471
10-03      54
10-04    1005
11-03     165
11-04     909
12-03     685
12-04     803
Name: TweetAt, dtype: int64

In [13]:
train.Location.value_counts()

United States                  603
London, England                568
London                         565
New York, NY                   429
Washington, DC                 411
                              ... 
MD/DC                            1
Lenoir, N.C.                     1
Hopefully not Garden Center      1
Adn                              1
Mahwah, NJ                       1
Name: Location, Length: 13127, dtype: int64

In [14]:
train['month'] = train.TweetAt.dt.month
train['day'] = train.TweetAt.dt.day
train['dayofweek'] = train.TweetAt.dt.dayofweek
train['weekday'] = train.TweetAt.dt.weekday

In [15]:
train.dayofweek.value_counts()

2    7442
4    7315
1    7061
3    6935
5    6584
0    5849
6    3769
Name: dayofweek, dtype: int64

In [16]:
train['tweetlength'] = train.OriginalTweet.str.len()

## 3. Clean the tweets

In [17]:
def tweet_cleaner(tweet):
    
    # remove urls
    tweet = re.sub(r'http\S+', ' ', tweet)
    
    # remove html tags
    #tweet = re.sub(r'<.*?>', ' ', tweet)
    
    # remove digits
    #tweet = re.sub(r'\d+', ' ', tweet)
    
    # remove hashtags
    tweet = re.sub(r'#\w+', ' ', tweet)
    
    # remove mentions
    tweet = re.sub(r'@\w+', ' ', tweet)
    
    # remove whitespaces
    tweet = ' '.join(tweet.split())

    return tweet
    
 
train['OriginalTweet'] = train['OriginalTweet'].apply(lambda x: tweet_cleaner(x)) 
train['CleanTweet'] = train['OriginalTweet'].apply(lambda x: x.replace('\n', ' '))
train['CleanTweet'] = train['CleanTweet'].str.lower()  

## 4. Lemmatization

In [18]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [19]:
wnl = WordNetLemmatizer()

def lemmatize_text(text):
    # Split the text into words
    words = nltk.word_tokenize(text)
    # Lemmatize each word and join them back into a string
    return ' '.join([wnl.lemmatize(word, get_wordnet_pos(word)) for word in words])

# Apply the lemmatization function to the text data
train['CleanTweet'] = train['CleanTweet'].apply(lemmatize_text)

In [20]:
def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)
train['CleanTweet'] = train['CleanTweet'].apply(lambda x: cleaning_repeating_char(x))

In [21]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44955 entries, 0 to 17981
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   UserName       44955 non-null  object        
 1   ScreenName     44955 non-null  object        
 2   Location       35531 non-null  object        
 3   TweetAt        44955 non-null  datetime64[ns]
 4   OriginalTweet  44955 non-null  object        
 5   Sentiment      44955 non-null  object        
 6   type           44955 non-null  object        
 7   month          44955 non-null  int64         
 8   day            44955 non-null  int64         
 9   dayofweek      44955 non-null  int64         
 10  weekday        44955 non-null  int64         
 11  tweetlength    44955 non-null  int64         
 12  CleanTweet     44955 non-null  object        
dtypes: datetime64[ns](1), int64(5), object(7)
memory usage: 5.8+ MB


In [22]:
train['OriginalTweet'][1]

ID
1    My food stock is not the only one which is emp...
1    In light of Covid-19, Canada's insurers are re...
Name: OriginalTweet, dtype: object

## 5. Split the dataset

In [23]:
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(train[['CleanTweet']], 
                                                                        train.Sentiment,
                                                                        stratify=train.Sentiment,
                                                                        test_size = 0.25,
                                                                        random_state = 42)

In [24]:
len(X_train_text)

33716

## DecisionTreeClassifier

In [25]:
# here, we just check the decision trees on our data
# it's unlikely that we will use this model further

from sklearn.model_selection import ParameterGrid
from sklearn.pipeline import Pipeline
from tqdm import tqdm
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
        
params_grid = dict(min_df=[.0001, .0005, .0007, .001, .005, .01], max_df=[.7, .75, .8, .85, .9])

resultsdtc = []

for params in tqdm(ParameterGrid(params_grid)):
    pipe = Pipeline(steps = [
        ('tf_idf_vec', TfidfVectorizer(
            token_pattern=r'[A-Za-z]{2,}',
            max_df=params['max_df'],
            min_df=params['min_df'],
            stop_words='english'
        )), 
        ('classifier', DecisionTreeClassifier())
    ])
    
    pipe.fit(X_train_text['CleanTweet'], y_train_text)
    
    pipe_preds_train = pipe.predict(X_train_text.CleanTweet)
    pipe_preds_test = pipe.predict(X_test_text.CleanTweet)
    
    resultsdtc.append(dict(
        params=params,
        
        precision_train=precision_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        precision_test=precision_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),       
        
        recall_train=recall_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        recall_test=recall_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),
        
        f1_train=f1_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        f1_test=f1_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),
    
        accuracy_train=accuracy_score(y_true=y_train_text, y_pred=pipe_preds_train),
        accuracy_test=accuracy_score(y_true=y_test_text, y_pred=pipe_preds_test)
    ))

100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [05:30<00:00, 11.03s/it]


In [26]:
resultsdtc = pd.DataFrame(resultsdtc)
resultsdtc.sort_values('accuracy_test', ascending=False).head(10).style.bar(vmin=0, vmax=1)

Unnamed: 0,params,precision_train,precision_test,recall_train,recall_test,f1_train,f1_test,accuracy_train,accuracy_test
12,"{'max_df': 0.8, 'min_df': 0.0001}",0.994848,0.296388,0.990851,0.280781,0.992835,0.286844,0.992289,0.334727
24,"{'max_df': 0.9, 'min_df': 0.0001}",0.994848,0.29445,0.990851,0.279753,0.992835,0.285468,0.992289,0.333927
21,"{'max_df': 0.85, 'min_df': 0.001}",0.990667,0.296941,0.986648,0.282308,0.988635,0.287955,0.987899,0.332948
0,"{'max_df': 0.7, 'min_df': 0.0001}",0.994848,0.291169,0.990851,0.277939,0.992835,0.283132,0.992289,0.331168
6,"{'max_df': 0.75, 'min_df': 0.0001}",0.994848,0.291713,0.990851,0.277704,0.992835,0.283197,0.992289,0.330901
26,"{'max_df': 0.9, 'min_df': 0.0007}",0.99194,0.294479,0.987652,0.277654,0.989777,0.284122,0.989056,0.330456
8,"{'max_df': 0.75, 'min_df': 0.0007}",0.99194,0.290684,0.987652,0.275368,0.989777,0.281358,0.989056,0.329389
18,"{'max_df': 0.85, 'min_df': 0.0001}",0.994848,0.290407,0.990851,0.277595,0.992835,0.2828,0.992289,0.329122
2,"{'max_df': 0.7, 'min_df': 0.0007}",0.99194,0.291048,0.987652,0.273681,0.989777,0.280433,0.989056,0.327965
20,"{'max_df': 0.85, 'min_df': 0.0007}",0.99194,0.294004,0.987652,0.277415,0.989777,0.28395,0.989056,0.327698


## Params for TF-IDF

In [27]:
# Separate pipe for finding best params for vectorizer for svc
# this enables us to lower computational complexity

params_grid = dict(min_df=[.0001, .0005, .0007, .001, .005, .01], max_df=[.7, .75, .8, .85, .9])

results = []

for params in tqdm(ParameterGrid(params_grid)):
    tfidf = TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}',
                            max_df=params['max_df'],
                            min_df=params['min_df'],
                            stop_words='english')
    
    tfidf.fit(X_train_text['CleanTweet'])
    X_train_tfidf = tfidf.transform(X_train_text['CleanTweet'])
    X_test_tfidf = tfidf.transform(X_test_text['CleanTweet'])
    
    clf = LinearSVC()
    clf.fit(X_train_tfidf, y_train_text)
    y_pred = clf.predict(X_test_tfidf)
    
    acc = accuracy_score(y_test_text, y_pred)
    
    results.append(dict(
        params=params,
        accuracy=acc
    ))

df = pd.DataFrame(results).sort_values(by='accuracy', ascending=False)

100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [00:48<00:00,  1.62s/it]


In [28]:
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,params,accuracy
29,"{'max_df': 0.9, 'min_df': 0.01}",0.401548
5,"{'max_df': 0.7, 'min_df': 0.01}",0.401548
23,"{'max_df': 0.85, 'min_df': 0.01}",0.401548
17,"{'max_df': 0.8, 'min_df': 0.01}",0.401548
11,"{'max_df': 0.75, 'min_df': 0.01}",0.401548
16,"{'max_df': 0.8, 'min_df': 0.005}",0.398167
28,"{'max_df': 0.9, 'min_df': 0.005}",0.398167
4,"{'max_df': 0.7, 'min_df': 0.005}",0.398167
22,"{'max_df': 0.85, 'min_df': 0.005}",0.398167
10,"{'max_df': 0.75, 'min_df': 0.005}",0.398167


In [29]:
#the best params for Extra Trees

params_grid = dict(min_df=[.0001, .0005, .0007, .001, .005, .01], max_df=[.7, .75, .8, .85, .9])

results = []

for params in tqdm(ParameterGrid(params_grid)):
    tfidf = TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}',
                            max_df=params['max_df'],
                            min_df=params['min_df'],
                            stop_words='english')
    
    tfidf.fit(X_train_text['CleanTweet'])
    X_train_tfidf = tfidf.transform(X_train_text['CleanTweet'])
    X_test_tfidf = tfidf.transform(X_test_text['CleanTweet'])
    
    clf = ExtraTreesClassifier()
    clf.fit(X_train_tfidf, y_train_text)
    y_pred = clf.predict(X_test_tfidf)
    
    acc = accuracy_score(y_test_text, y_pred)
    
    results.append(dict(
        params=params,
        accuracy=acc
    ))

df = pd.DataFrame(results).sort_values(by='accuracy', ascending=False)

100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [42:05<00:00, 84.20s/it]


In [30]:
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,params,accuracy
9,"{'max_df': 0.75, 'min_df': 0.001}",0.409022
12,"{'max_df': 0.8, 'min_df': 0.0001}",0.408844
25,"{'max_df': 0.9, 'min_df': 0.0005}",0.408488
6,"{'max_df': 0.75, 'min_df': 0.0001}",0.40831
24,"{'max_df': 0.9, 'min_df': 0.0001}",0.407688
21,"{'max_df': 0.85, 'min_df': 0.001}",0.407599
19,"{'max_df': 0.85, 'min_df': 0.0005}",0.40751
27,"{'max_df': 0.9, 'min_df': 0.001}",0.406976
7,"{'max_df': 0.75, 'min_df': 0.0005}",0.406976
20,"{'max_df': 0.85, 'min_df': 0.0007}",0.406976


## ExtraTreesClassifier

In [31]:
#find the parameters for the model with fixed params of TF-IDF

from sklearn.model_selection import ParameterGrid
from sklearn.pipeline import Pipeline
from tqdm import tqdm
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

params_grid = dict(min_df=[0.0007],
                   max_df=[0.85],
                   max_depth=[6, 7, 8, 9],
                   n_estimators=[100, 200, 300],
                   min_samples_split=[2, 3, 4])

results = []

for params in tqdm(ParameterGrid(params_grid)):
    pipe = Pipeline(steps=[('tf_idf_vec', TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}',
                                                          max_df=params['max_df'],
                                                          min_df=params['min_df'],
                                                          stop_words='english')
                            ),
                           ('classifier', ExtraTreesClassifier(max_depth=params['max_depth'],
                                                                n_estimators=params['n_estimators'],
                                                                min_samples_split=params['min_samples_split']))
                           ])

    pipe.fit(X_train_text['CleanTweet'], y_train_text)

    pipe_preds_train = pipe.predict(X_train_text.CleanTweet)
    pipe_preds_test = pipe.predict(X_test_text.CleanTweet)

    results.append(dict(

        params=params,

        precision_train=precision_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        precision_test=precision_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),

        recall_train=recall_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        recall_test=recall_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),

        f1_train=f1_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        f1_test=f1_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),

        accuracy_train=accuracy_score(y_true=y_train_text, y_pred=pipe_preds_train),
        accuracy_test=accuracy_score(y_true=y_test_text, y_pred=pipe_preds_test)
    ))

100%|███████████████████████████████████████████████████████████████████████████████████| 36/36 [02:03<00:00,  3.44s/it]


In [32]:
results = pd.DataFrame(results)
results.sort_values('accuracy_test', ascending=False).head(10).style.bar(vmin=0, vmax=1)

Unnamed: 0,params,precision_train,precision_test,recall_train,recall_test,f1_train,f1_test,accuracy_train,accuracy_test
0,"{'max_depth': 6, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 2, 'n_estimators': 100}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
1,"{'max_depth': 6, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 2, 'n_estimators': 200}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
20,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 2, 'n_estimators': 300}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
21,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 3, 'n_estimators': 100}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
22,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 3, 'n_estimators': 200}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
23,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 3, 'n_estimators': 300}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
24,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 4, 'n_estimators': 100}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
25,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 4, 'n_estimators': 200}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
26,"{'max_depth': 8, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 4, 'n_estimators': 300}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036
27,"{'max_depth': 9, 'max_df': 0.85, 'min_df': 0.0007, 'min_samples_split': 2, 'n_estimators': 100}",0.066665,0.066673,0.166667,0.166667,0.095236,0.095244,0.399988,0.400036


## SVC

In [None]:
from sklearn.svm import SVC

params_grid = dict(min_df=[0.0005],
                   max_df=[0.85],
                   C=[1, 10, 100],
                   gamma=[0.1, 0.01, 0.001],
                   kernel=['linear', 'rbf'])

results = []

for params in tqdm(ParameterGrid(params_grid)):
    pipe = Pipeline(steps=[('tf_idf_vec', TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}',
                                                          max_df=params['max_df'],
                                                          min_df=params['min_df'],
                                                          stop_words='english')
                            ),
                           ('classifier', SVC(C=params['C'],
                                               gamma=params['gamma'],
                                               kernel=params['kernel']))
                           ])

    pipe.fit(X_train_text['CleanTweet'], y_train_text)

    pipe_preds_train = pipe.predict(X_train_text.CleanTweet)
    pipe_preds_test = pipe.predict(X_test_text.CleanTweet)

    results.append(dict(

        params=params,

        precision_train=precision_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        precision_test=precision_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),

        recall_train=recall_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        recall_test=recall_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),

        f1_train=f1_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
        f1_test=f1_score(y_true=y_test_text, y_pred=pipe_preds_test, average='macro'),

        accuracy_train=accuracy_score(y_true=y_train_text, y_pred=pipe_preds_train),
        accuracy_test=accuracy_score(y_true=y_test_text, y_pred=pipe_preds_test)
    ))

 72%|███████████████████████████████████████████████████████▌                     | 13/18 [2:04:18<1:27:03, 1044.77s/it]

In [None]:
results = pd.DataFrame(results)
results.sort_values('accuracy_test', ascending=False).head(10).style.bar(vmin=0, vmax=1)

# Results

|  Model                |  Accuracy train |  Accuracy test |
|-----------------------|-------------------|----------------|
| SVC                   |  98%         |  63%     |
| DecisionTreeClassifier|  99%        |  33%     |
| ExtraTreesClassifier  |  39%         |  40%      |

In [None]:
results.append(dict(
params=params,
precision_train=precision_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
recall_train=recall_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
f1_train=f1_score(y_true=y_train_text, y_pred=pipe_preds_train, average='macro'),
accuracy_train=accuracy_score(y_true=y_train_text, y_pred=pipe_preds_train),
))

In [None]:
pd.DataFrame(pipe_preds_test).to_csv('result_tweet.csv',sep=',')

In [None]:
results = pd.DataFrame(results)
results.sort_values('accuracy_test', ascending=False).head(10).style.bar(vmin=0, vmax=1)