1. In cases, for high dimensional feature space or if the feature space is very sparse, Support Vector Machine (SVM) would be good candidate. In such cases, SVM has much better performance than Random Forest.
2. SVM is also robust to overfitting because it contains the epsilon parameter which acts as regularizer thereby preventing overfitting. Epsilon parameter can be fine-tuned to get optimum model performance on test data. 
3. As SVM uses hinge loss to get the optimum hyperplane separating the two classes, it is robust to outlier presence. The soft margin-based hyperplane depends solely on few points known as support vectors which are close to the hyperplane. Other points don't have much of an impact on the optimum hyperplane. 
4. We can account for correlation between input features by adding regularization term to the hinge loss. Adding regularization term, the weights of variables which are artificially boosted as they are related to another variable which strongly impacts the outcome variable.
5. Although non-linear decision boundaries are not modelled accurately by SVM, we can use kernel transformation of features to higher dimensional however it is not possible to get feature importance as the features are already transformed due to kernel transformation.
6. SVM is sensitive to variable scale and it gives more weightage variables of higher magnitude. However, we can overcome this by first re-scaling all the variables, so they have the same range


In [1]:
import pandas as pd
import numpy as np
import json
import nltk
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from stemmer_util import Stem
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.kernel_approximation import Nystroem
from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV, train_test_split, GridSearchCV
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dgxuser_layersvanguard/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/dgxuser_layersvanguard/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
tweet_data = pd.read_csv("train_set.csv", engine ='python')
tweet_data[['has_hashtag','has_mentions','has_url','has_media']] = tweet_data[['has_hashtag','has_mentions','has_url','has_media']].astype(bool)
all_stopwords = stopwords.words('english') + stopwords.words('german') + stopwords.words('italian') + stopwords.words('french') + stopwords.words('portuguese') + stopwords.words('spanish')
all_stopwords.extend(['http', 'https'])

In [3]:
 # Preprocessing
#1. Removing rows with rare/undefine authors
tweet_data = tweet_data[tweet_data['author'].isin(['Barack Obama', 'Kim Kardashian West', 'KATY PERRY', 'Snoop Dogg', 'Cristiano Ronaldo', 'Elon Musk', 'Ellen DeGeneres', 'Sebastian Ruder', 'Donald J. Trump'])].reset_index(drop=True)
#2. convert to lower case and  removing punctuation
tweet_data['tweet'] = tweet_data['tweet'].str.lower().str.replace('[^\w\s]', '').apply(lambda x: " ".join(x.split()))
tweet_data['tweet'] = tweet_data['tweet'].apply(lambda x: " ".join([word for word in x.split() if word != ""]))
tweet_data.loc[~tweet_data['lang'].isin(['en', 'und', 'es', 'pt']),'lang'] = 'others'
source_map = {}
with open("source_mapping.txt" ,'r') as inp:
    for line in inp:
        source_map[int(line.strip().split(":")[1].strip())] = line.strip().split(":")[0]

tweet_data['source_info'] = tweet_data['source'].apply(lambda x:source_map[x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [4]:
source_encoder = OneHotEncoder().fit(tweet_data[['source_info']])
source_enc_df = pd.DataFrame(source_encoder.transform(tweet_data[['source_info']]).todense())
source_enc_df.columns = source_encoder.get_feature_names()

In [5]:
tweet_data = pd.concat([tweet_data[['author','lang','has_hashtag','has_mentions','has_url','has_media','tweet']],source_enc_df], axis =1)

In [6]:
X = tweet_data.drop(columns = ['author'])
output_enc = LabelEncoder().fit(tweet_data['author'])
Y = output_enc.transform(tweet_data['author'])

In [17]:
text_feature_ext_ppl = Pipeline([('stem',Stem(do_stem=False, str_col = 'tweet', lang_col='lang')), 
                                 ('tfidf',ColumnTransformer(transformers=[('tfidf', TfidfVectorizer(stop_words=all_stopwords, analyzer='word'), 
                                'tweet')], remainder = 'passthrough', sparse_threshold=0))])
Feature_ext_pipeline = Pipeline([('text_feat_ext', text_feature_ext_ppl),
                                 ('feature_scaling', RobustScaler()), 
                                 ('Kernel_trans',Nystroem(n_components=1000))])
model_pipeline = Pipeline([("feature_ext",Feature_ext_pipeline ), ('svm', SGDClassifier(loss='hinge', penalty='elasticnet', random_state=100, verbose=1, early_stopping=True, max_iter=1000, tol=0.001))])

hyperparameters = {'feature_ext__text_feat_ext__stem__do_stem':(True,False),
                    'feature_ext__Kernel_trans__kernel': ('linear','rbf', 'poly', 'sigmoid'), 
                   'feature_ext__Kernel_trans__gamma':(0.001, 0.01, 0.1), 
                   'svm__alpha': (0.01, 0.1, 0.5, 1, 2, 5), 
                   'svm__l1_ratio': list(np.arange(0, 1.1, 0.2))}
opt_svm = RandomizedSearchCV(model_pipeline, hyperparameters,scoring='f1_micro', n_jobs=12, cv=5, verbose=True, error_score=np.nan, n_iter=100)

In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=100)
print('\nTraining support vector machine classifier ...')
opt_svm.fit(X_train, Y_train)
joblib.dump(opt_svm.best_estimator_, 'svm_model.pkl', compress=1)


Training support vector machine classifier ...
Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:  5.6min
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed: 28.4min
[Parallel(n_jobs=12)]: Done 426 tasks      | elapsed: 68.3min
[Parallel(n_jobs=12)]: Done 500 out of 500 | elapsed: 79.5min finished
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.str_col] = l
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


-- Epoch 1
Norm: 1.38, NNZs: 1000, Bias: -1.109418, T: 14823, Avg. loss: 0.217448
Total training time: 0.09 seconds.
-- Epoch 2
Norm: 1.36, NNZs: 1000, Bias: -1.074434, T: 29646, Avg. loss: 0.211259
Total training time: 0.22 seconds.
-- Epoch 3
Norm: 1.35, NNZs: 1000, Bias: -1.058904, T: 44469, Avg. loss: 0.210916
Total training time: 0.35 seconds.
-- Epoch 4
Norm: 1.35, NNZs: 1000, Bias: -1.048198, T: 59292, Avg. loss: 0.210831
Total training time: 0.48 seconds.
-- Epoch 5
Norm: 1.34, NNZs: 1000, Bias: -1.055650, T: 74115, Avg. loss: 0.210869
Total training time: 0.61 seconds.
-- Epoch 6
Norm: 1.34, NNZs: 1000, Bias: -1.035724, T: 88938, Avg. loss: 0.210843
Total training time: 0.76 seconds.
-- Epoch 7
Norm: 1.34, NNZs: 1000, Bias: -1.047302, T: 103761, Avg. loss: 0.210829
Total training time: 0.89 seconds.
Convergence after 7 epochs took 0.90 seconds
-- Epoch 1
Norm: 2.08, NNZs: 1000, Bias: -0.725183, T: 14823, Avg. loss: 0.186082
Total training time: 0.11 seconds.
-- Epoch 2
Norm: 2

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    7.4s finished


['svm_model.pkl']

In [19]:
opt_svm.best_score_

0.7338352255479327

In [20]:
opt_svm.best_params_

{'svm__l1_ratio': 0.0,
 'svm__alpha': 0.01,
 'feature_ext__text_feat_ext__stem__do_stem': True,
 'feature_ext__Kernel_trans__kernel': 'linear',
 'feature_ext__Kernel_trans__gamma': 0.001}

In [21]:
#X_test
metrics.f1_score(Y_test, opt_svm.predict(X_test), average='micro')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.str_col] = l


0.754735308402137

In [22]:
joblib.dump(output_enc, "output_encoder.pkl")
joblib.dump(source_encoder, "source_encoder.pkl")

['source_encoder.pkl']