## Steps in text-cleaning for Natural Language Processing (NLP) problems

When working with text data for supervised Machine-Learning or deep learning problems, text data needs to be handled appropriately with the following key steps:

1. Tokenization

2. Removing stopwords and special characters that are irrelevant and converting all characters to lower case.

3. Stemming/Lemmatization

4. Word Embeddings

Note that the steps above may be customized depending on the given problem statement.

In NLP, the following terminologies are frequently used:

1. Corpus: Paragraph with more than 1 sentence

2. Documents: List of sentences

3. Vocabulary: Number of unique words (in dictionary form)

4. Words: Individual alphabets that provides meaning when combined together

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer, f1_score, ConfusionMatrixDisplay
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim
from tqdm import tqdm
import pandas as pd
import numpy as np

In [2]:
train_data = fetch_20newsgroups(subset='train')

## Tokenization

Tokenization is the process of converting sentences into individual words. This is the most fundamental step that allows words to be further processed based on its importance.

In [3]:
def text_tokenization(text):
    review = re.sub('[^a-zA-Z]',' ', text).lower().split()
    return review
corpus = pd.Series(train_data.data).map(lambda x: text_tokenization(str(x)))
corpus[2]

['from',
 'twillis',
 'ec',
 'ecn',
 'purdue',
 'edu',
 'thomas',
 'e',
 'willis',
 'subject',
 'pb',
 'questions',
 'organization',
 'purdue',
 'university',
 'engineering',
 'computer',
 'network',
 'distribution',
 'usa',
 'lines',
 'well',
 'folks',
 'my',
 'mac',
 'plus',
 'finally',
 'gave',
 'up',
 'the',
 'ghost',
 'this',
 'weekend',
 'after',
 'starting',
 'life',
 'as',
 'a',
 'k',
 'way',
 'back',
 'in',
 'sooo',
 'i',
 'm',
 'in',
 'the',
 'market',
 'for',
 'a',
 'new',
 'machine',
 'a',
 'bit',
 'sooner',
 'than',
 'i',
 'intended',
 'to',
 'be',
 'i',
 'm',
 'looking',
 'into',
 'picking',
 'up',
 'a',
 'powerbook',
 'or',
 'maybe',
 'and',
 'have',
 'a',
 'bunch',
 'of',
 'questions',
 'that',
 'hopefully',
 'somebody',
 'can',
 'answer',
 'does',
 'anybody',
 'know',
 'any',
 'dirt',
 'on',
 'when',
 'the',
 'next',
 'round',
 'of',
 'powerbook',
 'introductions',
 'are',
 'expected',
 'i',
 'd',
 'heard',
 'the',
 'c',
 'was',
 'supposed',
 'to',
 'make',
 'an',
 'ap

## Removing stopwords and irrelevant special characters

Stopwords refer to list of less important words that can be removed. While nltk library has its own list of stopwords, stopwords can also be customized depending on the problem statement.

In [4]:
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...


True

In [5]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

## Stemming vs Lemmatization

<b>Stemming</b>: Process of converting complex words into its base word stem

Note that stemming process may remove meaning of original word despite being a fast process.

<b>Lemmatization</b>: Process of converting complex words into its base word stem and ensures base word belongs to given language.

Note that lemmatization process retains meaning of original word however it is a slower process than stemming.

In cases where grammar of sentence is emphasized like chatbots, language translation or text summarization, lemmatization is a more suitable alternative than stemming.

In [6]:
# Stemming
ps = PorterStemmer()
tqdm.pandas()
def text_stemming(text):
    review = re.sub('[^a-zA-Z]',' ', text).lower().split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    return review
corpus = pd.Series(train_data.data).progress_map(lambda x: text_stemming(str(x)))
corpus[0]

100%|██████████| 11314/11314 [07:18<00:00, 25.80it/s]


'lerxst wam umd edu thing subject car nntp post host rac wam umd edu organ univers maryland colleg park line wonder anyon could enlighten car saw day door sport car look late earli call bricklin door realli small addit front bumper separ rest bodi know anyon tellm model name engin spec year product car made histori whatev info funki look car pleas e mail thank il brought neighborhood lerxst'

In [7]:
# Lemmatization
wnl = WordNetLemmatizer()
tqdm.pandas()
def text_lemmatize(text):
    review = re.sub('[^a-zA-Z]',' ', text).lower().split()
    review = [wnl.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    return review
corpus = pd.Series(train_data.data).progress_map(lambda x: text_lemmatize(str(x)))
corpus[0]

100%|██████████| 11314/11314 [06:44<00:00, 27.99it/s]


'lerxst wam umd edu thing subject car nntp posting host rac wam umd edu organization university maryland college park line wondering anyone could enlighten car saw day door sport car looked late early called bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car made history whatever info funky looking car please e mail thanks il brought neighborhood lerxst'

## Word Embeddings

Word embedding is a general process of converting words into vectors of numbers for machine learning/deep learning models to process.

Word embeddings can be generally classified into two main domains:

1. <b>Count/Frequency-based</b>

- One Hot Encoding
- Bag of Words
- Term Frequency - Inverse Document Frequency (TFIDF)

2. <b>Deep Learning</b>

- Word2Vec
- AverageWord2Vec
- Doc2Vec
- RNN and LSTM

### One Hot Encoding

In NLP, this method involves assigning binary indicators to every word in the vocabulary.

<b>Advantages</b>: Simple to implement and intuitive

<b>Disadvantages</b>:

1. Sparse matrix

2. Out of vocabulary (No fixed size sentences)

3. Semantic meaning between words is not captured.

### Bag of Words

Bag of words involves converting sentences into features that represent frequency of words in a given sentence. Features are sorted based on frequency from highest to lowest.

<img src="https://user.oc-static.com/upload/2020/10/23/16034397439042_surfin%20bird%20bow.png" width="600">

<b>Note that Bag of Words in sklearn has the option to make it strictly-binary for words with more than 1 occurence in a given sentence.</b>

<b>Advantages</b>: Simple to implement and intuitive

<b>Disadvantages</b>:

1. Sparse matrix
2. Out of vocabulary (No fixed size sentences)
3. Meaning of sentence may be distorted due to changes in order of words
4. Semantic information (similarity) between words is not captured (using unigram)

Semantic meaning between words for bag of words can be captured by using the concept of n-grams.

N-grams refer to bundling n sequential words per feature (i.e. A B C D E -> AB, BC, CD, DE (for 4 bi-grams))

<b>Important hyperparameters for Bag of Words (CountVectorizer)</b>:
1. max_features: Building vocabulary that contains top max_features sorted by frequency in descending order.
2. ngram_range: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
3. binary: Indicator of whether bag-of-words should be strictly binary or not.

In [8]:
# Bag of words model
cv = CountVectorizer(max_features=300, ngram_range=(1,3))
X = cv.fit_transform(corpus).toarray()
data = pd.DataFrame(X, columns = sorted(cv.vocabulary_, key=cv.vocabulary_.get))
data

Unnamed: 0,able,ac,access,actually,address,also,always,american,another,answer,...,word,work,world,would,writes,writes article,wrong,year,yes,yet
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,1,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11309,1,0,0,0,0,3,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
11310,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11311,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
11312,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0


In [9]:
def nested_cv(X, y, pipeline, search_space = None):
    num_folds = 10
    skfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=120)
    val_f1, test_f1 = [], []
    for fold, (outer_train_idx, outer_test_idx) in enumerate(skfold.split(X, y)):
        X_train = X.iloc[outer_train_idx].reset_index(drop=True)
        y_train = y.iloc[outer_train_idx].reset_index(drop=True)
        X_test = X.iloc[outer_test_idx].reset_index(drop=True)
        y_test = y.iloc[outer_test_idx].reset_index(drop=True)
        search = BayesSearchCV(estimator=pipeline, search_spaces=search_space, cv=3, n_iter=10,scoring= make_scorer(f1_score, average='macro'),refit=True, n_jobs=3)
        search.fit(X_train,y_train)
        val_f1.append(search.best_score_)
        print(f'Validation F1 score for fold {fold+1}:',search.best_score_)
        print(f'Best hyperparameters for fold {fold+1}:', search.best_params_)
        y_pred = search.best_estimator_.predict(X_test)
        test_f1.append(f1_score(y_test,y_pred, average='macro'))
        print(f'Test F1 score for fold {fold+1}:',f1_score(y_test,y_pred, average='macro'))
        print()
    print('----------------------')
    print('Average validation F1 score:', np.mean(val_f1))
    print('Average test F1 score:', np.mean(test_f1))

In [10]:
clf = DecisionTreeClassifier(random_state=120)
pipeline = Pipeline(steps=[])
pipeline.steps.append(('text',CountVectorizer(ngram_range=(1,2))))
pipeline.steps.append(('classification',clf))
search_space = dict()
search_space['text__max_features'] = Integer(1,100)
search_space['classification__ccp_alpha'] = Real(0,0.02)
search_space['classification__class_weight'] = Categorical(['balanced',None])
nested_cv(corpus, pd.Series(train_data.target), pipeline, search_space)

Validation F1 score for fold 1: 0.18392029731265216
Best hyperparameters for fold 1: OrderedDict([('classification__ccp_alpha', 0.002248774476935585), ('classification__class_weight', None), ('text__max_features', 94)])
Test F1 score for fold 1: 0.15752156904390618

Validation F1 score for fold 2: 0.14315626710130913
Best hyperparameters for fold 2: OrderedDict([('classification__ccp_alpha', 0.004836949011754973), ('classification__class_weight', None), ('text__max_features', 96)])
Test F1 score for fold 2: 0.11996673687303158

Validation F1 score for fold 3: 0.1503670391660286
Best hyperparameters for fold 3: OrderedDict([('classification__ccp_alpha', 0.0044015787973292544), ('classification__class_weight', None), ('text__max_features', 94)])
Test F1 score for fold 3: 0.13797332912150556

Validation F1 score for fold 4: 0.22948909596900116
Best hyperparameters for fold 4: OrderedDict([('classification__ccp_alpha', 0.0008963686954806895), ('classification__class_weight', None), ('text_

### TFIDF (Term Frequency - Inverse Document Frequency)

Unlike Bag of Words, TFIDF captures semantic meaning of words directly by providing more weightage to less frequent words.

![image.png](https://blog.expertrec.com/wp-content/uploads/2019/02/TF-IDF-calucation.png)

TFIDF consists of two components with its respective formulas:

1. Term Frequency (TF) : Num. of words repeated/Num. of words

2. Inverse Document Frequency (IDF): log(Num. of sentences/Num. of sentences that contains specified word)

Note that TF component captures less frequent words, while IDF component captures more frequent words.

TFIDF is simply multiplying both TF and IDF components together.

<b>Advantages</b>: Intuitive and captures word importance

<b>Disadvantages</b>:

1. Sparse matrix

2. Out of vocabulary (No fixed size sentences)

<b>Important hyperparameters for TFIDF (TfidfVectorizer)</b>:
1. max_features: Building vocabulary that contains top max_features sorted by term-frequency in descending order.
2. ngram_range: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
3. binary: Indicator of whether tfidf should be strictly binary or not.

In [11]:
# TFIDF model
cv = TfidfVectorizer(max_features=300, ngram_range=(1,3))
X = cv.fit_transform(corpus).toarray()
data = pd.DataFrame(X, columns = sorted(cv.vocabulary_, key=cv.vocabulary_.get))
data

Unnamed: 0,able,ac,access,actually,address,also,always,american,another,answer,...,word,work,world,would,writes,writes article,wrong,year,yes,yet
0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.113780,0.000000,0.000000
1,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.0,0.107916,0.101784,0.000000,0.000000,0.0,0.0,0.0,0.111287,...,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.0,0.000000,0.000000,0.193904,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.133627,0.0,0.080919,0.158146,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.322645,0.0,0.097690,0.000000,0.000000,0.000000,0.214153,0.213929
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11309,0.144964,0.0,0.000000,0.000000,0.000000,0.298494,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.111044,0.0,0.000000,0.000000,0.000000,0.108469,0.000000,0.000000
11310,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
11311,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.160145,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.211890
11312,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.056444,0.000000,0.122216,0.000000,0.000000,0.000000


In [12]:
clf = DecisionTreeClassifier(random_state=120)
pipeline = Pipeline(steps=[])
pipeline.steps.append(('text',TfidfVectorizer(ngram_range=(1,2))))
pipeline.steps.append(('classification',clf))
search_space = dict()
search_space['text__max_features'] = Integer(1,100)
search_space['classification__ccp_alpha'] = Real(0,0.02)
search_space['classification__class_weight'] = Categorical(['balanced',None])
nested_cv(corpus, pd.Series(train_data.target), pipeline, search_space)

Validation F1 score for fold 1: 0.15067997172854272
Best hyperparameters for fold 1: OrderedDict([('classification__ccp_alpha', 0.00020614205652772058), ('classification__class_weight', 'balanced'), ('text__max_features', 44)])
Test F1 score for fold 1: 0.1518826204103875

Validation F1 score for fold 2: 0.13800662344050796
Best hyperparameters for fold 2: OrderedDict([('classification__ccp_alpha', 0.00546314958659191), ('classification__class_weight', None), ('text__max_features', 95)])
Test F1 score for fold 2: 0.12635475974516777

Validation F1 score for fold 3: 0.16958932386649783
Best hyperparameters for fold 3: OrderedDict([('classification__ccp_alpha', 0.002015986738326026), ('classification__class_weight', 'balanced'), ('text__max_features', 89)])
Test F1 score for fold 3: 0.16048181336410303

Validation F1 score for fold 4: 0.051044871304274646
Best hyperparameters for fold 4: OrderedDict([('classification__ccp_alpha', 0.016845250201640002), ('classification__class_weight', 'b

## Word2Vec and Doc2Vec

Word2Vec and Doc2Vec are deep learning modules for NLP that uses shallow neural networks (with 1 hidden layer).

Word2Vec involves deriving new features for every word under fixed-set of dimensions.

<b>Neural-network methods of Word2Vec</b>:
1. <b>Continuous Bag of Words (CBOW)</b>

![image.png](https://1.bp.blogspot.com/-nZFc7P6o3Yc/XQo2cYPM_ZI/AAAAAAAABxM/XBqYSa06oyQ_sxQzPcgnUxb5msRwDrJrQCLcBGAs/s1600/image001.png)

- Using odd number window size, center word is defined as target with left and right context as inputs in sliding window fashion.

2. <b>SkipGram</b>

![image.png](https://1.bp.blogspot.com/-Vz5pLuZ49K8/XV0ErlMtdDI/AAAAAAAAB0A/FIM74z__LAUkCqpW12ViAnGX8Br56W2PQCEwYBhgL/s1600/image001.png)

- Using odd number window size, left and right context is defined as target with center word as inputs in sliding window fashion.

![image.png](https://www.researchgate.net/profile/Wang-Ling-16/publication/281812760/figure/fig1/AS%3A613966665486361%401523392468791/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models.png)

From the diagram above, both methods are similar but with a difference in neural network architecture.

Doc2Vec is another word embedding technique that is mostly used for sentences, which is more commonly applied in practice. 

Doc2Vec reduces dimensions created for every word from using Word2Vec by either "row" averaging or summation to represent single feature.

Unlike Word2Vec, Doc2Vec is much faster algorithm, since no memory is required to store word vectors.

<b>Algorithms of Doc2Vec</b>:

1. Distributed Memory

![image.png](https://miro.medium.com/max/640/0%2Ax-gtU4UlO8FAsRvL.)

2. Distributed Bag of Words

![image.png](https://miro.medium.com/max/640/0%2ANtIsrbd4VQzUKVKr.)

Comparing both algorithms above, distributed memory model remembers what is missing from the current context — or as the topic of the paragraph, while distributed bag of words model is similar to SkipGram 

<b>Advantages</b>:

1. Semantic information (meaning of similar words) is retained

2. Reduces sparsity

<b>Disadvantages</b>:

1. Derived features are difficult to interpret

<b>Important hyperparameters for Doc2Vec</b>:
1. window: Size of hidden layer in neural network
2. vector_size: Size of features
3. dbow_words: 1 for training word vectors in skip gram together with DBOW for doc-vector training vs 0 for training doc-vector directly (usually faster).
4. dm_mean: Method to use for handling context of word vectors (Average (1) or summation (0))
5. dm: Type of algorithm to use for training paragraph vectors (Distributed memory (1) vs Distributed bag of words (0))

In [5]:
# Lemmatization
wnl = WordNetLemmatizer()
tqdm.pandas()
def text_lemmatize(text):
    review = re.sub('[^a-zA-Z]',' ', text).lower().split()
    review = [wnl.lemmatize(word) for word in review if not word in stopwords.words('english')]
    return review
corpus = pd.Series(train_data.data).progress_map(lambda x: text_lemmatize(str(x)))

100%|██████████| 11314/11314 [06:48<00:00, 27.67it/s]


In [6]:
def nested_cv(X, y, pipeline, search_space = None):
    num_folds = 10
    skfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=120)
    val_f1, test_f1 = [], []
    for fold, (outer_train_idx, outer_test_idx) in enumerate(skfold.split(X, y)):
        X_train = X.iloc[outer_train_idx].reset_index(drop=True)
        y_train = y.iloc[outer_train_idx].reset_index(drop=True)
        X_test = X.iloc[outer_test_idx].reset_index(drop=True)
        y_test = y.iloc[outer_test_idx].reset_index(drop=True)
        search = BayesSearchCV(estimator=pipeline, search_spaces=search_space, cv=3, n_iter=10,scoring= make_scorer(f1_score, average='macro'),refit=True, n_jobs=3)
        search.fit(X_train,y_train)
        val_f1.append(search.best_score_)
        print(f'Validation F1 score for fold {fold+1}:',search.best_score_)
        print(f'Best hyperparameters for fold {fold+1}:', search.best_params_)
        y_pred = search.best_estimator_.predict(X_test)
        test_f1.append(f1_score(y_test,y_pred, average='macro'))
        print(f'Test F1 score for fold {fold+1}:',f1_score(y_test,y_pred, average='macro'))
        print()
    print('----------------------')
    print('Average validation F1 score:', np.mean(val_f1))
    print('Average test F1 score:', np.mean(test_f1))

In [7]:
class Word2VecTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, window_size=5, vector_size=100, sg=0):
        self.window_size=window_size
        self.vector_size=vector_size
        self.sg = sg

    def fit(self, X, y=None):
        documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(X)]
        self.model = Doc2Vec(documents, window=self.window_size, vector_size=self.vector_size, dbow_words=self.sg, dm_mean=1)
        return self
    
    def transform(self, X, y=None):
        corpus_list=[]
        for i in range(len(X)):
            corpus_list.append(self.model.infer_vector(X.iloc[i]))
        data = pd.DataFrame(corpus_list)
        return data

In [6]:
### Continuous Bag of Words
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(documents, window=5, vector_size=50, dbow_words=0, dm_mean=1)
vec_list = []
for i in tqdm(range(len(corpus))):
    vec_list.append(model.infer_vector(corpus.iloc[i]))
data = pd.DataFrame(vec_list)
data

100%|██████████| 11314/11314 [00:39<00:00, 288.10it/s]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,-0.181867,0.142621,0.160889,-0.059451,-0.185729,-0.384308,0.070571,0.058495,0.000561,-0.276120,...,-0.309787,-0.019852,-0.595044,-0.495648,-0.194946,0.115274,-0.745542,0.114599,-0.136785,-0.262709
1,-0.254696,-0.202066,0.326909,0.183216,0.108386,-0.020028,-0.388124,-0.143820,0.174458,0.170976,...,-0.366750,0.252011,-0.630868,-0.320176,-0.153325,-0.037158,-0.275498,0.586989,-0.539745,0.174718
2,-0.035048,-0.039179,0.297219,0.481677,-0.312172,-0.284266,0.595817,0.312010,-0.078611,-0.001274,...,-0.455461,0.064141,-1.011400,-0.095309,-0.093512,-0.437727,-0.355270,0.302072,0.369565,0.318478
3,-0.032163,-0.101395,-0.337188,-0.140559,0.078657,0.274490,0.251014,0.254533,-0.079445,-0.406573,...,-0.140821,0.188404,-0.294763,0.082081,-0.094968,-0.223158,-0.258877,0.289686,0.038878,-0.347468
4,0.225668,0.227880,0.239966,0.681074,-0.313954,-0.018924,0.005046,0.287880,0.020320,-0.435511,...,-0.552227,0.029562,-0.478643,-0.274663,-0.139647,-0.374686,-0.334586,0.066808,-0.552995,0.044064
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11309,0.196941,0.613343,0.549250,1.225908,0.017679,-0.303703,0.371028,-0.109360,1.115787,-0.310815,...,-0.625617,-0.119782,-0.515483,-0.186371,-0.818890,0.951120,-0.990050,0.567417,0.040211,1.155560
11310,-0.197742,0.203054,0.167696,0.160393,-0.025061,-0.496154,-0.179802,-0.189870,0.431367,-0.382585,...,-0.621502,-0.362972,-0.635916,0.044675,0.204900,-0.371604,-0.340396,0.327096,-0.128715,0.196688
11311,-0.255346,-0.360278,0.104715,0.717931,-0.148578,0.033314,0.190778,-0.350770,0.071576,0.191106,...,-0.025464,0.133005,-0.040903,-0.447190,-0.369890,-0.139451,-0.518170,0.230131,-0.522328,0.261976
11312,0.662580,0.442421,0.729126,-0.009726,-0.626188,-0.483847,-0.341213,0.329171,-0.173751,-0.705158,...,-0.723478,0.335752,0.030538,0.321691,0.176673,-0.209002,-0.387193,0.485543,-0.016760,0.189186


In [12]:
clf = DecisionTreeClassifier(random_state=120)
pipeline = Pipeline(steps=[])
pipeline.steps.append(('text',Word2VecTransformer()))
pipeline.steps.append(('classification',clf))
search_space = dict()
search_space['text__window_size'] = Categorical([3,5,7,9])
search_space['text__vector_size'] = Integer(1,100)
search_space['classification__ccp_alpha'] = Real(0,0.02)
search_space['classification__class_weight'] = Categorical(['balanced',None])
nested_cv(corpus, pd.Series(train_data.target), pipeline, search_space)

Validation F1 score for fold 1: 0.31464911188966166
Best hyperparameters for fold 1: OrderedDict([('classification__ccp_alpha', 0.0014507615937854993), ('classification__class_weight', 'balanced'), ('text__vector_size', 28), ('text__window_size', 5)])
Test F1 score for fold 1: 0.30708073075526265

Validation F1 score for fold 2: 0.2802194238270943
Best hyperparameters for fold 2: OrderedDict([('classification__ccp_alpha', 0.002214737395726681), ('classification__class_weight', None), ('text__vector_size', 10), ('text__window_size', 5)])
Test F1 score for fold 2: 0.2584749714270034

Validation F1 score for fold 3: 0.28429350539403836
Best hyperparameters for fold 3: OrderedDict([('classification__ccp_alpha', 0.0013276722768080765), ('classification__class_weight', None), ('text__vector_size', 41), ('text__window_size', 7)])
Test F1 score for fold 3: 0.25178622379934185

Validation F1 score for fold 4: 0.25682745921600997
Best hyperparameters for fold 4: OrderedDict([('classification__cc

In [13]:
### SkipGram
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(documents, window=5, vector_size=50, dbow_words=1, dm_mean=1)
vec_list = []
for i in tqdm(range(len(corpus))):
    vec_list.append(model.infer_vector(corpus.iloc[i]))
data = pd.DataFrame(vec_list)
data

100%|██████████| 11314/11314 [00:38<00:00, 290.75it/s]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.138296,0.310877,-0.003023,-0.098006,-0.154416,-0.077633,0.149321,0.416268,-0.016233,-0.279939,...,-0.430208,0.129288,-0.318273,-0.403427,-0.080636,0.029839,-0.804690,0.010699,0.195876,-0.094002
1,-0.322533,-0.228493,0.259658,0.583815,0.054585,-0.410566,0.161045,-0.061435,0.061736,-0.183913,...,-0.710538,0.273902,-0.127719,-0.374665,0.014058,-0.084448,-0.308150,0.499244,-0.461537,0.258391
2,0.017253,-0.300993,0.078815,0.587306,-0.253314,-0.877639,0.610329,0.601593,-0.214956,-0.280966,...,-0.481166,-0.050697,-0.648011,-0.406621,0.348724,-0.056114,-0.155708,-0.284408,0.228785,0.338700
3,-0.112147,-0.147457,-0.052790,0.129133,-0.019868,-0.048773,0.186313,0.291806,-0.417658,-0.315570,...,-0.152403,0.243233,-0.547521,-0.151617,0.022199,-0.062400,-0.196100,0.390207,-0.152791,-0.245113
4,0.086882,0.430148,-0.073857,0.634513,-0.360836,-0.342152,0.314116,0.309371,0.206878,-0.316622,...,-0.659270,-0.253001,-0.315521,-0.161839,-0.295772,-0.184502,-0.257272,0.101284,-0.545335,0.042074
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11309,-0.203259,0.332130,0.381063,1.195262,-0.286618,-0.158270,0.479994,-0.150410,1.007477,-0.013528,...,-0.733101,-0.054535,-0.182382,0.214754,-0.824925,0.684483,-1.446060,0.552993,0.200208,0.639000
11310,-0.334115,0.119862,-0.094698,0.166379,0.210553,-0.534351,0.117241,0.087344,0.214325,-0.523951,...,-0.487798,-0.247501,-0.445998,-0.071153,-0.260226,-0.400416,-0.331814,0.413073,0.237487,0.409700
11311,-0.095471,-0.068837,0.139592,0.861653,-0.168663,-0.147079,0.390835,-0.215389,0.098699,0.097101,...,-0.042169,0.252173,0.097161,-0.351590,-0.181138,-0.352563,-0.658175,0.259115,-0.317019,0.102958
11312,0.342822,0.470287,0.440438,-0.141964,-0.583460,-0.216313,-0.118912,0.433664,0.012252,-0.473858,...,-0.448919,0.274798,0.052984,0.042835,-0.158311,-0.246133,-0.033715,0.465782,0.139712,0.428265


In [8]:
clf = DecisionTreeClassifier(random_state=120)
pipeline = Pipeline(steps=[])
pipeline.steps.append(('text',Word2VecTransformer(sg=1)))
pipeline.steps.append(('classification',clf))
search_space = dict()
search_space['text__window_size'] = Categorical([3,5,7,9])
search_space['text__vector_size'] = Integer(1,100)
search_space['classification__ccp_alpha'] = Real(0,0.02)
search_space['classification__class_weight'] = Categorical(['balanced',None])
nested_cv(corpus, pd.Series(train_data.target), pipeline, search_space)

Validation F1 score for fold 1: 0.14679261130596402
Best hyperparameters for fold 1: OrderedDict([('classification__ccp_alpha', 0.004414685066848233), ('classification__class_weight', None), ('text__vector_size', 34), ('text__window_size', 9)])
Test F1 score for fold 1: 0.1367525825302221

Validation F1 score for fold 2: 0.26865655194851895
Best hyperparameters for fold 2: OrderedDict([('classification__ccp_alpha', 0.0011607221554817196), ('classification__class_weight', None), ('text__vector_size', 83), ('text__window_size', 9)])
Test F1 score for fold 2: 0.23049590128547165

Validation F1 score for fold 3: 0.32601892430323814
Best hyperparameters for fold 3: OrderedDict([('classification__ccp_alpha', 0.00037908224336926873), ('classification__class_weight', None), ('text__vector_size', 80), ('text__window_size', 5)])
Test F1 score for fold 3: 0.3359490583581755

Validation F1 score for fold 4: 0.29275517102371446
Best hyperparameters for fold 4: OrderedDict([('classification__ccp_alp