# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [161]:
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

Let's load the dataset:

In [162]:
path_df = "C:/Users/Keletso/Documents/Research Paper Classification/1. Exploratory Data Analysis/research_dataset.pickle"

with open(path_df, 'rb') as data:
    df_train = pickle.load(data)

In [163]:
df_train.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,Title_length,Abstract_length
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0,43,1912
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0,34,513
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0,70,668
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0,91,783
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0,142,860


And visualize one sample title and abstract:

In [164]:
df_train.loc[4]['TITLE']

'Comparative study of Discrete Wavelet Transforms and Wavelet Tensor Train decomposition to feature extraction of FTIR data of medicinal plants'

In [165]:
df_train.loc[0]['ABSTRACT']

"  Predictive models allow subject-specific inference when analyzing disease\nrelated alterations in neuroimaging data. Given a subject's data, inference can\nbe made at two levels: global, i.e. identifiying condition presence for the\nsubject, and local, i.e. detecting condition effect on each individual\nmeasurement extracted from the subject's data. While global inference is widely\nused, local inference, which can be used to form subject-specific effect maps,\nis rarely used because existing models often yield noisy detections composed of\ndispersed isolated islands. In this article, we propose a reconstruction\nmethod, named RSM, to improve subject-specific detections of predictive\nmodeling approaches and in particular, binary classifiers. RSM specifically\naims to reduce noise due to sampling error associated with using a finite\nsample of examples to train classifiers. The proposed method is a wrapper-type\nalgorithm that can be used with different binary classifiers in a diagn

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\n``
* ``$``
* ``\\`` 

In [166]:
# \r and \n
df_train['Title_Parsed_1'] = df_train['TITLE'].str.replace("\n", " ")
df_train['Title_Parsed_1'] = df_train['Title_Parsed_1'].str.replace("$", " ")
df_train['Title_Parsed_1'] = df_train['Title_Parsed_1'].str.replace("\\", " ")
df_train['Title_Parsed_1'] = df_train['Title_Parsed_1'].str.replace("\`", " ")                                                                    

df_train['ABSTRACT_Parsed_1'] = df_train['ABSTRACT'].str.replace("\n", " ")
df_train['ABSTRACT_Parsed_1'] = df_train['ABSTRACT_Parsed_1'].str.replace("$", " ")
df_train['ABSTRACT_Parsed_1'] = df_train['ABSTRACT_Parsed_1'].str.replace("\\", " ")
df_train['ABSTRACT_Parsed_1'] = df_train['ABSTRACT_Parsed_1'].str.replace("\`", " ")

In [167]:
df_train

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,Title_length,Abstract_length,Title_Parsed_1,ABSTRACT_Parsed_1
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0,43,1912,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0,34,513,Rotation Invariance Neural Network,Rotation invariance and translation invarian...
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0,70,668,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0,91,783,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0,142,860,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...
5,6,On maximizing the fundamental frequency of the...,Let $\Omega \subset \mathbb{R}^n$ be a bound...,0,0,1,0,0,0,72,1224,On maximizing the fundamental frequency of the...,Let Omega subset mathbb{R}^n be a bound...
6,7,On the rotation period and shape of the hyperb...,We observed the newly discovered hyperbolic ...,0,1,0,0,0,0,102,625,On the rotation period and shape of the hyperb...,We observed the newly discovered hyperbolic ...
7,8,Adverse effects of polymer coating on heat tra...,The ability of metallic nanoparticles to sup...,0,1,0,0,0,0,78,829,Adverse effects of polymer coating on heat tra...,The ability of metallic nanoparticles to sup...
8,9,SPH calculations of Mars-scale collisions: the...,We model large-scale ($\approx$2000km) impac...,0,1,0,0,0,0,120,803,SPH calculations of Mars-scale collisions: the...,We model large-scale ( approx 2000km) impac...
9,10,$\mathcal{R}_{0}$ fails to predict the outbrea...,Time varying susceptibility of host at indiv...,0,0,0,0,1,0,102,1014,mathcal{R}_{0} fails to predict the outbrea...,Time varying susceptibility of host at indiv...


### 1.2. Uppercase to lowercase

We'll lowercase text because we want, for example, `Reconstructing ` and `reconstructing ` to be the same word.

In [168]:
df_train['Title_Parsed_2'] = df_train['Title_Parsed_1'].str.lower()
df_train['ABSTRACT_Parsed_2'] = df_train['ABSTRACT_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs don't have any prediction power, so we'll just remove them.

In [169]:
punctuation_signs = list("?:!.,;")
df_train['Title_Parsed_3'] = df_train['Title_Parsed_2']
df_train['ABSTRACT_Parsed_3'] = df_train['ABSTRACT_Parsed_2']

for punct_sign in punctuation_signs:
    df_train['Title_Parsed_3'] = df_train['Title_Parsed_3'].str.replace(punct_sign, '')
    df_train['ABSTRACT_Parsed_3'] = df_train['ABSTRACT_Parsed_3'].str.replace(punct_sign, '')

By doing this we are messing up with title length and abstract length, but it's no problem since we aren't expecting any predictive power from these characters.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [170]:
df_train['Title_Parsed_4'] = df_train['Title_Parsed_3'].str.replace("'s", "")
df_train['ABSTRACT_Parsed_4'] = df_train['ABSTRACT_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce words that don't exist, we'll only use a lemmatization. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [171]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Keletso\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Keletso\AppData\Roaming\nltk_data...


------------------------------------------------------------


[nltk_data]   Package wordnet is already up-to-date!


True

In [172]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

We iterate through every word to lemmatize each word:

In [173]:
nrows = len(df_train)
lemmatized_title_list = []
lemmatized_abstract_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    title_list = []
    abstract_list = []
    
    # Save the text and its words into an object
    title = df_train.loc[row]['Title_Parsed_4']
    title_words = title.split(" ")
    
    abstract = df_train.loc[row]['ABSTRACT_Parsed_4']
    abstract_words = abstract.split(" ")

    # Iterate through every word to lemmatize
    for word in title_words:
        title_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    
    for word in abstract_words:
        abstract_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    # Join the list
    lemmatized_title = " ".join(title_list)
    lemmatized_abstract = " ".join(abstract_list)
    
    # Append to the list containing the texts
    lemmatized_title_list.append(lemmatized_title)
    lemmatized_abstract_list.append(lemmatized_abstract)

In [174]:
df_train['Title_Parsed_5'] = lemmatized_title_list
df_train['ABSTRACT_Parsed_5'] = lemmatized_abstract_list

Let's see what the results look like.

In [175]:
df_train['Title_Parsed_4'].loc[4]

'comparative study of discrete wavelet transforms and wavelet tensor train decomposition to feature extraction of ftir data of medicinal plants'

In [176]:
df_train['Title_Parsed_5'].loc[4]

'comparative study of discrete wavelet transform and wavelet tensor train decomposition to feature extraction of ftir data of medicinal plant'

In [177]:
df_train['ABSTRACT_Parsed_4'].loc[4]

'  fourier-transform infra-red (ftir) spectra of samples from 7 plant species were used to explore the influence of preprocessing and feature extraction on efficiency of machine learning algorithms wavelet tensor train (wtt) and discrete wavelet transforms (dwt) were compared as feature extraction techniques for ftir data of medicinal plants various combinations of signal processing steps showed different behavior when applied to classification and clustering tasks best results for wtt and dwt found through grid search were similar significantly improving quality of clustering as well as classification accuracy for tuned logistic regression in comparison to original spectra unlike dwt wtt has only one parameter to be tuned (rank) making it a more versatile and easier to use as a data processing tool in various signal processing applications '

In [178]:
df_train['ABSTRACT_Parsed_5'].loc[4]

'  fourier-transform infra-red (ftir) spectra of sample from 7 plant species be use to explore the influence of preprocessing and feature extraction on efficiency of machine learn algorithms wavelet tensor train (wtt) and discrete wavelet transform (dwt) be compare as feature extraction techniques for ftir data of medicinal plant various combinations of signal process step show different behavior when apply to classification and cluster task best result for wtt and dwt find through grid search be similar significantly improve quality of cluster as well as classification accuracy for tune logistic regression in comparison to original spectra unlike dwt wtt have only one parameter to be tune (rank) make it a more versatile and easier to use as a data process tool in various signal process applications '

We can see that the lemmatization didn't work particularly well for us, but it may help going forward that we did it. 

### 1.6. Stop words

In [179]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Keletso\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [180]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [181]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Let's remove stop words.

In [182]:
df_train['Title_Parsed_6'] = df_train['Title_Parsed_5']
df_train['ABSTRACT_Parsed_6'] = df_train['ABSTRACT_Parsed_5']
for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df_train['Title_Parsed_6'] = df_train['Title_Parsed_6'].str.replace(regex_stopword, '')
    df_train['ABSTRACT_Parsed_6'] = df_train['ABSTRACT_Parsed_6'].str.replace(regex_stopword, '')

Let's see the how the processes, we've implemented, have changed the original title and abstract:

In [183]:
df_train.loc[6]['TITLE']

'On the rotation period and shape of the hyperbolic asteroid 1I/`Oumuamua (2017) U1 from its lightcurve'

In [184]:
df_train.loc[6]['Title_Parsed_6']

'  rotation period  shape   hyperbolic asteroid 1i/ oumuamua (2017) u1   lightcurve'

In [185]:
df_train.loc[6]['ABSTRACT']

"  We observed the newly discovered hyperbolic minor planet 1I/`Oumuamua (2017\nU1) on 2017 October 30 with Lowell Observatory's 4.3-m Discovery Channel\nTelescope. From these observations, we derived a partial lightcurve with\npeak-to-trough amplitude of at least 1.2 mag. This lightcurve segment rules out\nrotation periods less than 3 hr and suggests that the period is at least 5 hr.\nOn the assumption that the variability is due to a changing cross section, the\naxial ratio is at least 3:1. We saw no evidence for a coma or tail in either\nindividual images or in a stacked image having an equivalent exposure time of\n9000 s.\n"

In [186]:
df_train.loc[6]['ABSTRACT_Parsed_6']

'   observe  newly discover hyperbolic minor planet 1i/ oumuamua (2017 u1)  2017 october 30  lowell observatory 43- discovery channel telescope   observations  derive  partial lightcurve  peak--trough amplitude   least 12 mag  lightcurve segment rule  rotation periods less  3 hr  suggest   period   least 5 hr   assumption   variability  due   change cross section  axial ratio   least 31  saw  evidence   coma  tail  either individual image    stack image   equivalent exposure time  9000  '

We can delete the intermediate columns:

In [187]:
df_train.head(1)

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,Title_length,...,Title_Parsed_2,ABSTRACT_Parsed_2,Title_Parsed_3,ABSTRACT_Parsed_3,Title_Parsed_4,ABSTRACT_Parsed_4,Title_Parsed_5,ABSTRACT_Parsed_5,Title_Parsed_6,ABSTRACT_Parsed_6
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0,43,...,reconstructing subject-specific effect maps,predictive models allow subject-specific inf...,reconstructing subject-specific effect maps,predictive models allow subject-specific inf...,reconstructing subject-specific effect maps,predictive models allow subject-specific inf...,reconstruct subject-specific effect map,predictive model allow subject-specific infe...,reconstruct subject-specific effect map,predictive model allow subject-specific infe...


In [188]:
list_columns = ["TITLE", "ABSTRACT", "Computer Science", 'Physics','Mathematics' ,'Statistics' ,'Quantitative Biology' ,'Quantitative Finance' , "Title_Parsed_6", "ABSTRACT_Parsed_6"]
df_train = df_train[list_columns]

df_train = df_train.rename(columns={'Title_Parsed_6': 'Title_Parsed','ABSTRACT_Parsed_6': 'ABSTRACT_Parsed'})

We combine the two descriptors to make it one, this is for ease of insertion into the train_test split 

In [189]:
df_train['Article Description']=df_train['Title_Parsed'] + df_train['ABSTRACT_Parsed']

In [190]:
df_train.loc[0]['Article Description']

'reconstruct subject-specific effect map  predictive model allow subject-specific inference  analyze disease relate alterations  neuroimaging data give  subject data inference   make  two level global ie identifiying condition presence   subject  local ie detect condition effect   individual measurement extract   subject data  global inference  widely use local inference    use  form subject-specific effect map  rarely use  exist model often yield noisy detections compose  disperse isolate islands   article  propose  reconstruction method name rsm  improve subject-specific detections  predictive model approach   particular binary classifiers rsm specifically aim  reduce noise due  sample error associate  use  finite sample  examples  train classifiers  propose method   wrapper-type algorithm    use  different binary classifiers   diagnostic manner ie without information  condition presence reconstruction  pose   maximum--posteriori problem   prior model whose parameters  estimate  trai

## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set. Note also that we have some instances where multiple fields are represented in a research paper. However for simplicity, we assume that a paper can only be specific to one reseach field. 

In [192]:
def get_fields(row):
    dictionary={}   
    for c in df_train[['Computer Science','Physics','Mathematics','Statistics','Quantitative Biology','Quantitative Finance']].columns:
        if row[c]==1:      
            return c

In [193]:
df_train['Research Field']=pd.DataFrame(df_train.apply(get_fields, axis=1))

In [194]:
Research_Field_codes = {
    'Computer Science': 0,
    'Physics': 1,
    'Mathematics': 2,
    'Statistics': 3,
    'Quantitative Biology': 4,
    'Quantitative Finance': 5
}

In [195]:
df_train['Research Field Code'] = df_train['Research Field']
df_train = df_train.replace({'Research Field Code':Research_Field_codes})

In [196]:
df_train

Unnamed: 0,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,Title_Parsed,ABSTRACT_Parsed,Article Description,Research Field,Research Field Code
0,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0,reconstruct subject-specific effect map,predictive model allow subject-specific infe...,reconstruct subject-specific effect map predi...,Computer Science,0
1,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0,rotation invariance neural network,rotation invariance translation invariance ...,rotation invariance neural network rotation i...,Computer Science,0
2,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0,spherical polyharmonics poisson kernels poly...,introduce develop notion spherical polyh...,spherical polyharmonics poisson kernels poly...,Mathematics,2
3,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0,finite element approximation stochastic max...,stochastic landau--lifshitz--gilbert (llg) ...,finite element approximation stochastic max...,Mathematics,2
4,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0,comparative study discrete wavelet transform ...,fourier-transform infra-red (ftir) spectra ...,comparative study discrete wavelet transform ...,Computer Science,0
5,On maximizing the fundamental frequency of the...,Let $\Omega \subset \mathbb{R}^n$ be a bound...,0,0,1,0,0,0,maximize fundamental frequency complement ...,let omega subset mathbb{r}^n bound do...,maximize fundamental frequency complement ...,Mathematics,2
6,On the rotation period and shape of the hyperb...,We observed the newly discovered hyperbolic ...,0,1,0,0,0,0,rotation period shape hyperbolic asteroid...,observe newly discover hyperbolic minor pl...,rotation period shape hyperbolic asteroid...,Physics,1
7,Adverse effects of polymer coating on heat tra...,The ability of metallic nanoparticles to sup...,0,1,0,0,0,0,adverse effect polymer coat heat transport ...,ability metallic nanoparticles supply hea...,adverse effect polymer coat heat transport ...,Physics,1
8,SPH calculations of Mars-scale collisions: the...,We model large-scale ($\approx$2000km) impac...,0,1,0,0,0,0,sph calculations mars-scale collisions role ...,model large-scale ( approx 2000km) impact ...,sph calculations mars-scale collisions role ...,Physics,1
9,$\mathcal{R}_{0}$ fails to predict the outbrea...,Time varying susceptibility of host at indiv...,0,0,0,0,1,0,mathcal{r}_{0} fail predict outbreak pote...,time vary susceptibility host individual l...,mathcal{r}_{0} fail predict outbreak pote...,Quantitative Biology,4


In [197]:
X_train, X_test, y_train, y_test = train_test_split(df_train['Article Description'], 
                                                    df_train['Research Field Code'],
                                                    test_size=0.25, 
                                                    random_state=8)

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [205]:
# Parameter election
ngram_range = (1,3)
min_df = 10
max_df = 1.
max_features = 3000

We have chosen these values as a first approximation

In [206]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(15729, 3000)
(5243, 3000)


In [207]:
from sklearn.feature_selection import chi2
import numpy as np

for label, label_id in sorted(Research_Field_codes.items()):
    features_chi2 = chi2(features_train, labels_train == label_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    trigrams = [v for v in feature_names if len(v.split(' ')) == 3]
    print("# '{}' label:".format(label))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-5:])))
    print("  . Most correlated trigrams:\n. {}".format('\n. '.join(trigrams[-5:])))
    print("")

# 'Computer Science' label:
  . Most correlated unigrams:
. neural
. train
. task
. learn
. network
  . Most correlated bigrams:
. paper propose
. convolutional neural
. magnetic field
. state art
. neural network
  . Most correlated trigrams:
. deep reinforcement learn
. experimental result show
. recurrent neural network
. deep neural network
. convolutional neural network

# 'Mathematics' label:
  . Most correlated unigrams:
. algebra
. algebras
. group
. prove
. mathbb
  . Most correlated bigrams:
. paper prove
. sufficient condition
. riemannian manifold
. neural network
. mathbb mathbb
  . Most correlated trigrams:
. experimental result show
. generative adversarial network
. recurrent neural network
. convolutional neural network
. deep neural network

# 'Physics' label:
  . Most correlated unigrams:
. electron
. phase
. temperature
. spin
. magnetic
  . Most correlated bigrams:
. spin orbit
. many body
. grind state
. dark matter
. magnetic field
  . Most correlated trigrams:
.

As we can see, the unigrams and bigrams correspond well to their category. However, trigrams do not. If we get the trigrams in our features:

In [208]:
trigrams

['experimental result show',
 'play important role',
 'paper propose novel',
 'convolutional neural network',
 'outperform state art',
 'long short term',
 'recurrent neural network',
 'achieve state art',
 'deep reinforcement learn',
 'state art methods',
 'deep learn model',
 'short term memory',
 'density functional theory',
 'support vector machine',
 'stochastic gradient descent',
 'generative adversarial network',
 'deep neural network',
 'machine learn model']

We can see there are only 18. This means the unigrams and bigrams have more correlation with the category than the trigrams, and since we're restricting the number of features to the most representative 3000, only a few trigrams are being considered.

Let's save the files we'll need in the next steps:

In [211]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df_train.pickle', 'wb') as output:
    pickle.dump(df_train, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)