# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [2]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [2]:
#path_df = "/home/lnc/0. Latest News Classifier/02. Exploratory Data Analysis/News_dataset.pickle"

#with open(path_df, 'rb') as data:
#    df = pickle.load(data)

In [3]:
df = pd.read_csv('ml_model.csv')

In [4]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Name,Date,Speech,Type,Party,Republican,word_count,unique_word,unique_word_ratio,year,positive_words,negative_words,positive_words_ratio,negative_words_ratio,vader_compound,vader_neg,vader_neu,vader_pos
0,0,0,Donald Trump,"February 05, 2019","\r\n \r\n \r\nMadam Speaker, Mr. Vic...",State of the Union,Republican,1,30948,1909,0.061684,2019,226,159,0.007303,0.005138,0.9999,0.123,0.7,0.176
1,1,1,Donald Trump,"January 30, 2018","\r\n \r\n Mr. Speaker, Mr. Vice Pres...",State of the Union,Republican,1,30540,1905,0.062377,2018,232,132,0.007597,0.004322,0.9999,0.111,0.716,0.173
2,2,2,Donald Trump,"January 20, 2017","\r\n \r\n Chief Justice Roberts, Pre...",Inaugural Address,Republican,1,8551,621,0.072623,2017,77,26,0.009005,0.003041,0.9998,0.072,0.711,0.217
3,3,3,Barack Obama,"January 12, 2016","\r\n \r\n Mr. Speaker, Mr. Vice Pres...",State of the Union,Democrat,0,35410,1930,0.054504,2016,276,145,0.007794,0.004095,0.9999,0.099,0.731,0.171
4,4,4,Barack Obama,"January 20, 2015","\r\n \r\n Mr. Speaker, Mr. Vice Pres...",State of the Union,Democrat,0,40477,2067,0.051066,2015,299,162,0.007387,0.004002,1.0,0.083,0.739,0.178


And visualize one sample news content:

In [6]:
df.loc[1]['Speech']

'\r\n      \r\n      Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and my fellow Americans:\r\n\r\nLess than 1 year has passed since I first stood at this podium, in this majestic chamber, to speak on behalf of the American People—and to address their concerns, their hopes, and their dreams. That night, our new Administration had already taken swift action. A new tide of optimism was already sweeping across our land.\r\n\r\nEach day since, we have gone forward with a clear vision and a righteous mission—to make America great again for all Americans.\r\n\r\nOver the last year, we have made incredible progress and achieved extraordinary success. We have faced challenges we expected, and others we could never have imagined. We have shared in the heights of victory and the pains of hardship. We endured floods and fires and storms. But through it all, we have seen the beauty of America’s soul, and the steel in America’s spine.\r\n\r\nEach test ha

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [7]:
# \r and \n
df['Speech_Parsed_1'] = df['Speech'].str.replace("\r", " ")
df['Speech_Parsed_1'] = df['Speech_Parsed_1'].str.replace("\n", " ")
df['Speech_Parsed_1'] = df['Speech_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [6]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [8]:
# " when quoting text
df['Speech_Parsed_1'] = df['Speech_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [9]:
# Lowercasing the text
df['Speech_Parsed_2'] = df['Speech_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [10]:
punctuation_signs = list("?:!.,;")
df['Speech_Parsed_3'] = df['Speech_Parsed_2']

for punct_sign in punctuation_signs:
    df['Speech_Parsed_3'] = df['Speech_Parsed_3'].str.replace(punct_sign, '')

By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [11]:
df['Speech_Parsed_4'] = df['Speech_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [12]:
# Downloading punkt and wordnet from NLTK
#nltk.download('punkt')
#print("------------------------------------------------------------")
#nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sarah\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [13]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [14]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Speech_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [15]:
df['Speech_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [16]:
# Downloading the stop words list
#nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sarah\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [17]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [18]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [19]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [20]:
df['Speech_Parsed_6'] = df['Speech_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Speech_Parsed_6'] = df['Speech_Parsed_6'].str.replace(regex_stopword, '')

We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [22]:
df.loc[5]['Speech']

'\r\n      \r\n      Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans:\r\n\r\nToday in America, a teacher spent extra time with a student who needed it, and did her part to lift America’s graduation rate to its highest level in more than three decades.\r\n\r\nAn entrepreneur flipped on the lights in her tech startup, and did her part to add to the more than eight million new jobs our businesses have created over the past four years. \r\n\r\nAn autoworker fine-tuned some of the best, most fuel-efficient cars in the world, and did his part to help America wean itself off foreign oil.\r\n\r\nA farmer prepared for the spring after the strongest five-year stretch of farm exports in our history.  A rural doctor gave a young child the first prescription to treat asthma that his mother could afford.  A man took the bus home from the graveyard shift, bone-tired but dreaming big dreams for his son.  And in tight-knit communities across America, fathers and mothers will t

1. Special character cleaning

In [23]:
df.loc[5]['Speech_Parsed_1']

'    Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Today in America, a teacher spent extra time with a student who needed it, and did her part to lift America’s graduation rate to its highest level in more than three decades. An entrepreneur flipped on the lights in her tech startup, and did her part to add to the more than eight million new jobs our businesses have created over the past four years.  An autoworker fine-tuned some of the best, most fuel-efficient cars in the world, and did his part to help America wean itself off foreign oil. A farmer prepared for the spring after the strongest five-year stretch of farm exports in our history.  A rural doctor gave a young child the first prescription to treat asthma that his mother could afford.  A man took the bus home from the graveyard shift, bone-tired but dreaming big dreams for his son.  And in tight-knit communities across America, fathers and mothers will tuck in their kids, put an arm around their s

2. Upcase/downcase

In [24]:
df.loc[5]['Speech_Parsed_2']

'    mr. speaker, mr. vice president, members of congress, my fellow americans: today in america, a teacher spent extra time with a student who needed it, and did her part to lift america’s graduation rate to its highest level in more than three decades. an entrepreneur flipped on the lights in her tech startup, and did her part to add to the more than eight million new jobs our businesses have created over the past four years.  an autoworker fine-tuned some of the best, most fuel-efficient cars in the world, and did his part to help america wean itself off foreign oil. a farmer prepared for the spring after the strongest five-year stretch of farm exports in our history.  a rural doctor gave a young child the first prescription to treat asthma that his mother could afford.  a man took the bus home from the graveyard shift, bone-tired but dreaming big dreams for his son.  and in tight-knit communities across america, fathers and mothers will tuck in their kids, put an arm around their s

3. Punctuation signs

In [25]:
df.loc[5]['Speech_Parsed_3']

'    mr speaker mr vice president members of congress my fellow americans today in america a teacher spent extra time with a student who needed it and did her part to lift america’s graduation rate to its highest level in more than three decades an entrepreneur flipped on the lights in her tech startup and did her part to add to the more than eight million new jobs our businesses have created over the past four years  an autoworker fine-tuned some of the best most fuel-efficient cars in the world and did his part to help america wean itself off foreign oil a farmer prepared for the spring after the strongest five-year stretch of farm exports in our history  a rural doctor gave a young child the first prescription to treat asthma that his mother could afford  a man took the bus home from the graveyard shift bone-tired but dreaming big dreams for his son  and in tight-knit communities across america fathers and mothers will tuck in their kids put an arm around their spouse remember falle

4. Possessive pronouns

In [26]:
df.loc[5]['Speech_Parsed_4']

'    mr speaker mr vice president members of congress my fellow americans today in america a teacher spent extra time with a student who needed it and did her part to lift america’s graduation rate to its highest level in more than three decades an entrepreneur flipped on the lights in her tech startup and did her part to add to the more than eight million new jobs our businesses have created over the past four years  an autoworker fine-tuned some of the best most fuel-efficient cars in the world and did his part to help america wean itself off foreign oil a farmer prepared for the spring after the strongest five-year stretch of farm exports in our history  a rural doctor gave a young child the first prescription to treat asthma that his mother could afford  a man took the bus home from the graveyard shift bone-tired but dreaming big dreams for his son  and in tight-knit communities across america fathers and mothers will tuck in their kids put an arm around their spouse remember falle

5. Stemming and Lemmatization

In [27]:
df.loc[5]['Speech_Parsed_5']

'    mr speaker mr vice president members of congress my fellow americans today in america a teacher spend extra time with a student who need it and do her part to lift america’s graduation rate to its highest level in more than three decades an entrepreneur flip on the light in her tech startup and do her part to add to the more than eight million new job our businesses have create over the past four years  an autoworker fine-tune some of the best most fuel-efficient cars in the world and do his part to help america wean itself off foreign oil a farmer prepare for the spring after the strongest five-year stretch of farm export in our history  a rural doctor give a young child the first prescription to treat asthma that his mother could afford  a man take the bus home from the graveyard shift bone-tired but dream big dream for his son  and in tight-knit communities across america father and mother will tuck in their kid put an arm around their spouse remember fall comrades and give tha

6. Stop words

In [28]:
df.loc[5]['Speech_Parsed_6']

'    mr speaker mr vice president members  congress  fellow americans today  america  teacher spend extra time   student  need     part  lift america’ graduation rate   highest level    three decades  entrepreneur flip   light   tech startup    part  add     eight million new job  businesses  create   past four years   autoworker fine-tune    best  fuel-efficient cars   world    part  help america wean   foreign oil  farmer prepare   spring   strongest five-year stretch  farm export   history   rural doctor give  young child  first prescription  treat asthma   mother could afford   man take  bus home   graveyard shift bone-tired  dream big dream   son    tight-knit communities across america father  mother  tuck   kid put  arm around  spouse remember fall comrades  give thank   home   war   twelve long years  finally come   end tonight  chamber speak  one voice   people  represent     citizens  make  state   union strong     result   efforts   lowest unemployment rate   five years   re

Finally, we can delete the intermediate columns:

In [29]:
df.head(1)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Name,Date,Speech,Type,Party,Republican,word_count,unique_word,...,vader_compound,vader_neg,vader_neu,vader_pos,Speech_Parsed_1,Speech_Parsed_2,Speech_Parsed_3,Speech_Parsed_4,Speech_Parsed_5,Speech_Parsed_6
0,0,0,Donald Trump,"February 05, 2019","\r\n \r\n \r\nMadam Speaker, Mr. Vic...",State of the Union,Republican,1,30948,1909,...,0.9999,0.123,0.7,0.176,"Madam Speaker, Mr. Vice President, Membe...","madam speaker, mr. vice president, membe...",madam speaker mr vice president members ...,madam speaker mr vice president members ...,madam speaker mr vice president members ...,madam speaker mr vice president members ...


In [28]:
list_columns = ["File_Name", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Speech_Parsed_6': 'Speech_Parsed'})

In [29]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed
0,001.txt,business,001.txt-business,Ad sales boost Time Warner profit\r\n\r\nQuart...,ad sales boost time warner profit quarterly pr...
1,002.txt,business,002.txt-business,Dollar gains on Greenspan speech\r\n\r\nThe do...,dollar gain greenspan speech dollar hit hi...
2,003.txt,business,003.txt-business,Yukos unit buyer faces loan claim\r\n\r\nThe o...,yukos unit buyer face loan claim owners emba...
3,004.txt,business,004.txt-business,High fuel prices hit BA's profits\r\n\r\nBriti...,high fuel price hit ba profit british airways ...
4,005.txt,business,005.txt-business,Pernod takeover talk lifts Domecq\r\n\r\nShare...,pernod takeover talk lift domecq share uk dri...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [30]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

In [31]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [32]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed,Category_Code
0,001.txt,business,001.txt-business,Ad sales boost Time Warner profit\r\n\r\nQuart...,ad sales boost time warner profit quarterly pr...,0
1,002.txt,business,002.txt-business,Dollar gains on Greenspan speech\r\n\r\nThe do...,dollar gain greenspan speech dollar hit hi...,0
2,003.txt,business,003.txt-business,Yukos unit buyer faces loan claim\r\n\r\nThe o...,yukos unit buyer face loan claim owners emba...,0
3,004.txt,business,004.txt-business,High fuel prices hit BA's profits\r\n\r\nBriti...,high fuel price hit ba profit british airways ...,0
4,005.txt,business,005.txt-business,Pernod takeover talk lifts Domecq\r\n\r\nShare...,pernod takeover talk lift domecq share uk dri...,0


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [31]:
X_train, X_test, y_train, y_test = train_test_split(df['Speech_Parsed_6'], 
                                                    df['Republican'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [32]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [33]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(146, 300)
(26, 300)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [38]:
from sklearn.feature_selection import chi2
import numpy as np

for speech, party in sorted(df['Republican'].items()):
    features_chi2 = chi2(features_train, labels_train == party)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(speech))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# '0' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '1' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '2' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '3' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '4' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '5' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '6' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. pe

. per cent

# '90' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '91' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '92' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '93' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '94' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '95' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '96' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:

# '153' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '154' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '155' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '156' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '157' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '158' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. unite state
. per cent

# '159' category:
  . Most correlated unigrams:
. cent
. common
. commission
. per
. court
  . Most correlated bigrams:
. un

As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [39]:
bigrams

['fiscal year', 'unite state', 'per cent']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [40]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)