# Feature Engineering¶
The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

Text Cleaning and Preparation: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization.
Label coding: creation of a dictionary to map each category to a code.
Train-test split: to test the models on unseen data.
Text representation: use of TF-IDF scores to represent text.

In [1]:

import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

In [2]:

path_df = "News_dataset.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [3]:
df.head()

Unnamed: 0,Article,labels,News_length
0,You need enable JavaScript view site . Skip Co...,0,9339
1,Skip main content The Verge homepage Follow Th...,1,6029
2,Product Reviews Top Products Appliances Babies...,1,23926
3,Markets Tech Media Success Perspectives Videos...,1,12377
4,"Skip main content September 4 , 2021 Volume XI...",1,24799



# 1. Text cleaning and preparation
1.1. Special character cleaning
We can see the following special characters:

\r
\n
\ before possessive pronouns (government's = government\'s)
\ before possessive pronouns 2 (Yukos' = Yukos\')
" when quoting text

In [4]:
df['Content_Parsed_1'] = df['Article'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

In [6]:
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

# 1.2. Upcase/downcase
We'll downcase the texts because we want, for example, Football and football to be the same word.

In [7]:

# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()


# 1.3. Punctuation signs
Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [8]:

punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

# 1.4. Possessive pronouns
We'll also remove possessive pronoun terminations:

In [9]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

# 1.5. Stemming and Lemmatization
Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [10]:

# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tt0342\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tt0342\AppData\Roaming\nltk_data...


------------------------------------------------------------


[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [11]:
wordnet_lemmatizer = WordNetLemmatizer()

In [12]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [13]:
df['Content_Parsed_5'] = lemmatized_text_list

In [14]:
stop_words = list(stopwords.words('english'))

In [15]:

example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

In [16]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

In [17]:
df.loc[2]['Article']



In [18]:
df.loc[2]['Content_Parsed_1']



In [19]:
df.loc[5]['Content_Parsed_2']

"industries services site selection why michigan news thousands resources . ready help . loading ... advantages office future mobility cost living calculator success stories awards rankings covid-19 resources popular newsletter sign up foreign direct investment michigan strategic fund ( msf ) about medc covid-19 response industries services site selection why michigan news about medc contact medc community michigan strategic fund freedom information act medc project map site selection pure michigan travel mobility pure opportunity pure opportunity `` in michigan , revolution air . '' talk specialist talk specialist trevor pawl chief mobility officer , office future mobility electrification michigan always kept world moving . our manufacturing automotive heritage paired mobility ecosystem talented workforce proving make difference helping shape future . join us make mark matters . learn more the office future mobility electrification michigan demonstrates leadership driving collaboratio

In [20]:
df.loc[5]['Content_Parsed_3']

"industries services site selection why michigan news thousands resources  ready help  loading  advantages office future mobility cost living calculator success stories awards rankings covid-19 resources popular newsletter sign up foreign direct investment michigan strategic fund ( msf ) about medc covid-19 response industries services site selection why michigan news about medc contact medc community michigan strategic fund freedom information act medc project map site selection pure michigan travel mobility pure opportunity pure opportunity `` in michigan  revolution air  '' talk specialist talk specialist trevor pawl chief mobility officer  office future mobility electrification michigan always kept world moving  our manufacturing automotive heritage paired mobility ecosystem talented workforce proving make difference helping shape future  join us make mark matters  learn more the office future mobility electrification michigan demonstrates leadership driving collaboration among pub

In [21]:
df.loc[5]['Content_Parsed_4']

"industries services site selection why michigan news thousands resources  ready help  loading  advantages office future mobility cost living calculator success stories awards rankings covid-19 resources popular newsletter sign up foreign direct investment michigan strategic fund ( msf ) about medc covid-19 response industries services site selection why michigan news about medc contact medc community michigan strategic fund freedom information act medc project map site selection pure michigan travel mobility pure opportunity pure opportunity `` in michigan  revolution air  '' talk specialist talk specialist trevor pawl chief mobility officer  office future mobility electrification michigan always kept world moving  our manufacturing automotive heritage paired mobility ecosystem talented workforce proving make difference helping shape future  join us make mark matters  learn more the office future mobility electrification michigan demonstrates leadership driving collaboration among pub

In [22]:
df.loc[5]['Content_Parsed_5']

"industries service site selection why michigan news thousands resources  ready help  load  advantage office future mobility cost live calculator success stories award rank covid-19 resources popular newsletter sign up foreign direct investment michigan strategic fund ( msf ) about medc covid-19 response industries service site selection why michigan news about medc contact medc community michigan strategic fund freedom information act medc project map site selection pure michigan travel mobility pure opportunity pure opportunity `` in michigan  revolution air  '' talk specialist talk specialist trevor pawl chief mobility officer  office future mobility electrification michigan always keep world move  our manufacture automotive heritage pair mobility ecosystem talented workforce prove make difference help shape future  join us make mark matter  learn more the office future mobility electrification michigan demonstrate leadership drive collaboration among public  private philanthropic p

In [23]:
df.loc[5]['Content_Parsed_6']

"industries service site selection  michigan news thousands resources  ready help  load  advantage office future mobility cost live calculator success stories award rank covid-19 resources popular newsletter sign  foreign direct investment michigan strategic fund ( msf )  medc covid-19 response industries service site selection  michigan news  medc contact medc community michigan strategic fund freedom information act medc project map site selection pure michigan travel mobility pure opportunity pure opportunity ``  michigan  revolution air  '' talk specialist talk specialist trevor pawl chief mobility officer  office future mobility electrification michigan always keep world move   manufacture automotive heritage pair mobility ecosystem talented workforce prove make difference help shape future  join us make mark matter  learn   office future mobility electrification michigan demonstrate leadership drive collaboration among public  private philanthropic partner advance state ’ mobilit

In [24]:
df.head(1)

Unnamed: 0,Article,labels,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,You need enable JavaScript view site . Skip Co...,0,9339,You need enable JavaScript view site . Skip Co...,you need enable javascript view site . skip co...,you need enable javascript view site skip con...,you need enable javascript view site skip con...,you need enable javascript view site skip con...,need enable javascript view site skip conten...


In [25]:
list_columns = ["labels","Article", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [26]:
df.head()

Unnamed: 0,labels,Article,Content_Parsed
0,0,You need enable JavaScript view site . Skip Co...,need enable javascript view site skip conten...
1,1,Skip main content The Verge homepage Follow Th...,skip main content verge homepage follow verg...
2,1,Product Reviews Top Products Appliances Babies...,product review top products appliances baby & ...
3,1,Markets Tech Media Success Perspectives Videos...,market tech media success perspectives videos ...
4,1,"Skip main content September 4 , 2021 Volume XI...",skip main content september 4 2021 volume xi ...


# 3. Train - test split
We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['labels'], 
                                                    test_size=0.15, 
                                                    random_state=8)

In [31]:
X_train.shape

(5,)

# 4. Text representation
We have various options:

Count Vectors as features
TF-IDF Vectors as features
Word Embeddings as features
Text / NLP based features
Topic Models as features
We'll use TF-IDF Vectors as features.

We have to define the different parameters:

ngram_range: We want to consider both unigrams and bigrams.
max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
See TfidfVectorizer? for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument norm

In [49]:
ngram_range = (1,2)
min_df = 3
max_df = 1.
max_features = 300

In [50]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(5, 300)
(1, 300)


In [35]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(df['labels'].items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# '0' category:
  . Most correlated unigrams:
. help
. office
. manufacture
. mobility
. michigan
  . Most correlated bigrams:
. self driving
. covid 19

# '1' category:
  . Most correlated unigrams:
. help
. office
. manufacture
. mobility
. michigan
  . Most correlated bigrams:
. self driving
. covid 19

# '2' category:
  . Most correlated unigrams:
. help
. office
. manufacture
. mobility
. michigan
  . Most correlated bigrams:
. self driving
. covid 19

# '3' category:
  . Most correlated unigrams:
. help
. office
. manufacture
. mobility
. michigan
  . Most correlated bigrams:
. self driving
. covid 19

# '4' category:
  . Most correlated unigrams:
. help
. office
. manufacture
. mobility
. michigan
  . Most correlated bigrams:
. self driving
. covid 19

# '5' category:
  . Most correlated unigrams:
. help
. office
. manufacture
. mobility
. michigan
  . Most correlated bigrams:
. self driving
. covid 19



In [36]:
bigrams

['sign control',
 'navigate autopilot',
 'smart summon',
 'videos product',
 'home garden',
 'car buy',
 'consumer report',
 'baby kid',
 'car drive',
 'driving capability',
 'lane change',
 'light stop',
 'contact us',
 'stop sign',
 'privacy policy',
 'traffic light',
 'park lot',
 'tesla model',
 'product review',
 'law review',
 'assistance systems',
 'news videos',
 'national law',
 'dow jones',
 'tesla full',
 'beta testers',
 'driverless cars',
 'term use',
 'cnn business',
 'autonomous vehicles',
 'full self',
 'driving cars',
 'self driving',
 'covid 19']

In [38]:

# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)