# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [150]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [151]:
path_df = "../02. Exploratory Data Analysis/AutonomousCars_dataset.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [152]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length
0,1.txt,b'Road vehicle that is capable of moving safel...,Autonomous Cars,1.txt-Autonomous Cars,1,100847
1,100.txt,"b""The U.S. Department of Transportation (USDOT...",Autonomous Cars,100.txt-Autonomous Cars,1,2239
2,101.txt,b'National Security Implications of Leadership...,Autonomous Cars,101.txt-Autonomous Cars,1,28636
3,102.txt,b'Road vehicle that is capable of moving safel...,Autonomous Cars,102.txt-Autonomous Cars,1,100847
4,104.txt,b'The dream of the self-driving car \xe2\x80\x...,Autonomous Cars,104.txt-Autonomous Cars,1,17677


And visualize one sample news content:

In [153]:
df.loc[1]['Content']

'b"The U.S. Department of Transportation (USDOT) is committed to facilitating a new era of transportation innovation and safety and ensuring that our country remains a leader in automation. USDOT is acting as a convener and facilitator, partnering with a broad coalition of industry, academic, states and local, safety advocacy, and transportation stakeholders to support the safe development, testing, and integration of automated vehicle technologies.\\n\\nUSDOT developed the Automated Vehicles Comprehensive Plan to advance the Department\'s work to prioritize safety while preparing for the future of transportation. Building upon the principles stated in AV 4.0, the plan defines three goals to achieve this vision for Automated Driving Systems (ADS): Promote Collaboration and Transparency, Modernize the Regulatory Environment, and Prepare the Transportation System\\n\\nUSDOT and the White House Office of Science and Technology Policy developed Ensuring American Leadership in Automated Veh

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [154]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [155]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [156]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [157]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [158]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

  df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')


By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [159]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [160]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mohan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [161]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [162]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [163]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [164]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mohan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [165]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [166]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [167]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [168]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

  df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')


We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [169]:
df.loc[5]['Content']

"b'NHTSA In Action NHTSA is dedicated to advancing the lifesaving potential of new vehicle technologies\\n\\nNHTSA demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways through its proactive and inclusive approach to the safe development, testing, and deployment of new and advanced vehicle technologies that have enormous potential for improving safety and mobility for all Americans, NHTSA demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways.\\n\\nIn September 2016, NHTSA and the U.S. Department of Transportation issued the Federal Automated Vehicles Policy which set forth a proactive approach to providing safety assurance and facilitating innovation. Building on that policy and incorporating feedback received through public comments, stakeholder meetings, and Congressional hearings, in September 2017, the agency issued, Automated Driving Systems: A Vision for Safety 2.0. The updated guidance, 2.0, offers a f

1. Special character cleaning

In [170]:
df.loc[5]['Content_Parsed_1']

"b'NHTSA In Action NHTSA is dedicated to advancing the lifesaving potential of new vehicle technologies\\n\\nNHTSA demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways through its proactive and inclusive approach to the safe development, testing, and deployment of new and advanced vehicle technologies that have enormous potential for improving safety and mobility for all Americans, NHTSA demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways.\\n\\nIn September 2016, NHTSA and the U.S. Department of Transportation issued the Federal Automated Vehicles Policy which set forth a proactive approach to providing safety assurance and facilitating innovation. Building on that policy and incorporating feedback received through public comments, stakeholder meetings, and Congressional hearings, in September 2017, the agency issued, Automated Driving Systems: A Vision for Safety 2.0. The updated guidance, 2.0, offers a f

2. Upcase/downcase

In [171]:
df.loc[5]['Content_Parsed_2']

"b'nhtsa in action nhtsa is dedicated to advancing the lifesaving potential of new vehicle technologies\\n\\nnhtsa demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways through its proactive and inclusive approach to the safe development, testing, and deployment of new and advanced vehicle technologies that have enormous potential for improving safety and mobility for all americans, nhtsa demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways.\\n\\nin september 2016, nhtsa and the u.s. department of transportation issued the federal automated vehicles policy which set forth a proactive approach to providing safety assurance and facilitating innovation. building on that policy and incorporating feedback received through public comments, stakeholder meetings, and congressional hearings, in september 2017, the agency issued, automated driving systems: a vision for safety 2.0. the updated guidance, 2.0, offers a f

3. Punctuation signs

In [172]:
df.loc[5]['Content_Parsed_3']

"b'nhtsa in action nhtsa is dedicated to advancing the lifesaving potential of new vehicle technologies\\n\\nnhtsa demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways through its proactive and inclusive approach to the safe development testing and deployment of new and advanced vehicle technologies that have enormous potential for improving safety and mobility for all americans nhtsa demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways\\n\\nin september 2016 nhtsa and the us department of transportation issued the federal automated vehicles policy which set forth a proactive approach to providing safety assurance and facilitating innovation building on that policy and incorporating feedback received through public comments stakeholder meetings and congressional hearings in september 2017 the agency issued automated driving systems a vision for safety 20 the updated guidance 20 offers a flexible nonregulato

4. Possessive pronouns

In [173]:
df.loc[5]['Content_Parsed_4']

"b'nhtsa in action nhtsa is dedicated to advancing the lifesaving potential of new vehicle technologies\\n\\nnhtsa demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways through its proactive and inclusive approach to the safe development testing and deployment of new and advanced vehicle technologies that have enormous potential for improving safety and mobility for all americans nhtsa demonstrates its dedication to saving lives on our nation\\xe2\\x80\\x99s roads and highways\\n\\nin september 2016 nhtsa and the us department of transportation issued the federal automated vehicles policy which set forth a proactive approach to providing safety assurance and facilitating innovation building on that policy and incorporating feedback received through public comments stakeholder meetings and congressional hearings in september 2017 the agency issued automated driving systems a vision for safety 20 the updated guidance 20 offers a flexible nonregulato

5. Stemming and Lemmatization

In [174]:
df.loc[5]['Content_Parsed_5']

"b'nhtsa in action nhtsa be dedicate to advance the lifesaving potential of new vehicle technologies\\n\\nnhtsa demonstrate its dedication to save live on our nation\\xe2\\x80\\x99s roads and highways through its proactive and inclusive approach to the safe development test and deployment of new and advance vehicle technologies that have enormous potential for improve safety and mobility for all americans nhtsa demonstrate its dedication to save live on our nation\\xe2\\x80\\x99s roads and highways\\n\\nin september 2016 nhtsa and the us department of transportation issue the federal automate vehicles policy which set forth a proactive approach to provide safety assurance and facilitate innovation build on that policy and incorporate feedback receive through public comment stakeholder meet and congressional hear in september 2017 the agency issue automate drive systems a vision for safety 20 the update guidance 20 offer a flexible nonregulatory approach to automate vehicle technology s

6. Stop words

In [175]:
df.loc[5]['Content_Parsed_6']

"b'nhtsa  action nhtsa  dedicate  advance  lifesaving potential  new vehicle technologies\\n\\nnhtsa demonstrate  dedication  save live   nation\\xe2\\x80\\x99s roads  highways   proactive  inclusive approach   safe development test  deployment  new  advance vehicle technologies   enormous potential  improve safety  mobility   americans nhtsa demonstrate  dedication  save live   nation\\xe2\\x80\\x99s roads  highways\\n\\nin september 2016 nhtsa   us department  transportation issue  federal automate vehicles policy  set forth  proactive approach  provide safety assurance  facilitate innovation build   policy  incorporate feedback receive  public comment stakeholder meet  congressional hear  september 2017  agency issue automate drive systems  vision  safety 20  update guidance 20 offer  flexible nonregulatory approach  automate vehicle technology safety  support  automotive industry   key stakeholders   consider  design best practice   safe test  deployment  ads level 3  5  also provi

Finally, we can delete the intermediate columns:

In [176]:
df.head(1)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,1.txt,b'Road vehicle that is capable of moving safel...,Autonomous Cars,1.txt-Autonomous Cars,1,100847,b'Road vehicle that is capable of moving safel...,b'road vehicle that is capable of moving safel...,b'road vehicle that is capable of moving safel...,b'road vehicle that is capable of moving safel...,b'road vehicle that be capable of move safely ...,b'road vehicle capable move safely little ...


In [177]:
list_columns = ["File_Name", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [178]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed
0,1.txt,Autonomous Cars,1.txt-Autonomous Cars,b'Road vehicle that is capable of moving safel...,b'road vehicle capable move safely little ...
1,100.txt,Autonomous Cars,100.txt-Autonomous Cars,"b""The U.S. Department of Transportation (USDOT...",bthe us department transportation (usdot) co...
2,101.txt,Autonomous Cars,101.txt-Autonomous Cars,b'National Security Implications of Leadership...,b'national security implications leadership ...
3,102.txt,Autonomous Cars,102.txt-Autonomous Cars,b'Road vehicle that is capable of moving safel...,b'road vehicle capable move safely little ...
4,104.txt,Autonomous Cars,104.txt-Autonomous Cars,b'The dream of the self-driving car \xe2\x80\x...,b' dream self-driving car \xe2\x80\x94 get ...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [179]:
category_codes = {
    'Autonomous Cars': 0,
    'Other': 1,
}

In [180]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [181]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed,Category_Code
0,1.txt,Autonomous Cars,1.txt-Autonomous Cars,b'Road vehicle that is capable of moving safel...,b'road vehicle capable move safely little ...,0
1,100.txt,Autonomous Cars,100.txt-Autonomous Cars,"b""The U.S. Department of Transportation (USDOT...",bthe us department transportation (usdot) co...,0
2,101.txt,Autonomous Cars,101.txt-Autonomous Cars,b'National Security Implications of Leadership...,b'national security implications leadership ...,0
3,102.txt,Autonomous Cars,102.txt-Autonomous Cars,b'Road vehicle that is capable of moving safel...,b'road vehicle capable move safely little ...,0
4,104.txt,Autonomous Cars,104.txt-Autonomous Cars,b'The dream of the self-driving car \xe2\x80\x...,b' dream self-driving car \xe2\x80\x94 get ...,0


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [182]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [183]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [184]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(447, 300)
(79, 300)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [185]:
bigrams

['xe2 x80',
 'x80 x99re',
 'x80 x93',
 'unite state',
 'x80 x99s',
 'x80 x94',
 'x80 x99t',
 'x80 x9d',
 'xc2 xb0c',
 'cocoa bean',
 'human driver',
 'driving car',
 'automate vehicles',
 'cocoa butter',
 'chocolate bar',
 'milk chocolate',
 'dark chocolate',
 'driving cars',
 'autonomous vehicle',
 'autonomous vehicles',
 'self driving']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [186]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)