# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [2]:
# Data processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import numpy as np
import pandas as pd
import pickle
import re

# Feature selections/extractions.
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer

# Dataset splitting
from sklearn.model_selection import train_test_split

#### Load the dataset.

In [3]:
# Set the file path.
path_df = '../02. EDA/AVArticles_dataset.pickle'

# Create the dataframe from the pickle file.
with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [4]:
df.head()

Unnamed: 0,FileName,Content,Category,Complete_Filename,id,News_length
0,na,Markets Tech Media Success Perspectives Videos...,tech,na,1,11121
1,na,"Full Episode Tuesday , Sep 7 Close Menu PBS Ne...",tech,na,1,4906
2,na,Accessibility links Skip main content Keyboard...,tech,na,1,3062
3,na,Skip main content Search Brookings About Us Pr...,tech,na,1,15330


And visualize one sample news content:

In [5]:
# Output index 1 row for column Content.
df.loc[1]['Content']

"Full Episode Tuesday , Sep 7 Close Menu PBS NewsHour Episodes Podcasts Subscribe The Latest Politics Brooks Capehart Politics Monday Supreme Court Arts CANVAS Poetry Now Read This Nation Supreme Court Race Matters Essays Brief But Spectacular World Agents Change Economy Making Sen $ e Paul Solman Science The Leading Edge ScienceScope Basic Research Innovation Invention Health Long-Term Care Education Teachers ' Lounge Student Reporting Labs For Teachers About Feedback Funders Support Jobs Close Menu Educate inbox Subscribe Here ’ Deal , politics newsletter analysis ’ find anywhere else . Email Address Subscribe Form error message goes . Thank . Please check inbox confirm . Close Popup PBS NewsHour Menu Notifications Get news alerts PBS NewsHour Turn desktop notifications ? Yes Not Full Episodes Podcasts Subscribe Live By — Nana Adwoa Antwi-Boasiako Nana Adwoa Antwi-Boasiako By — Joshua Barajas Joshua Barajas Leave feedback Share Copy URL https : //www.pbs.org/newshour/nation/uber-laun

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [6]:
# Replace double space, \r, and \n with single spaces.
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [7]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [8]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [9]:
# Set the punctuation signs.
punctuation_signs = list("?:!.,;")

# Create a copy of column, Content_Parsed_2
df['Content_Parsed_3'] = df['Content_Parsed_2']

# Remove punctuation signs from column, Content_Parsed_3.
for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [10]:
# Create a copy of column, Content_Parsed_3 with "'s" removed.
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [11]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\School\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\School\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [13]:
#Sets the number of rows in the dataframe.
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [14]:
# Creates a new column from the lemmatized_text_list.
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [15]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\School\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [17]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

We can now loop through all the stop words:

In [19]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [20]:
df.loc[2]['Content']

"Accessibility links Skip main content Keyboard shortcuts audio player Open Navigation Menu NPR Shop Close Navigation Menu Home News Expand/collapse submenu News National World Politics Business Health Science Climate Race Arts & Life Expand/collapse submenu Arts & Life Books Movies Television Pop Culture Food Art & Design Performing Arts Life Kit Music Expand/collapse submenu Music # NowPlaying Tiny Desk All Songs Considered Music News Music Features Live Sessions Shows & Podcasts Expand/collapse submenu Shows & Podcasts Daily Morning Edition Weekend Edition Saturday Weekend Edition Sunday All Things Considered Fresh Air Up First Featured No Compromise On Our Watch Throughline Rough Translation More Shows & Podcasts Search NPR Shop # NowPlaying Tiny Desk All Songs Considered Music News Music Features Live Sessions About NPR Diversity Organization Support Careers Connect Press Ethics California Approves A Pilot Program For Driverless Rides California allowing company Cruise offer free 

1. Special character cleaning

In [21]:
df.loc[2]['Content_Parsed_1']

"Accessibility links Skip main content Keyboard shortcuts audio player Open Navigation Menu NPR Shop Close Navigation Menu Home News Expand/collapse submenu News National World Politics Business Health Science Climate Race Arts & Life Expand/collapse submenu Arts & Life Books Movies Television Pop Culture Food Art & Design Performing Arts Life Kit Music Expand/collapse submenu Music # NowPlaying Tiny Desk All Songs Considered Music News Music Features Live Sessions Shows & Podcasts Expand/collapse submenu Shows & Podcasts Daily Morning Edition Weekend Edition Saturday Weekend Edition Sunday All Things Considered Fresh Air Up First Featured No Compromise On Our Watch Throughline Rough Translation More Shows & Podcasts Search NPR Shop # NowPlaying Tiny Desk All Songs Considered Music News Music Features Live Sessions About NPR Diversity Organization Support Careers Connect Press Ethics California Approves A Pilot Program For Driverless Rides California allowing company Cruise offer free 

2. Upcase/downcase

In [22]:
df.loc[2]['Content_Parsed_2']

"accessibility links skip main content keyboard shortcuts audio player open navigation menu npr shop close navigation menu home news expand/collapse submenu news national world politics business health science climate race arts & life expand/collapse submenu arts & life books movies television pop culture food art & design performing arts life kit music expand/collapse submenu music # nowplaying tiny desk all songs considered music news music features live sessions shows & podcasts expand/collapse submenu shows & podcasts daily morning edition weekend edition saturday weekend edition sunday all things considered fresh air up first featured no compromise on our watch throughline rough translation more shows & podcasts search npr shop # nowplaying tiny desk all songs considered music news music features live sessions about npr diversity organization support careers connect press ethics california approves a pilot program for driverless rides california allowing company cruise offer free 

3. Punctuation signs

In [23]:
df.loc[2]['Content_Parsed_3']

"accessibility links skip main content keyboard shortcuts audio player open navigation menu npr shop close navigation menu home news expand/collapse submenu news national world politics business health science climate race arts & life expand/collapse submenu arts & life books movies television pop culture food art & design performing arts life kit music expand/collapse submenu music # nowplaying tiny desk all songs considered music news music features live sessions shows & podcasts expand/collapse submenu shows & podcasts daily morning edition weekend edition saturday weekend edition sunday all things considered fresh air up first featured no compromise on our watch throughline rough translation more shows & podcasts search npr shop # nowplaying tiny desk all songs considered music news music features live sessions about npr diversity organization support careers connect press ethics california approves a pilot program for driverless rides california allowing company cruise offer free 

4. Possessive pronouns

In [24]:
df.loc[2]['Content_Parsed_4']

'accessibility links skip main content keyboard shortcuts audio player open navigation menu npr shop close navigation menu home news expand/collapse submenu news national world politics business health science climate race arts & life expand/collapse submenu arts & life books movies television pop culture food art & design performing arts life kit music expand/collapse submenu music # nowplaying tiny desk all songs considered music news music features live sessions shows & podcasts expand/collapse submenu shows & podcasts daily morning edition weekend edition saturday weekend edition sunday all things considered fresh air up first featured no compromise on our watch throughline rough translation more shows & podcasts search npr shop # nowplaying tiny desk all songs considered music news music features live sessions about npr diversity organization support careers connect press ethics california approves a pilot program for driverless rides california allowing company cruise offer free 

5. Stemming and Lemmatization

In [25]:
df.loc[2]['Content_Parsed_5']

'accessibility link skip main content keyboard shortcuts audio player open navigation menu npr shop close navigation menu home news expand/collapse submenu news national world politics business health science climate race arts & life expand/collapse submenu arts & life book movies television pop culture food art & design perform arts life kit music expand/collapse submenu music # nowplaying tiny desk all songs consider music news music feature live sessions show & podcast expand/collapse submenu show & podcast daily morning edition weekend edition saturday weekend edition sunday all things consider fresh air up first feature no compromise on our watch throughline rough translation more show & podcast search npr shop # nowplaying tiny desk all songs consider music news music feature live sessions about npr diversity organization support career connect press ethics california approve a pilot program for driverless rid california allow company cruise offer free rid passengers driverless c

6. Stop words

In [26]:
df.loc[2]['Content_Parsed_6']

'accessibility link skip main content keyboard shortcuts audio player open navigation menu npr shop close navigation menu home news expand/collapse submenu news national world politics business health science climate race arts & life expand/collapse submenu arts & life book movies television pop culture food art & design perform arts life kit music expand/collapse submenu music # nowplaying tiny desk  songs consider music news music feature live sessions show & podcast expand/collapse submenu show & podcast daily morning edition weekend edition saturday weekend edition sunday  things consider fresh air  first feature  compromise   watch throughline rough translation  show & podcast search npr shop # nowplaying tiny desk  songs consider music news music feature live sessions  npr diversity organization support career connect press ethics california approve  pilot program  driverless rid california allow company cruise offer free rid passengers driverless cars — without safety drivers bo

Finally, we can delete the intermediate columns:

In [27]:
df.head(1)

Unnamed: 0,FileName,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,na,Markets Tech Media Success Perspectives Videos...,tech,na,1,11121,Markets Tech Media Success Perspectives Videos...,markets tech media success perspectives videos...,markets tech media success perspectives videos...,markets tech media success perspectives videos...,market tech media success perspectives videos ...,market tech media success perspectives videos ...


In [28]:
list_columns = ["FileName", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [29]:
df.head()

Unnamed: 0,FileName,Category,Complete_Filename,Content,Content_Parsed
0,na,tech,na,Markets Tech Media Success Perspectives Videos...,market tech media success perspectives videos ...
1,na,tech,na,"Full Episode Tuesday , Sep 7 Close Menu PBS Ne...",full episode tuesday sep 7 close menu pbs new...
2,na,tech,na,Accessibility links Skip main content Keyboard...,accessibility link skip main content keyboard ...
3,na,tech,na,Skip main content Search Brookings About Us Pr...,skip main content search brook us press room ...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [30]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

In [31]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [32]:
df.head()

Unnamed: 0,FileName,Category,Complete_Filename,Content,Content_Parsed,Category_Code
0,na,tech,na,Markets Tech Media Success Perspectives Videos...,market tech media success perspectives videos ...,4
1,na,tech,na,"Full Episode Tuesday , Sep 7 Close Menu PBS Ne...",full episode tuesday sep 7 close menu pbs new...,4
2,na,tech,na,Accessibility links Skip main content Keyboard...,accessibility link skip main content keyboard ...,4
3,na,tech,na,Skip main content Search Brookings About Us Pr...,skip main content search brook us press room ...,4


## 3. Train - test split

In [33]:
# Create the train/test datasets.
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

In [33]:
# Output the training dimensions
X_train.shape

(3,)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [34]:
# Parameter election
ngram_range = (1,2)
min_df = 3
max_df = 1.
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [35]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(3, 78)
(1, 78)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [36]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'business' category:
  . Most correlated unigrams:
. economy
. driving
. drive
. front
. years
  . Most correlated bigrams:
. self driving
. autonomous drive

# 'entertainment' category:
  . Most correlated unigrams:
. economy
. driving
. drive
. front
. years
  . Most correlated bigrams:
. self driving
. autonomous drive

# 'politics' category:
  . Most correlated unigrams:
. economy
. driving
. drive
. front
. years
  . Most correlated bigrams:
. self driving
. autonomous drive

# 'sport' category:
  . Most correlated unigrams:
. economy
. driving
. drive
. front
. years
  . Most correlated bigrams:
. self driving
. autonomous drive

# 'tech' category:
  . Most correlated unigrams:
. economy
. driving
. drive
. front
. years
  . Most correlated bigrams:
. self driving
. autonomous drive



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [37]:
bigrams

['self driving', 'autonomous drive']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [38]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)