### Feature Engineering

The Next step is to create features from the raw text. The steps are as follows:

1. **Text Cleaning** : Removing special characters,stop words, proper nouns; explanding contractions; converting the whole text to lower case.
2. **Dictionary mapping**
3. **Train-Test split** : For training our model and testing on unseen data
4. **Text representation** : We will be using Tf-Idf scores to represent out data internally

Importing necessary packages

In [1]:
import pickle
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2

In [6]:
with open("Data/News_dataset.pickle",'rb') as data:
    df = pickle.load(data)

I have made a contraction mapping to expand contractions in our dataset

In [3]:
#contraction mapping to expand contractions
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}

defined a function to handle all precprocessing steps together. 

In [7]:
df.head(5)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,News_length
1,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,2559
2,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...,business,002.txt-business,2251
3,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...,business,003.txt-business,1551
4,004.txt,High fuel prices hit BA's profits\n\nBritish A...,business,004.txt-business,2411
5,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...,business,005.txt-business,1569


In [8]:
stop_words = set(stopwords.words('english')) 
def text_cleaner1(text, remove_stopwords=True):
    text = text.lower()
    #Using Bautiful soup to remove html tags
    text = BeautifulSoup(text, "lxml").text
    #Replace contractions with their longer forms
    #text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])    
    # Format words and remove unwanted characters
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'\([^)]*\)', '', text)
    text = re.sub('"','', text)
    text = re.sub(r"'s\b","",text)
    text = re.sub("[^a-zA-Z]", " ", text) 
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)
    return text

In [9]:
clean_content = []
para = []
for i in range(len(df)):
    para = df.iloc[i]['Content']
    clean_content.append(text_cleaner1(para))

In [10]:
df['Clean_content']=clean_content
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,News_length,Clean_content
1,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,2559,ad sales boost time warner profit quarterly pr...
2,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...,business,002.txt-business,2251,dollar gains greenspan speech dollar hit highe...
3,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...,business,003.txt-business,1551,yukos unit buyer faces loan claim owners embat...
4,004.txt,High fuel prices hit BA's profits\n\nBritish A...,business,004.txt-business,2411,high fuel prices hit ba profits british airway...
5,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...,business,005.txt-business,1569,pernod takeover talk lifts domecq shares uk dr...


Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [11]:
#Downloading punkt and wordnet from nltk
nltk.download('punkt')
print("=====================================================================")
nltk.download('wordnet')



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dorothyjeyson/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dorothyjeyson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
lemmatizer = WordNetLemmatizer()

In [13]:
rows = len(df)
lemma_list = []
for i in range(rows):
    #create and empty list of lemmatized words
    lemma = []
    
    #save the text in an object and split it into words
    text = df.iloc[i]['Clean_content']
    words_in_text = text.split(" ")
    
    #iterate through every word
    for word in words_in_text:
        lemma.append(lemmatizer.lemmatize(word,pos='v'))
    
    #join the list
    lemma_text = " ".join(lemma)
    
    #append to lemma_list to create a new column in our existing dataframe for better readability
    lemma_list.append(lemma_text)

In [14]:
df['Lemmatised_content']=lemma_list

In [15]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,News_length,Clean_content,Lemmatised_content
1,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,2559,ad sales boost time warner profit quarterly pr...,ad sales boost time warner profit quarterly pr...
2,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...,business,002.txt-business,2251,dollar gains greenspan speech dollar hit highe...,dollar gain greenspan speech dollar hit highes...
3,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...,business,003.txt-business,1551,yukos unit buyer faces loan claim owners embat...,yukos unit buyer face loan claim owners embatt...
4,004.txt,High fuel prices hit BA's profits\n\nBritish A...,business,004.txt-business,2411,high fuel prices hit ba profits british airway...,high fuel price hit ba profit british airways ...
5,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...,business,005.txt-business,1569,pernod takeover talk lifts domecq shares uk dr...,pernod takeover talk lift domecq share uk drin...


In [16]:
df.iloc[0]['Content']

'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to si

In [17]:
df.iloc[0]['Clean_content']

'ad sales boost time warner profit quarterly profits us media giant timewarner jumped bn three months december year earlier firm one biggest investors google benefited sales high speed internet connections higher advert sales timewarner said fourth quarter sales rose bn bn profits buoyed one gains offset profit dip warner bros less users aol time warner said friday owns search engine google internet business aol mixed fortunes lost subscribers fourth quarter profits lower preceding three quarters however company said aol underlying profit exceptional items rose back stronger internet advertising revenues hopes increase subscribers offering online service free timewarner internet customers try sign aol existing customers high speed broadband timewarner also restate results following probe us securities exchange commission sec close concluding time warner fourth quarter profits slightly better analysts expectations film division saw profits slump helped box office flops alexander catwoma

In [20]:
df.iloc[0]['Lemmatised_content']

'ad sales boost time warner profit quarterly profit us media giant timewarner jump bn three months december year earlier firm one biggest investors google benefit sales high speed internet connections higher advert sales timewarner say fourth quarter sales rise bn bn profit buoy one gain offset profit dip warner bros less users aol time warner say friday own search engine google internet business aol mix fortunes lose subscribers fourth quarter profit lower precede three quarter however company say aol underlie profit exceptional items rise back stronger internet advertise revenues hop increase subscribers offer online service free timewarner internet customers try sign aol exist customers high speed broadband timewarner also restate result follow probe us securities exchange commission sec close conclude time warner fourth quarter profit slightly better analysts expectations film division saw profit slump help box office flop alexander catwoman sharp contrast year earlier third final 

https://github.com/miguelfzafra/Latest-News-Classifier/blob/master/0.%20Latest%20News%20Classifier/03.%20Feature%20Engineering/03.%20Feature%20Engineering.ipynb

In [21]:
columns = ['File_Name', 'Content', 'Category','Complete_Filename','News_length','Lemmatised_content']
df = df[columns]
df = df.rename(columns={'Lemmatised_content': 'Content_Preprocessed'})
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,News_length,Content_Preprocessed
1,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,2559,ad sales boost time warner profit quarterly pr...
2,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...,business,002.txt-business,2251,dollar gain greenspan speech dollar hit highes...
3,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...,business,003.txt-business,1551,yukos unit buyer face loan claim owners embatt...
4,004.txt,High fuel prices hit BA's profits\n\nBritish A...,business,004.txt-business,2411,high fuel price hit ba profit british airways ...
5,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...,business,005.txt-business,1569,pernod takeover talk lift domecq share uk drin...


### Label coding

In [22]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

In [23]:
df['CategoryCode']=df['Category']
df = df.replace({'CategoryCode':category_codes})

In [24]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,News_length,Content_Preprocessed,CategoryCode
1,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,2559,ad sales boost time warner profit quarterly pr...,0
2,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...,business,002.txt-business,2251,dollar gain greenspan speech dollar hit highes...,0
3,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...,business,003.txt-business,1551,yukos unit buyer face loan claim owners embatt...,0
4,004.txt,High fuel prices hit BA's profits\n\nBritish A...,business,004.txt-business,2411,high fuel price hit ba profit british airways ...,0
5,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...,business,005.txt-business,1569,pernod takeover talk lift domecq share uk drin...,0


### Train test split

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(df['Content_Preprocessed'],
                                                    df['CategoryCode'], 
                                                    test_size = 0.15,
                                                    random_state=8)

### Text representation
We have various options:

Count Vectors as features, TF-IDF Vectors as features, Word Embeddings as features, Text / NLP based features, Topic Models as features

In this case, I am going to use TF-IDF Vectors as features. 

We have to define the different parameters:

ngram_range: We want to consider both unigrams and bigrams.
max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument norm.

In [26]:
#parameter election
ngram_range = (1,2)
min_df = 1
max_df = 10
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [27]:
tfidf=TfidfVectorizer(encoding='utf-8',
    lowercase=False,
    stop_words=None,
    ngram_range=(1, 2),
    max_df=10,
    min_df=1,
    max_features=300,
    norm='l2',
    sublinear_tf=False,
)
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = Y_train
print(features_train.shape)

features_test = tfidf.fit_transform(X_test).toarray()
labels_test = Y_test
print(features_test.shape)

(1796, 300)
(317, 300)


In [31]:
for product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")

# 'business' category:
  . Most correlated unigrams:
. camera
. systems
. seven
. payments
. hybrid
  . Most correlated bigrams:
. work pension
. election campaign

# 'entertainment' category:
  . Most correlated unigrams:
. broadband
. pledge
. students
. takeover
. cable
  . Most correlated bigrams:
. fannie mae
. last month

# 'politics' category:
  . Most correlated unigrams:
. la
. ready
. olympic
. colleagues
. hewitt
  . Most correlated bigrams:
. radio today
. oil price

# 'sport' category:
  . Most correlated unigrams:
. pro
. chip
. swiss
. visa
. squad
  . Most correlated bigrams:
. deutsche boerse
. mr smith

# 'tech' category:
  . Most correlated unigrams:
. goals
. chinese
. watchdog
. june
. car
  . Most correlated bigrams:
. blair say
. mr straw



In [30]:
bigrams

['anti virus',
 'file share',
 'house price',
 'lib dem',
 'music players',
 'fannie mae',
 'mr howard',
 'digital music',
 'hard drive',
 'foreign secretary',
 'radio today',
 'grand slam',
 'attorney general',
 'liberal democrats',
 'mr blunkett',
 'say expect',
 'civil service',
 'australian open',
 'deutsche boerse',
 'mr smith',
 'oil price',
 'box office',
 'last month',
 'could use',
 'work pension',
 'election campaign',
 'interest rat',
 'public service',
 'laser light',
 'mr brown',
 'hip hop',
 'blair say',
 'mr straw']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [33]:
#X_train
with open('Data/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)

#X_test
with open('Data/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# Y_train
with open('Data/Y_train.pickle', 'wb') as output:
    pickle.dump(Y_train, output)
    
# Y_test
with open('Data/Y_test.pickle', 'wb') as output:
    pickle.dump(Y_test, output)

# dataframe
with open('Data/Preprocessed.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Data/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Data/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Data/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Data/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Data/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)