# Feature extraction

This is the final notebook explaining the methodology for text feature extraction. We will explain each of the different features extracted and then we will explain the process and methodology to extract all the features and create a new dataset.

## Index

- [1. Features](#1.-Feature-explanation)

 - [1.1. Complexity features](#1.1.-Complexity-features)
 - [1.2. Stylometric features](#1.2.-Stylometric-features)


- [2. Requisites](#2.-Requisites)


- [3. Feature extraction for training](#3.-Feature-extraction-for-training)


- [4. Feature extraction function for predictions](#4.-Feature-extraction-function-for-predictions)

## 1. Feature explanation

On this section we will explain the features that we are going to extract from the News Headline and News Content text. These features are language-independent, for example, they do not consider specific terms from a language, in this case spanish.

Our objective is to extract features based on high-level structures. To accomplish this objective, we are going to extract features from 2 categories: Complexity and Stylometric

### 1.1. Complexity features

The objective of these features is to capture te overall intricacy of the news, in sentence and word level. To achive this, we use metrics like average word size, words count per sentence and type token ratio:

**avg_words_sentence**: Average words per sentence

**avg_word_size**: Average word size

**avg_syllables_word**: Average syllables per word

**unique_words**: Hapaxes or unique words that only appears once in a text

**ttr**: Type token ratio

### Bonus ###

Spanish readability tests:

**huerta_score**: Fernández Huerta's redability score (Reading comprehension of the text), spanish adaptation of the Flesch equation

$$Perspicuity = 206.84 - 0.60 \times Average Syllables Word - 1.02 \times Average Words Sentence$$

**szigriszt_score**: Szigriszt Pazos perspicuity score (Legibility and clarity of the text), a modern spanish adaptation of the Flesch equation.

$$Perspicuity = 206.835 - \frac{62.3 \times TotalSyllables}{Words} - \frac{Words}{Sentences}$$

 



### 1.2. Stylometric features
For stylometric or lexical features, we use NLP techniques to extract grammatical and lexical information for each text. We are using Spacy POS tagging techniques to track different word style frequencies:

**mltd**: Measure of Textual Lexical Diversity, based on McCarthy and Jarvis (2010).

**upper_case_ratio**: Uppercase letters to all letters ratio

**entityratio**: Ratio of named Entities to the text size

**quotes_ratio**: Ratio of quotes marks to text size

**propn_ratio**: Proper Noun tag frequency

**noun_ratio**: Noun tag frequency

**pron_ratio**: Pronoun tag frequency

**adp_ratio**: Adposition tag frequency

**det_ratio**: Determinant tag frequency

**punct_ratio**: Punctuation tag frequency

**verb_ratio**: Verb tag frequency

**adv_ratio**: Adverb tag frequency

**sym_ratio**: Symbol tag frequency

### 2. Requisites

*For Python 3 installations use ___!pip3 install___ and ___python3 *___

[NLTK package](https://pypi.org/project/nltk/)

`!pip install nltk`

`import nltk`


[Spacy spanish package](https://spacy.io/models/es)

`!pip install spacy`

`python -m spacy download es_core_news_lg`

`import spacy`


[lexical_diversity package](https://pypi.org/project/lexical-diversity/)

`!pip install lexical-diversity`

`from lexical_diversity import lex_div as ld`


[Syltippy](https://github.com/nur-ag/syltippy)

Syltippy is a simple, user friendly word syllabization package for spanish language with no additional dependencies.

`!pip install syltippy`

`from syltippy import syllabize`

## 3. Feature extraction for training

In [2]:
# Tried several syllabizers for spanish and this is the chosen solution. Believe me, i spent a whole day.
# I had to replace all symbols, punctuations and it includes accentuation from other languages like ä, à, etc...
# It's a bit inconsistent with words from others languages, acronyms and abreviations. However it performs really well for our case!!!

def get_nsyllables(text):
    from syltippy import syllabize

    text = text.replace(r"*NUMBER*", "número")
    text = text.replace(r"*PHONE*", "número")
    text = text.replace(r"*EMAIL*", "email")
    text = text.replace(r"*URL*", "url")
    text = re.sub(r'\d+', '', text)
    text = re.sub('\n', '', text)
    text = re.sub(r'[^ \nA-Za-z0-9ÁÉÍÓÚÑáéíóúñ/]+', '', text)
    
    n_syllables = len(syllabize(text)[0])
    
    return n_syllables

In [3]:
%%time

import itertools
import pandas as pd
import nltk
import spacy
import re
from nltk import FreqDist
from sklearn.preprocessing import LabelEncoder
from lexical_diversity import lex_div as ld
pd.options.display.max_columns = None

nlp = spacy.load('es_core_news_lg')

df = pd.read_csv('../data/corpus_spanish_v3.csv')

labelencoder = LabelEncoder()
df['Label'] = labelencoder.fit_transform(df['Category'])

# empty lists and df
df_features = pd.DataFrame()
list_text = []
list_sentences = []
list_words = []
list_words_sent = []
list_word_size = []
list_avg_syllables_word = []
list_unique_words = []
list_ttr = []
list_huerta_score = []
list_szigriszt_score = []
list_mltd = []
list_entity_ratio = []
list_upper_case_ratio = []
list_quotes = []
list_quotes_ratio = []
list_propn_ratio = [] 
list_noun_ratio = []
list_adp_ratio = []
list_det_ratio = []
list_punct_ratio = []
list_pron_ratio = []
list_verb_ratio = []
list_adv_ratio = []
list_sym_ratio = []

list_headline = []
list_words_h = []
list_word_size_h = []
list_avg_syllables_word_h = []
list_ttr_h = []
list_mltd_h = []
list_unique_words_h = []

# df iteration
for n, row in df.iterrows():
    
    ## headline ##
    headline = df['Headline'].iloc[n]
    headline = re.sub(r"http\S+", "", headline)
    headline = re.sub(r"http", "", headline)
    headline = re.sub(r"@\S+", "", headline)
    headline = re.sub("\n", " ", headline)
    headline = re.sub(r"(?<!\n)\n(?!\n)", " ", headline)
    headline = headline.replace(r"*NUMBER*", "número")
    headline = headline.replace(r"*PHONE*", "número")
    headline = headline.replace(r"*EMAIL*", "email")
    headline = headline.replace(r"*URL*", "url")
    headline_new = headline.lower()
    doc_h = nlp(headline_new)
    
    list_tokens_h = []
    list_tags_h = []

    for sentence_h in doc_h.sents:
        for token in sentence_h:
            list_tokens_h.append(token.text)

    fdist_h = FreqDist(list_tokens_h)
    syllables_h = get_nsyllables(headline)
    words_h = len(list_tokens_h)
    
    # headline complexity features
    avg_word_size_h = round(sum(len(word) for word in list_tokens_h) / words_h, 2)
    avg_syllables_word_h = round(syllables_h / words_h, 2)
    unique_words_h = round((len(fdist_h.hapaxes()) / words_h) * 100, 2)
    ttr_h = round(ld.ttr(list_tokens_h) * 100, 2)
    mltd_h = round(ld.mtld(list_tokens_h), 2)
    
    ## text content##   
    text = df['Text'].iloc[n]  
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"http", "", text)
    text = re.sub("\n", " ", text)
    text = text.replace(r"*NUMBER*", "número")
    text = text.replace(r"*PHONE*", "número")
    text = text.replace(r"*EMAIL*", "email")
    text = text.replace(r"*URL*", "url")
    
    # to later calculate upper case letters ratio
    alph = list(filter(str.isalpha, text))
    text_new = text.lower()
    doc = nlp(text)

    list_tokens = []
    list_pos = []
    list_tag = []
    list_entities = []
    sents = 0
    
    for entity in doc.ents:
        list_entities.append(entity.label_)

    for sentence in doc.sents:
        sents += 1
        for token in sentence:
            list_tokens.append(token.text)
            list_pos.append(token.pos_)
            list_tag.append(token.tag_)
    
    # Calculate entities, pos, tag, freq, syllables, words and quotes
    entities = len(list_entities)
    n_pos = nltk.Counter(list_pos)
    n_tag = nltk.Counter(list_tag)
    fdist = FreqDist(list_tokens)
    syllables = get_nsyllables(text)
    words = len(list_tokens)
    quotes = n_tag['PUNCT__PunctType=Quot']

    # complexity features
    avg_word_sentence = round(words / sents, 2)
    avg_word_size = round(sum(len(word) for word in list_tokens) / words, 2)
    avg_syllables_word = round(syllables / words, 2)
    unique_words = round((len(fdist.hapaxes()) / words) * 100, 2)
    ttr = round(ld.ttr(list_tokens) * 100, 2)
    mltd = round(ld.mtld(list_tokens), 2)

    # readability spanish test
    huerta_score = round(206.84 - (60 * avg_syllables_word) - (1.02 * avg_word_sentence), 2)
    szigriszt_score = round(206.835 - ((62.3 * syllables) / words) - (words / sents), 2)

    # stylometric features
    upper_case_ratio = round(sum(map(str.isupper, alph)) / len(alph) * 100, 2)
    entity_ratio = round((entities / words) * 100, 2)
    quotes_ratio = round((quotes / words) * 100, 2)
    propn_ratio = round((n_pos['PROPN'] / words) * 100 , 2)
    noun_ratio = round((n_pos['NOUN'] / words) * 100, 2) 
    adp_ratio = round((n_pos['ADP'] / words) * 100, 2)
    det_ratio = round((n_pos['DET'] / words) * 100, 2)
    punct_ratio = round((n_pos['PUNCT'] / words) * 100, 2)
    pron_ratio = round((n_pos['PRON'] / words) * 100, 2)
    verb_ratio = round((n_pos['VERB'] / words) * 100, 2)
    adv_ratio = round((n_pos['ADV'] / words) * 100, 2)
    sym_ratio = round((n_tag['SYM'] / words) * 100, 2)
    
    # appending on lists
    # headline
    list_headline.append(headline_new)
    list_words_h.append(words_h)
    list_word_size_h.append(avg_word_size_h)
    list_avg_syllables_word_h.append(avg_syllables_word_h)
    list_unique_words_h.append(unique_words_h)
    list_ttr_h.append(ttr_h)
    list_mltd_h.append(mltd_h)
    
    # text
    list_text.append(text_new)
    list_sentences.append(sents)
    list_words.append(words)
    list_words_sent.append(avg_word_sentence)
    list_word_size.append(avg_word_size)
    list_avg_syllables_word.append(avg_syllables_word)
    list_unique_words.append(unique_words)
    list_ttr.append(ttr)
    list_huerta_score.append(huerta_score)
    list_szigriszt_score.append(szigriszt_score)
    list_mltd.append(mltd)
    list_entity_ratio.append(entity_ratio)
    list_upper_case_ratio.append(upper_case_ratio)
    list_quotes.append(quotes)
    list_quotes_ratio.append(quotes_ratio)
    list_propn_ratio.append(propn_ratio)
    list_noun_ratio.append(noun_ratio)
    list_adp_ratio.append(adp_ratio)
    list_det_ratio.append(det_ratio)
    list_punct_ratio.append(punct_ratio)
    list_pron_ratio.append(pron_ratio)
    list_verb_ratio.append(verb_ratio)
    list_adv_ratio.append(adv_ratio)
    list_sym_ratio.append(sym_ratio)
    
# dataframe
df_features['topic'] = df['Topic']
df_features['text'] = list_text
df_features['headline'] = list_headline

# headline
df_features['words_h'] = list_words_h
df_features['word_size_h'] = list_word_size_h
df_features['avg_syllables_word_h'] = list_avg_syllables_word_h
df_features['unique_words_h'] = list_unique_words_h
df_features['ttr_h'] = list_ttr_h
df_features['mltd_h'] = list_mltd_h

# text
df_features['sents'] = list_sentences
df_features['words'] = list_words
df_features['avg_words_sent'] = list_words_sent
df_features['avg_word_size'] = list_word_size
df_features['avg_syllables_word'] = list_avg_syllables_word
df_features['unique_words'] = list_unique_words
df_features['ttr'] = list_ttr
df_features['mltd'] = list_mltd
df_features['huerta_score'] = list_huerta_score
df_features['szigriszt_score'] = list_szigriszt_score
df_features['upper_case_ratio'] = list_upper_case_ratio
df_features['entity_ratio'] = list_entity_ratio
df_features['quotes'] = list_quotes
df_features['quotes_ratio'] = list_quotes_ratio
df_features['propn_ratio'] = list_propn_ratio
df_features['noun_ratio'] = list_noun_ratio
df_features['adp_ratio'] = list_adp_ratio
df_features['det_ratio'] = list_det_ratio
df_features['punct_ratio'] = list_punct_ratio
df_features['pron_ratio'] = list_pron_ratio
df_features['verb_ratio'] = list_verb_ratio
df_features['adv_ratio'] = list_adv_ratio
df_features['sym_ratio'] = list_sym_ratio

df_features['label'] = df['Label']

df_features.to_csv('../data/spanish_corpus_features_v6.csv', encoding = 'utf-8', index = False)

CPU times: user 5min 30s, sys: 3.05 s, total: 5min 33s
Wall time: 5min 33s


In [11]:
df_features

Unnamed: 0,text,headline,words_h,word_size_h,avg_syllables_word_h,unique_words_h,ttr_h,mltd_h,sents,words,avg_words_sent,avg_word_size,avg_syllables_word,unique_words,ttr,mltd,huerta_score,szigriszt_score,upper_case_ratio,entity_ratio,quotes,quotes_ratio,propn_ratio,noun_ratio,adp_ratio,det_ratio,punct_ratio,pron_ratio,verb_ratio,adv_ratio,sym_ratio,label
0,el pasado jueves 5 de noviembre la superintend...,nueva sanción a doña gallina por discriminar g...,9,5.89,2.22,100.00,100.00,0.00,9,351,39.00,4.41,1.79,42.45,54.42,76.65,59.66,56.01,2.13,3.13,11,3.13,6.27,17.95,14.53,13.68,9.69,3.99,6.27,2.56,0.00,0
1,la rae estudia incluir «machirulo» en el dicci...,la rae estudia incluir «machirulo» en el dicci...,10,4.50,1.80,100.00,100.00,0.00,18,437,24.28,4.23,1.73,30.66,44.39,63.54,78.27,74.78,2.94,5.49,23,5.26,7.55,13.50,10.98,13.73,17.16,1.83,8.70,3.66,0.00,1
2,el alto comisionado de naciones unidas para lo...,save the children y acnur alertan de riesgos q...,16,4.62,1.81,87.50,93.75,71.68,27,1276,47.26,4.47,1.82,24.92,35.50,76.24,49.43,45.91,3.42,4.39,54,4.23,6.82,15.44,16.85,11.91,10.66,3.61,10.97,2.66,0.47,1
3,el colegio de abogados ha entregado en la maña...,colegio de abogados de granada entrega distinc...,13,5.00,2.23,76.92,84.62,23.66,10,587,58.70,4.39,1.85,28.79,40.89,55.69,35.97,32.87,4.42,8.01,16,2.73,16.35,15.67,17.55,13.29,10.39,2.21,7.67,0.85,0.00,1
4,era todo un misterio el paradero de la familia...,espera de tres años a instalación de internet ...,14,4.43,2.00,64.29,78.57,18.29,15,424,28.27,4.16,1.73,41.27,53.54,83.89,74.20,70.57,1.76,3.54,11,2.59,4.01,16.04,13.68,14.15,12.97,4.48,10.14,4.72,0.24,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3969,qué creen? acabo de descubrir que los administ...,twitter apoya a amlo,4,4.25,2.00,100.00,100.00,0.00,6,83,13.83,4.27,1.63,66.27,74.70,65.39,94.93,91.67,5.09,8.43,0,0.00,14.46,6.02,9.64,3.61,19.28,4.82,9.64,3.61,0.00,0
3970,ver todos europa press nueva york.- el grupo c...,el grupo chino hna compra el 25% del gigante h...,15,5.33,1.93,86.67,93.33,63.00,15,572,38.13,4.69,1.81,33.57,46.15,70.47,59.35,55.86,5.11,9.44,3,0.52,15.21,19.58,17.48,10.84,9.44,1.22,6.47,1.75,0.00,1
3971,un curandero infecta de sida a medio centenar ...,un curandero infecta de sida a medio centenar ...,12,4.67,2.00,83.33,91.67,40.32,10,367,36.70,4.46,1.85,36.78,49.86,67.32,58.41,54.87,4.45,6.81,6,1.63,11.72,21.53,20.71,14.17,10.08,0.82,6.81,1.91,0.00,1
3972,política primeras páginas de los diarios llega...,primeras páginas de los diarios llegados esta ...,11,5.55,2.09,100.00,100.00,0.00,12,591,49.25,3.97,1.54,23.69,38.07,70.94,64.20,61.55,9.02,10.66,62,10.49,11.68,15.23,13.71,15.74,22.00,1.02,8.12,1.86,2.03,1


## 4. Feature extraction function for predictions

To make predictions we need to extracte features from a given news headline and news text content. So we are going to pack the code above to extract the features for our predictions

In [9]:
%%time

import pandas as pd
import nltk
import spacy
import re
from nltk import FreqDist
from lexical_diversity import lex_div as ld

def get_news_features(headline, text):
    
    nlp = spacy.load('es_core_news_lg')

    ## headline ##
    headline = re.sub(r"http\S+", "", headline)
    headline = re.sub(r"http", "", headline)
    headline = re.sub(r"@\S+", "", headline)
    headline = re.sub("\n", " ", headline)
    headline = re.sub(r"(?<!\n)\n(?!\n)", " ", headline)
    headline = headline.replace(r"*NUMBER*", "número")
    headline = headline.replace(r"*PHONE*", "número")
    headline = headline.replace(r"*EMAIL*", "email")
    headline = headline.replace(r"*URL*", "url")
    headline_new = headline.lower()
    doc_h = nlp(headline_new)

    list_tokens_h = []
    list_tags_h = []

    for sentence_h in doc_h.sents:
        for token in sentence_h:
            list_tokens_h.append(token.text)

    fdist_h = FreqDist(list_tokens_h)
    syllables_h = get_nsyllables(headline)
    words_h = len(list_tokens_h)

    # headline complexity features
    avg_word_size_h = round(sum(len(word) for word in list_tokens_h) / words_h, 2)
    avg_syllables_word_h = round(syllables_h / words_h, 2)
    unique_words_h = round((len(fdist_h.hapaxes()) / words_h) * 100, 2)
    mltd_h = round(ld.mtld(list_tokens_h), 2)

    ## text content##     
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"http", "", text)
    text = re.sub("\n", " ", text)
    text = text.replace(r"*NUMBER*", "número")
    text = text.replace(r"*PHONE*", "número")
    text = text.replace(r"*EMAIL*", "email")
    text = text.replace(r"*URL*", "url")

    # to later calculate upper case letters ratio
    alph = list(filter(str.isalpha, text))
    text_new = text.lower()
    doc = nlp(text)

    list_tokens = []
    list_pos = []
    list_tag = []
    list_entities = []
    sents = 0

    for entity in doc.ents:
        list_entities.append(entity.label_)

    for sentence in doc.sents:
        sents += 1
        for token in sentence:
            list_tokens.append(token.text)
            list_pos.append(token.pos_)
            list_tag.append(token.tag_)

    # Calculate entities, pos, tag, freq, syllables, words and quotes
    entities = len(list_entities)
    n_pos = nltk.Counter(list_pos)
    n_tag = nltk.Counter(list_tag)
    fdist = FreqDist(list_tokens)
    syllables = get_nsyllables(text)
    words = len(list_tokens)
    quotes = n_tag['PUNCT__PunctType=Quot']

    # complexity features
    avg_word_sentence = round(words / sents, 2)
    avg_word_size = round(sum(len(word) for word in list_tokens) / words, 2)
    avg_syllables_word = round(syllables / words, 2)
    unique_words = round((len(fdist.hapaxes()) / words) * 100, 2)
    ttr = round(ld.ttr(list_tokens) * 100, 2)

    # readability spanish test
    huerta_score = round(206.84 - (60 * avg_syllables_word) - (1.02 * avg_word_sentence), 2)
    szigriszt_score = round(206.835 - ((62.3 * syllables) / words) - (words / sents), 2)

    # stylometric features
    mltd = round(ld.mtld(list_tokens), 2)
    upper_case_ratio = round(sum(map(str.isupper, alph)) / len(alph) * 100, 2)
    entity_ratio = round((entities / words) * 100, 2)
    quotes_ratio = round((quotes / words) * 100, 2)
    propn_ratio = round((n_pos['PROPN'] / words) * 100 , 2)
    noun_ratio = round((n_pos['NOUN'] / words) * 100, 2) 
    pron_ratio = round((n_pos['PRON'] / words) * 100, 2)
    adp_ratio = round((n_pos['ADP'] / words) * 100, 2)
    det_ratio = round((n_pos['DET'] / words) * 100, 2)
    punct_ratio = round((n_pos['PUNCT'] / words) * 100, 2)
    verb_ratio = round((n_pos['VERB'] / words) * 100, 2)
    adv_ratio = round((n_pos['ADV'] / words) * 100, 2)
    sym_ratio = round((n_tag['SYM'] / words) * 100, 2)

    # create df_features
    df_features = pd.DataFrame({'words_h': [words_h], 'avg_word_size_h': [avg_word_size_h],'avg_syllables_word': [avg_syllables_word_h],
                                'unique_words_h': [unique_words_h], 'mltd_h': [mltd_h], 'sents': [sents], 'words': [words], 
                                'avg_word_sentence': [avg_word_sentence], 'avg_word_size': [avg_word_size], 
                                'avg_syllables_word': avg_syllables_word, 'unique_words': [unique_words], 
                                'ttr': [ttr], 'huerta_score': [huerta_score], 'szigriszt_score': [szigriszt_score],
                                'mltd': [mltd], 'upper_case_ratio': [upper_case_ratio], 'entity_ratio': [entity_ratio], 
                                'quotes': [quotes], 'quotes_ratio': [quotes_ratio], 'propn_ratio': [propn_ratio], 
                                'noun_ratio': [noun_ratio], 'pron_ratio': [pron_ratio], 'adp_ratio': [adp_ratio],
                                'det_ratio': [det_ratio], 'punct_ratio': [punct_ratio], 'verb_ratio': [verb_ratio],
                                'adv_ratio': [adv_ratio], 'sym_ratio': [sym_ratio]})
    
    return df_features

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 24.8 µs


In [11]:
headline = input('Insert news headline:')
text = input('insert news content:')

get_news_features(text, headline)

Insert news headline:El Gobierno ha presentado hoy al Niño de Schrödinger, que va y no va al colegio
insert news content:La ministra de Educación y Formación Profesional, Isabel Celaá, ha presentado esta mañana al Niño de Schrödinger, fruto de un proyecto en el que han colaborado varias universidades españolas y que viene a resolver el problema de la vuelta a los colegios en plena ola de contagios por coronavirus.  «Va y no va al colegio y está expuesto al virus pero al mismo tiempo no lo está», ha explicado Celaá, insistiendo en que se trata de «una paradoja avalada científicamente».  La ministra ha mostrado a los medios al niño, cuyo nombre es Fernando Campos Leza, describiéndolo como «un alumno perfectamente sano y normal que ahora mismo, estando aquí con nosotros, está al mismo tiempo en casa, donde permanecerá mientras vaya al colegio con normalidad junto al resto de niños de Schrödinger».  A partir de mañana y hasta el inicio del nuevo curso escolar, los padres deberán adaptar a 

Unnamed: 0,words_h,avg_word_size_h,avg_syllables_word,unique_words_h,mltd_h,sents,words,avg_word_sentence,avg_word_size,unique_words,ttr,huerta_score,szigriszt_score,mltd,upper_case_ratio,entity_ratio,quotes,quotes_ratio,propn_ratio,noun_ratio,pron_ratio,adp_ratio,det_ratio,punct_ratio,verb_ratio,adv_ratio,sym_ratio
0,258,4.4,1.47,35.66,65.99,1,17,17.0,3.76,76.47,88.24,101.3,98.22,40.46,6.35,11.76,0,0.0,17.65,5.88,0.0,17.65,5.88,5.88,17.65,11.76,0.0
