# Clarity metrics
Included in this notebook:
- Average sentence length (by number of words)
- Average word length (by number of syllables)
- Readability index Szigriszt-Pazos
- Usage of common words
- Lexical diversity
 

- Análisis Jonathan Haidt para juicios tajantes. 

https://colab.research.google.com/drive/1jMVh_pXs8-9tIzMBKK11VRu31jn947xk

In [135]:
import pandas as pd
import numpy as np
import re

from wordfreq import word_frequency # Documentation: https://github.com/LuminosoInsight/wordfreq/
import textstat # Documentation: https://github.com/shivam5992/textstat
from textstat import szigriszt_pazos
textstat.set_lang('es')

In [2]:
# Load articles of authors
data = pd.read_csv('../Data/Data_clean_csv/clean_dataframe.csv')

with open('../Data/Data_clean_txt/Denisse Dresser.txt', 'r', encoding='utf8') as f:
    dresser_content = f.read()
    
with open('../Data/Data_clean_txt/Enrique Krauze.txt', 'r', encoding='utf8') as f:
    krauze_content = f.read()
    
with open('../Data/Data_clean_txt/John Ackerman.txt', 'r', encoding='utf8') as f:
    ackerman_content = f.read()
    
with open('../Data/Data_clean_txt/Ricardo Raphael.txt', 'r', encoding='utf8') as f:
    raphael_content = f.read()
    
with open('../Data/Data_clean_txt/Valeria Moy.txt', 'r', encoding='utf8') as f:
    moy_content = f.read()

## Average sentence length (by number of words)

In [55]:
def html_cleaner(text):
    """ Removes html expressions and line breaks"""
    
    text = re.sub(r'(\n|\r)', '', text)
    
    # Remove italics
    text = re.sub(r'<i>', '', text)
    text = re.sub(r'<\\i>', '', text)
    
    # Remove bold
    text = re.sub(r'<b>', '', text)
    text = re.sub(r'<\\b>', '', text)
    
    #Remove multiple spaces
    text = re.sub(r'\s{2,}', ' ', text)
    text = re.sub(r'\s,\s', ', ', text)
        
    return text

In [122]:
def words_per_sentence(text):
    """Returns average words per sentence of a text"""
    
    # Clean html expressions and line breaks
    text = html_cleaner(text)
    
    #Split into sentences
    sentence_regex = re.compile('[\.|\?\s?|!]\s')
    sentences = sentence_regex.split(text)
    
    #Remove empty sentences
    for sentence in sentences:
        if not sentence:
            sentences.remove(sentence)
    
    #Count words in each string
    words_per_sentence = [len(re.findall(r'\w+', sentence)) for sentence in sentences]
        
    #Get average words per sentence
    avg_words_per_sentence = sum(words_per_sentence) / len(words_per_sentence)
        
    return avg_words_per_sentence

## Average word length (by number of syllables)

In [106]:
def punctuation_cleaner(text):
    """Removes all punctuation and special characters from a text"""
    
    text = re.sub(r'[^A-Za-z\sáéíóúñ]+', '', text)
    
    return text

In [127]:
def syllables_per_word(text):
    """Returns average syllables per word of a text"""
    
    # Clean html expressions, line breaks, and punctuation
    text = html_cleaner(text)
    text = punctuation_cleaner(text)
    
    #Remove initialisms and acronyms
    text = re.sub(r'\b[A-ZÑ]{2,}\b', '', text)
    
    #Remove multiple spaces
    text = re.sub(r'\s{2,}', ' ', text)
    text = re.sub(r'\s,\s', ', ', text)
    text = re.sub(r'\s+$', '', text)
    
    # Lowercase all words
    text = text.lower()
    
    # Calculate average number of syllables per word
    words = text.split()
    syllables_per_word = [textstat.syllable_count(word) for word in words]
    avg_syllables_per_word = sum(syllables_per_word) / len(words)
    
    return avg_syllables_per_word

## Readability index Szigriszt-Pazos

This index is a Spanish adaptation of the Flesch readability-ease test, which considers average words per sentence and average syllables per word.

See https://legible.es/blog/perspicuidad-szigriszt-pazos/

| Score | Difficulty | Education level |
| ----- | ---------- | --------------- |
|0-15| Very hard|University graduates
|16-35| Hard| University graduates
|36-50| Somewhat hard| College
|51-65| Normal| 13 to 15-year-old students
|66-75| Somewhat easy| 12 year-old students
|76-85| Easy| 11 year-old students
|86-100| Very easy| 6 to 10 year-olds

In [136]:
def average_szigriszt_pazos(author):
    """Returns average Szigriszt-Pazos index of the articles of an author:
    - author: write the whole name (e.g., 'Denisse Dresser', Enrique Krauze')
    """
    
    szigriszt_pazos_scores = data.loc[data['author'] == author, 'body'].apply(szigriszt_pazos)
    avg_szigriszt_pazos = szigriszt_pazos_scores.mean()
    
    return avg_szigriszt_pazos

In [141]:
print(average_szigriszt_pazos('Denisse Dresser'))
print(average_szigriszt_pazos('Enrique Krauze'))
print(average_szigriszt_pazos('Ricardo Raphael'))
print(average_szigriszt_pazos('Valeria Moy'))

44.84963963963964
50.79950171821305
11.723434343434345
57.3544827586207


In [199]:
data.iloc[1216]

author                                        John Ackerman
title                   Un voto razonado para Mario Delgado
date                                             2020/10/05
body                                                    NaN
source                                           La Jornada
link      https://johnackerman.mx/un-voto-razonado-para-...
Name: 1216, dtype: object

In [196]:
a = data.loc[data['author'] == 'John Ackerman', 'body'].apply(type) == float
a[a == True]

1216    True
Name: body, dtype: bool

## Usage of common words

In [133]:
test1 = "La casa del gato"
test2 = 'La Filosofía del agnósticismo'

In [None]:
def score(text):
    
    # Remove stopwords
    

In [32]:
data.loc[data['author'] == 'Enrique Krauze', 'body'][2]

'Martha creció con su numerosa familia en una colonia popular de la Ciudad de México. Tiene dos hijos: Michelle, de 17, y Josef, de 11. Trabaja de cocinera en una casa particular. Desde hace algunos años había vivido separada de Rubén, su marido, un hombre de cuarenta y dos años que se ganaba la vida como taxista. \r\n\r\n Al inicio de la pandemia, Martha comenzó a recibir noticias alarmantes de su colonia. Mucha gente conocida se estaba muriendo. Su tía Esperanza, de cerca de 74 años, enfermó. Había llegado de Acapulco a la casa de varios pisos donde viven generaciones de familiares suyos y algunos inquilinos. Murió cuando iban a llevarla al hospital. El acta de defunción registró "complicaciones respiratorias". Les dieron la caja con sus cenizas. Cuatro de sus hijos y varios nietos se hicieron la prueba de Covid y salieron positivos. Los vecinos quisieron quemar la casa. \r\n\r\n Otra tía, Irma, murió el 17 de enero. Tenía 78 años. Sus hijos se reunieron con ella en su casa del Estad

In [84]:
scores[scores > 70]

2      74.51
19     73.37
83     70.70
116    72.83
125    72.83
289    76.36
Name: body, dtype: float64

In [59]:
data['author' == 'Enrique Krauze']

KeyError: False

In [None]:
def average_szigriszt_pazos(author):
    

In [57]:
print('Dresser:', szigriszt_pazos(dresser_content))
print('Krauze: ', szigriszt_pazos(krauze_content))
print('Ackerman: ', szigriszt_pazos(ackerman_content))
print('Raphael: ', szigriszt_pazos(raphael_content))
print('Moy: ', szigriszt_pazos(moy_content))

Dresser: 53.11
Krauze:  49.59
Ackerman:  39.49
Raphael:  11.5
Moy:  55.46


In [50]:
test = data['body'][0]
test_2 = 'Manotazoazo'

In [27]:
textstat.lexicon_count(test, removepunct=True)

757

In [41]:
textstat.flesch_reading_ease(test)

62.6

## Lexical diversity

# Clarity metrics of all authors

# Export database