# Generación de dataframes

En este *notebook* se van a generar dataframes generales a partir del corpus cargado en el archivo read_data.py almacenado en la carpeta src. Para modificar el corpus, añade en el script los archivos deseados.

In [1]:
# Bibliotecas necesarias
import pandas as pd

from src.classes import Text, Corpus
from src.read_data import create_text_list

In [2]:
# Generación del corpus
corpus = Corpus(create_text_list())

### Información del corpus

Genera un archivo con el título, autor, número de oraciones, número de tokens, número de palabras total y número de palabras únicas. No se ha lematizado para obtener estos datos.

In [3]:
# Generación del dataframe
dict_corpus_info = {'title':[i.title for i in corpus.list_of_texts],
                    'author':[i.author for i in corpus.list_of_texts],
                    'sentences':[len(i.sentences) for i in corpus.list_of_texts],
                    'tokens':[len(i.tokens) for i in corpus.list_of_texts],
                    'words':[len(i.words) for i in corpus.list_of_texts],
                    'unique words':[len(i.unique_words) for i in corpus.list_of_texts]
                    }
corpus_info = pd.DataFrame(dict_corpus_info)
corpus_info

Unnamed: 0,title,author,sentences,tokens,words,unique words
0,Fortunata y Jacinta,Benito Pérez Galdós,23279,469256,386505,28795
1,Gerona,Benito Pérez Galdós,3058,71742,60405,9888
2,La corte de Carlos IV,Benito Pérez Galdós,3460,81242,68370,10525
3,La de Bringas,Benito Pérez Galdós,3666,83676,70307,11477
4,Marianela,Benito Pérez Galdós,3059,60722,49980,8126
5,Misericordia,Benito Pérez Galdós,4175,100143,81381,12310
6,Napoleón en Chamartín,Benito Pérez Galdós,3869,90710,75077,11461
7,Trafalgar,Benito Pérez Galdós,2060,59558,50159,8304
8,La prueba,Emilia Pardo Bazán,3584,74508,61160,11651
9,La tribuna,Emilia Pardo Bazán,2991,77267,63784,12736


In [4]:
# Guardado
corpus_info.to_csv('./data/processed/corpus_info.csv')

### Contenido del corpus

A continuación se va a generar el dataframe necesario para procesar los textos. Contiene el título de la obra como índice y el número de apariciones en cada texto de los 1000 tokens más comunes en el corpus.

In [5]:
corpus_content = corpus.comparar_textos(n=1000)
corpus_content

Unnamed: 0,",",de,.,y,que,la,a,el,en,no,...,leer,llamar,condición,poca,hechos,quienes,basta,conocía,entendimiento,buscar
Fortunata y Jacinta,29501,18177,18803,12905,15545,14470,9826,7871,7554,7478,...,34,49,19,45,41,24,36,44,24,29
Gerona,5134,2944,2425,2135,2130,1937,1577,1282,1327,1009,...,5,2,2,3,8,2,4,4,4,8
La corte de Carlos IV,5143,3458,2718,2130,2797,2226,1722,1477,1481,1218,...,8,4,8,5,8,12,4,7,9,10
La de Bringas,5297,4117,2886,2144,2472,2567,1570,1480,1545,1247,...,4,8,2,6,7,6,7,5,6,9
Marianela,3709,2483,2295,1473,1788,1873,1105,1024,982,922,...,10,1,6,6,3,1,9,1,6,1
Misericordia,8005,4250,3109,3079,3245,3063,1900,1626,1866,1578,...,3,7,6,8,4,3,4,8,1,6
Napoleón en Chamartín,6234,4182,2779,2732,3152,2232,1833,1515,1590,1361,...,8,10,9,3,8,9,12,2,3,4
Trafalgar,4314,2561,1828,1623,1879,1700,1314,1299,1117,729,...,1,3,3,10,10,10,2,7,3,5
La prueba,4850,3224,2451,2094,2316,2162,1575,1365,1348,1191,...,5,8,4,5,4,6,7,3,4,3
La tribuna,5813,3639,2159,2419,2057,2805,1641,1629,1245,817,...,6,2,4,10,3,7,7,6,0,5


In [6]:
# Guardado
corpus_content.to_csv('./data/processed/corpus_content.csv')

### Contenido del corpus normalizado

El siguiente dataframe es el mismo que el anterior pero con los datos normalizados. En realidad, simplemente se ha dividido por el número total de palabras que se analizan de cada obra.

En el futuro, sería más conveniente realizar el mismo análisis con una medida estandarizada clásica o con delta de Burrows.

In [7]:
corpus_content_normal = corpus.analisis1(n=1000)
corpus_content_normal

Unnamed: 0,",",de,.,y,que,la,a,el,en,no,...,leer,llamar,condición,poca,hechos,quienes,basta,conocía,entendimiento,buscar
Fortunata y Jacinta,0.080962,0.049884,0.051602,0.035416,0.042661,0.039711,0.026966,0.021601,0.020731,0.020522,...,9.3e-05,0.000134,5.2e-05,0.000123,0.000113,6.6e-05,9.9e-05,0.000121,6.6e-05,8e-05
Gerona,0.096159,0.05514,0.04542,0.039988,0.039894,0.03628,0.029537,0.024012,0.024854,0.018898,...,9.4e-05,3.7e-05,3.7e-05,5.6e-05,0.00015,3.7e-05,7.5e-05,7.5e-05,7.5e-05,0.00015
La corte de Carlos IV,0.083676,0.056261,0.044222,0.034655,0.045507,0.036217,0.028017,0.024031,0.024096,0.019817,...,0.00013,6.5e-05,0.00013,8.1e-05,0.00013,0.000195,6.5e-05,0.000114,0.000146,0.000163
La de Bringas,0.083945,0.065245,0.045736,0.033977,0.039175,0.040681,0.024881,0.023454,0.024485,0.019762,...,6.3e-05,0.000127,3.2e-05,9.5e-05,0.000111,9.5e-05,0.000111,7.9e-05,9.5e-05,0.000143
Marianela,0.079592,0.053283,0.049249,0.031609,0.038369,0.040193,0.023712,0.021974,0.021073,0.019785,...,0.000215,2.1e-05,0.000129,0.000129,6.4e-05,2.1e-05,0.000193,2.1e-05,0.000129,2.1e-05
Misericordia,0.106195,0.056381,0.041244,0.040846,0.043049,0.040634,0.025206,0.021571,0.024755,0.020934,...,4e-05,9.3e-05,8e-05,0.000106,5.3e-05,4e-05,5.3e-05,0.000106,1.3e-05,8e-05
Napoleón en Chamartín,0.090934,0.061002,0.040537,0.039851,0.045978,0.032558,0.026738,0.022099,0.023193,0.019853,...,0.000117,0.000146,0.000131,4.4e-05,0.000117,0.000131,0.000175,2.9e-05,4.4e-05,5.8e-05
Trafalgar,0.099387,0.059001,0.042114,0.037391,0.043289,0.039165,0.030272,0.029927,0.025734,0.016795,...,2.3e-05,6.9e-05,6.9e-05,0.00023,0.00023,0.00023,4.6e-05,0.000161,6.9e-05,0.000115
La prueba,0.088803,0.059031,0.044878,0.038341,0.042406,0.039586,0.028838,0.024993,0.024682,0.021807,...,9.2e-05,0.000146,7.3e-05,9.2e-05,7.3e-05,0.00011,0.000128,5.5e-05,7.3e-05,5.5e-05
La tribuna,0.106146,0.066449,0.039424,0.044171,0.037561,0.05122,0.029965,0.029746,0.022734,0.014919,...,0.00011,3.7e-05,7.3e-05,0.000183,5.5e-05,0.000128,0.000128,0.00011,0.0,9.1e-05


In [8]:
# Guardado
corpus_content_normal.to_csv('./data/processed/corpus_content_normal.csv')