# Basic comparison of newspapers

In this Notebook, we will make some more visualizations of the corpus. Here, we will use the preprocessing that we performed in the previous Notebook, in order to create plots that are a bit less general than the ones we made in Notebook 2.

## Import packages

Import the necessary packages for this notebook.

In [1]:
import pandas as pd
import plotly.express as px


## Load the dataset

In [3]:
data = pd.read_csv('data_preprocessed.csv')
data = data.dropna(subset=['content'])
# Show the first rows of the DataFrame
data.head()

Unnamed: 0,identifier,oaiIdentifier,type,title,date,content,krantnaam,verspreidingsgebied,month,day,doc,tokens,lemmas,article_length
0,http://resolver.kb.nl/resolve?urn=ddd:01026234...,DDD:ddd:010262344:mpeg21,artikel,RIVIKR-VA.A.B-',1873-01-01,"&M.STERDAM 31 December, ü ding n_ar: Tot Mannh...",De Tĳd : godsdienstig-staatkundig dagblad,Landelijk,January,Wednesday,"&M.STERDAM 31 December, ü ding n_ar: Tot Mannh...","['&', 'M.STERDAM', '31', 'December', ',', 'ü',...","['&', 'M.STERDAM', '31', 'December', ',', 'ü',...",41
1,http://resolver.kb.nl/resolve?urn=ddd:01026234...,DDD:ddd:010262344:mpeg21,artikel,?sfde van den feestdag:,1873-01-01,?sfde van den feestdag: 'ESIIIJI.EIIIS O. 11.»...,De Tĳd : godsdienstig-staatkundig dagblad,Landelijk,January,Wednesday,?sfde van den feestdag: 'ESIIIJI.EIIIS O. 11.»...,"['?', 'sfde', 'van', 'den', 'feestdag', ':', ""...","['?', 'sfde', 'van', 'den', 'feestdag', ':', ""...",18
2,http://resolver.kb.nl/resolve?urn=ddd:01106545...,DDD:ddd:011065450:mpeg21,artikel,WATERHOOGTE.,1873-01-01,"M<i.bh'i;n,27l>cc. 10 »:. 0 dm. Gtv. 4 dm. CoU...",De standaard,Landelijk,January,Wednesday,"M<i.bh'i;n,27l>cc. 10 »:. 0 dm. Gtv. 4 dm. CoU...","['M', '<', ""i.bh'i;n,27l"", '>', 'cc', '.', '10...","['m', '<', ""i.bh'i;n,27l"", '>', 'cc', '.', '10...",74
3,http://resolver.kb.nl/resolve?urn=ddd:01106545...,DDD:ddd:011065450:mpeg21,artikel,KOERS VAN ANTWERPEN 30 Dec. 1872.,1873-01-01,Amiterdam ztgt £r. 210.05 Geld. fr. 210 30 pap...,De standaard,Landelijk,January,Wednesday,Amiterdam ztgt £r. 210.05 Geld. fr. 210 30 pap...,"['Amiterdam', 'ztgt', '£', 'r.', '210.05', 'Ge...","['Amiterdam', 'ztgt', '£', 'r.', '210.05', 'Ge...",21
4,http://resolver.kb.nl/resolve?urn=ddd:01106545...,DDD:ddd:011065450:mpeg21,artikel,Koers van het geld bij de Ned. Bank sedert 12 ...,1873-01-01,Wissel-DUcuulo S jiCt.; Promeneu-DijconU 51') ...,De standaard,Landelijk,January,Wednesday,Wissel-DUcuulo S jiCt.; Promeneu-DijconU 51') ...,"['Wissel-DUcuulo', 'S', 'jiCt', '.', ';', 'Pro...","['Wissel-DUcuulo', 'S', 'JiCt', '.', ';', 'Pro...",34


## Visualize the average length of articles per newspaper

First, let's use [plotly](https://plotly.com/python/) to create a boxplot of all article lengths in the corpus.

In [7]:
fig = px.box(data,
    y='article_length'
)

fig.update_layout(
    height = 650,
    width = 500,
    title='Article lengths in entire corpus',
    yaxis_title_text='Article length (words)',
    xaxis_title_text='Articles'
)

fig.show()

In [6]:
fig = px.box(data, 
             x='krantnaam', 
             y='article_length')

fig.update_layout(
    height = 650,
    width = 600,
    title='Average article length per newspaper',
    yaxis_title_text='Article length (words)',
    xaxis_title_text='Newspaper'
)

fig.show()

## Looking at months and days

How about differences between different months and days of the week? Does the average article length depend on those factors? Let's take a look.

In [8]:
fig = px.histogram(data,
                 x='month',
                 y='article_length',
                 color='krantnaam',
                 histfunc='avg')

fig.update_layout(
    title='Average word count per article per newspaper',
    yaxis_title_text='Article length (words)',
    xaxis_title_text='Month'
)

fig.show()


In [9]:
fig = px.histogram(data,
                 x='day',
                 y='article_length',
                 color='krantnaam',
                 histfunc='avg')

fig.update_layout(
    title='Average word count per article per newspaper',
    yaxis_title_text='Article length (words)',
    xaxis_title_text='Day'
)

fig.update_xaxes(
    categoryorder='array',
    categoryarray=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
)

fig.show()


## Do we want more visualizations here?