# Basic comparison of newspapers

In this Notebook, we will make some more visualizations of the corpus. Here, we will use the preprocessing that we performed in the previous Notebook, in order to create plots that are a bit less general than the ones we made in Notebook 2.

## Import packages

Import the necessary packages for this notebook.

In [None]:
import pandas as pd
import plotly.express as px
import pickle
import matplotlib.pyplot as plt
import numpy as np

## Load the dataset

In [None]:
# Deserialize
with open('data/preprocessed_docs.pkl', 'rb') as f:
    processed_docs = pickle.load(f)

## Article lengths

In the next code block, we will count the number of words in each article and store it in a separate column, called 'article_length'.

In [None]:
## Retrieve the length of each article in the corpus and store it in the DataFrame

## Create empty list to store the lengths
article_lengths = []

## Retrieve length of each article and store in list
for index, row in processed_docs.iterrows():
    article_lengths.append(len(row['tokens']))

## Append list to DataFrame
processed_docs['article_length'] = article_lengths

## Show the first rows of title, tokens and article length in DataFrame                                    
processed_docs[['title', 'tokens', 'article_length']].head()

### What is the distribution of the article length for each paper?

In [None]:
# Define the number of bins
num_bins = 10

# Calculate bin edges and round to the nearest 100
min_length = processed_docs['article_length'].min()
max_length = processed_docs['article_length'].max()
bin_edges = np.linspace(min_length, max_length, num_bins + 1)
bin_edges = np.round(bin_edges, -2)

# Creating bins for article length
processed_docs['length_bin'] = pd.cut(processed_docs['article_length'], bins=bin_edges, labels=False, include_lowest=True)

# Counting the number of articles per bin per paper
bin_counts = processed_docs.groupby(['length_bin', 'krantnaam']).size().reset_index(name='counts')

# Create the plot
fig = px.bar(bin_counts, x='length_bin', y='counts', color='krantnaam', title='Number of articles per length group',
             labels={'length_bin': 'Article length (number of words)', 'counts': 'Number of articles'})

# Update layout for the x-axis ticks and bar mode
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=list(range(len(bin_edges) - 1)),
        ticktext=[f'{int(bin_edges[i])}-{int(bin_edges[i+1])}' for i in range(len(bin_edges) - 1)]
    ),
    barmode='group',  # Change the bar mode to group (bars next to each other)
    height=700  # Increase the height of the plot
)

# Customizing tick labels to be on the same line
fig.update_xaxes(tickangle=45)

fig.show()


### Visualise outliers with a boxplot

In [None]:
# Extract the article lengths for each newspaper
newspaper_names = processed_docs['krantnaam'].unique()
article_lengths = [processed_docs[processed_docs['krantnaam'] == newspaper]['article_length'] for newspaper in newspaper_names]

# Create the box plot
fig, ax = plt.subplots(figsize=(5, 6.5))  # width=5 inches, height=6.5 inches

# Plot data
ax.boxplot(article_lengths, vert=True, patch_artist=True)

# Set the title and labels
ax.set_title('Article lengths in entire corpus')
ax.set_ylabel('Article length (words)')
ax.set_xticklabels(newspaper_names, rotation=45, ha='right')  # Set x-axis labels to newspaper names
ax.set_xlabel('Newspapers')

# Display the plot
plt.show()


## Looking at months and days

How about differences between different months and days of the week? Does the average article length depend on those factors? Let's take a look.

In [None]:
fig = px.histogram(processed_docs,
                 x='month',
                 y='article_length',
                 color='krantnaam',
                 histfunc='avg')


fig.update_layout(
    title='Average word count per article per newspaper',
    yaxis_title_text='Article length (words)',
    xaxis_title_text='Month',
    bargap=0.1  # Adjust this value to make the bars narrower (0 to 1)
)

fig.show()


In [None]:
fig = px.histogram(processed_docs,
                 x='day',
                 y='article_length',
                 color='krantnaam',
                 histfunc='avg')

fig.update_layout(
    title='Average word count per article per newspaper',
    yaxis_title_text='Article length (words)',
    xaxis_title_text='Day',
    barmode='group',  # Change the bar mode to group (bars next to each other)
)

fig.update_xaxes(
    categoryorder='array',
    categoryarray=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])

fig.show()
