In this notebook, we use the full texts of articles from BamS, BILD, Spiegel, Focus, Capital, and WamS, along with their sentiment labels, to prepare the input for sentiment models. The steps involved in this process are as follows:

1. **Combining Articles**: We combine all the articles from the different sources into a single DataFrame.
2. **Randomizing Sequence**: We randomize the sequence of articles.
3. **Creating Binary Sentiment Labels**: We create binary sentiment labels for the articles.
4. **Saving Data**: We save the article texts and their corresponding binary sentiment labels to separate text files.

In [1]:
import pandas as pd
import numpy as np
import random
import os

# Read Data from CSV Files
bams_bild = pd.read_csv('bams_bild.csv', encoding='utf-8', sep=';')
spiegel = pd.read_csv('spiegel.csv', encoding='utf-8', sep=';')
focus = pd.read_csv('focus.csv', encoding='utf-8', sep=';')
capital = pd.read_csv('capital.csv', encoding='utf-8', sep=';')
wams = pd.read_csv('wams.csv', encoding='utf-8', sep=';')

# Combine Data into a Single DataFrame
media_tenor = pd.concat([bams_bild, spiegel, focus, capital, wams], sort=False).reset_index(drop=True)

# Save the DataFrame to a CSV file
media_tenor.to_csv('media_tenor.csv', encoding='utf-8-sig', sep=';')

# Randomize the Sequence of Articles
random.seed(1)
media_tenor = media_tenor.sample(frac=1).reset_index(drop=True)

# Create Binary Sentiment Labels
def sentiment_binary(row):
    
    '''Define function to create binary sentiment labels'''
    
    if row['sentiment'] in [1.0, 0.0]:
        return 'positive'
    elif row['sentiment'] == -1.0:
        return 'negative'

# Apply the 'sentiment' function to create a new sentiment column
media_tenor['binary_sentiment'] = media_tenor.apply(lambda row: sentiment_binary(row), axis=1)

# Save the Data to Files
# Extract text and binary sentiment labels
data = media_tenor[['text', 'binary_sentiment']]

# Define the output directory
output_dir = os.path.join(os.getcwd(), '..', 'sentiment', 'MediaTenor_data')

# Save the article texts
np.savetxt(os.path.join(output_dir, 'articles.txt'), data.text.values, fmt='%s', encoding='utf-8')

# Save the binary sentiment labels
np.savetxt(os.path.join(output_dir, 'labels_binary.txt'), data.binary_sentiment.values, fmt='%s', encoding='utf-8')

The following code calculates simple descriptive statistics for the final dataset, specifically focusing on the sentiment distribution in terms of counts and percentages. 

In [2]:
# Calculate counts and percentages
sentiment_counts = media_tenor['sentiment'].value_counts()
sentiment_percentages = media_tenor['sentiment'].value_counts(normalize=True) * 100

# Create a DataFrame with counts and percentages
sentiment_stats = pd.DataFrame({
    'Count': sentiment_counts,
    'Percentage': sentiment_percentages
})

# Reset index to have sentiment as a column
sentiment_stats = sentiment_stats.reset_index().rename(columns={'index': 'Sentiment'})

sentiment_stats

Unnamed: 0,Sentiment,Count,Percentage
0,-1.0,1604,48.813147
1,1.0,992,30.188679
2,0.0,690,20.998174
