### EDA on Reuters-21578 dataset

This notebook aims to conduct an in-depth Exploratory Data Analysis (EDA) on the Reuters-21578 text classification dataset. The Reuters dataset is a collection of news documents that are categorized into various topics. Understanding the characteristics and nuances of this dataset is crucial for building efficient and effective text classification models.

Note that we are using `ModApte` split in this case.

---

#### 1. Dataset Overview

In [1]:
from datasets import load_dataset

dataset = load_dataset("reuters21578", "ModApte")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

DatasetDict({
    test: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 3299
    })
    train: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 9603
    })
    unused: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 722
    })
})

In [3]:
dataset["train"][0]

{'text': 'Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this coc

---

#### 2. Basic Stat

- Count the number of samples.
- Calculate the average length of the text.
- Estimate the vocabulary size.

In [4]:
import pandas as pd

df = pd.DataFrame(dataset["train"])

In [5]:
df.head()

Unnamed: 0,text,text_type,topics,lewis_split,cgis_split,old_id,new_id,places,people,orgs,exchanges,date,title
0,Showers continued throughout the week in\nthe ...,"""NORM""",[cocoa],"""TRAIN""","""TRAINING-SET""","""5544""","""1""","[el-salvador, usa, uruguay]",[],[],[],26-FEB-1987 15:01:01.79,BAHIA COCOA REVIEW
1,The U.S. Agriculture Department\nreported the ...,"""NORM""","[grain, wheat, corn, barley, oat, sorghum]","""TRAIN""","""TRAINING-SET""","""5548""","""5""",[usa],[],[],[],26-FEB-1987 15:10:44.60,NATIONAL AVERAGE PRICES FOR FARMER-OWNED RESERVE
2,Argentine grain board figures show\ncrop regis...,"""NORM""","[veg-oil, linseed, lin-oil, soy-oil, sun-oil, ...","""TRAIN""","""TRAINING-SET""","""5549""","""6""",[argentina],[],[],[],26-FEB-1987 15:14:36.41,ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
3,Moody's Investors Service Inc said it\nlowered...,"""NORM""",[],"""TRAIN""","""TRAINING-SET""","""5551""","""8""",[usa],[],[],[],26-FEB-1987 15:15:40.12,USX &lt;X> DEBT DOWGRADED BY MOODY'S
4,Champion Products Inc said its\nboard of direc...,"""NORM""",[earn],"""TRAIN""","""TRAINING-SET""","""5552""","""9""",[usa],[],[],[],26-FEB-1987 15:17:11.20,CHAMPION PRODUCTS &lt;CH> APPROVES STOCK SPLIT


In [6]:
from typing import List, Dict

def basic_statistics(text_column: pd.Series) -> Dict[str, float]:
    """
    Calculate basic statistics about the text data in the DataFrame.
    
    Args:
        text_column (pd.Series): The text column in the DataFrame.
        
    Returns:
        Dict[str, float]: A dictionary containing the basic statistics.
    """
    # Count the number of samples
    num_samples = len(text_column)
    
    # Calculate the average length of the text
    avg_length = text_column.apply(len).mean()
    
    # Estimate the vocabulary size
    all_words = ' '.join(text_column).split()
    vocab_size = len(set(all_words))
    
    return {
        'Number of Samples': num_samples,
        'Average Text Length': avg_length,
        'Vocabulary Size': vocab_size
    }

# Perform basic statistics on the 'text' column
basic_stats = basic_statistics(df['text'])
basic_stats

{'Number of Samples': 9603,
 'Average Text Length': 772.6746849942726,
 'Vocabulary Size': 66778}

##### 3. Word Freq Analysis

- Identify the most frequent words.
- Identify the least frequent words.
- Count the number of unique words.

In [7]:
from collections import Counter

def word_frequency_analysis(text_column: pd.Series, n_most_frequent: int = 10, n_least_frequent: int = 10) -> Dict[str, List[str]]:
    """
    Analyze the frequency of words in the text column.
    
    Args:
        text_column (pd.Series): The text column in the DataFrame.
        n_most_frequent (int): Number of most frequent words to return.
        n_least_frequent (int): Number of least frequent words to return.
        
    Returns:
        Dict[str, List[str]]: A dictionary containing lists of most and least frequent words.
    """
    # Tokenize the text and count word frequencies
    all_words = ' '.join(text_column).lower().split()
    word_freq = Counter(all_words)
    
    # Get the n most frequent words
    most_frequent_words = word_freq.most_common(n_most_frequent)
    
    # Get the n least frequent words
    least_frequent_words = word_freq.most_common()[:-n_least_frequent-1:-1]
    
    return {
        'Most Frequent Words': most_frequent_words,
        'Least Frequent Words': least_frequent_words,
        'Unique Words': len(word_freq)
    }

# Perform word frequency analysis on the 'text' column
word_freq_stats = word_frequency_analysis(df['text'])
word_freq_stats

{'Most Frequent Words': [('the', 64571),
  ('of', 34532),
  ('to', 31973),
  ('in', 25032),
  ('and', 24970),
  ('a', 23336),
  ('said', 15608),
  ('mln', 14266),
  ('for', 11835),
  ('it', 9755)],
 'Least Frequent Words': [('unsworth', 1),
  ('barrie', 1),
  ('ratification.', 1),
  ('260,000,', 1),
  ('845.50', 1),
  ('843.90.', 1),
  ('844.30', 1),
  ('midrate', 1),
  ('4,044', 1),
  ('3,978', 1)],
 'Unique Words': 61882}

- Most Frequent Words: The most common words are mostly stop words like 'the', 'of', 'to', etc., which are generally expected in natural language text.
- Least Frequent Words: Words like 'unsworth', 'barrie', 'ratification.', etc., appear only once in the dataset.
- Unique Words: There are 61,882 unique words after lowercasing the text.

---

#### 4. N-gram analysis

In [8]:
#!pip install nltk

In [9]:
from nltk.util import ngrams

def ngram_analysis(text_column: pd.Series, n: int, top_n: int = 10) -> List[str]:
    """
    Analyze the frequency of n-grams in the text column.
    
    Args:
        text_column (pd.Series): The text column in the DataFrame.
        n (int): The size of the n-gram (bi-gram: 2, tri-gram: 3, etc.)
        top_n (int): Number of most frequent n-grams to return.
        
    Returns:
        List[str]: A list of the most frequent n-grams.
    """
    # Tokenize the text
    all_words = ' '.join(text_column).lower().split()
    
    # Generate n-grams
    n_grams = ngrams(all_words, n)
    
    # Count the frequency of each n-gram
    ngram_freq = Counter(n_grams)
    
    # Get the top_n most frequent n-grams
    most_frequent_ngrams = ngram_freq.most_common(top_n)
    
    return most_frequent_ngrams

# Perform n-gram analysis for bi-grams and tri-grams on the 'text' column
bigram_stats = ngram_analysis(df['text'], 2)
trigram_stats = ngram_analysis(df['text'], 3)

In [10]:
# bi-gram
bigram_stats

[(('of', 'the'), 6474),
 (('in', 'the'), 5918),
 (('said', 'it'), 3918),
 (('said', 'the'), 3305),
 (('mln', 'dlrs'), 2997),
 (('for', 'the'), 2570),
 (('mln', 'vs'), 2476),
 (('to', 'the'), 2328),
 (('will', 'be'), 2316),
 (('cts', 'vs'), 2175)]

In [11]:
# tri-gram
trigram_stats

[(('the', 'company', 'said'), 773),
 (('mln', 'dlrs', 'in'), 635),
 (('mln', 'dlrs', 'of'), 585),
 (('pct', 'of', 'the'), 555),
 (('said', 'it', 'has'), 554),
 (('inc', 'said', 'it'), 525),
 (('cts', 'vs', 'loss'), 495),
 (('corp', 'said', 'it'), 460),
 (('the', 'end', 'of'), 449),
 (('the', 'bank', 'of'), 406)]

- Most Frequent Bi-grams: The pair ('of', 'the') appears most frequently, followed by ('in', 'the'), ('said', 'it'), etc.
- Most Frequent Tri-grams: The sequence ('the', 'company', 'said') is the most common tri-gram, followed by ('mln', 'dlrs', 'in'), ('mln', 'dlrs', 'of'), etc.

----

#### 5. Label Analysis

In [12]:
from itertools import chain
from collections import defaultdict

def label_analysis(topics_column: pd.Series) -> Dict[str, Dict]:
    """
    Perform label analysis on the topics column.
    
    Args:
        topics_column (pd.Series): The topics column containing the labels.
        
    Returns:
        Dict[str, Dict]: A dictionary containing individual and combined label counts, cardinality, and density.
    """
    # Count individual labels
    individual_label_count = Counter(chain.from_iterable(topics_column))
    
    # Calculate label cardinality (average number of labels per instance)
    label_cardinality = topics_column.apply(len).mean()
    
    # Calculate label density (label cardinality divided by the number of unique labels)
    label_density = label_cardinality / len(individual_label_count) if individual_label_count else 0
    
    return {
        'unique_label_count': individual_label_count,
        'label_cardinality': label_cardinality,
        'label_density': label_density
    }

# Perform label analysis on the 'topics' column
label_stats = label_analysis(df['topics'])

In [13]:
for k, v in label_stats.items():
    print(f"{k}:\n\t{v}")

unique_label_count:
	Counter({'earn': 2877, 'acq': 1650, 'money-fx': 538, 'grain': 433, 'crude': 389, 'trade': 369, 'interest': 347, 'wheat': 212, 'ship': 197, 'corn': 182, 'money-supply': 140, 'dlr': 131, 'sugar': 126, 'oilseed': 124, 'coffee': 111, 'gnp': 101, 'gold': 94, 'veg-oil': 87, 'soybean': 78, 'livestock': 75, 'nat-gas': 75, 'bop': 75, 'cpi': 69, 'cocoa': 55, 'reserves': 55, 'carcass': 50, 'copper': 47, 'jobs': 46, 'yen': 45, 'ipi': 41, 'iron-steel': 40, 'cotton': 39, 'barley': 37, 'rubber': 37, 'gas': 37, 'rice': 35, 'alum': 35, 'meal-feed': 30, 'palm-oil': 30, 'sorghum': 24, 'retail': 23, 'silver': 21, 'zinc': 21, 'pet-chem': 20, 'wpi': 19, 'tin': 18, 'rapeseed': 18, 'stg': 17, 'housing': 16, 'strategic-metal': 16, 'hog': 16, 'orange': 16, 'lead': 15, 'soy-oil': 14, 'heat': 14, 'soy-meal': 13, 'fuel': 13, 'lei': 12, 'sunseed': 11, 'lumber': 10, 'dmk': 10, 'tea': 9, 'income': 9, 'oat': 8, 'nickel': 8, 'l-cattle': 6, 'sun-oil': 5, 'platinum': 5, 'rape-oil': 5, 'groundnut': 5,

In [14]:
# Calculate the overall topic frequency
overall_topic_freq = Counter(chain.from_iterable(df['topics']))

# Identify and count labels that appear only once
labels_once = [label for label, count in overall_topic_freq.items() if count == 1]
count_labels_once = len(labels_once)

print(f"{count_labels_once = }")
print(f"{labels_once = }")

count_labels_once = 20
labels_once = ['lin-oil', 'rye', 'red-bean', 'groundnut-oil', 'citruspulp', 'rape-meal', 'corn-oil', 'peseta', 'cotton-oil', 'ringgit', 'castorseed', 'castor-oil', 'lit', 'rupiah', 'skr', 'nkr', 'dkr', 'sun-meal', 'lin-meal', 'cruzado']


- Unique Label Count: Notice that there some labels appear only once in the training set.
- Label Cardinality: The average number of labels per instance is approximately 1.005.
- Label Density: The label density, calculated as the label cardinality divided by the number of unique labels, is approximately 0.0087. A label density of 0.0087 indicates that, on average, each article is associated with a very small fraction of the total unique labels available. In simpler terms, this means that the labels are quite diverse across the dataset, and each article is usually related to a very specific topic among many possible topics.
- Top 5 Most Common Topics: The most common topics across all documents are 'earn', 'acq', 'money-fx', 'grain', and 'crude'.
- Least 5 Common Topics: The least common topics, appearing only once in the dataset, are 'cruzado', 'lin-meal', 'sun-meal', 'dkr', and 'nkr'.

---

#### 6. Document Length Analysis (median)

In [15]:
# Calculate the median document length
median_doc_length = df['text'].apply(len).median()

# Filter documents above and below the median length
df_above_median = df[df['text'].apply(len) > median_doc_length]
df_below_median = df[df['text'].apply(len) < median_doc_length]

# Count of documents above and below the median length
count_above_median = len(df_above_median)
count_below_median = len(df_below_median)

# Most common topics in documents above the median length
most_common_above_median = Counter(chain.from_iterable(df_above_median['topics'])).most_common(5)

# Most common topics in documents below the median length
most_common_below_median = Counter(chain.from_iterable(df_below_median['topics'])).most_common(5)

print(f"{median_doc_length = } ")
print(f"{count_above_median = } ")
print(f"{most_common_above_median = } ")
print(f"{count_below_median = } ")
print(f"{most_common_below_median = } ")

median_doc_length = 511.0 
count_above_median = 4801 
most_common_above_median = [('acq', 848), ('earn', 683), ('money-fx', 316), ('trade', 315), ('grain', 313)] 
count_below_median = 4796 
most_common_below_median = [('earn', 2192), ('acq', 799), ('money-fx', 222), ('interest', 187), ('grain', 120)] 


- Median Document Length: The median length of the documents is 511 characters.

For documents above the median length:

- Number of Documents: There are 4,801 documents that are longer than the median length.
- Most Common Topics: The top 5 most common topics in these documents are 'acq', 'earn', 'money-fx', 'trade', and 'grain'.

For documents below the median length:

- Number of Documents: There are 4,796 documents that are shorter than the median length.
- Most Common Topics: The top 5 most common topics in these documents are 'earn', 'acq', 'money-fx', 'interest', and 'grain'.

----

7. Check single_appearance_label in test set

In [16]:
single_appearance_labels = ['lin-oil',
 'rye',
 'red-bean',
 'groundnut-oil',
 'citruspulp',
 'rape-meal',
 'dfl',
 'corn-oil',
 'peseta',
 'cotton-oil',
 'ringgit',
 'lit',
 'rupiah',
 'skr',
 'nkr',
 'dkr',
 'sun-meal',
 'lin-meal',
 'cruzado']

In [17]:
# Initialize a list to store samples with single-appearance labels
samples_with_single_appearance_labels = []

for sample in dataset["test"]:
    if any(label in single_appearance_labels for label in sample['topics']):
        samples_with_single_appearance_labels.append(sample)

In [18]:
len(samples_with_single_appearance_labels )

7

Before we remove labels that appear only once in the training set, it's a good idea to check how often these single-occurrence labels show up in the test set. Labels that appear only once can be problematic for classification tasks, as machine learning models may not have enough data to learn from them effectively.

After checking, we found that there are only seven samples that contain one of those single-appearance labels. In other words, it is safe to drop them and reduce the size of the label space.

---

Key takeway:

- It is a multilabel text dataset.
- Preprocess train/test split by removing those single-appearance-labels.
- We will be focusing on title, text, topics for this multilabel text classification. 
- A good starting point of defining the max_vocab_size is about 20000 since it is stated by PapersWithCode that: The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.
- We are using the dataset hosted on HF hub (check the first code cell) and we will be only the train and test split.