# **Homework 2 - Text Mining**

## Group Members:
1. **Iñigo Exposito**  
   <inigo.exposito@bse.eu>

2. **Deepak Malik**  
   <deepak.malik@bse.eu>

3. **Enzo Infantes**  
   <enzo.infantes@bse.eu>

<img src='https://upload.wikimedia.org/wikipedia/commons/4/41/BSE_primary_logo_color.jpg' width=300 />

# **0. Libraries**

In [1]:
import os
import time

import pandas as pd
import numpy as np
import re
from tabulate import tabulate
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline

from packages.preprocessing import Tokenizer, Normalizer, RemoveStopwords, Lemmatizer, Stemmer, JoinTokens
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# **1. Getting Main Text Data**

**Global News Dataset**: This dataset consists of news articles collected over the past few months using the NewsAPI (https://newsapi.org/). The main motivation for creating this dataset was to develop and experiment with various natural language processing (NLP) models. The goal of the dataset is to support the creation of text summarization models, sentiment analysis models, and other NLP applications.

**Source**: *Kumar Saksham. (2023). Global News Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7105651*

In [2]:
df = pd.read_csv('data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105375 entries, 0 to 105374
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   article_id    105375 non-null  int64 
 1   source_id     24495 non-null   object
 2   source_name   105375 non-null  object
 3   author        97156 non-null   object
 4   title         105335 non-null  object
 5   description   104992 non-null  object
 6   url           105375 non-null  object
 7   url_to_image  99751 non-null   object
 8   published_at  105375 non-null  object
 9   content       105375 non-null  object
 10  category      105333 non-null  object
 11  full_content  58432 non-null   object
dtypes: int64(1), object(11)
memory usage: 9.6+ MB


In [5]:
print(tabulate(df.head(3), headers='keys', tablefmt='psql'))
print(df.shape)

+----+--------------+-------------+------------------------------+--------------------------------------------+--------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-------------------------------------------------------------------------------------------------------

**Drop Columns**: There are some columns that we won't need for the analysis, so we can drop them.
- `source_id`: Redundant because we already have 'source_name'.
- `url` and `url_to_image`: Not needed for our  analysis.
- `content`: We will use 'full_content' instead, which provides a more complete version.

In [6]:
df.drop(columns=['source_id', 'url', 'url_to_image', 'content'], inplace=True)

**Drop Duplicates:** To avoid having the same entry more than once, we decided to drop duplicates

In [7]:
df.drop_duplicates(inplace=True)
print(df.shape)

(101832, 8)


**Missing Values**: In this analysis, we need to have all the information from the `full_content` column because it contains the complete body of the news.

In [8]:
df.apply(lambda x: x.isnull().mean()).to_frame().sort_values(by=0, ascending=False).transpose()

Unnamed: 0,full_content,author,description,category,title,source_name,article_id,published_at
0,0.460985,0.080711,0.003732,0.000412,0.000393,0.0,0.0,0.0


**Missing Value Recategorization**: We perform recategorization to standardize missing values by replacing them with clear, consistent placeholder text. This ensures our analysis isn’t affected by null values.

In [11]:
df['author'].fillna('Unknown', inplace=True)
df['category'].fillna('Uncategorized', inplace=True)
df['title'].fillna('No Title', inplace=True)
df['description'].fillna('No Description', inplace=True)

print("\n=== Missing Values After Cleaning ===")
print(df.isna().sum())


=== Missing Values After Cleaning ===
article_id          0
source_name         0
author              0
title               0
description         0
published_at        0
category            0
full_content    46943
dtype: int64


In [16]:
sample_df = df[df['full_content'].notnull()].sample(5, random_state=42)
print(tabulate(sample_df, headers='keys', tablefmt='psql'))

+--------+--------------+--------------------+---------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
print(tabulate(sample_df, headers='keys', tablefmt='psql'))

In [None]:
df.dropna(subset=['full_content'], inplace=True)

In [None]:
pipeline = Pipeline([
    ('tokenizer', Tokenizer()),
    ('normalizer', Normalizer()),
    ('remove_stopwords', RemoveStopwords()),
    ('lemmatizer', Lemmatizer()),
#   ('stemmer', Stemmer()), # Stemmer is not used because it is not as accurate as Lemmatizer.
    ('joiner', JoinTokens())
])

In [None]:
df_sample = df.sample(10, random_state=42)

df_sample['full_content_processed'] = pipeline.fit_transform(df_sample['full_content'])

In [None]:
print("\n=== Article Counts by Source Name ===")
source_counts = df['source_name'].value_counts()
print(source_counts)

plt.figure(figsize=(15, 6)) 
sns.barplot(x=source_counts.index, y=source_counts.values, palette='viridis')

plt.title("Article Counts by Source Name")
plt.xlabel("Source Name")
plt.ylabel("Number of Articles")
plt.xticks(rotation=90) 
plt.tight_layout()
plt.show()

In [None]:
df['published_at'] = pd.to_datetime(df['published_at'], errors='coerce')

# Create a summary of articles over time by day.
df['date'] = df['published_at'].dt.date
date_summary = df['date'].value_counts().sort_index()

print("\n=== Article Counts Over Time (by Day) ===")
print(date_summary)

plt.figure(figsize=(12, 6))
plt.plot(date_summary.index, date_summary.values, marker='o', linestyle='-', color='teal')
plt.title("Number of Articles Over Time (Daily)")
plt.xlabel("Date")
plt.ylabel("Number of Articles")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Calculate the mean number of articles per category
mean_count = df['category'].value_counts().mean()

# Filter categories with more articles than the mean
category_counts = df['category'].value_counts()
categories_above_mean = category_counts[category_counts > mean_count]

# Print categories with more articles than the mean
print("\nCategories with more articles than the mean:")
print(categories_above_mean)

# Plot the categories with more articles than the mean
plt.figure(figsize=(13, 6))
sns.barplot(x=categories_above_mean.index, y=categories_above_mean.values, palette='viridis')
plt.title("Categories with More Articles Than the Mean")
plt.xlabel("Category")
plt.ylabel("Number of Articles")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# **2. Develop Methodology**

# **3. Implementation**