In [1]:
import pandas as pd

In [2]:
PATH = 'Data/News_Category_Dataset_v3.json'
df = pd.read_json(PATH, lines = True)

In [3]:
print(df.head(1).describe())
print(df['category'].value_counts())

                      date
count                    1
mean   2022-09-23 00:00:00
min    2022-09-23 00:00:00
25%    2022-09-23 00:00:00
50%    2022-09-23 00:00:00
75%    2022-09-23 00:00:00
max    2022-09-23 00:00:00
category
POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9814
PARENTING          8791
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3653
WOMEN              3572
CRIME              3562
IMPACT             3484
DIVORCE            3426
WORLD NEWS         3299
MEDIA              2944
WEIRD NEWS         2777
GREEN              2622
WORLDPOST          2579
RELIGION           2577
STYLE              2254
SCIENCE            2206
TECH               2104
TASTE              2096
MONEY              1756
ARTS   

In [4]:
print("First 10 links:")
for i in range(10):
    print("\t", df.iloc[i]['link'])

print("\nFirst entry:")
print(df.iloc[0])

print("\nFirst description:")
print(df.iloc[0]['short_description'])

First 10 links:
	 https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9
	 https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe
	 https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02
	 https://www.huffpost.com/entry/funniest-parenting-tweets_l_632d7d15e4b0d12b5403e479
	 https://www.huffpost.com/entry/amy-cooper-loses-discrimination-lawsuit-franklin-templeton_n_632c6463e4b09d8701bd227e
	 https://www.huffpost.com/entry/belk-worker-found-dead-columbiana-centre-bathroom_n_632c5f8ce4b0572027b0251d
	 https://www.huffpost.com/entry/reporter-gets-adorable-surprise-from-her-boyfriend-while-working-live-on-tv_n_632ccf43e4b0572027b10d74
	 https://www.huffpost.com/entry/puerto-rico-water-hurricane-fiona_n_632bdfd8e4b0d12b54014e13
	 https://www.huffpost.com/entry/mija-documentary-immigration-isabel-castro-interview_n_632329aee4b000d98858dbda
	 

# The plan
After initial data exploration it's time to do the plan.

We want to build a graph from a dataset with columns "title, date, content, source". 

So let's make the correct dataframe and construct a graph using the above format. The idea is that any new dataset can simply be converted to this form and then concatenated before the graph creation pipeline.

In [5]:
# Constructing standardized DataFrame
df_st = pd.DataFrame()
df_st['title'] = df['headline']
df_st['date'] = df['date']
df_st['content'] = df['short_description']
df_st = df_st.assign(source="Huffington Post") # Create new column with source as default

# Here would be a good time to filter stuff out.
# And that's why I'll filter stuff out
print("Entries:", len(df_st))

COMPANIES = ["Apple", "Google", "Alphabet", "Facebook", "Meta", "Microsoft", "OpenAI"]
#COMPANIES = [company.lower() for company in COMPANIES]
df_st

Entries: 209527


Unnamed: 0,title,date,content,source
0,Over 4 Million Americans Roll Up Sleeves For O...,2022-09-23,Health experts said it is too early to predict...,Huffington Post
1,"American Airlines Flyer Charged, Banned For Li...",2022-09-23,He was subdued by passengers and crew when he ...,Huffington Post
2,23 Of The Funniest Tweets About Cats And Dogs ...,2022-09-23,"""Until you have a dog you don't understand wha...",Huffington Post
3,The Funniest Tweets From Parents This Week (Se...,2022-09-23,"""Accidentally put grown-up toothpaste on my to...",Huffington Post
4,Woman Who Called Cops On Black Bird-Watcher Lo...,2022-09-22,Amy Cooper accused investment firm Franklin Te...,Huffington Post
...,...,...,...,...
209522,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,2012-01-28,Verizon Wireless and AT&T are already promotin...,Huffington Post
209523,Maria Sharapova Stunned By Victoria Azarenka I...,2012-01-28,"Afterward, Azarenka, more effusive with the pr...",Huffington Post
209524,"Giants Over Patriots, Jets Over Colts Among M...",2012-01-28,"Leading up to Super Bowl XLVI, the most talked...",Huffington Post
209525,Aldon Smith Arrested: 49ers Linebacker Busted ...,2012-01-28,CORRECTION: An earlier version of this story i...,Huffington Post


In [6]:
import pandas as pd
import math
import numpy as np
from collections import Counter
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
import re
import string
class NewsData:
    def __init__(self, df: pd.DataFrame, scheme):
        """Handles news article dataframes."""
        self.df = df
        self.scheme = ["title", "date", "content", "source"] if not scheme else scheme
        self.df["content_cleaned"] = self.df[scheme[2]].apply(self.clean_text)  # Clean text column
        self.vocabulary = self.build_vocabulary()
        self.idf = self.compute_idf()

    def from_df(df: pd.DataFrame, scheme=None):
        """
        Creates a NewsData object from a DataFrame.

        Args:
            df (pd.DataFrame): DataFrame containing news articles.
            scheme (list): List of column names in order of title, date, content, and source.

        Returns:
            NewsData: NewsData object.
        """
        scheme = ["title", "date", "content", "source"] if not scheme else scheme

        if scheme and len(scheme) != 4:
            raise ValueError("Scheme must contain four column names: for title, date, content, and source.")
        
        if scheme and not all(col in df.columns for col in scheme):
            raise ValueError("All scheme column names must be present in the DataFrame.")
        
        if not scheme and not all(col in df.columns for col in ["title", "date", "content", "source"]):
            raise ValueError("DataFrame must contain columns: title, date, content, and source. Otherwise, specify a scheme.")

        return NewsData(df, scheme)
    
    def clean_text(self, text):
        """Standardizes and cleans text by lowercasing, removing punctuation, and extra whitespace."""
        text = text.lower()  # Lowercasing
        text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)  # Remove punctuation
        text = re.sub(r"\s+", " ", text).strip()  # Remove extra whitespace
        return text
    
    def build_vocabulary(self):
        """Creates a consistent vocabulary across documents."""
        vocabulary = set()
        for article in self.df["content_cleaned"]:
            vocabulary.update(article.split())
        return vocabulary
    
    def compute_tf(self, text):
        """Calculates normalized term frequency using NumPy for efficiency."""
        words = text.split()
        word_count = len(words)
        tf = Counter(words)
        return {word: count / word_count for word, count in tf.items()}
    
    def compute_idf(self):
        """Calculates inverse document frequency for each word in the vocabulary."""
        num_docs = len(self.df)
        doc_count = Counter()
        
        for article in self.df["content_cleaned"]:
            doc_count.update(set(article.split()))
        
        return {word: math.log((num_docs + 1) / (count + 1)) + 1 for word, count in doc_count.items()}
    
    def generate_vector(self, tf_dict):
        """Generates a sparse vector for a document based on TF and IDF."""
        vector = np.array([tf_dict.get(word, 0) * self.idf.get(word, 0) for word in self.vocabulary])
        return csr_matrix(vector)  # Convert to sparse matrix to save memory
    
    def compute_similarity_matrix(self, method="tfidf"):
        """Calculates pairwise cosine similarity for all document vectors."""
        # Ensure all vectors are created with the same vocabulary length
        vectors = [self.generate_vector(self.compute_tf(doc)).toarray()[0] for doc in self.df["content_cleaned"]]
        
        # Calculate cosine similarity
        return cosine_similarity(vectors)

In [7]:
# Example Usage
data = {
    "Category": ["Economy", "Politics", "Tech", "Health"],
    "Short Description": [
        "Stock markets react to the new policy changes.",
        "The new tax bill has been introduced in Congress.",
        "Innovative AI technology is shaping the future.",
        "Recent health studies show promising results."
    ],
    "Article Body": [
        "The stock market experienced significant changes after recent policy announcements affecting various sectors.",
        "The newly introduced tax bill has sparked debates in Congress and could have long-term impacts.",
        "Artificial Intelligence is evolving with applications in multiple industries including healthcare and finance.",
        "A new health study indicates that certain lifestyle changes could lead to improved well-being and longevity."
    ],
    "Date": ["2024-11-01", "2024-11-02", "2024-11-03", "2024-11-04"],
    "Link": [
        "https://news.example.com/economy1",
        "https://news.example.com/politics1",
        "https://news.example.com/tech1",
        "https://news.example.com/health1"
    ]
}
# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)
# Create an instance of NewsData and calculate similarity matrix
news_data = NewsData.from_df(df, scheme=["Category", "Short Description", "Article Body", "Date"])
similarity_matrix = news_data.compute_similarity_matrix()
# Print similarity matrix
print("TF-IDF Cosine Similarity Matrix:")
print(similarity_matrix)

TF-IDF Cosine Similarity Matrix:
[[1.         0.04876266 0.         0.04641208]
 [0.04876266 1.         0.08143865 0.07379344]
 [0.         0.08143865 1.         0.03068949]
 [0.04641208 0.07379344 0.03068949 1.        ]]


In [8]:
# ~4 seconds
SCHEME = ["title", "date", "content", "source"]

print("n items before filter:", len(df_st))
filtered = df_st[df_st['content'].str.contains('|'.join(COMPANIES), case=False)]
print("n items after filter:", len(filtered))
print(filtered.columns)

news_data = NewsData.from_df(filtered, SCHEME)

n items before filter: 209527
n items after filter: 2899
Index(['title', 'date', 'content', 'source'], dtype='object')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df["content_cleaned"] = self.df[scheme[2]].apply(self.clean_text)  # Clean text column


In [9]:
# ~8s
similarity_matrix = news_data.compute_similarity_matrix()

In [10]:
filtered.head(1)

Unnamed: 0,title,date,content,source,content_cleaned
185,Metal-Detecting Stranger Retrieves Woman’s Rin...,2022-08-21,Francesca Teal's plea for help on Facebook got...,Huffington Post,francesca teals plea for help on facebook got ...


In [11]:
print(similarity_matrix.shape)

# Check sparsity of matrix
sparsity = 1.0 - np.count_nonzero(similarity_matrix) / similarity_matrix.size
print("Sparsity:", sparsity)

# Print similarity matrix
print("TF-IDF Cosine Similarity Matrix:")
print(similarity_matrix)

(2899, 2899)
Sparsity: 0.10729562512843283
TF-IDF Cosine Similarity Matrix:
[[1.         0.09058845 0.08489038 ... 0.03169783 0.10387559 0.03119028]
 [0.09058845 1.         0.0106302  ... 0.         0.01091851 0.06622526]
 [0.08489038 0.0106302  1.         ... 0.         0.05170663 0.0205693 ]
 ...
 [0.03169783 0.         0.         ... 1.         0.01272418 0.04252987]
 [0.10387559 0.01091851 0.05170663 ... 0.01272418 1.         0.02798301]
 [0.03119028 0.06622526 0.0205693  ... 0.04252987 0.02798301 1.        ]]


# 2 - Building the Graph
In this section we:
- Load a json dataset
- Convert the data into standardized structure
- Filter the data on a criterion (select companies)
- Compute a tf.idf-similarity matrix
- Build a graph using networkx
    - Node IDs are original row indices
    - Edges are similarity
- Display the graph using netwulf

In [12]:
import networkx as nx

PATH = 'Data/News_Category_Dataset_v3.json'
SOURCE = "Huffington Post"
COLUMN_CONVERSION = {
    "headline": "title",
    "short_description": "content",
    "date": "date"
}

COMPANIES = ["Apple", "Google", "Alphabet", "Facebook", "Meta", "Microsoft", "OpenAI"]
SCHEME = ["title", "date", "content", "source"]

df_file = pd.read_json(PATH, lines = True)
df_data = df_file.rename(columns=COLUMN_CONVERSION).assign(source=SOURCE)[SCHEME]
df_data = df_data[df_data['content'].str.contains('|'.join(COMPANIES), case=False)]

news_data = NewsData.from_df(df_data)
similarity_matrix = news_data.compute_similarity_matrix()

In [13]:
print("Creating graph")
G = nx.Graph()

"""
Create nodes based on DataFrame index.
"""
index_list = list(df_data.index)
G.add_nodes_from(index_list)

"""
Create edges based on similarity matrix.
"""
threshold_l = 0.3
threshold_h = 0.8
edge_list = []
n=0
for i in range(similarity_matrix.shape[0]):
    for j in range(similarity_matrix.shape[1]):
        if j > i or i == j:
            continue
        node_i = index_list[i]
        if node_i not in G.nodes:
            continue
        node_j = index_list[j]
        if node_j not in G.nodes:
            continue
        sim = float(similarity_matrix[i, j])
        if sim < threshold_l:
            continue
        if sim > threshold_h:
            for neighbor in list(G.neighbors(node_j)):
                if neighbor != node_i:
                    G.add_edge(node_i, neighbor, weight=G[node_j][neighbor]["weight"])

            G.remove_node(node_j)
            continue

        edge_list.append((node_i, node_j, sim))
        n+=1
print(f"Created {n} edges @ {threshold_h} > weight > {threshold_l}")

G.add_weighted_edges_from(edge_list)

print("Done")

Creating graph
Created 59379 edges @ 0.8 > weight > 0.3
Done


### Similarity of two nodes in a strongly bound cluster

In [14]:
print(len(G.nodes))

id1 = 153246
id2 = 141857

print(G.has_edge(id1, id2))
if G.has_edge(id1, id2):
    print(G.edges[id1, id2])

print('\n'.join(list(df_data['content_cleaned'][[id1, id2]])))

print(df_file.loc[153246]['link'])

2869
False
want more be sure to check out huffpost style on twitter facebook tumblr pinterest and instagram at huffpoststyle photos
photos want more be sure to check out huffpost style on twitter facebook tumblr pinterest and instagram at huffpoststyle
https://www.huffingtonpost.com/entry/prada-fashion-week_us_5b9d87bae4b03a1dcc892a46


In [15]:
from netwulf.interactive import visualize

In [16]:
id=135204

In [17]:
#visualize(G) # Uncomment to visualize graph
print()




In [18]:
#G_test = nx.dodecahedral_graph()
#nx.draw(G_test)
#G_test.edge

# 3 - Improving the graph
### Graph is kinda wack. What's going on?
Looking at the netwulf visual, we see that there are large chunks of interconnected nodes, but also a lot of connections between chunks. And then a sea of lone nodes or small bunches. We want to see something where communities are visually evident, this really aint it. 

How do we target these edges, and eliminate them? Do we limit the amount of connections that a node can have?

- Plot hist on in and out degree
- Try Louvain to make clusters

### The project so far
- Make clusters (almost there)
- Get stock data
- Compute impact metric

### What is impact
- Impact is a metric of how relevant an article is
- We calculate impact by looking at the slope of a stock before the article and the slope after. We use this slope change in combination with sentiment: relevance = slope_change * sentiment. We can use this relevance, or impact metrix to sort our cluster by importance by assigning the average value of relevance (of the articles in the cluster) to the cluster.
- Ordering the cluster by relevance will then give us an interesting starting point to analyze. Which subjects are influenzing the stock market? Which companies are related to which subjects, still with respect to importance/relevance. Is AI more important to Microsoft or Google? Are layoffs good news for Amazon but bad news for Apple? Do layoffs itself define a cluster, and does that cluster have higher variance of relevance (i.e. larger polarization, layoffs can be good or bad depending on context but rarely meaningless)

### Wordcloud analysis
For a graph with clusters, we could get sum of impact per cluster, and rank the clusters thereby. Or average impact. We could then say something about the cluster with the highest impact is the one talking about "blablabla"

A cluster might be based on something like "Covid", "Artificial Intelligence" or "Election" or "Quarterly Earnings" because these will have high impact.

We could plot a histogram of the average impact in each cluster.
We could also just plot a histogram of the average impact.