## Idea
In our network each node is a news article. We embed its title and short description and calculate all the articles cosine similarities. We then link the articles that are semantically similar.\
For each article, we've also found the press-freedom rank and corruption-perception score of the country it has been published in.\
This could be interesting as:
- It shows who get talked about in english-language news
- It lets us connect what stories are about, with how free / corrupt the reporting environment is.

## Dataset
At the moment we have downloaded 5401 articles. Each article has the attributes: Link, Title, Short description, Country, Category, Publication date, Source ID and Source name.\
All the articles are downloaded through News Data API, and restricted to recent english-language articles.

To the country attribute of each article, Press Freedom Index for each country (current rank + several sub-scores) and Corruption Perceptions Index for each country (latest score + trend stats) has been added.

## Network
After filtering to the largest connected component we have 5211 nodes. We embed the title and description of each article and add an undirected edge when the cosine similarity of the embedding are above 0.3. This gives us 71167 edges. 98.48\% of all articles are in the largest connected component, so the system is well connected.

$$\text{Average degree} \approx \frac{2E}{N}\approx 27$$
$$E_{max} = \frac{N(N-1)}{2} = 13,574,655$$
$$\text{Density}=\frac{E}{E_{max}} = 0.00524 \approx 0.5\%$$

![](degree_distribution.png)

The degree distribution plot shows a heavy-tailed distribution, which tells us that most articles have lot to moderate degree, and that a few hub-articles connect to more than 150 other articles.

Network measures that could be added as attributes later could be: Degree, clustering coefficient, centralities and community labels.

![](degree_count.png)
When looking at the graphs of nodes degrees vs. count, we see in the log-log-plot that it somewhat follows a falling line, which is typical for a heavy-tailed and almost power-law degree distribution. It is indicating that the network is scale-free and hub-dominated.

If we look at 1000 randomly chosen nodes the friendship paradox holds true for 803 of them, meaning a typical article in the network is connected to articles that on average are more central and look like many other articles. (e.g. broader and more general stories).

The network is sparse, but strongly connected, with a big core og hub-articles and a long tail of more peripheral articles. Therefore there is good reason to look at communities, central articles and differences between hub-countries and moreperipheral countries.

## Text for analyzing
We have the titles and short descriptions of each article, thatwe already embedded into a vector space to get their similarities, we can use this later for topic and cluster analysis.

We could also look at keyword or n-gram frequencies, and topic distributions between groups of countries.

## Patterns
We already see that our data has a bias towards a few countries

![](num_article_bar.png)
![](cartogram_num.png)
![](cartogram_percapita.png)
The cartogram of number of articles, shows a clear domination of the US, with India and the UK scaling up as well. When we normalize by population, we see New Zealand, and some south american and african counrties popping up.\
It is not really a surprise, that we see that the english-language new in our sample is heavily skewed towards a handful of mostly english-speaking countries.

## Work to do
- Use the network structure to find communities of semantically similar articles.
- For each country: compute average degree / centrality and see how close its articles are to the dense "core", and see which communities its articles tend to fall into.
- Use the text to interpret the communities, and look at what topics dominate the community. Are there correlations between primary news topics of a country and the press freedom rank / corruption score?
- We could look at whether countries with less press freedom or higher corruption tend to have fewer articles, maybe sit isolated or in the periphery of the network, or are associated with specific topics.