## Part 2: Stats of the Country Music Network 

Part 2 of the assignment covers the following two exercises from week 4 of the course:
This second part requires you to have built the network of Country Musicians as described in the exercises for Week 4. You should complete the following exercise from **Part 2**.
- A. Simple network statistics and analysis
- B. Build a simple visualization of the network

### Exercise 2A. Simple network statistics and analysis

We begin with the necessary imports and after that the construction of the API url for the wiki page.

In [21]:
#Remove all variable declarations from Part1
for name in dir():
    if not name.startswith('_'):
        del globals()[name]

import urllib3, re, json, requests, random
import matplotlib.pyplot as plt
import networkx as nx
from urllib.parse import quote
import matplotlib as mpl
import numpy as np

baseurl = "https://en.wikipedia.org/w/api.php?"
action = "action=query"
title = "titles=List_of_country_music_performers"
content = "prop=revisions&rvprop=content"
dataform = "format=json"
query = "%s%s&%s&%s&%s" % (baseurl,action,title,content,dataform)
print(query)

https://en.wikipedia.org/w/api.php?action=query&titles=List_of_country_music_performers&prop=revisions&rvprop=content&format=json


Then we call this API to get this page's information. After its transformation into json format we convert it to a string form in order for us to be able to extract the performers links.

In [22]:
import urllib3, re, json, requests, random
import matplotlib.pyplot as plt
import networkx as nx
from urllib.parse import quote
import matplotlib as mpl
import numpy as np

baseurl = "https://en.wikipedia.org/w/api.php?"
action = "action=query"
title = "titles=List_of_country_music_performers"
content = "prop=revisions&rvprop=content"
dataform = "format=json"
query = "%s%s&%s&%s&%s" % (baseurl,action,title,content,dataform)
print(query)

https://en.wikipedia.org/w/api.php?action=query&titles=List_of_country_music_performers&prop=revisions&rvprop=content&format=json


Then we call this API to get this page's information. After its transformation into json format we convert it to a string form in order for us to be able to extract the performers links.

In [23]:
http = urllib3.PoolManager()
wikiresponse = http.request('GET', query)
wikisource = wikiresponse.data.decode('utf-8')

wikijson = json.loads(wikisource)
text = wikijson["query"]["pages"]
wiki_text = json.dumps(text)

Now we are going to use regex on this string in order to extract only the needed information, which in this case are the performers links. We can see that all the words that are a link are inside double brackets (for example [[The Abrams Brothers]]). Thus, it will be easy to extract all the links using "findall".

In [24]:
results = re.findall(r"\[\[(.*?)\]\]", wiki_text)

Now, we have to do some cleaning to the names of the performers. First and foremost, we are replacing all white spaces with underscores, otherwise when we transform these names into URLs we might not be able to find all of the pages. Furthermore, there are many images and other file links in the performer list page, which we can just remove them.

In [25]:
cleaned_data = [name.replace(" ", "_") for name in results if not (name.startswith("File:") or name.startswith("Image:"))]
cleaned_data = cleaned_data[1:]

Many performers have special characters, but since there are so many it is a kind of the trouble to find all of them. Just by looking at the names, the most common one is double backslash "\\\\", which we are going to remove. Also, some names have a vertical bar "|" and after that bar the simple name is represented, but after testing it out it has no influence in the construction of the performer names URLs, so there is no need to remove it. Last but not least, we noticed that some performers' URL do not actually have the name shown in the list above, but it is impossible to go one by one and check every performer's link.

In [26]:
list_of_singers = [name.replace("\\", "") for name in cleaned_data]

So, now we can start getting the information of every performer in the list. We do that with a loop where the API of each performer is build and using it we get the page's information, which is transformed into a string. The result of this loop is a list of all the performer names and their information in json format converted to text. 

In [29]:
performer_text = {}
for singer in list_of_singers:
    baseurl = "https://en.wikipedia.org/w/api.php?"
    action = "action=query"
    title = f"titles={quote(singer)}"
    content = "prop=revisions&rvprop=content"
    dataform = "format=json"
    query = f"{baseurl}{action}&{title}&{content}&{dataform}"
    response = requests.get(query, timeout=10)

    if response.status_code == 200:
        #imported code from previously
        wikiresponse = http.request('GET', query)
        wikisource = wikiresponse.data.decode('utf-8')
        wikijson = json.loads(wikisource)
        text = wikijson["query"]["pages"]
        wiki_text = json.dumps(text)
        performer_text[singer] = wiki_text
    else:
        print(f"Error fetching data for {singer}: {response.status_code}")



Now we can use regex again and extract all the links in each performer wiki page using findall.

In [30]:
performer_contents = {}
for performer, content in performer_text.items():
    performer_contents[performer] = re.findall(r"\[\[(.*?)\]\]", content)

We go again for the same transformation for the links, in order for them to be in the same form as the performer names. Thus, if a link is another performer's name we can find it and match it easily.

In [31]:
for performer, links in performer_contents.items():
        new_values = [link.replace(" ", "_").replace("\\", "") for link in links]

        performer_contents[performer] = new_values

So now we do the matching of links with the performer names and store all the matched pairs in a list.

In [32]:
matches = []
for performer, links in performer_contents.items():
    for link in links:
        if link in performer_contents:
            matches.append((performer, link))

And remove the duplicates.

In [33]:
matches = list(set(matches))

Now, we are ready to start building the network, which has the performer names as nodes and the matched pairs as directed edges. 

In [35]:
G = nx.DiGraph()
G.add_nodes_from(performer_contents.keys())
G.add_edges_from(matches)

In [38]:
print("Total number of nodes:", G.number_of_nodes())
print("Total number of edges:", G.number_of_edges())

Total number of nodes: 2103
Total number of edges: 17517


Let's get started with the statistics now. First, let's plot the distributions for in and out degree. As we can see both follow a power law distribution where we have some extreme points (nodes) with a big in or out degree number, while the majority of nodes have a very small degree number.

In [39]:
in_degrees = [d for n, d in G.in_degree()]
out_degrees = [d for n, d in G.out_degree()]

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(in_degrees, bins=100, color='blue')
plt.title('In-Degree Distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')

plt.subplot(1, 2, 2)
plt.hist(out_degrees, bins=100, color='darkorange')
plt.title('Out-Degree Distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')

plt.show()

If we compare the out-degree of this network with the out-degree of a completely random directed network we would see major diffences. The degree distribution of a random network follows a poisson distribution, where the degree of most nodes is close to the average and there are not many nodes with extremely low or high degree. We can double check it by ploting the out-degree distribution of a random network with the same number of nodes and edges.

In [40]:
D = nx.gnm_random_graph(2013, 17509, directed=True)

out_degrees_D = [d for n, d in D.out_degree()]

plt.figure(figsize=(10, 6))
plt.hist(out_degrees_D, bins=50, color='darkorange')
plt.title('Out-Degree Distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')

plt.show()

On the other hand, if we compared the in-degree distribution with a scale free network with the same number of nodes, we would see that the distributions are almost identical, as the scale free network also follows a power law distribution. In the plot below, we can see a log-log scale, where if we exclude the few nodes on the bottom right corner with high in-degree that can be considered as hubs, the rest of the nodes form an approximate straight line, suggesting that they indeed follow a power law distribution like the scale free network.

In [41]:
degree_counts = np.bincount(in_degrees)
degree_values = np.arange(len(degree_counts))

plt.figure(figsize=(8, 6))
plt.loglog(degree_values, degree_counts, 'bo', markersize=4, label="Data points")

plt.title("Log-Log Plot")
plt.xlabel("Degree (k)")
plt.ylabel("P(k)")
plt.grid(True, which="both", linestyle="--", linewidth=0.5)
plt.legend()
plt.show()

Next, we can see the performers with the most in and out degree. In the top 5 of in-degree there are some very famous performers such as Elvis Presley, Johnny Cash and Dolly Parton. This is expected, as performers with a high in-degree can be considered very popular and their names are referenced frequently on other performers' pages.

In [42]:
in_degree = dict(G.in_degree())
out_degree = dict(G.out_degree())

top_5_in = sorted(in_degree.items(), key=lambda x: x[1], reverse=True)[:5]
top_5_out = sorted(out_degree.items(), key=lambda x: x[1], reverse=True)[:5]

print("Top 5 Performers by in-degree:")
for performer, in_degree in top_5_in:
    print(f"{performer}: {in_degree}")

print("\nTop 5 Performers by out-degree:")
for performer, out_degree in top_5_out:
    print(f"{performer}: {out_degree}")

Top 5 Performers by in-degree:
Willie_Nelson: 202
Johnny_Cash: 185
Elvis_Presley: 173
Dolly_Parton: 161
Merle_Haggard: 159

Top 5 Performers by out-degree:
Hillary_Lindsey: 97
Pam_Tillis: 82
Randy_Travis: 75
Vince_Gill: 71
Patty_Loveless: 67


Now, in order to find the length of each performer's page, we are going to use the list from before, where we store all the information from the API. Once again, we are using findall to count the words of each page. It will not be completely accurate, since as we said previously there are some performers that use another name from the one in the list of "country music performers" and thus no information could be found through their API.

In [43]:
content_size = []

for performer, content in performer_text.items():
    words = re.findall(r'\w+', content)
    content_size.append((performer,len(words)))

#content_size

The top 10 performers with the longest wiki entries are displayed below.

In [44]:
top_10_content = sorted(content_size, key=lambda x: x[1], reverse=True)[:10]

print("Top 10 Performers by length of wiki page:")
for performer, content in top_10_content:
    print(f"{performer}: {content}")

Top 10 Performers by length of wiki page:
Taylor_Swift: 56923
Miley_Cyrus: 51750
Justin_Bieber: 51170
Carrie_Underwood: 45906
Justin_Timberlake: 45205
Demi_Lovato: 44441
Alabama_(band)|Alabama: 44304
Bob_Dylan: 41806
Ed_Sheeran: 36819
Elvis_Presley: 34565


We can also see the in and out degree of these top 10 performers, but with a quick look there is not a very strong correlation between length and the degrees.

In [45]:
for performer in top_10_content:
    in_top10 = [d for n, d in G.in_degree(performer)]
    out_top10 = [d for n, d in G.out_degree(performer)]
    print(f"{performer}: in-degree:{in_top10} and out-degree:{out_top10}")

('Taylor_Swift', 56923): in-degree:[77] and out-degree:[27]
('Miley_Cyrus', 51750): in-degree:[17] and out-degree:[11]
('Justin_Bieber', 51170): in-degree:[15] and out-degree:[11]
('Carrie_Underwood', 45906): in-degree:[68] and out-degree:[61]
('Justin_Timberlake', 45205): in-degree:[13] and out-degree:[14]
('Demi_Lovato', 44441): in-degree:[9] and out-degree:[7]
('Alabama_(band)|Alabama', 44304): in-degree:[26] and out-degree:[24]
('Bob_Dylan', 41806): in-degree:[141] and out-degree:[21]
('Ed_Sheeran', 36819): in-degree:[14] and out-degree:[8]
('Elvis_Presley', 34565): in-degree:[173] and out-degree:[21]


### Exercise 2B. Build a simple visualization of the network

Moving to the network visualization, we first have to convert the previous network (G) to an undirected network. Then, we have to collect the total degree of each node as it will be used as the node size later. The average node degree is ~14 so if we use a multiplier of 1.5 the average node size would be also around 20, which is pretty good visually (we tested some other multipliers but it gets messy when >2).

In [46]:
G_undirected = G.to_undirected()
node_degrees = dict(G_undirected.degree())
print(sum(node_degrees.values()) / len(node_degrees))
list_of_sizes = [degree * 1.5 for degree in node_degrees.values()]

14.088445078459344


The node color depends on the length of the performers' wiki page. We will use a reversed blues scale, where the dark blue means small number of word count and light blue to white means high word count. This way it is easy to distinguish all the nodes.

In [47]:
node_sizes = dict(content_size).values()
norm = mpl.colors.Normalize(vmin=min(node_sizes), vmax=max(node_sizes))
cmap = plt.cm.Blues_r
node_colors = [cmap(norm(size)) for size in node_sizes]


Finally, we visualize the network. The visualization can be seen below:

There is a giant component in the middle, where most hubs (popular performers) are located. Another thing that can be noticed is that most of the light blue colored nodes are towards the middle and are bigger than the average. This suggests that performers with long wiki entries are more likely to be connected with other performers and this is completely normal, since if a performer has a long wiki page, he probably has many accomplishments and collaborations with other singers. Thus, there is a high probability of having other performers' names (links) inside their page. 