### Exploring RILM Data

RILM (https://www.rilm.org/) is the most important source of information about writings on music, in all languages.  It's widely used by all music scholars.

The RILM team is especially interested in having some help exploring the scholarship they index--what changes are taking place in various sub-fields?  

Possible terms of interest:

- women’s studies
- Jewish studies
- therapy
- psychology
- activism
- ecology
- sustainability
- migration
- gender


Of course we could also think of particular genres or traditions:

- K-pop
- techno

This Notebook will help you query the RILM database for responses, then sort, slice, group, and analyze the results.

Histograms, Barcharts, and especially Networks would help us understand how fields are changing.


### Load Code

In [6]:
import os
# from decouple import AutoConfig # Install python-decouple
import requests # Install requests
import pandas as pd
import plotly as plt



import pyvis
from pyvis import network as net
from pyvis.network import Network
import networkx as nx

from copy import deepcopy

from community import community_louvain



from itertools import tee
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return tuple(zip(a, b))

### Add Token as Hidden .env File in your Jupyter Hub

In [7]:


# config = AutoConfig("env") # Create a file called .env file in the same directory.
#                             # This is just a text file that contains the BEARER TOKEN so that we don't 
#                             # Have to include it in the code.
#                             # It will have one line like the following (exclude the angle brackets):
#                             # BEARER_TOKEN=<MY_BEARER_TOKEN>
                
BASE = "https://api-ibis.rilm.org/200/haverford/"

BEARER_TOKEN='INSERT TOKEN

URLS = {
    "year": BASE + "rilm_index_RYs",
    "terms": BASE + "rilm_index_top_terms",
    "index": BASE + "rilm_index"
}

HEADERS = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

# Example queries

https://api-ibis.rilm.org/200/haverford/rilm_index_RYs?termName=activism

https://api-ibis.rilm.org/200/haverford/rilm_index_top_terms?termName=activism

https://api-ibis.rilm.org/200/haverford/rilm_index?termName=activism

https://api-ibis.rilm.org/200/haverford/rilm_index?termName=activism&includeAuthors=true

Possible terms of interest:

- women’s studies
- Jewish studies
- therapy
- psychology
- activism
- ecology
- sustainability
- gender
- migration

In [8]:
## here is where we define the search term and author status

# "termName" is the search term
# 'includeAuthors': True will return author names in the data

params = {
    "termName": "Beethoven, Ludwig van",
    "includeAuthors": True
}

# and get the response
response = requests.get(
    URLS["index"], 
    headers=HEADERS, 
    params=params
)
# response.url  

In [9]:
# get the data

data = response.json()
results = pd.DataFrame(data)
results = results.fillna('')
# combines year and accession number to make unique id for each item
results['full_acc'] = results.ry.apply(str) + "-"  + results.ac.apply(str)
results.rename(columns = {'ry': 'year', 'ac': 'item', 'ent' : 'entry', 'lvl': 'level', 'name': 'term', 'cat': 'category', 'full_acc': 'full_id'}, inplace=True)

# how many items
print(len(results))

287580


In [5]:
results

Unnamed: 0,year,item,entry,level,id,term,category,author,pubCC,langItem,langTransFrom,full_id
0,1845,2,1,1,220229,"Beethoven, Ludwig van",N,"Breidenstein, Heinrich Carl",Germany,German,,1845-2
1,1845,2,1,2,254692,Festschriften,M,"Breidenstein, Heinrich Carl",Germany,German,,1845-2
2,1845,2,1,3,1842548,monument inauguration,,"Breidenstein, Heinrich Carl",Germany,German,,1845-2
3,1845,2,1,4,240858,1845,,"Breidenstein, Heinrich Carl",Germany,German,,1845-2
4,1845,3,1,1,220229,"Beethoven, Ludwig van",N,"Breidenstein, Heinrich",Germany,German,,1845-3
...,...,...,...,...,...,...,...,...,...,...,...,...
287575,2022,10019,3,3,218174,"trios, piano, op. 97",W,"Barry, Barbara R.",United Kingdom,English,,2022-10019
287576,2022,10019,4,1,184773,tonality,T,"Barry, Barbara R.",United Kingdom,English,,2022-10019
287577,2022,10019,4,2,220229,"Beethoven, Ludwig van",N,"Barry, Barbara R.",United Kingdom,English,,2022-10019
287578,2022,10019,4,3,218174,"trios, piano, op. 97",W,"Barry, Barbara R.",United Kingdom,English,,2022-10019


In [6]:
concepts = results[results['category'] == "T"]
concepts = concepts[results['year'] > 1900]
concepts

  concepts = concepts[results['year'] > 1900]


Unnamed: 0,year,item,entry,level,id,term,category,author,pubCC,full_id
57639,1901,132,3,1,180939,aesthetics,T,"Ferrarelli, Giuseppe",Italy,1901-132
57645,1902,47,2,1,112788,symphony,T,"Livonius, Dr.",Germany,1902-47
57647,1902,47,3,1,53413,orchestral music,T,"Livonius, Dr.",Germany,1902-47
57657,1902,140,3,1,149773,sonata,T,"Ernest, Gustav",United Kingdom,1902-140
57661,1902,140,4,1,192399,form,T,"Ernest, Gustav",United Kingdom,1902-140
...,...,...,...,...,...,...,...,...,...,...
285802,2022,7294,4,1,192399,form,T,"Pander, Gaila",United Kingdom,2022-7294
285806,2022,7294,5,1,34676,rhetoric,T,"Pander, Gaila",United Kingdom,2022-7294
285824,2022,10019,1,4,184773,tonality,T,"Barry, Barbara R.",United Kingdom,2022-10019
285829,2022,10019,3,1,192399,form,T,"Barry, Barbara R.",United Kingdom,2022-10019


In [8]:
lvb_top_terms = concepts["term"].value_counts().to_frame().head(50).index.to_list()

In [75]:
places.groupby(['full_id'])['term'].describe()


Unnamed: 0_level_0,count,unique,top,freq
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1963-10634,3,2,ecology,2
1973-24842,1,1,ecology,1
1975-3114,2,2,ecology,1
1976-7724,2,2,soundscape,1
1977-1720,3,3,ecology,1
...,...,...,...,...
2022-3820,7,7,mass media,1
2022-4388,15,5,children,3
2022-5255,5,4,mixed media,2
2022-5393,4,4,analyses--by composer,1


In [84]:
places = places.groupby(['entry', 'level'])['term'].describe()
places.first()

KeyError: 'Column not found: term'

In [8]:
terms = results.groupby(['term'])['entry'].count()
df = pd.DataFrame(terms)
df.sort_values('entry', ascending=False).head(25)

Unnamed: 0_level_0,entry
term,Unnamed: 1_level_1
vocal music,13962
works,4498
history of music,1307
China,1228
Italy,1107
writings,1094
vocal chamber music,986
traditional music,928
instrumental music,917
life,869


In [None]:
results.groupby(['entry'])['term'].describe().head(25)

In [108]:
terms = results.groupby(['entry', 'level'])['term']
df = pd.DataFrame(terms)
df
# df.sort_values('full_id', ascending=False).head(25)

Unnamed: 0,0,1
0,"(1, 1)",0 ecology 5 ...
1,"(1, 2)",1 cultural ecology 6 acous...
2,"(1, 3)",2 ecology of music 7 ...
3,"(1, 4)",132 ...
4,"(1, 5)",133 Wester...
...,...,...
117,"(19, 4)","8093 dub techno Name: term, dtype: object"
118,"(19, 5)","8094 relation to urban space, economics, an..."
119,"(20, 1)",8095 sound recordings--general 10502 ...
120,"(20, 2)",8096 dub techno 10503 works Name: ...


### What do the column names mean?

- **year** = year of publication
- **item** = an accession number, or the id of that item within its year
- **full_id** = the combined year and accession number, thus a unique ID for the publication
- **term** = the index term
- **entry** and **level** = ways of grouping the index terms each "ent" can have more than one 'lvl'.  These are in turn combined to make a full index string.  See below
- **id** = the id number of the index term
- **category** = a 'category' for the index term (see below), such as:
    - **G** = Geographical
    - **O** = an Organization
    - **N** = name of a person
- **author** = author of the publication
- **pubCC** = where the item was published



## What are the Categories for the Cat field?

```
B = broadcasts, radio, TV, and podcasts
C = title of choreographic work
D = dictionary
E = ethnic group
F = films and videos
G = geographic name
I = instrument
L = literary work (poetry and prose)
M = margin
N = personal name
O = Organization (other than a school)
P =  periodical
Q = databases
R = treatise
S = school
T = topic
V = visual art
W = work title
```

In [109]:
# another basic plot based on the year of publication and the Geographical Place mentioned in the results index for the Name field

places = results[results['category'] == "E"]
# places.plot.scatter(x = 'year', y = 'term', s = 100, figsize=(10, 15))

In [34]:
communities = results[results['category'] == "E"]
communities

Unnamed: 0,year,item,entry,level,id,term,category,author,pubCC,full_id
11,1913,52,1,3,67472,German-speaking people,E,"Fryklund, Daniel",Sweden,1913-52
16,1913,52,2,4,67472,German-speaking people,E,"Fryklund, Daniel",Sweden,1913-52
7923,1997,2440,1,5,52561,Finno-Ugric peoples,E,"Tari, Lujza",Hungary,1997-2440
7926,1997,2440,2,3,52561,Finno-Ugric peoples,E,"Tari, Lujza",Hungary,1997-2440
11266,2000,12088,1,2,350070,Romani people,E,"Sárosi, Bálint",Austria,2000-12088
18843,2015,7033,1,2,320931,Rarámuri people,E,"Zenker, Miguel",Mexico,2015-7033
18850,2015,7033,2,4,320931,Rarámuri people,E,"Zenker, Miguel",Mexico,2015-7033


In [110]:
# histogram of publications by place of publication:  results['pubCC']


italy = results[results['pubCC'].str.contains("Italy")]
# italy.hist('year', figsize=(10, 5), bins=100)

### A Concept Map


Here we find all of the terms associated with a given initial term, as follows:

- Limit the 'term' field to the "T" category (concepts).  This could also be done for a person, with "N"
```
t_concepts = results[results['category'] == "T"]
```

 - Now find all the **full_id numbers** that feature that term word and save as list
```
selected_concept = t_concepts[t_concepts['term'].str.contains('Black')]
selected_items = selected_concept['full_id'].to_list()
```
- Filter the original df so we have only the given full_id numbers (publications), and in turn filter that set so we only have terms corresponding to the "T" category.  This could instead be done for "N" or 'G', depending on your goal!
```
filtered_results = results[results['full_id'].isin(selected_items)]
filtered_results_t_concepts = filtered_results[filtered_results['category'] == "T"]
```

- Find the 'pairs' of all the terms mentioned in the at the various levels of the entries
- Remove the pairs that are just one term 2x
```
topic_as_pairs = filtered_results_t_concepts.groupby('year')['term'].apply(pairwise).explode().dropna().unique()
final_topic_pairs = []
for pair in topic_as_pairs:
    if len(set(pair)) > 1:
        final_topic_pairs.append(pair)
final_topic_pairs
```

In [10]:
# here we find all of the terms associated with a given term
# limit to terms with the "T" category
# t_concepts = results[results['category'] == "T"]

# then find all the full_id numbers that feature a given word as the term and save as list
selected_concept = results[results['term'].isin(lvb_top_terms)]
# selected_concept = t_concepts[t_concepts['term'].str.contains(lvb_top_terms)]
selected_items = selected_concept['full_id'].to_list()
# and retun the original list, now filtered for just those items
filtered_results = results[results['full_id'].isin(selected_items)]

# and filter those results to fit the a certain category, such as "T" or "G" or "N"
filtered_results_t_concepts = filtered_results[filtered_results['category'] == "T"]
# check the list of names for each essay/item
# groups = ideas.groupby('year')['term'].apply(list)
# # instead find the 'pairs' of all names mentioned in the items
topic_as_pairs = filtered_results_t_concepts.groupby('full_id')['term'].apply(pairwise).explode().dropna().unique()
final_topic_pairs = []
# remove pairs that are just one name 2x
for pair in topic_as_pairs:
    if len(set(pair)) > 1:
        final_topic_pairs.append(pair)

final_topic_pairs
filtered_results

Unnamed: 0,year,item,entry,level,id,term,category,author,pubCC,full_id
24,1846,1,1,1,220229,"Beethoven, Ludwig van",N,"Héquet, Gustave",Germany,1846-1
25,1846,1,1,1,220229,"Beethoven, Ludwig van",N,"Hering, Carl Eduard",Germany,1846-1
26,1846,1,1,1,220229,"Beethoven, Ludwig van",N,"Höpner, Christian Gottlob",Germany,1846-1
27,1846,1,1,1,220229,"Beethoven, Ludwig van",N,"Koehler, Ernst",Germany,1846-1
28,1846,1,1,1,220229,"Beethoven, Ludwig van",N,"Körner, Gotthold Wilhelm",Germany,1846-1
...,...,...,...,...,...,...,...,...,...,...
285831,2022,10019,3,3,218174,"trios, piano, op. 97",W,"Barry, Barbara R.",United Kingdom,2022-10019
285832,2022,10019,4,1,184773,tonality,T,"Barry, Barbara R.",United Kingdom,2022-10019
285833,2022,10019,4,2,220229,"Beethoven, Ludwig van",N,"Barry, Barbara R.",United Kingdom,2022-10019
285834,2022,10019,4,3,218174,"trios, piano, op. 97",W,"Barry, Barbara R.",United Kingdom,2022-10019


### Make a Simple Network

- You will need to pass in the set of pairs created above and name the html file

In [36]:
G = nx.Graph()
net = net.Network(notebook=True, width=1000, height = 800)
for a, b in final_topic_pairs:

    G.add_edge(a, b)
net.from_nx(G)
# Showing the network
net.show("final_topic_pairs.html")

AttributeError: 'Network' object has no attribute 'Network'

### Community Network

In [205]:
# do not edit!

def add_communities(G):
    G = deepcopy(G)
    partition = community_louvain.best_partition(G)
    nx.set_node_attributes(G, partition, "group")
    return G

def create_node_html(node: str, source_df: pd.DataFrame, node_col: str):
    rows = source_df.loc[source_df[node_col] == node].itertuples()
    
    html_lis = []
    
    for r in rows:
        html_lis.append(f"""<li>author: {r.author}<br>
                                id: {r.full_acc}<br>"""
                       )
        
    html_ul = f"""<ul>{''.join(html_lis)}</ul>"""
        
    return html_ul


def add_nodes_from_edgelist(edge_list: list, 
                               source_df: pd.DataFrame, 
                               graph: nx.Graph,
                               node_col: str):
    
    graph = deepcopy(graph)
    
    node_list = pd.Series(edge_list).apply(pd.Series).stack().unique()
    
    for n in node_list:
        graph.add_node(n, title=create_node_html(n, source_df, node_col))
        
    return graph




### Create Community Network Here

In [206]:
pyvis_graph = Network(notebook=False, width="1800", height="1400", bgcolor="black", font_color="white")
G = nx.Graph()

try:
    G = add_nodes_from_edgelist(edge_list=final_topic_pairs, source_df=filtered_results, graph=G, node_col='name')
except Exception as e:
    print(e)

G.add_edges_from(final_topic_pairs)
G = add_communities(G)
pyvis_graph.from_nx(G)
pyvis_graph.show('Black_studies_names.html')

  node_list = pd.Series(edge_list).apply(pd.Series).stack().unique()


'Series' object has no attribute 'stack'
