# Networking popular words in lyrics
For this exercise, we'll create a network of words (based on part of speech or some other classification) that are frequently used in collections of song lyrics assembled by the Olivia Rodriguez project team. For our network example, we ask: Which of the popular words is used the most frequently by particular artists? 

If you're adapting this to your own project, take some time with your team to think about what's interesting to explore as a network from your project. You can use this cell to sketch out in markdown what you want to do. 

The steps to create this network are:
## Collect the words, rank them, and select them
* Collect the distinct words from all of the lyrics together by part of speech. (Let's look at nouns in this example.) Return these as a sorted list with duplicates removed, ranked from most to least frequent. 
* Streamline the list, by choosing, say, the top 10 or 20 words.

## Find out which artists use each word and how much
* Reach into the collections assembled by artist.
* For each word in our streamlined list, return a count of how much the artist repeats that word
* Prepare network data (arranged as a TSV or pandas dataframe) with this structure. (The syntax will be different, but this is just an idea of the information we need.

  ```
     word | used by | artist | count of times used by this artist
  ```

### Alternative ways to develop networks of information
There are lots of ways to think about how to explore word use in song lyrics. We are making something of a "big picture" study of the most popular words by part of speech used by all artists, and looking across their albums: a word is just used by an artist (regardless of which song or album it's in). But we could change this to take a closer look at other patterns. For example:
* Change the context: word use by song or within an album!

* Start with a word of interest **to accept as input into the notebook** and:
    * Look for all the ones most closely related in a collection using cosine similarity (see our early Python homework assignments to explore words similar to a word of interest).
    * Try an adjacency network of words: Find out which other words of its type are sitting close (adjacent) to the word of interest. 


# Practical considerations: an overview of our process!
## Consider how to visualize this network
This would be a bimodal network to show which words are shared the most by which artists in the collection. 
Node1 = word
Node2 = artist
Edge connection = "used by"
Edge weight data = count of times used by this artist (needs to be an integer)

## How will we do all this (with the Olivia Rodriguez Team collections)?
* Python imports and functions to coordinate processing
* Apply saxonche to import the team's XML: pull the text you want from the XML nodes in a collection using XQuery
    * _Without XML data_, collect strings of words by opening the files and reading them in. Refer to our early Python assignments.
* Use NLP to find the words of interest: We can send the text for the whole collection to spaCy (much as we did in [project ipynb exercise 3](3-projectExplore-dataCounts.ipynb) to retrieve the words of interest (e.g. nouns).
    * Remove duplicates and rank them (use Counter and mostCommon()).
    * Slice this list to get you the top 20 (or however many you want to plot).
      
* Now return to our XML collection to find out information about who is using the words and where.
    * For our example from the Olivia Rodriguez team, the team has organized files in folders named by artist and album. We'll use this organization to help collect information based on the artists. (See local folder in this projectExamples directory named `lyricXML/`.
    * We'll return to saxonche and **write an XQuery function that defines the collections we need to reach into based on the artist**. (**NOTE: This part is tricky: It will be specific to the project team's folder structure**. Ask for help if you need to on this part to adapt to your project!)
        * _Without XML data_, return to the text files: open them based on filename or folder name or whatever structure will help you establish context for your files
    * Start with a for loop over your words of interest. Each word is sent into XQuery to retrieve how much it's used by an artist and return the count of its use.
        *  _Without XML data_, each word is sent by a Python function retrieve its count of how much it's used in whatever you're using to delimit a special "bucket" of files or folders in your project.
     
    * Output a pandas dataframe  to prepare data to be read by NetworkX and pyVis.
 
## Time to code this...

In [3]:
# START WITH INSTALLS AND IMPORTS!

# If you're missing anything in the import cells below, you should install it with pip (or pip3) in your virtual environment. 

# TRY uncommenting the lines here to see if the notebook will handle the imports directly. 
# Here you need the `!` in front if it's going to work.
# IF THAT DOESN'T WORK:
# (Go to your command line in the git bash shell (Windows) or Terminal (Mac) and 
# activate your virtual environment where you've set it on your local computer: 
# Windows: source Scripts/activate
# Mac: source bin/activate 
# watch for your virtual environment to show you it's active in the shell.
# Then enter your pip installs (without the `!` explanation points). 

# INSTALLS
# !pip install pathlib
# !pip install saxonche
# !pip install pandas
# !pip install networkx
# !pip install pyvis


In [45]:
# IMPORTS for the text NLP processing
import pathlib
import spacy
from pathlib import Path
from saxonche import PySaxonProcessor
from collections import Counter

# Just in case you want it:

# import re as regex
# re is standard to Python3: lets us work with regular expressions in Python. 
# Uncomment it if you want to try it here to search for a specific pattern in your texts with Python.

In [44]:
# IMPORTS For the network visualizations
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network


In [8]:
#nlp = spacy.cli.download("en_core_web_lg")
# ONLY NEED ABOVE LINE ONCE. REMEMBER: COMMENT OUT THE ABOVE LINE THE NEXT TIME YOU RUN THIS.
nlp = spacy.load('en_core_web_lg')

## Collect the words, rank them, and select them
* Collect the distinct words from all of the lyrics together by part of speech. (Let's look at nouns in this example.) Return these as a sorted list with duplicates removed, ranked from most to least frequent. 
* Streamline the list, by choosing, say, the top 10 or 20 words.


### Sample XML code for a file in the lyricXML collection

```xml
<lyrics>
    <section type="verse" n="1">
        <l>Told you not to worry</l>
        <l>But maybe that's a lie</l>
        <l>Honey, what's your hurry?</l>
        <l>Won't you stay inside?</l>
        <l>Remember not to get too close to stars</l>
        <l>They're never gonna give you love like ours</l>
    </section>
    <section type="chorus">
        <l>Where did you go?</l>
        <l>I should know, but it's cold</l>
        <l>And I don't wanna be lonely</l>
        <l>So show me the way home</l>
        <l>I can't lose another life</l>
    </section>
    <section type="refrain">
        <l>Hurry, I'm worried</l>
    </section>
    <section type="verse" n="2">
        <l>The world's a little blurry</l>
        <l>Or maybe it's my eyes</l>
        <l>The friends I've had to bury</l>
        <l>They keep me up at night</l>
        <l>Said I couldn't love someone</l>
        <l>'Cause I might break</l>
        <l>If you're gonna die, not by mistake</l>
    </section>
    <section type="chorus">
        <l>So, where did you go?</l>
        <l>I should know, but it's cold</l>
        <l>And I don't wanna be lonely</l>
        <l>So tell me you'll come home</l>
        <l>Even if it's just a lie</l>
    </section>
    <section type="bridge">
        <l>I tried not to upset you</l>
        <l>Let you rescue me the day I met you</l>
        <l>I just wanted to protect you</l>
        <l>But now I'll never get to</l>
    </section>
    <section type="refrain">
        <l>Hurry, I'm worried</l>
    </section>
    <section type="outro">
        <l>Where did you go?</l>
        <l>I should know, but it's cold</l>
        <l>And I don't wanna be lonely</l>
        <l>Was hoping you'd come home</l>
        <l>I don't care if it's a lie</l>
    </section>
</lyrics>
```

### The next two cells...
Define your input and output filepaths...and send them to XQuery for processing.

#### About reaching into your file collections with XQuery
Our input for this exericse in lyricXML is a set of nested folders, so we need to recurse through them. 

The collection() function in our XQuery is set to **recurse** through each of the folders   and find all the XML files inside. 

#### Keeping your outputs from scrolling forever
On an output cell with a LONG BLOB of text, right-click and select "Enable Scrolling for Outputs"

In [11]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'lyricXML'
OutputPath = 'testOutput' 

# NOTE: We need to use a return line on this function to return the string value of `r` as the result of our python function.
# With the return line, that makes it possible to call the function in the next cell when we need to deliver the output to nlp.

In [12]:
def xqueryAndNLP(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $allTheLyrics := collection('lyricXML/.?select=*.xml;recurse=yes')
(: ebb: our collection variable is set to recurse through the internal nested folders. :)
let $lines := $allTheLyrics//l ! text()
return string-join($lines, ' ')

''')
        r = xq.run_query_to_string()
        # print(r)
        r = str(r)
    return r

xqueryAndNLP(InputPath)

SaxonC-HE 12.4.2 from Saxonica




### Let's roll this ball of text over to NLP now. . .

In [17]:
# If everything's working properly and you have lots of text for the computer to read, this cell may take a moment to run. 

inputstring = xqueryAndNLP(InputPath)

# start playing with spaCy and nlp:
words = nlp(inputstring)

Lemmas = []
for token in words:
    if token.pos_ == "VERB": #***CALEB COMMENT*** This was originally NOUN, I tried using ADVERB, but there were 0 results for it
        lemma = token.lemma_
        Lemmas.append(lemma)


lemmaFreq = Counter(Lemmas)
totalLemmaCount = len(lemmaFreq) 

print(f"Lemma count: {totalLemmaCount}")

print(f"Lemma frequency {lemmaFreq}")

# We can even calculate the percentage each verb is used.
# The totalVerbCount will be the length of the BenderLemmas list.



SaxonC-HE 12.4.2 from Saxonica
Lemma count: 634
Lemma frequency Counter({'know': 410, 'get': 409, 'go': 266, 'think': 251, 'say': 240, 'tell': 234, 'do': 209, 'want': 194, 'wanna': 181, 'make': 178, 'see': 167, 'take': 156, 'feel': 136, 'have': 124, 'love': 117, 'look': 111, 'like': 105, 'be': 104, 'leave': 93, 'need': 80, 'let': 77, 'stop': 77, 'try': 75, 'come': 74, 'give': 74, 'find': 72, 'call': 66, 'keep': 62, 'stay': 58, 'watch': 57, 'lose': 53, 'push': 49, 'wish': 48, 'put': 47, 'break': 46, 'ask': 45, 'guess': 44, 'meet': 43, 'mean': 43, 'start': 43, 'forget': 42, 'change': 42, 'hate': 39, 'die': 38, 'run': 37, 'care': 37, 'talk': 36, 'comin': 36, 'walk': 36, 'fall': 34, 'hear': 32, 'burn': 30, 'hold': 30, 'lie': 30, 'use': 29, 'hurt': 28, 'hope': 28, 'play': 27, 'miss': 27, 'wait': 26, 'help': 26, 'remember': 25, 'end': 25, 'wonder': 22, 'sit': 21, 'cry': 20, 'write': 20, 'stand': 20, 'pretend': 19, 'promise': 18, 'buy': 18, 'talkin': 18, 'wake': 18, 'fake': 18, 'kiss': 17, 'b

In [14]:
# As with our previous bar graph examples in exercise 3, we don't want to plot every last word here.
# But we have a lot of data, so we can experiment!
# To access data in our Counter list and keep it organized from highest to lowest value, we use `most_common()`.
# Then we can slice it to store however many we want to plot. [:10] would plot the first 11 values since python starts counting from zero.

mostCommon = dict(lemmaFreq.most_common()[:29])
print(f"mostCommon Lemmas {mostCommon}")

# Here we are unpacking our sliced dictionary of most common noun lemmas into lists of the values and keys,
# and checking to make sure they remain in their dictionary order here. 
# We will use the list of lemmas in the next code cell to look for each one as used by each artist.  
# (We used them when plotting bar graphs, 
# so you could output some bar graphs in the next cells if you want, and then return to the network we're building!

listCounts = list(mostCommon.values())
listLems = list(mostCommon.keys())
print(f"listCounts: {listCounts}")
print(f"listLems: {listLems}")


mostCommon Lemmas {'time': 191, 'love': 172, 'thing': 108, 'friend': 105, 'baby': 86, 'way': 83, 'da': 81, 'night': 76, 'man': 72, 'eye': 66, 'bed': 62, 'boy': 58, 'mind': 57, 'heart': 55, '-': 52, 'girl': 47, 'head': 45, 'one': 43, 'day': 43, 'daylight': 39, 'guy': 38, 'life': 37, 'nothing': 35, 'diamond': 34, 'hand': 33, 'mine': 32, 'people': 31, 'name': 30, 'light': 29}
listCounts: [191, 172, 108, 105, 86, 83, 81, 76, 72, 66, 62, 58, 57, 55, 52, 47, 45, 43, 43, 39, 38, 37, 35, 34, 33, 32, 31, 30, 29]
listLems: ['time', 'love', 'thing', 'friend', 'baby', 'way', 'da', 'night', 'man', 'eye', 'bed', 'boy', 'mind', 'heart', '-', 'girl', 'head', 'one', 'day', 'daylight', 'guy', 'life', 'nothing', 'diamond', 'hand', 'mine', 'people', 'name', 'light']


## Okay, time to build our network...
### Find out which artists use each word and how much
* Reach into the collections assembled by artist, **with XQuery again, but this time, reaching into specific collections for each artist** 
* **For each word in our streamlined list**, return a count of how much the artist repeats that word
* Prepare network data (arranged as a TSV or pandas dataframe) with this structure.
* This is just to remind us what we're  constructing for our network. **The syntax for our dataframes will be different** (no vertical `|`'s ). 

  ```
     word | used by | artist | count of times used by this artist
  ```


In [37]:
from IPython.display import display, HTML

def networkClass(listLems, InputPath):
    with PySaxonProcessor(license=False) as proc:
        count = 0
        for lemma in listLems:
            xq = proc.new_xquery_processor()
            xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
            xq.set_query_content(
                f'''
                declare variable $lemma as xs:string external := "{lemma}";

                let $taylorLyrics := collection('{InputPath}/taylor/.?select=*.xml;recurse=yes')

                return (
                    for $lyric in $taylorLyrics
                    where contains($lyric, $lemma)
                    return $lyric
                )
                '''
            )

            occurrences = xq.run_query_to_string()
            count += len(occurrences.split('\n')) - 1  #the one is to get rid of the empty space

        display(HTML(f'<div id="output">Total occurrences of "love": {count}</div>'))

networkClass(["love"], "lyricXML")

#That's a lotta love XD

In [18]:
from io import StringIO
OutputPath = 'testOutput/networkdata.tsv' 

def networkQuery(listLems, InputPath):
    listdfs = []
    with PySaxonProcessor(license=False) as proc:
       for lemma in listLems:
            xq = proc.new_xquery_processor()
            xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
            xquery = f'''
                declare variable $lemma as xs:string* external := '{lemma}';
                declare variable $string as xs:string := string-join(
                let $billieLyrics := collection('lyricXML/billie/.?select=*.xml;recurse=yes')
                let $allTheLyrics := collection('lyricXML/.?select=*.xml;recurse=yes')
                (: ebb: our collection variables are set to recurse through the internal nested folders.:)
                
                let $artistNames := ('billie', 'olivia', 'sabrina', 'taylor')
                    for $name in $artistNames
                    (: return $name :) 
                
                    let $lemmaLines := $allTheLyrics[base-uri() ! contains(., $name)]//l ! text()[contains(., $lemma)]
                    let $billieCount := count($lemmaLines)
                    return ($lemma || '\t' || 'used by' || '\t' || $name || '\t' ||  $billieCount), '\n');
                
                (: May work more reliably than regex '\n' :)
                (: IF NEEDED: in place of \t, we can use `&#x9;.` :)
                (: IF NEEDED: in pace of \n, we can use this weird special character for a newline or hard return.:)
                $string
            '''
            xq.set_query_content(xquery)
        
            r = xq.run_query_to_value()
            r = str(r)
            # print(r)
            # ebb: Now we read this into a pandas dataframe based on tab-separated values as csv/tsv data:
            df = pd.read_csv(StringIO(r), header=None, sep="\t")
            # print(df) # ebb: This churns out lots of little dataframes based on each turn of the python for loop in this function.
            # So, we need to bundle them together. We'll start by putting them in a list. 
            listdfs.append(df)
            #print(listdfs)
    # ebb: Now we concatenate the list of pandas dataframes into just one using pd.concat:
    merged_df = pd.concat(listdfs, ignore_index=True)
    merged_df.to_csv(OutputPath, sep="\t") 
           

    return(merged_df)
    
networkQuery(listLems, InputPath)   
    
    

Unnamed: 0,0,1,2,3
0,time,used by,billie,24
1,time,used by,olivia,25
2,time,used by,sabrina,86
3,time,used by,taylor,51
4,love,used by,billie,44
...,...,...,...,...
111,name,used by,taylor,13
112,light,used by,billie,0
113,light,used by,olivia,6
114,light,used by,sabrina,10


### Network Vis Time!
We have lovely network data formatted as pandas dataframes. (We could have output the data and bundled it up in a dictinoary structure or something else, but we thought we'd try dataframes for ease of reading. 

Dataframes are used frequently in text data analytics for organizing and reading values into charts and graphs. Read more about it at <https://www.geeksforgeeks.org/python-pandas-dataframe/>.

Now we need to send the dataframes to networkx and pyvis for networking. 

#### Fine-tuning the network display
Here are some resources for adjusting how the network displays. 
* PyVis Network documentation: <https://pyvis.readthedocs.io/en/latest/documentation.html>
* Using the Configuration UI to tweak the Network: <https://pyvis.readthedocs.io/en/latest/tutorial.html#using-the-configuration-ui-to-dynamically-tweak-network-settings>
  


In [47]:
networkData = networkQuery(listLems, InputPath)   

# Create the network graph
net = Network(height='1000px', width='100%', bgcolor='#ffffff', font_color='black', notebook=True, select_menu=True, cdn_resources="in_line")


# Iterate through the DataFrame and add nodes and edges
for i, row in networkData.iterrows():
    source = row[0]
    target = row[2]
    weight = row[3]
    net.add_node(source, shape='star', color= '#ffc2f6')
    net.add_node(target, shape='circle', color='#fffb91')
    net.add_edge(source, target, value=weight*5, title=weight)


# Customize the layout
# ebb: see PyVis docs: https://pyvis.readthedocs.io/en/latest/documentation.html#pyvis.network.Network.barnes_hut
# I'm trying out / commenting out various combinations of network properties here:
# net.barnes_hut()
net.barnes_hut(gravity=-80000, central_gravity=0.003, spring_length=5, spring_strength=.1, damping=0.09, overlap=0)

# print(net)
# Display the graph in the Jupyter Notebook
net.show_buttons(filter_=['physics'])
net.show('network_graph.html')

network_graph.html


UnicodeEncodeError: 'charmap' codec can't encode characters in position 263607-263621: character maps to <undefined>

### Applying network statistics to the visualization
#### For experimentation, discussion, project development


Our network visualization should be already displaying weighted edges based on the count value (the number of times an artist uses a word). 

Nodes are not sized by network stats yet. 
What about applying color and size and other visual properties based on network statistics?

For this we need to work with NetworkX and PyVis libraries together. NetworkX calculates network statistics to apply to our visual plot. We'll have to redo our graph to generate network centrality calculations.


Network properties to investigate:
* degree centrality
* closeness centrality
* eigenvector centrality
* eccentricity

  

In [21]:
### ebb: UNDER CONSTRUCTION! 
networkData = networkQuery(listLems, InputPath)   

# Create the network graph
G = nx.Graph()


# Iterate through the DataFrame and add nodes and edges
for i, row in networkData.iterrows():
    source = row[0]
    target = row[2]
    weight = row[3]
    G.add_node(source, shape='dot')
    G.add_node(target, shape='triangle')
    if target == "billie":
        colorEdge = "blue"
    elif target == "olivia": 
        colorEdge = "green"
    # elif target == "sabrina":
    #    colorEdge = "coral"
    # elif target == "taylor":
    #    colorEdge = "purple"
    else:
        colorEdge = "coral"
    # How to write Python if else conditions: https://www.w3schools.com/python/python_conditions.asp 
    G.add_edge(source, target, value=weight, title=weight, color=colorEdge)

# Calculate this network's centrality statistics
degree_centrality = nx.degree_centrality(G)
# print(degree_centrality)

closeness_centrality = nx.closeness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)
eccentricity = nx.eccentricity(G)
print(eccentricity)

# VISUALIZE THE NETWORKX NETWORK IN PYVIS

# Create node size list based on closeness centrality
node_sizes = [v * 20 for v in degree_centrality.values()]
print(node_sizes)

# Generate node colors based on degree or eigenvector centrality
node_colorVals = [c * 20 for c in closeness_centrality.values()]

# Create PyVis Network object
net = Network(height='600px', width='100%', bgcolor='#222222', font_color='white', notebook=True, select_menu=True, cdn_resources="in_line")


# Add nodes and edges to PyVis Network
for node in G.nodes:
    # print(node, '||', node_sizes[list(G.nodes).index(node)])
    cv = node_colorVals[list(G.nodes).index(node)] * 50
    # ebb: Here we'll try basing the COLOR and SIZE of the nodes based on network calculations. 
    # Follow the code to see which variables store the network information. 
    # We set rgba Red, Green, Blue color values: https://www.w3schools.com/cssref/func_rgba.php
    net.add_node(node, shape=G.nodes[node]['shape'], color=f"rgba({cv}, 255, 255, 0.8)", size=node_sizes[list(G.nodes).index(node)])
# print(list(G.nodes).index(node))

for source, target, edge_data in G.edges(data=True):
    print(edge_data)
    net.add_edge(source, target, value=edge_data['value'], color=edge_data['color'], title=edge_data['title'])
# Show the interactive plot

# print(net)
# net.barnes_hut(gravity=80, central_gravity=0.0005, spring_length=50, spring_strength=.1, damping=0.09, overlap=0)
# net.force_atlas_2based(gravity=-50, central_gravity=0.01, spring_length=100, spring_strength=0.08, damping=0.4, overlap=0)[source]
net.show_buttons(filter_=['physics'])
net.show('degreeNetworkVis.html')






{'time': 2, 'billie': 2, 'olivia': 2, 'sabrina': 2, 'taylor': 2, 'love': 2, 'thing': 2, 'friend': 2, 'baby': 2, 'way': 2, 'da': 2, 'night': 2, 'man': 2, 'eye': 2, 'bed': 2, 'boy': 2, 'mind': 2, 'heart': 2, '-': 2, 'girl': 2, 'head': 2, 'one': 2, 'day': 2, 'daylight': 2, 'guy': 2, 'life': 2, 'nothing': 2, 'diamond': 2, 'hand': 2, 'mine': 2, 'people': 2, 'name': 2, 'light': 2}
[2.5, 18.125, 18.125, 18.125, 18.125, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5]
{'value': 24, 'title': 24, 'color': 'blue'}
{'value': 25, 'title': 25, 'color': 'green'}
{'value': 86, 'title': 86, 'color': 'coral'}
{'value': 51, 'title': 51, 'color': 'coral'}
{'value': 44, 'title': 44, 'color': 'blue'}
{'value': 57, 'title': 57, 'color': 'blue'}
{'value': 28, 'title': 28, 'color': 'blue'}
{'value': 3, 'title': 3, 'color': 'blue'}
{'value': 45, 'title': 45, 'color': 'blue'}
{'value': 76, 'title': 76, 'color': 'blue'}
{'v

UnicodeEncodeError: 'charmap' codec can't encode characters in position 263607-263621: character maps to <undefined>