[<img src='https://vermontcomplexsystems.org/index_files/large-526602.png' width="180" height="150" align="center"/>](https://vermontcomplexsystems.org/)


**CSYS 303:** Complex Networks

**Name:** Patrick L. Harvey

**Date:** 20230228

**Report:** Available on [Overleaf](https://www.overleaf.com/read/jwptqkrxjbkr) or [GitHub](https://github.com/P-Harvey/WebLaTex/blob/a6ff861fc8e49981e7c4deb5417826cbfd124951/CSYS_303_Complex_Networks/assignment20.tex)

****

**Description:** This notebook applies [methods](https://arxiv.org/abs/2008.02250) developed by members of the Vermont Complex Systems Center to examine sentiment (e.g. happiness) for a given corpus. The specific corpus examined in this implementation of their methods is *Blood Meridian* authored by Cormac McCarthy and available [here](https://altair.pw/pub/lib/Cormac%20Mccarthy%20-%20The%20Blood%20Meridian.pdf).

****

**Index**

[Setup](#Setup)

1. [Imports](##Imports)

2. [Functions](##Functions)

[Assignment 20](#Assignment_20)

1. [Part 1](##Question_1)

$\qquad$ a. [Text Destruction](###Part_a.)

$\qquad$ b. [Time-Series](###Part_b.)

$\qquad$ c. [Lensed Time-Series](###Part_c.)

2. [Part 2](##Question_2)

$\qquad$ a. 

$\qquad$ b. 

$\qquad$ c. 

$\qquad$ d. 

$\qquad$ e. 

$\qquad$ f. 

4. [Appendices](#Appendices)

  $\qquad$ a. [Hedonometer API](##Hedonometer_API) 

  $\qquad$ b. [Citations and References](##Citations_and_References)

****

# Setup

## Imports

In [None]:
""" Run the latest and greatest. """
%pip install -r requirements.txt --upgrade

""" Import them all! """
import math
import matplotlib as mpl
import matplotlib.pyplot as plt
from nltk import tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np
import pandas as pd
import plotly
import re
import seaborn as sns
import string
from string import punctuation as punct
import sys

""" Sometimes ignorance = bliss. """
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

""" In case Computer Modern Roman size 10 isn't an option. """
# mpl.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')

""" Set default font for plots. """
mpl.rc('font',
       family = 'serif',
       serif = 'cmr10')

source = 'Blood_Meridian_McCarthy.txt'

""" If running in Google Colab. Just don't. """
def mount_drive():
  from google.colab import drive
  drive.mount('/content/drive/', force_remount = True)

## Functions

In [None]:
def load_text(f: str = '/content/text.txt',
              prnt: bool = True):
  with open(f) as txt:
    gram_list = re.findall(r"\w+[^\s\.,?!;]+|\w+|[\.,?!;]",
                           ''.join(txt.readlines()))
    if prnt:
      print(f"Total n-grams: {len(np.array(gram_list)):>14,}")
  return np.array(gram_list)

def get_word_freq(f: str = '/content/text.txt',
                  prnt: bool = True):
  n_gram_vec = load_text(f)
  freq = np.array(np.unique(n_gram_vec, return_counts=True, axis=None))
  if prnt:
    print(f'Total unique n-grams: {len(np.array(freq[1])):>7,}')
  return freq, n_gram_vec

def plot_zipf(f: str = '/content/text.txt'):
  freq = load_text(f)
  plt.scatter(x = np.linspace(len(freq),1,len(freq)), 
              y = freq,
              s = 2,
              c = 'k',
              alpha = 0.8,
              linewidths=0.)
  plt.xscale('log')
  plt.yscale('log')
  plt.suptitle('Blood Meridian by Cormac McCarthy',
              fontsize = 16)
  plt.title('Rank-Frequency (Zipf) Plot')
  plt.xlabel(r'log$_{10}$(Rank)')
  plt.ylabel(r'log$_{10}$(Frequency)')
  sns.despine()
  plt.show()

def test_str(n_gram: np.ndarray = None,
             prnt: bool = True):
  test_str = ' '.join(n_gram)
  if prnt:
    print(test_str)
  elif test_str:
    print("Non-zero string created.")
  else:
    print('You probably messed up. Try again!')

# Assignment 20

## Question 1

****

Using the main text you chose at the start of the semester, plot a time-series of happiness as described below using the labMT lexicon definitions of sentiment.

The labMT lexicon is referenced in Peter S. Dodds' et. al. 2011 [paper](https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0026752&type=printable). 

The lexicon has been occasionally upgraded to accommodate major changes in language use.

****

**Notes:**

• The horizontal axis is “narrative time” corresponding to 1-grams in the text, running from $1 \to N$.

• The windows should overlap, sliding one word ahead each time This is a simple averaging filter.

• Points should be located above the center of each window.

• So the point for the window running from $n$ to $n+T −1$ (T words) will be located at $n + (T − 1)/2$.

• Do not pre-filter the text for any given lens. Windows will contain variable numbers of words with and without happiness scores.

****

### Part a.

****

Process (or destroy) your text so that it is a simple text file with one 1-gram per line—a vector of 1-grams.

To the extent possible, keep punctuation in as separate 1-grams. Periods, commas, semicolons, em-dashes, ellipses,

****

In [71]:
wf, ng = get_word_freq(source)

Total n-grams:        126,105
Total unique n-grams:  11,302


### Part b.

****

First, use the full lexical lens provided by labMT.

Make a single figure containing a stacked set of 7 plots with text windows of
sizes:

$T = [10^z]$ for $z = 1, 1.5, 2, 2.5, 3, 3.5,$ and $4.0$.

Stacked here means separated and stacked vertically, as opposed to directly
overlaid. 

See examples for Moby Dick at the end of this assignment. 

The notation [$\cdot$] implying that we round to the nearest integer.

*****

In [72]:
ts = load_text(source, False)

### Part c.

****

Repeat the above for lenses which exclude the central words around the neutral point.

The blocked words are $h_{avg} \pm \delta h_{avg}$ where $\delta h_{avg} = 0.5, 1.0, 1.5, 2.0, 2.5, 3.0,$ and $3.5$.

****

## Question 2

****

Using your text of choice,
generate word shifts comparing two "interesting" regions of text.

Use the Python package described [here](https://github.com/ryanjgallagher/shifterator).

Various MATLAB versions made by REDACTED do exist and need to shared.

Links to paper versions (arXiv is always best), 
Github repository,
and an exhilarating Twitter feed can be found [here](https://pdodds.w3.uvm.edu/research/papers/gallagher2021a/).

"Interesting" is anything you find interesting.

Could be books 3 and 12 in a series;
the second half of a book compared to the first half;
season 4 of a show versus all seasons;
etc.

Aim to find two texts that are both reasonably large (more than $10^{4}$ words)
and fairly different in average happiness scores.

(Though even the same scores can be meaningfully explored with word shifts)

Let's call the two texts
$T^{(A)}$
and
$T^{(B)}$.

In your plots, you should label them meaningfully based on your choices).

Use a reasonable exclusion lens of your choice, e.g., $[4, 6]$ or $[3, 7]$.

****

### Part a.

****


****

### Part b.

****


****

### Part c.

****


****

### Part d.

****


****

### Part e.

****


****

### Part f.

****


****

# Appendices

## Hedonometer API

In [None]:
"""
API Links:
https://hedonometer.org/api/v1/words/?format=json&wordlist__title=labMT-en-v2
...
...
...
"""

import requests
import pandas as pd
import json

# TODO: Functionize

fmt   = 'json'
date  = '2018-01-01'
limit = '10000'
base = f'http://hedonometer.org/api/v1/events/?format={fmt}&'
uri = f'{base}happs__timeseries__title=en_all&'
uri = f'{uri}happs__date__gte={date}&limit={limit}'
req   = requests.get(uri)
happiness_df = pd.DataFrame(json.loads(req.content)['objects'])

# TODO: Breakdown into sub DataFrames

for i in range(happiness_df.shape[0]):
  date = happiness_df['happs'][i]['date']
  frequency = happiness_df['happs'][i]['frequency']
  happiness = happiness_df['happs'][i]['happiness']
  timeseries = happiness_df['happs'][i]['timeseries']

## Citations and References

```
@article{10.1371/journal.pone.0026752,
    doi = {10.1371/journal.pone.0026752},
    author = {
      Dodds, Peter Sheridan 
      AND Harris, Kameron Decker 
      AND Kloumann, Isabel M. 
      AND Bliss, Catherine A. 
      AND Danforth, Christopher M.
    },
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {
      Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter
    },
    year = {2011},
    month = {12},
    volume = {6},
    url = {https://doi.org/10.1371/journal.pone.0026752},
    pages = {1-1},
    abstract = {
      Individual happiness is a fundamental societal metric.
      Normally measured through self-report, happiness has often 
      been indirectly characterized and overshadowed by more 
      readily quantifiable economic indicators such as gross 
      domestic product. Here, we examine expressions made on the 
      online, global microblog and social networking service 
      Twitter, uncovering and explaining temporal variations in 
      happiness and information levels over timescales ranging 
      from hours to years. Our data set comprises over 46 billion 
      words contained in nearly 4.6 billion expressions posted 
      over a 33 month span by over 63 million unique users. 
      In measuring happiness, we construct a tunable, real-time, 
      remote-sensing, and non-invasive, text-based hedonometer. 
      In building our metric, made available with this paper, 
      we conducted a survey to obtain happiness evaluations of 
      over 10,000 individual words, representing a tenfold size 
      improvement over similar existing word sets. Rather than 
      being ad hoc, our word list is chosen solely by frequency 
      of usage, and we show how a highly robust and tunable 
      metric can be constructed and defended.
    },
    number = {12}
}

@Misc{dodds2014a,
  author = {
    Dodds, P. S.
    and Clark, E. M.
    and Desu, S.
    and Frank, M. R.
    and Reagan, A. J.
    and  Williams, J. R.
    and Mitchell, L.
    and Harris, K. D.
    and Kloumann, I. M.
    and Bagrow, J. P.
    and Megerdoomian, K.
    and McMahon, M. T.
    and Tivnan, B. F.
    and Danforth, C. M.
  },
  title = {
    Human language reveals a universal positivity bias
  },
  OPThowpublished = {},
  OPTmonth = 	 {},
  year = {
    2014
  },
  note = 
  {
    Preprint available at
    \href{http://arxiv.org/abs/1406.3855}{http://arxiv.org/abs/1406.3855}
  },
  OPTannote = 	 {}
}
```










