*Notebook made for the CDH workshop I-Analyzer, October 2022*
In this notebook we provide some entrypoints to working with I-Analyzer data in Python. This is by no means a tutorial. Some knowledge of Python is required.

You can copy this notebook to your own Google Drive to get an editable file.

# Data

There is a CSV file attached to this notebook called `queen_king.csv`. We have generated this with the query `queen|king` on the `Times` corpus, filtered for OCR quality 80-100. The download includes the first 10.000 results. See https://ianalyzer.hum.uu.nl/search/times;query=queen%7Cking;$ocr=80:100 for the full results.

Try to replicate the download, or come up with an interesting query yourself! You can upload your file using the folder icon in the toolbar on the left-hand side of the screen.

Alternatively, you can use the example corpus that Ruben hosts on his GitHub.

In [None]:
# imports
import pandas as pd
import spacy
from collections import Counter
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
import unicodedata
import re
import networkx
import matplotlib.pyplot as plt
%matplotlib inline

# set the name of your uploaded file here
CSV_FILENAME = 'queen_king.csv'
CSV_FILE = 'https://raw.githubusercontent.com/RubenSchalk/textcorpora/main/data/times_sample.csv'

In [None]:
# loading the data into a Pandas dataframe
# see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html for documentation

# use CSV_FILE if you wish to read the GitHub file instead
full_data = pd.read_csv(CSV_FILENAME)

# show some information
full_data.info()

# take a smaller subset of the data, so we can work faster
sample_data = full_data.sample(100)
sample_data.head(5)



# Example analysis: plotting article length

In [None]:
# let's generate a new colum containing the length (in words) of an article
# we use the Spacy package for some common NLP tasks

# We use a pre-trained model provided by the Spacy package
nlp = spacy.load("en_core_web_sm")

def get_num_words(content):
    doc = nlp(content)
    return len(doc) # note that this is a very rough tokenizer

sample_data['num_words'] = sample_data['content'].apply(get_num_words)

# and plot this colum against the publication date
sample_data.plot(x='date-pub', y='num_words')

# Example analysis: NER

In this section, we will perform Named Entity Recognition on the data. This extracts any named entities from text input, and classifies their types. 

In [None]:
# function to find entities the 'content' field
# returns only the text and label
def find_entities(content):
    doc = nlp(content)
    entities = [(ent.text ,ent.label_) for ent in doc.ents]
    return entities

sample_data['entities'] = sample_data['content'].apply(find_entities)

In [None]:
# If we inspect the data again, we will notice an 'entities' column containing all detected entities
sample_data.head()

# We can ask what the various labels mean:
spacy.explain('NORP')
spacy.explain('GPE')

# Let's see if some entities occur more than others
all_ents = sample_data['entities'].tolist()
flattened_ents = [ent for row_ents in all_ents for ent in row_ents]

entity_counts = Counter(flattened_ents)
entity_counts.most_common(25)

In [None]:
# Use Spacy's built-in visualizer to show NER in the text
from spacy import displacy

# We take a random sample of size 1 and select the text
text = sample_data.sample(1).iloc[0]['content']
doc = nlp(text)
sentence_spans = list(doc.sents)

displacy.render(sentence_spans, style="ent", jupyter=True)

In [None]:
# There is stuff available in the Spacy nlp() object:
displacy.render(sentence_spans, style='dep', jupyter=True, options={'compact': True})

# Example analysis: N-gram visualization

In this section, we use the [nltk](https://www.nltk.org/) package to generate N-grams.
Heavily inspired by and borrowed from https://towardsdatascience.com/from-dataframe-to-n-grams-e34e29df3460


In [None]:
# Add any stopwords you wish to exclude from the ngrams to this list
ADDITIONAL_STOPWORDS = []

In [None]:
def basic_clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

# apply the cleaning functions
words = basic_clean(''.join(str(sample_data['content'].tolist())))

In [None]:
# bigrams
bigrams = pd.Series(nltk.ngrams(words, 2)).value_counts()
bigrams[:10]

In [None]:
# N-grams
N = 3
# bigrams
ngrams = pd.Series(nltk.ngrams(words, N)).value_counts()
ngrams[:10]

Visualize our bigrams using [NetworkX](https://networkx.org/).
Inspired by/borrowed from https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/calculate-tweet-word-bigrams/

In [None]:
# Create dictionary of bigrams and their counts
bigrams_dict = bigrams[:50].to_dict()

# Create network plot 
graph = networkx.Graph()

# Create connections between nodes
for k, v in bigrams_dict.items():
    graph.add_edge(k[0], k[1], weight=(v * 10))

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))

pos = networkx.spring_layout(graph, k=2)

# Plot networks
networkx.draw_networkx(graph, pos,
                 font_size=16,
                 width=3,
                 edge_color='grey',
                 node_color='purple',
                 with_labels = False,
                 ax=ax)

# Create offset labels
for key, value in pos.items():
    x, y = value[0]+.135, value[1]+.045
    ax.text(x, y,
            s=key,
            horizontalalignment='center', fontsize=13)
    
plt.show()