# Available Variables or Insights

## Added by Text Analysis
- Topic per text
- Sentiment polarity per text
- Sentiment classification
- Entities in the text

## Added by Network Analysis
- Centrality Measures of entity signficance
- Relations between entities
- Entity Community 

## Method Driven Questions
I have a method, what can it do for me?
- What are the major topics of these articles?
- What are the major topics of this band's lyrics?
- Who is mentioned in these articles?
- What are the major and minor figures of these stories?
- What figures and organisations are commonly mentioned together?
- What are the sentiments of these documents?
- What are the most significant words of a collection of texts?



# Thinking in building blocks
Generally in social science research you are encouraged to be *question first*. This means you should be choosing data, methods and the frameworks you use to interpret those methods all in service of answering the question. The point of this module is to open up the range of possible sources of data, and methods that you have so you have a bigger toolkit to answer a greater variety of questions.

For the purposes of this course, your methods and sources of data are limited, but even in that limited range there are many options directly taught to you and other options that can come from *combining* those methods.


## Building Blocks
Think of your project as a pipeline with different stages of processing, analysis and interpretaion. What goes into the pipeline at the beginning, has a direct impact on what kinds of analysis make sense, how effective that analysis will be, and what kinds of interpretations you can draw from it. Each step in the pipeline has knock-on effects to what follows. Plannng and being reflective about what it is you are *actually doing* in each step is critical to good research and solid outcomes.

- The data
- The primary method of analysis
- The secondary method of analysis for feature generation and/or greater insight (optional)
- The ordering or segmenting of data for comparison and/or closer analysis



### The Data
- The data is the foundation of your project, it is what starts your pipeline and has a knock on effect to every other stage. Right at the very start the data you collect, what kinds of features it has, what has been included and what has been excluded by the collection process, such as the way you wrote your query to an API, will limit the range of possible questions, interpretations and applicability of the techniques you're using. 

- You're broadly limited to two sources from the course, Guardian news data, and Lyrics data. Other options are permissible but you need to talk to me and get my approval first. My goal is a good grade for you so if I think you're risking that, I will deny your request.

- Start with what you're interested in. You will find analysis easier and motivation higher if you're personally interested.

- Consider the most appropriate source at your disposal to fit that interest. Generally content from The Guardian, or music lyrics are your options.



#### Choosing the *right* data

- Now is the tricky part, what do you need from those sources to best address your question?

> Example: You want to study crime, you query 'crime' from the Guardian API and get as many articles using the word 'crime' as possible. You now have articles using the word 'crime' from the news, opinion, arts, and sports pillars across the entire Guardian archive's date range. Some of those articles are about crimes, some are reviews of plays where they declare that the obscurity of the lead actor is 'an absolute crime!'. Your dataset is truly massive, one of the biggest datasets of all time. Nobody's ever seen a dataset this big before, and it's going to be beautiful.
    - What kinds of questions could you actually answer with this dataset? How?

> Example: You want to study crime. You download all the lyrics of songs that you remember talk about doing crime along with their release dates and artist names.
    - What kinds of questions could you actually answer with this dataset? How?

> Example: You want to study instances of mass shootings over time. You search the Guardian website first and read some stories that look relevant to identify terms and phrases that may help you narrow it down. You use the Guardian API to search those terms and phrases using the OR and AND operators such that you recieve a relatively small but targetted dataset. You specify that material should only come from the News pillar, and you set a date range to limit it to the last twenty years. 
    -  What kinds of questions could you actually answer with this dataset? How?

### Greatness in, Greatness out
- The data you start with has a major impact on every other step of your project so it is important to really think through what you need to collect and how. Whilst getting a good sample size is important, a more carefully targetted or selected dataset with a good rationale makes it easier to then understand why you're applying specific analysis techniques, and what the results tell you.



##  Analysis techniques
The techniques we have learned are:
- Document summarisation using TFIDF word significance
- Document similarity using TFIDF vectors
- Document similarity using embeddings i.e. Topic Modelling
- Sentiment analysis
- Entity recognition to extract names and organisations
- Network analysis for entity significance, and entity communities.

Some of these techniques offer a wide range of additional analysis options. For example, topic modelling provides a wide array of additional interpretations of the data around the topics identified such as topics over time, topic clustering and comparison of topic to class (such as newspaper section). 

Some are more direct and specific, offering you one form of insight such as sentiment scores or most representative words. 

The network techniques, as they rely on there being a relation between things, moves away from your main unit of analysis being a 'document' to instead being relations within documents (though we'll see there are ways to map those relational insights back to 'per document').

The choice of technique used should be a balance of your own personal confidence and comfort with the technique, the kind of the data you are using and what kinds of insights you want to get out of it.

### Single Dataset Style Questions 
The methods we've learned suggest certain types of questions:
- What are the major topics of these articles?
- What are the major topics of this band's lyrics?
- Who is mentioned in these articles?
- What are the major and minor figures of these stories?
- What figures and organisations are commonly mentioned together?
- What are the sentiments of these documents?
- What are the most significant words of these texts?

These can be reasonable questions, but there may be more insight to be gained by splitting, ordering or grouping your data such that you can make a comparison, or see trends. For example:

### Segmented or Ordered Datset Style Questions
- What are the major topics of these articles over time?
- What is the average sentiment of a band's lyrics, over time?
- Who are the major and minor figures in these stories, by section?
- What are the most significant words of these different groups of texts?
- Which of these topics have the highest/lowest wordcount.
- Which entities get the highest wordcount?
- Which tags correlate with which topics?

> Variables for segmenting and ordering available from the source

> #### From the Guardian API
> - Pillar and Section (Generally exclusive)
> - Type of publication
> - Publication Date
> - Article Tags (Multiple)
> - Byline (Author)

> #### From the Genius API
> - Track and album release dates
> - Track and album titles
> - Track and album artist names

Rather than treating your dataset as a single lump, finding a way to make a comparison or  identify change gives you more interesting results for interpretation and also opens up different more nuanced kinds of questions.

A project based around these kinds of questions or similar, will do well.

# Mixing Methods
One of the key benefits, in my view, of computational social science, is the flexibility to build methods and tools specific to what you want to achieve. If you rely on pre-built software packages for the social sciences, often your range of possible actions with the data, and analysis techniques are limited by the expectations and design limitations of that software.

## Creating your own features
- When you perform analysis, that is not necessarily the end-point of the pipeline. Doing analysis generates outputs which themselves can be put back into your dataset for use in another analysis technique.

- For example, topic modelling generates a topic assignment for each document.
- Network community analysis assigns a community to key entities within your documents, and so for each document you have can have multiple 'entity communities'.
- Sentiment analysis can give a document a sentiment score or classification, or if broken down into paragraphs, a range of sentiments/clasifications.
- Groups of documents can be given a list of significant words representing that group. A group could be based on topic (which is how topic modelling works) but also a time period, filtered by mention of a specific entity, grouped by mention of an entity community etc.

## Levels of Analysis
One thing to keep in mind when thinking about your project is that data can be examined at different 'levels', that may tell us different things and that you might make claims about different levels. A simple way to differentiate the levels would be:

- Whole dataset
- Subsets of the dataset
- Each item of the dataset
- A component part of the item

For data from the Guardian API this would be
- Whole corpus of documents
- Collections of documents based on some grouping or segmenting
- Each indivdual article
- Paragraphs within articles

### Whole Dataset
For example, if you were using topic modelling you may run it across the whole corpus and report on the different topics available across the whole dataset - how many articles per topic, what each topic is about.

### Subset
You may also examine whether certain topics occur more often with different subsets, such as time periods of reporting, or sections, or all documents mentioning specific entities.

### Individual articles
You may qualitatively examine representative articles from each topic, and use the topics to help you sample a smaller number of articles. This then allows you to provide a more in-depth analysis than the topic summaries which tend to be descriptive rather than explanatory.

### Paragraphs
You could choose to split each article into paragraphs (like we did for entity detection) before running topic modelling. This would then mean each document was a paragraph and may provide more nuanced topics and encourage a much closer reading when examining representative documents, comparing with entity presence etc.

### Which level?
 - A whole corpus topic modelling will tell you the broad stories being discussed, but not whether there is a pattern in the story publication, or whether there is something of interest in the language used. 
- A more granular focussed (lower) level provides opportunities for finding patterns, and qualitative interpretation, but may also need more thought behind the research design.

> #### Example: The Cypherpunks
> In a project with Dr. Amy Stevens (Sheffield), we applied network analysis, topic modelling and qualitative content analysis to explore the discussions of an online group called 'The Cypherpunks' across a ten year period.

> Using their archived mailing list discussions (think group email chains) as our data we...
> 1. Used network representation to rebuild the archives into a network of related messages.
> 2. We used this information to identify key periods of activity and select a specific period of high engagement before a lull of a few years.
> 3. We then used the network representation to identify the most central members of the community and all messages part of discussions.
> 4. We applied topic modelling to the discussions and generated our topics.
> 5. Finally we qualitatively examined all discussions that were considered most representative of a topic AND which were started by high centrality members of the community.
> 6. This allowed us to systematically sample the data so that we reduced the possible data to consider for qualitative analysis from ~190,000 messages from hundreds of users down to just the discussions started by the top 44 that best represent the topics.




# Example Code Snippets
How do you mix some of these different elements? Some short code snippets below show you how.

## 1. Working with Paragraphs

We did a little of this when we generated entities per paragraph in the last networks session. However if you want to work at the paragraph level in other ways you'll need to do a little prep.

In [None]:
# if you want to save time in the future, save this version of the dataset and load it in instead of the usual article per line version.

## 2. Topic Modelling Paragraphs

In [None]:
# save your model

# add topics to you paragraph data


In [None]:
# Topics per section of the article they came from


## 2. Mixing Topics and Entities
Which entities are most common per-topic?

In [None]:
# This is the same process we used in the second networks session

# Save your entities to your paragraph data

# just showing a few of the relevant columns


## 3. Ordering by network measure
Which people are the most 'important' in the network for each topic? First we turn our entities data into an adjacency table, then use this to build a network. We filter a little and then use it to calculate our chosen measurement of importance (in the example we use eigenvector centrality).

We then use the scores to assign the appropriate centrality score to each instance of an entity


In [None]:
# Represent our entities as a network

# Remove low weight edges

# Remove low degree nodes (this will also clean up any nodes that were disconnected by the edge filter above)

# Get the giant component

# Calculate whole network scores with chosen metric

# Map those scores back to the dataset


## 4. TFIDF Keywords by Subset
Generate a set of keywords representing texts from any subset of the data

# Getting more representative documents from your topic model

If you want to engage in a more qualitative approach to your documents after assigning them to topics you might want more than just the few representative documents that BertTopic provides.

Whilst it is a common request of the developer, there is yet to be a direct way to simply ask for more documents. Below is a function that does the job.