# Intro to Coding, Python and Text-as-Data

## Welcome!


![](https://media.giphy.com/media/UZStIMu2MTx3W/giphy.gif)

Let's start with a very philosophical question: **why are we here?**

Basically for two reasons:
- to learn the basics of coding with Python
- to harness these skills for interrogating digital texts (at scale)
- ... all this within the context of humanities research

We convert this very existential question to one we can address within the scope of the summer school

- why code?
- why Python?
- why quantify?



# Why Code?

### In the era of Large Language Models and ChatGPT, why should I learn to code?
- Easy to generate code, based on a 'prompt' (an instruction formulated in natural language).
- When doing computational research, it is valuable to think computationally. 
- Need to strike a fine balance between researcher's agency and automation 
- Code remains the most unambiguous way to give computers instructions (it's probably there to stay, for a while)

- ... However, we feel **reading** and **adapting** code will become more important.
    - most of the exercises in this course are built on snippets you can just copy-paste (i.e. adapt)
    - ... but this still requires some basic understanding of the syntax and principles
- Automatic translation doesn't stop us from learning new languages, so why should ChatGPT prevent us from coding?


## Why write programs for research?

- Not just labour saving, scripted research can be tested and reproduced!

## Sensible Input - Reasonable Output
Programs are a rigorous way of describing data analysis for other researchers, as well as for computers.

Computational research suffers from people assuming each other’s data manipulation is correct. By sharing codes, which are much more easy for a non-author to understand than spreadsheets.

# Why Python?

![](images/python.jpg)

- A "high-level" programming language, meaning it's EASY (relatively speaking haha!)
- More reabable (indentation, brevity)
- Very popular in NLP, data science and machine learning, and has lots of useful libraries (i.e. tools) thoroughly documented 
- Python is free, so you’ll never have a problem getting hold of it, wherever you go.



# Why Quantify (Texts): NLP and Text-as-Data

### Text-as-Data

- Converting text to numbers to say something about society.
- Texts becomes proxies for something else, e.g. measure racial prejudice or emotion.
- Construct validity and validation

“The use of texts and language to make inferences about human behavior” (Grimmer et al., 2022)

### Natural Language Processing?

The development and application of computational techniques for the the analysis of human language. Often used within the context of text-as-data.

<img src="images/nlp.png" width="800" />

Is it different to Computational Linguistics? Computational linguistics is a social science with the goal of understanding language through computational models.



### How can I use computational techniques in my Humanities Research?

Well, that is what we discuss this week. 

But it is good to know that the computational analysis of text in the humanities remains contested and often rightly so. **Always questions your method!**

For a more recent provocation see [this article](https://www.journals.uchicago.edu/doi/abs/10.1086/702594?journalCode=ci): 

> Da, Nan Z. "The computational case against computational literary studies." Critical inquiry 45, no. 3 (2019): 601-639.

### Benefits and drawbacks of quantifiying language use

**Benefits**
- Go beyond anecdotal evidence and surface patterns, systemic observations
- Get more precise estimates of typicality ("how common") and scope ("Who")
    - Provide context for qualitative analysis
    - Spot outliers
- Serendipity

**Drawbacks**
- We tend to project a 

# Our working environment for the next days




### Running Code is more complicated than Displaying Code!
[GitHub](https://github.com/) is a great service for sharing code, but the contents are static.

It would be great if you could you run a GitHub repository without installing complicated requirements, like directly in your browser.

However, to run code, you need:

1. Hardware on which to run the code
2. Software, including:
    - The code itself
    - The programming language (e.g. Python, R, Julia, and so on)
    - Relevant packages (e.g. pandas, matplotlib, tidyverse, ggplot)

If you are already familiar with creating Python environments, installing libraries and Jupyter Notebooks you can clone [our GitHub repo](https://github.com/Living-with-machines/dhoxss-text2tech) and use the notebooks on your machine.
## Binder

![](images/mybinder.png)
[Binder](https://mybinder.org/) is a service that provides your code and the hardware and software to execute it.

You can create a **link** to a live, interactive version of your code (like this one)!

## Local Installation

## The Jupyter Notebook

![](images/jupyter.png)

The easiest way to get started using Python, and one of the best for research data work, is the Jupyter Notebook.

In the notebook (like this), you can easily mix code with discussion and commentary, and mix code with the results of that code; including graphs and other data visualisations.

Jupyter notebooks consist of discussion cells, referred to as “markdown cells”, and “code cells”, which contain Python. This document has been created using Jupyter notebook, and this very cell is a Markdown Cell. To learn how to write in Markdown, follow [this link](https://daringfireball.net/projects/markdown/)

And now, let's start with our first Code cell!

# Fin.