# Intro to Coding, Python and Text-as-Data

## Welcome!

Let's start with a very philosophical question: **why are we here?**
![philosophical_hamster](https://media.giphy.com/media/UZStIMu2MTx3W/giphy.gif)

Basically, for two reasons:
- to learn the basics of **coding with Python**
- to harness these skills for interrogating **digital texts** (at scale)

... all this within the context of **humanities research**.

We convert this very existential question to one we can address within the scope of the summer school

- Why code?
- Why Python?
- Why quantify?



# Why Code?

So, what's new in 2023?
![](https://www.theinsaneapp.com/wp-content/uploads/2023/05/chatgpt-meme-tree.webp)



### In the era of Large Language Models and ChatGPT, why should I learn to code?
- Easy to generate code based on a 'prompt' (an instruction formulated in natural language).

Yes, but:
- When doing computational research, one must learn to think computationally. 
- Need to strike a fine balance between researcher's agency and automation 
- Code remains the most unambiguous way to give computers instructions (it's probably there to stay, for a while)

However, 
-  **reading** and **adapting** code will become more important.
    - most of the exercises in this course are built on snippets you can just copy-paste (i.e. adapt)
    - ... but this still requires some basic understanding of the syntax and principles
- We encourage you to use tools such as ChatGPT to obtain a better understanding of code. Or to explain a certain concepts in specific contexts. Or to debug a piece of code.

![](https://i.imgflip.com/76eq2p.jpg)

Yes, we will talk of about ChatGPT etc, but be sceptical of the hype:
- Please keep in mind that technological change does not automatically improve humanities research
- We code for research purposes, and need to keep in mind to community of researchers you want to engage and convince (simpler is often better)




## Why write programs for research?

- Not just labour saving, scripted research can be **tested and reproduced**! 
    - Others can confirm that you were right ;-)


# Why Python?

![](images/python.jpg)

- A "high-level" programming language, meaning it's EASY (relatively speaking haha!)
- More readable (indentation, brevity)
- Very popular in NLP, data science and machine learning, and contains lots of useful libraries (i.e. tools) thoroughly documented 
- Python is free, so you never have a problem getting hold of it



# Why Quantify (Texts): NLP and Text-as-Data

### Text-as-Data

“The use of texts and language to make inferences about human behavior” (Grimmer et al., 2022) Read the [book](https://press.princeton.edu/books/paperback/9780691207551/text-as-data)!

- Converting text to numbers to study society.
- Texts often become a proxy for something else, e.g. measure racial prejudice or emotion. These proxies are then used to study social, economic and cultural dynamics
- **Construct validity** and validation



### Natural Language Processing

The development and application of **computational techniques** for the the analysis of human language. Often used within the context of text-as-data.

<img src="images/nlp.png" width="800" />

Is it different to Computational Linguistics? Computational linguistics is a social science with the goal of understanding language through computational models.


### How can I use computational techniques in my Humanities Research?

Well, that is what we will discuss this week!

But it is good to know that the computational analysis of text in the humanities remains contested and often rightly so. **Always question your method!**

For a more recent provocation see [this article](https://www.journals.uchicago.edu/doi/abs/10.1086/702594?journalCode=ci): 

> Da, Nan Z. "The computational case against computational literary studies." Critical inquiry 45, no. 3 (2019): 601-639.



### Benefits and drawbacks of quantifying language use

**Benefits**
- Go beyond anecdotal evidence and surface **patterns** in data (systemic differences). See for example "A Worlds of Patterns" [Rens Bod](https://muse.jhu.edu/book/98273)
- Get more precise estimates of **typicality** ("how common") and **scope** ("Who")
    - Provide context for qualitative analysis
    - Spot outliers
- Reproducible

**Drawbacks and other issues**
- We **lose context** and tend to project our preconceived idea on abstract results
- Computational Humanities is not more objective than traditional qualitative research
- Code can contain (semantic) errors (it runs but does not do what we intend) and other bugs
- How to build a narrative with numbers?

What do we want to automate, and what not?

# Our working environment for the next days




### Running Code is more complicated than Displaying Code!
[GitHub](https://github.com/) is a great service for sharing code, but the contents are static.

It would be great if you could you run a GitHub repository without installing complicated requirements, like directly in your browser.

However, to run code, you need:

1. Hardware on which to run the code
2. Software, including:
    - The code itself
    - The programming language (e.g. Python, R, Julia, and so on)
    - Relevant packages (e.g. pandas, matplotlib, tidyverse, ggplot)


## Binder

![](images/mybinder.png)
[Binder](https://mybinder.org/) is a service that provides your code and the hardware and software to execute it.

You can create a **link** to a live, interactive version of your code (like this one)!

## Local Installation

Oh this will be fun, but later!

![](https://media.giphy.com/media/5UJyoBqHtHmU0/giphy-downsized-large.gif)

Seriously, if you are already familiar with creating Python environments, installing libraries and Jupyter Notebooks you can clone [our GitHub repo](https://github.com/Living-with-machines/dhoxss-text2tech) and use the notebooks on your machine.

## The Jupyter Notebook

![](images/jupyter.png)

The easiest way to get started using Python, and one of the best for research data work, is the Jupyter Notebook.

In the notebook (like this), you can easily mix code with discussion and commentary, and mix code with the results of that code; including graphs and other data visualisations.

Jupyter notebooks consist of discussion cells, referred to as “markdown cells”, and “code cells”, which contain Python. This document has been created using Jupyter Notebook, and this very cell is a Markdown Cell. To learn how to write in Markdown, follow [this link](https://daringfireball.net/projects/markdown/)

And now, let's start with our first Code cell!

# Fin.