![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Being a Computational Social Scientist

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Table of Contents</a></span></li></ul></li><li><span><a href="#Human-Thinking-and-Human-Problems" data-toc-modified-id="Human-Thinking-and-Human-Problems-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Human Thinking and Human Problems</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Basic-Problems" data-toc-modified-id="Basic-Problems-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Basic Problems</a></span></li><li><span><a href="#Less-Basic-Problems" data-toc-modified-id="Less-Basic-Problems-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Less Basic Problems</a></span></li><li><span><a href="#Problem-Work-arounds" data-toc-modified-id="Problem-Work-arounds-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>Problem Work-arounds</a></span></li></ul></li></ul></li><li><span><a href="#Computer-Thinking-and-..." data-toc-modified-id="Computer-Thinking-and-...-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Computer Thinking and ...</a></span></li><li><span><a href="#Why-Different-Thinking-Matters" data-toc-modified-id="Why-Different-Thinking-Matters-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Why Different Thinking Matters</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Some-problems-need-human-and-computer-thinking" data-toc-modified-id="Some-problems-need-human-and-computer-thinking-3.0.1"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>Some problems need human and computer thinking</a></span></li><li><span><a href="#Computational-Social-Science-is-an-Opportunity" data-toc-modified-id="Computational-Social-Science-is-an-Opportunity-3.0.2"><span class="toc-item-num">3.0.2&nbsp;&nbsp;</span>Computational Social Science is an Opportunity</a></span></li></ul></li></ul></li><li><span><a href="#Knowing-your-computational-environment" data-toc-modified-id="Knowing-your-computational-environment-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Knowing your computational environment</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#File-system-and-working-directory" data-toc-modified-id="File-system-and-working-directory-4.0.1"><span class="toc-item-num">4.0.1&nbsp;&nbsp;</span>File system and working directory</a></span></li><li><span><a href="#Environments" data-toc-modified-id="Environments-4.0.2"><span class="toc-item-num">4.0.2&nbsp;&nbsp;</span>Environments</a></span></li><li><span><a href="#Capturing-a-computational-environment" data-toc-modified-id="Capturing-a-computational-environment-4.0.3"><span class="toc-item-num">4.0.3&nbsp;&nbsp;</span>Capturing a computational environment</a></span></li></ul></li></ul></li><li><span><a href="#Acquiring,-understanding-and-manipulating-unstructured/unfamiliar-data" data-toc-modified-id="Acquiring,-understanding-and-manipulating-unstructured/unfamiliar-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Acquiring, understanding and manipulating unstructured/unfamiliar data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Acquiring-data" data-toc-modified-id="Acquiring-data-5.0.1"><span class="toc-item-num">5.0.1&nbsp;&nbsp;</span>Acquiring data</a></span></li><li><span><a href="#Data-types" data-toc-modified-id="Data-types-5.0.2"><span class="toc-item-num">5.0.2&nbsp;&nbsp;</span>Data types</a></span><ul class="toc-item"><li><span><a href="#Numbers" data-toc-modified-id="Numbers-5.0.2.1"><span class="toc-item-num">5.0.2.1&nbsp;&nbsp;</span>Numbers</a></span></li><li><span><a href="#Strings" data-toc-modified-id="Strings-5.0.2.2"><span class="toc-item-num">5.0.2.2&nbsp;&nbsp;</span>Strings</a></span></li><li><span><a href="#Boolean" data-toc-modified-id="Boolean-5.0.2.3"><span class="toc-item-num">5.0.2.3&nbsp;&nbsp;</span>Boolean</a></span></li></ul></li><li><span><a href="#Data-structures" data-toc-modified-id="Data-structures-5.0.3"><span class="toc-item-num">5.0.3&nbsp;&nbsp;</span>Data structures</a></span><ul class="toc-item"><li><span><a href="#Data-frame" data-toc-modified-id="Data-frame-5.0.3.1"><span class="toc-item-num">5.0.3.1&nbsp;&nbsp;</span>Data frame</a></span></li><li><span><a href="#Dictionaries" data-toc-modified-id="Dictionaries-5.0.3.2"><span class="toc-item-num">5.0.3.2&nbsp;&nbsp;</span>Dictionaries</a></span></li><li><span><a href="#XML" data-toc-modified-id="XML-5.0.3.3"><span class="toc-item-num">5.0.3.3&nbsp;&nbsp;</span>XML</a></span></li><li><span><a href="#Graphs" data-toc-modified-id="Graphs-5.0.3.4"><span class="toc-item-num">5.0.3.4&nbsp;&nbsp;</span>Graphs</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Writing-code" data-toc-modified-id="Writing-code-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Writing code</a></span></li><li><span><a href="#Documenting-and-enhancing-your-workflow" data-toc-modified-id="Documenting-and-enhancing-your-workflow-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Documenting and enhancing your workflow</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bibliography</a></span></li></ul></div>

### Table of Contents

There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

TO DO:
* Change all links to `<a>` and `target=blank`.

# Introduction

Computational Social Science: it can be a scary, alluring, mystifying term. You may even be thinking, what's the big deal? Surely almost all social science involves the use of computers: we code our interviews using software such as NVivo; build our statistical models in SPSS, Stata etc; and generally conduct our research and teaching activities within a computational environment (e.g., personal desktop/laptop, Dropbox or iCloud for file storage). However, computational social science (CSS) refers to activities and technologies that go beyond what we're typically familiar with as social scientists:
* The use of datasets that are too large to store on your personal machine; 
* Writing programming scripts to access information held in online databases; 
* Employing analytical techniques, derived from computer science, that reveal structures and patterns in large or unfamiliar datasets (e.g. network analysis, text mining). 

More formally, CSS is an interdisciplinary branch of research, defined more by its methods and data than its substantive topics (Heiberger & Riebling, 2016). CSS is not limited to certain analytical approaches (e.g., Machine Learning) or data types (e.g., text data). So what makes CSS different from "traditional" social science? What are the techniques that are not well suited to familiar statistical software packages such as SAS, SPSS? And why would you want to be a computational social scientist?

# Why be a Computational Social Scientist?

## Human Thinking and Human Problems

Human thinking co-evolved with human problems. Specifically, human thinking co-evolved with the kind of human problems were persistent, frequent or important enough throughout their evolutionary history to drive adaptation.

#### Basic Problems 
By basic problems, I mean REALLY basic problems that apply to pretty much all living things, such as 'staying alive':

 - Responding to stimulus, including dangers that are both fast (like tigers) and slow (like the approach of winter) as well as useful things (like resources). 
 - Learning in order to turn the unknown into the known so it can be responeded to appropriately (as dangerous, useful, etc.).
 - Recognising patterns so as to respond more quickly to known patterns as stimulus or to pique curiosity and drive learning around new patterns. 

#### Less Basic Problems

Humans also have less basic problems that apply to fewer kinds of living things. Most notably, these only apply to social living things. In essence, a big problem for people is 'other people':

 - Understanding the intentions, or likely intentions, of the people around us. 
 - Predicting likely next actions or responses. 
 - The role of intentions, actions and responses within wider patterns like collaboration, competition or some combination of them both.

Solving these problems involve a kind of thinking that doesn’t exactly look like “thinking”. 
 - People often respond to perceived dangers even before they were consciously aware of the danger. 
 - People don't have to put a whole lot of mental effort into learning things like "Avoid the snarling animal" or "Don't eat the bitter tasting fruit". 
 - People don’t really sit down and deliberately set out to compare the footsteps heard on the stairs against the known footsteps of the housemate or not. 
 - People can recognise other people, read their faces, judge a situation and react all before they can consciously think about it.
 
 Most of these examples of common human thinking are nearly always invisible. These are not examples of **irrational** thinking so much as **sub-rational** thinking that operates at intstinctual levels, much faster than those that underpin rational thinking.

#### Problem Work-arounds

The typical problems that humans have (in an evolutionary sense) have driven how we think as well as the limits we have on thinking, useful work-arounds, etc. 

- Working memory capacity according to Kimberg, 1997 :
    - Limited focus - We can only hold between 1 and 7 things in mind at once. 
    - Complexity matters - We can hold fewer complex things in mind. 
    - Different capacity for structured vs. unstructured - complex but structured things are easier than complex and unstructured things.
    - Chunking (and others) as a work-around strategy - We have ways of creating structure to help us work with more/more complex things. 
- Bounded rationality according to Simon, 1997:
    - Satisficing - People tend to go with 'good enough' solutions for many things, especially when time or complexity is an issue. 
    - Heuristics  - We use a lot of 'rules of thumb' to short cut decision-making, speeding up how we arrive at 'good enough' solutions.
- Communication assumptions according to Pinker, 2003 :
    - Speaker and listener interpret the same way - We all assume others will understand something the way *we would have* understood it if we were the listener.
    - Objects and actions as distinct - We assume new terms to be either objects or actions, but not both.
    - Mid-level categories, but with knowledge of hierarchies - We assume a generic term like 'cat' to mean a household pet, rather than all felines in taxonomic terms or 'this particular cat, right now'. 
    - Transitive properties - We apply logical operations, so "The book that is on the desk" might actually mean "the book on top of the paperwork that is on the desk". 
- 2 types of thinking according to Kahneman, 2011:
    - Fast, error-prone, intuition-based - this is the sub-rational stuff.
    - Slow, (more) accurate, rationality-based - still not perfectly rational, but often with conscious effort involved

## Computer Thinking and ...

Computers do not have computer problems, at least not in the sense that humans have human problems. Without intervention, a computer does not try to stay alive, to communicate, to learn about its surroundings... Basically, computers just do computer thinging about human problems. 

And humans have tried to give computers ALL KINDS of human problems. Not all of that has worked equally well, because computers do not have the kind of co-evolutionary drive linked to human problems that humans have had. Nevertheless, they are very good at dealing with some problems, especially those that humans are **not good at**. For example, problems that:

 - Vast problems that exceed human working memory capacity, either through the number of elements involved, the complexity of elements, or both.
 - Hyper rational problems that require the best answer, not the best answer that can be found quickly.
 - Non-interpretive problems that do not have any embedded reciprocal, creative or assumed communication.

It is important to remember that computers thinking has *different* limitations when compared to human thinking. There are still limits, shortcuts and workarounds to drive efficiency, but these are in other areas and make computer thinking better or more efficient at some tasks than others. 


In [None]:
Those differences inlude: 
|          Humans             |          Computers            |
|-----------------------------|-------------------------------|
|       Abstract concepts     |       Concrete definitions    |
|         Inference           |   Defined terms/rules only    |
| Shared/background knowledge | Nothing carried outside scope |
|      Fuzzy categories       |        Strict categories      |
|      Context-dependence     |           Absolutes           |
|-----------------------------|-------------------------------|


## Why Different Thinking Matters

#### Some problems need human and computer thinking

Human thinking is typically needed to identify the problem, possible solutions, relevant info, etc. This is especially true of problems that deal with humans and human behaviour. For example, trying to predict how people will react to an innovation, a policy change, a new trend, etc. requires understanding and predicting how people will interpret new information, how they will apply that information to their diverse personal goals and fears, how they will choose to react (hopefully) in line with existing laws and societal norms, etc. At the same time, computer thinking is needed if dealing with that problem involves working accurately and reproducibly with large volumes of (complex) data. 

For example, imagine a new law is proposed that makes smart meters obligatory for all new houses. Social science researchers could think about how people might react to this new law - they might create some abstract scenarios, identify potential problems, identify similar new laws in the past and try to categorise human reactions to those laws, etc. 

But if those social scientists want to know how people *actually* do react to this new law, then they will need to collect quite a lot of potentially very complex data, ideally from different sources (interviews, smart meter data, social media discussions, freedom of information requests, etc.). Those social scientists will need to employ computer thinking to integrate all of that vast and complex data in useful ways that might actually address the problem. 

The combination of human thinking and computer thinking to address social science problems through the use of potentially vast and complex data is, in a nutshell, what computational social science is all about. 


#### Computational Social Science is an Opportunity

Computational social science (CSS) means working with new forms of digital data and analytical methods to address new problems or to address existing problems in new ways. Both of these mean that computational social science opens up a whole new galaxy of potential research opportunities! That's a rather grand statement but it's true: vast swathes of our daily lives, social interactions, activities and other behaviours are captured digitally. These are often aggregated into large and rich data sets, much of which is available with the right programming skills to access and manage data from websites, documents, records, images, corpora and algorithms. Accessing and using these new forms of data is only possible through computational means (Kitchin, 2014).

Halford and Savage (2017) outline further advantages to engaging in CSS, especially from a data perspective:
* Utilise techniques for handling and analysing large-scale, unstructured data. 
* Capture data that is generated in real time and over time.
* Access information on new/previously unmeasured activities.
* Access information on familiar/currently measured activities at an unprecedented scale, dynamism or complexity.
* It's happening whether we (social scientists) like it or not (see also Heiberger & Riebling, 2016); empirical social science in general is having its moment in the sun, in much the same way the social theorists did in the recent past.

Last, but certainly not least, there's more to CSS than boosting your research standing or productivity. I rather like Sociologist Dr James Allen-Robertson's thoughts on this:

> In response to the question of what computational social science has helped me achieve, it may seem obvious to mention the concrete projects, the outputs, the measurable outcomes. However, for me computational social science has achieved something more substantial and enduring — a new way of working, a new way of thinking, and a new kind of enthusiasm for research.<sup>[1]</sup>

[1]: https://campus.sagepub.com/blog/james-allen-robertson-css-blog

There are limitations with CSS of course; ethical consent is needed to access and use these new forms of data, computational resources and capacity (e.g., hard drive space, memory) and data quality issues are no trivial matter (see Halford and Savage, 2017). But limitations aside, CSS depends on the ability *to write code* and *manipulate data*. 

These skills may not be easy to obtain as they require a mode of thinking and intellectual dexterity that you may not have practiced before. However, a little ability goes a long way. You do not need to be a software engineer to successfully scrape a website and you do not need to have a high-performance computing environment or server to manipulate data accessed through Spotify or Twitter. 

Brooker (2020) calls this approach the "grilled cheese" methodology for programming: your activities just need to be effective i.e., produce the results you need or expect. Elegance, concision and optimisation (e.g., shaving milliseconds off the running time of your code) can come later - or not at all - as a computational social scientist. The aim - in the short-term at least - is not to learn everything about a particular technique, but enough to achieve your research aim e.g., write an executable programming script for collecting data from Wikipedia; Dr James Allen-Robertson again:<sup>[1]</sup>

> What mattered here wasn’t necessarily the nuts and bolts of the techniques I was learning, but the development of a 'methodological imagination' and an understanding of the application of these techniques.

The good news is that much of the code and tools necessary to be a computational social scientist are readily accessible and masterable in a short period of time (think weeks instead of months/years). Grasping and working with basic concepts is key to developing the foundation of computational thinking, which is the real skill needed for CSS. Remember that social science research is increasingly interdisciplinary and knowledge of CSS gives you an advantage for future research funding, academic positions, and more (Brooker, 2020). 

My message is clear: embrace this new world with enthusiasm and a critical and reflexive mindset.

<!--Material added by JK, needs to be worked in -->
<!-- * We live in a world of data (what does that even mean?!?). In fact, we pretty much always have been, although in the pre-digital ages most of that data is probably not what we would consider to be “data” in the way that we understand it now. But it was data nonetheless! As data accumulated, it began to be a problem. How does someone remember it all? How can someone make sure that the right person has new data? How can someone get to the right bit of data at the right time? How best to ensure data is accurate? Fortunately, people developed some pretty good methods for dealing with the problems. These included systems for learning, transferring, testing and accessing data. For example; writing systems allowed data to be transmitted without relying on a specific living messenger being privy to the message, the scientific method allowed insights and discoveries to be replicated ensuring data accuracy, and the dewey decimal system allowed library goers to quickly zoom in on areas that are most likely to have relevant data.  The world of data that surrounds us now includes digital data... Obviously “data-ish” data. And, sensibly, we have developed some methods of dealing with digital data, some based on the data methods developed for non-digital data. Storage drives full of folders with sensible names, for example, are like digital versions of libraries with organised shelves. I am sure you can all imagine more examples.  But, problematically, our modern data world now includes some (very) fast data. Not all of the long-established data methods are well-suited to the fast-moving digital data. For example, the idea that data should be carefully divided into documents, labelled, and stored does not work especially well for data that is continuous, spread over many sources, subject to change, or that is most useful when rapidly accessed.   

* Social sciences have traditionally used certain kinds of (slow) data 
* Social sciences may need to embrace computational thinking in order to use the (very) fast data  -->

<!-- What is data and how is it different than information? 
There is a subtle difference between data and information. Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context. 
Data exists, as it were, out there in the world. Information is created, for a purpose, by interpreting data within a given context.  
What kind of data are humans good at using? And computers? 
Humans tend to be good at finding patterns in messy, sparse or irregular data. This is to say, humans are good at deriving information out of minimal data. In theory, this is because we have an evolutionary drive to jump to conclusions as quickly as possible – it is better to flee from the tiger that you think you can hear creeping up behind you than to carry on gathering data about what might be causing the sound of approaching footsteps in the jungle. However, evolution-driven instincts are most applicable to evolutionary problems, so humans tend to be better deriving information efficiently from situations that involve immediate danger, other people, food, and other basic life situations. A good example is how almost all people acquire language in early childhood from the erratic and error-prone output of people around them under often difficult or unusual conditions. This is because language is a clear advantage for a species that reliably exists in a social and communicative context, both in the context of turning the jumble of noise that comes out of people’s mouths into coherent messages and in the context of language acquisition, in which a jumble of noise is turned into a syntax, vocabulary, etc.  
Likewise, people are very good at detecting faces, even doing so in tortilla scorch marks or clouds of smoke, even though the detected faces are detected against strange backgrounds, in all lighting conditions, at unpredictable angles and contrary to all expectations. This is because detecting human faces, in a wide range of potentially surprising conditions, is also a clear advantage to a social species like humans. Consequently, humans are very good at picking up on some very subtle and complex patterns in very challenging conditions.  
In contrast, humans are not good at reliably and accurately churning through boring and predictable data. Humans get bored, attention wanders, errors are made. In situations were errors have catastrophic consequences, then you really don’t want to rely on a human to be paying attention to a boring stream of data. Computers though, are very good at reliably working with boring data. You want a measurement added to a list once per hour? You want a computer for that!  
 
What problems arise when the differences between human and computer approaches to data are not acknowledged or dealt with? 
Errors in contexts where errors matter is the biggest problem of asking humans to do computer-things. On the other side, meaningless or counter-productive choices are the result of trusting data when information is actually more valuable.   -->

# How to be a Computational Social Scientist?

Though there are myriad aspects to the role, being a computational social scientist typically involves one or more of the following practices:
* Writing programming scripts to collect and manipulate data.
* Employing analytical techniques - many derived from computer/information sciences - to reveal patterns in data.
* Using technological tools and e-Research best practice to structure and document your research workflow.

The good news is, as a trained social scientist, you do not need to learn al of these aspects from scratch and can instead apply the knowledge, skills and strengths that you have already! For example, social scientists possess knowledge - theoretical and empirical - of social systems and phenomena and already have advanced data skills, especially around:
 - categorising and coding responses (qualitative and/or quantitative), 
 - evaluating data quality (e.g., why is this survey response missing?), and 
 - making inferences from data (e.g., how representative is this pattern of a larger population?). 
 
Thus, main gaps in your skillset concern one or more of the following:
 - awareness of computational structure and processes, 
 - experience with or knowledge of computational and/or data resources, 
 - programming skills and knowledge of programming languages, and 
 - experience with or knowledge of how to document work to improve reproducibility, and 
 - communicating about computational social science. 
 
We'll explore many examples of each throughout this [book/training series], but for now we're going to focus on the skills and behaviours that underpin the above activities. Let's call these our *Big Five for CSS*:
1. Knowing your computational environment
2. Acquiring, understanding and manipulate data
3. Writing code
4. Documentation and reproducibility
5. Communicating effectively

<!-- ## Big Five for Computational Social Science

General structure of each section:
1. Theory/abstract
2. Practical instantiation
3. Reproducibility -->

<!-- ### Thinking computationally

[Barba et al. (2019)](https://jupyter4edu.github.io/jupyter-edu-book/)

* Decomposition: Breaking down data, processes, or problems into smaller, manageable parts
* Pattern Recognition: Observing patterns, trends, and regularities in data
* Abstraction: Identifying the general principles that generate these patterns
* Algorithm Design: Developing the step by step instructions for solving this and similar problems

 -->

## Knowing your computational environment

All computational social science activities are dependent on knowing how to setup, manage and share a computational environment. This can be as simple as understanding how and where files are located on your machine, to defining and documenting which software packages, versions and configurations are necessary to execute your data analysis. Whether you are thinking about scraping a web page or implementing an advanced machine learning algorithm, it all begins with establishing your computational environment. First, let's understand how files are stored and accessed on your machine.

#### File system and working directory

It is critical that you think *logicially* and in an *organised* way about how you manage and store files for your project. This goes beyond just keeping your filesd and folders tidy using a graphic user interface, and requires that you know how to move around in and interpret command line interfaces. Although this may look a bit unfamiliar or even scary, the black window with stark contrast text and a blinky cursor will become your friend!

First thing to know is that files and folders stored on your machine's hard drive can and be accessed in two ways:
Absolute path 
Relative path

Both the absolute and relative path are like directions to the location of a file or folder, but they differ in that relative path assumes that whoever is giving directions is in the same place as whoever is getting the directions while absolute path does not. 

For example, if someone were to ask me "Where is your office?" I would answer differently in different contexts. If I was talking on the phone to someone who wanted to send a book to my office, I would respond with an absolute path type answer and give the full postal address of my office. But if someone where standing in the lobby of my building and asked me where to drop something off after lunch, I would give a relative path answer and say which floor and which hallway to take out of the stairwell, plus my office number. 

It is not always easy to know whether you (the one giving the directions) are in the same file system location as the computer (the one giving directions), so you need to know how to ask for the current working directory (i.e., where *this* notebook, that you are working in right now is located).

One way to do that is to ask by double clicking in the code cell below, just after the end of %cd%. Then either 
- hit the "Run" button at the top of the page or 
- use the keyboard shortcut Shift + Enter

In [8]:
!echo %cd%

C:\Users\mzyssjkc\GitWork\BCSS_JN


What this is doing is asking the computer to repeat out loud (or 'echo') its current working directory (or 'cd'). This happens to be structured very much like an English language command, with the verb at the front and the object at the end, not unlike a "Pass the salt" or "Close the door". 

When you run a command like this in a code cell, the computer will execute the code and echo back to you its current directory. 

Another way to do that is to import a library called os (which stands for operating system) and ask the computer to use an os command called getcwd (short for get current working directory). Like the echo command above, this tells the computer to report where in the file structure it is currely at. Try double clicking in the code cell below and hitting "Run" or Shift + Enter.

In [2]:
import os

os.getcwd()

'C:\\Users\\mzyssjkc\\GitWork\\BCSS_JN'

Unlike the echo command, this one is not structured so much like an English language command. Instead, it translates (more or less) to "Using os, run the getcwd command (here)". 

Although it is less English-like, os is very useful. For example, you can use it to get a list the contents of a directory. Put another way, that means "Tell me everything that is in this folder". Go ahead and double click/Run in the next cell. 

In [9]:
os.listdir() # return contents of current working directory

['.git',
 '.ipynb_checkpoints',
 'bcss-code-2020-05-06.ipynb',
 'convert-data-structures-2010-03-16.ipynb',
 'data',
 'images',
 'outlines',
 'Quick_Guide_to_Jupyter_Notebooks.ipynb',
 'README.md',
 '_config.yml']

Roughly translated, this command says "Using os, list the contents of the directory (here)". If you did not run the commands in the previous cell block (the command to import os), you would get an error here. If so, make sure you go back and run the commands to import os and then try this command block again. 


As well as getting os to list the contents of *here*, you can ask it to list the contents of directories that are *there* without you having to move to that and use os.listdir(). 

Double click/Run in the next code cell to see how that works. 

In [3]:
os.listdir("./data/") # return contents of the "data" directory

['oxfam-csv-2020-03-16.csv',
 'oxfam-csv-2020-03-16.json',
 'oxfam-csv-2020-03-16.xml']

If you look up to the results of asking os to list the contents of *here*, you will see that one of the items in the list was 'data'. When we asked os to list the contents of "./data/" we are asking it to list the contents of a directory or folder called 'data" that is located here. 

To translate that a bit more, the "./" and the beginning of "./data/" means "this directory here" or "this directory where we are now". The "data/" part at the end of "./data/" means a directory called "data". If you put them together, it means "a directory called 'data' that is in the directory here".

You can tell that both directories are directories, because of the "/" after "data" and after ".". That "/" means that whatever precedes the "/" will be a directory, rather than a file or something else. 

#### Environments

Your computational environment consists of hardware (e.g., the physical machine and its Central Processing Unit) and software (e.g., operating system, programming langauges and their versions, files). For instance, here is a snapshot of my computational environment as of 2020-03-30; first, the operating system:

And my version of Python, plus some of the additional packages installed:

Computational environments tend to be unique: for example, you may have different software applications installed on your machine compared to your classmate; or some machines in your computer lab run Windows 10, others Windows 7. This customisability presents considerable challenges for conducting, sharing and reproducing scientific work. In the words of the Turing Institute:<sup>[5]</sup>
> The analysis should be *mobile*. Mobility of compute is defined as the ability to define, create, and maintain a workflow locally while remaining confident that the workflow can be executed elsewhere.

Trying and failing to reproduce a piece of work after switching to a new machine is, frankly, soul destroying. Thankfully, there are numerous, simple technological solutions for capturing and sharing your computational environment.

#### Capturing a computational environment

If you run multiple projects, you may need more than one environment, achieved by using more than one machine or by _partitioning_ your machine into separate units. Each of these environments can then be customised for the kind of work you do on the different projects. 

For example, on one of my machines I have two environments: one for collecting charity data for Australia; and another for interacting with the [Companies House API](https://developer.companieshouse.gov.uk/api/docs/). Each environment has Python installed but they have different Python packages. I do not perform any web-scraping for the Companies House project, therefore I did not install the `requests` or `BeautifulSoup` packages in that environment. 

I find this beneficial because the the work I do on the different projects requires different packages, each of which can be picky about which versions of *other* packages I have installed. By keeping them separate, I can install or update only the packages I need, when I need them, without worrying that it will break the chain of requirements for one project by improving it for another. If I were to use one environment for all the different kinds of work I do, some of my scripts may break whenever I try to upgrade or install something that I need for only one project. Running separate environments for different projects allows me to manage these package dependencies carefully and correctly.

Interacting with and undertanding your computer at a more fundamental level is also excellent training for running your own server for research (or other) purposes. What is a server? Think of it as a more powerful form of personal computer, running in the cloud, and your primary means of communicating with it is through the Command Line Interface (CLI). It is always on (barring any planned or unplanned downtime) and thus is particularly useful for running automated, scheduled tasks e.g. conducting a weekly scrape of a particular web page.

[5]: https://the-turing-way.netlify.com/reproducible_environments/reproducible_environments.html

## Acquiring, understanding and manipulating unstructured/unfamiliar data

#### Acquiring data
There are LOADS of ways to get data, some that are more 'computational' than others. You are all surely familiar with surveys and interviews, as well as Official data sources and data requests. You may also be familiar with (at least the concepts of):
 - scraped data that comes from web-pages or APIs
 - “found” data that is captured through alongside orinigally intended data targets
 - meta-data, which is data about data
 - repurposed data, or data collected for some other purspose that is used in new and creative ways or 
 - other... cause this list is definitely not exhaustive. 
 
 To some extent, using these data sources requires that you keep your ear to the ground so that you know when relevant new sources come available. But once you know *about* them, you still need to know *what* they are and *how* to access and use them. 
 
 So, we will set data acquisition aside for the moment and instead focus on data literacy, which is knowldege of the types of data that you might find. 
 
 Being data literate involves understanding two key properties of datasets:
1. How the contents of the dataset are stored (e.g., as numbers, text, etc.).
2. How the contents of the dataset are structured (e.g., as rows of observations, or networks of relations).

<!-- Data literacy is the ability to manipulate a wide variety of different types of data. Data literacy is front-and-centre in computational social science! Why do you need computional skills for handling data? 
 --> 
 <!-- Imagine that you've found a website that publishes statistics about a phenomenon of interest, however these are updated daily and you do not have the time (or patience) to visit that website every day to extract the figures, copy them to a new row in a file, etc. 
Now, if you are not data literate, you might just be a bit stuck. You would have to do your best to visit the website as often as possible, ask others to help you visit it on days that you can't do it yourself, or something similar. BUT! If you were data literate, you would be able to find out how the data was structured on the website and would be able to create a programming script that would visit the website for you, copy out the data you needed, and save it somewhere sensible. 
 -->
#### Data types

Data types provide a means of classifying the contents (values) of your dataset. For example, in [Understanding Society](https://www.understandingsociety.ac.uk/) there are questions where the answers are recorded as numbers e.g., [`prfitb`](https://www.understandingsociety.ac.uk/documentation/mainstage/dataset-documentation/variable/prfitb) which captures total personal income; [One more example of qualitative variable]

Data types are important as they determine which values can be assigned to them, and what operations can be performed using them e.g., can you calculate the mean value of a piece of text (Tagliaferri, n.d.)?<sup>[4]</sup> Let's cover some of the main data types in Python.

[4]: https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

 <!-- Material added by JK to be worked in -->  
 <!-- Structured vs. unstructured (or more accurately, semi-structured) data 
# Fully structured data vs. completely unstructured data 
# Real-world data  
# Practical matters, or Rows and columns vs. free text 


# <!-- Think hard about data 
# Do you start by thinking about ideal data and then try to acquire the best possible match? 
# Or do you start with what is available and try to find the most useful thing to say about it? 
# Bit of both? What are the limitations? 
#  Searching structured data 
# Regular expressions 
# Databases 
# Hierarchies (semantic web, ontologies, etc.) 
# Other? 
# Searching un/semi-structured data 
# Regular expressions again 
# Free text fields 
# Text-mining 
# Machine learning 
# Deep learning 
# Edge-detection 
# AI 
# Other?  -->

 <!-- ### Working with data
#  -->
 <!-- Ideal or tidy data 
# Cutting and subsetting 
# Joining and merging 
# Recoding values 
# Coding data 
# Other? 
# Combining and cleaning data 
# Combining data with and without a common field 
# Cleaning messy, incomplete, or inconsistent data 
# Embrace the mess – using missing/incomplete/inconsistent data as a source of information  -->
 <!-- More material added by JK. Will this be another of the "Big Five"? Or maybe work into an existing section. But data prep is so important that it seems like it could be a section on its own.  -->


 <!-- Another item in the "Big Five"? or Just part of thinking computationally? Something to get people out of their normal mindsets of "How do I answer this research question...?" -->

##### Numbers

These can be integers or floats (decimals), both of which behave a little differently.

In the next code block, we name two variables (myint and yourint) and define them as integers (5 and 10 respectively). To do this, double click and Run/Shift+Enter in the cell block below. 

In [2]:
# Integers

myint = 5
yourint = 10


You double clicked, you hit run, but nothing happened, right? That is because naming and defining variables does not come with any commands that produce output. Basically, we ran a command that has no visible reaction. But maybe we want to check that it worked? To do that, we can call a print command. 

The cell below has a print command that includes the some text (within the quotation marks) and the result of a numerical operation over the variables we defined. Go ahead, double click in the cell and hit Run/Shift+Enter.

In [3]:
print("Summing integers: ", myint + yourint)


Summing integers:  15


Great! The print command worked and we see that it correctly summed the numerical value of the two variables that we defined. 

Let's try it again with Floats. Click in the code block below and hit Run/Shift+Enter. 

In [4]:
# Floats

myflo = 5.5
yourflo = 10.7
print("Summing floats: ", myflo + yourflo)


Summing floats:  16.2


It might not be surprising, but it worked again. This time, the resulting sum had a decimal point and a following digit, which is how we know it was a float rather than an integer. 

What happens when we sum an integer and a float? Find out with the next code block!

In [5]:
# Combining integers and floats

newnum = myint + myflo

print("Value of summing an integer and a float: ", newnum)
print("Data type when we sum an integer and a float: ", type(newnum))

Summing an integer and a float:  10.5
Data type when we sum an integer and a float:  <class 'float'>


In this case, create a new variable, called *newnum* and assign it the value of the sum of one of our previous integers and one of our previous floats.  

Then, we have two print statements. One returns the value of *newnum* while the other returns the *type* of *newnum*. 

You can always ask for the type. Go ahead and double click in the cell above again. This time, instead of just running the code, copy and past the final print statement. Before you run the code again with your new line, but change that line by rewriting the text inside the quotation marks to anything you like and change *newnum* to *myfloat* or *myint* or any of the other variables we defined. 

You can even define a whole new variable and then ask for the type of your new variable. 

##### Strings

This data type stores text information. This should be a bit familiar, as we used text information in the previous code blocks within quotation marks. 

Strings are immutable in Python i.e., you cannot permanently change its value after creating it. But you can see what type of variable a string is (just like with the numerical variables above. 

You can also re-define the variable, which rewrites or changes the original definition and you can create new strings by performing operations on existing strings, such as replacing bits of the string, splitting it into sub-strings, etc. 

Double click in the code block below and hit Run/Shift+Enter. 

You can also copy/paste/edit the commands to create your own string variables and run your own commands on them. 

In [12]:
# Strings

mystring = "Thsi is my feurst string."
print(mystring)

print("What type is mystring: ", type(mystring))

mystring = "This is my correct first string."
print(mystring)

yourstring = mystring.replace("my", "your") # replace the word "my" with "your"
print(yourstring)

splitstring = yourstring.split("your") # split into separate strings
print(splitstring)


Thsi is my feurst string.
What type is mystring:  <class 'str'>
This is my correct first string.
This is your correct first string.
['This is ', ' correct first string.']


Manipulating strings will be a common and crucial task during your computational social science work. We'll cover intermediate and advanced string manipulation throughout these training materials but for now we highly suggest you consult the resources listed below.

*Further Resources*:
* [Principles and Techniques of Data Science](https://www.textbook.ds100.org) - Chapter 8.
* [Python 101](https://python101.pythonlibrary.org) - Chapter 2.

##### Boolean

This data type captures values that are true or false, 1 or 0, yes or no, etc. These will be like dummy or indicator variables, if you have used those in Stata, SPSS or other stats programmes. 

Boolean data allow us to evaluate expressions or calculations (e.g., is one variable equal to another? Is this word found in a paragraph?).

Double click in the code block below and hit Run/Shift+Enter. 

In [14]:
# Boolean

result = (10+5) == (14+1) # check if two sums are equal
print(result) # print the value of the "result" object
print(type(result)) # print the data type of the "result" object

True
<class 'bool'>


It is important to note that we did not define *result* as the value of 10+5 or the value of 14+1. We defined *result* as the value of whether 10+5 was exactly equal to 14+1. 

In this case, 10+5 is exacly equal to 14+1, so *result* was defined as True, which we can see in the output of the *print(result)* command. 

Booleans are very useful for controlling the flow of your code: in the below example, we assign somebody a grade and then use boolean logic to test whether the grade is above a threshold, which determines whether or not that grade receives a pass or fail notification.

Double click in the code block below and hit Run/Shift+Enter. 

Then redefine grade as a different number by changing the number after the '=' and then hitting Run=Shift+Enter again. 

In [21]:
grade = 71

if grade >= 40:
    print("Congratulations, you have passed!")
else:
    print("uh oh, exam resits for you.")

Congratulations, you have passed!


You can write a boolean statement more consicely, as demonstrated in the next code block. This time, you don't get the nicely worded pass/fail messages, but those will not always be important. 

Double click in the code block below and hit Run/Shift+Enter. Try it again, but change the number. This changes the threshold against which the command will return a true. 

Remember that you can redefine *grade* at any point, either by changing the definition in the code block above and re-running that code block or by copy/pasting/editing the grade = 71 line from above into this code block and re-running it here. 

In [22]:
print(grade >= 40) # evaluate this expression

False


*Further Resources*:
* [How To Code in Python](https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf) - Chapter 21.

##### Lists

The list data type stores a variable that is defined as an ordered, mutable (i.e., you can change its values) sequence of elements. Lists are defined by naming a variable and setting it equal to elements inside of square brackets. 

Double click in the code block below and hit Run/Shift+Enter. 


In [29]:
# Creating a list

numbers = [1,2,3,4,5]
print("numbers is: ", numbers, type(numbers))

strings = ["Hello", "world"]
print("strings is: ", strings, type(strings))

mixed = [1,2,3,4,5,"Hello", "World"]
print("mixed is: ", mixed, type(mixed))

mixed_2 = [numbers, strings]
print("mixed_2 is: ", mixed_2, type(mixed_2)) # this is a list of lists

numbers is:  [1, 2, 3, 4, 5] <class 'list'>
strings is:  ['Hello', 'world'] <class 'list'>
mixed is:  [1, 2, 3, 4, 5, 'Hello', 'World'] <class 'list'>
mixed_2 is:  [[1, 2, 3, 4, 5], ['Hello', 'world']] <class 'list'>


Notice that most of these print commands print the value of the variable and also the type of variable.

Also notice that you can define a list variable by listing all of the elements that you want to be in that list inside of square brackets (like the code that defines 'mixed') or you can define a list by including *other* lists inside of the square brackets for a new list (like thecode that defines 'mixed_2'). 

As you can see, mixed has only one set of square brackets, but mixed_2 has square brackets nested inside of other square brackets to create a list of lists. 

Feel free to re-define these variables or add/define new variables too (but leave 'numbers' alone as we need it for the next several steps). 

When you are done testing out how to define work with lists, go on to run the next code block.

In [30]:
# List length

length_numbers = len(numbers)
print("The numbers list has {} items".format(length_numbers)) 
# the curly braces act as a placeholder for what we reference in .format()

The numbers list has 5 items


This one creates a new variable, called 'length_numbers' that is defined as the "len" of the "numbers" variable we defined above. 

The print statement underneath then goes on to tell us the value of the 'length_numbers' variable, but embeds that value inside of a sentence. We use the curly brackets as a placeholder for where the value should get embedded and use the '.format(length_numbers) to order the embedding and to define what is to be embedded. 

Try re-running the print command with other values embedded (by changing the variable that is to be emdedded), or embedding the variable in different places (be repositioning the curly brackets). 

In [31]:
# Accessing items (elements) within a list

print("{} is the second item in the list".format(numbers[1]))
# note that the position of items in a list (known as its 'index position')
# begins at zero i.e., [0] represents the first item in a list

# We can also loop through the items in a list:

print("\r") # add a new line to the output to aid readability
for item in numbers:
    print(item)
# note that the word 'item' in the for loop is not special and
# can instead be defined by the user - see below   

print("\r")
for chicken in numbers:
    print(chicken)
# of course, such a silly name does nothing to aid interpretability of the code    

2 is the second item in the list

1
2
3
4
5

1
2
3
4
5


In [None]:
# Adding or removing items in a list

numbers.append(6) # add the number six to the end of the list
print(numbers)

numbers.remove(3) # remove the number three from the list
print(numbers)

_Dictionaries_

The dictionary data type maps keys (i.e., variables) to values; thus, data in a dictionary are stored in key-value pairs (known as items). Dictionaries are useful for storing data that are related e.g., variables and their values for an observation in a dataset.

In [None]:
# Creating a dictionary

dict = {"name": "Diarmuid", "age": 32, "occupation": "Researcher"}
print(dict)

In [None]:
# Accessing items in a dictionary

print(dict["name"]) # print the value of the "name" key

In [None]:
print(dict.keys()) # print the dictionary keys

In [None]:
print(dict.items()) # print the key-value pairs

In [None]:
# Combining with lists

obs = [] # create a blank list

ind_1 = dict # create dictionaries for three individuals
ind_2 = {"name": "Jeremy", "age": 50, "occupation": "Nurse"}
ind_3 = {"name": "Sandra", "age": 41, "occupation": "Chef"}

for ind in ind_1, ind_2, ind_3: # for each dictionary, add to the blank list
    obs.append(ind)

print(obs)# print the list
print("\r")
print(type(obs)) # now we have a list of dictionaries

_Social science applications_

You may be wondering how the above examples have social science applications. To answer, here is an example from my research. Let's say I want to scrape First, I want to define a list of charity numbers that I 

#### Data structures

Indulge me: close your eyes and visualise a dataset. What do you picture? Heiberger and Riebling (2016, p. 4) are confident they can predict what you visualise:

> Ask any social scientist to visualize data; chances are they will picture a rectangular table consisting of observations along the rows and variables as columns.

This dataset (also known as a variable-by-case matrix or data frame) is a type of data structure: it stores values (e.g., text or numbers) in variables (e.g., strings or integers) in rows for _n_ number of observations. [Comment on why this is not always the best structure e.g. network data] As you engage in computational social science, you will encounter many more types of data structure, some of which may be unfamiliar; for now let's focus on some of the more common ones. We'll use some sample data - organisational and financial information for the charity [Oxfam](https://beta.charitycommission.gov.uk/charity-details/?regId=202918&subId=0) - to demonstrate and compare the properties of each data structure.

##### Data frame

A data frame is a rectangular data structure and is often stored in a Comma-Separated Value (CSV) file format. A CSV stores observations in rows, and separates (or "delimits") each value in an observation using a comma (','). Let's examine a CSV dataset in Python:

In [None]:
import csv # module for handling CSV files

with open("./data/oxfam-csv-2020-03-16.csv", "r") as f: # open file in 'read mode' and store in a Python CSV object called 'reader'
    reader = csv.reader(f) # read data in file
    for row in reader: # for every row in the data, print the contents of the row
        print(row) 

Though not as readable as we would like (compared to opening the file in Excel, Stata etc), we can clearly identify the structure of this file:
* the first row contains the variables;
* the following rows contain values for those variables, separated by commas; and
* each row is clearly defined as beginning on a new line

There is another way of opening CSV files and handling their contents: using the `pandas` module.

In [None]:
import pandas as pd # module for handling data frames

df = pd.read_csv("./data/oxfam-csv-2020-03-16.csv") # open the file and store its contents in the "df" object
df # view the data frame

Being able to work with CSV files is a fairly simple but crucial skill as a computational social scientist: many open-source datasets are shared in this format, and transforming more complicated data structures to CSV files can aid your data analysis workflow (e.g. importing the subsequent CSV file into Stata or R).

Now let's look at the same information but stored in a different data structure.

##### Dictionaries

A dictionary is a hierarchical data structure based on key-value pairs. Dictionaries are often stored as Javascript Object Notation (JSON) files. Let's examine a JSON dataset in Python:

In [None]:
import json # import Python module for handling JSON files

with open('./data/oxfam-csv-2020-03-16.json', 'r') as f: # open file in 'read mode' and store in a Python JSON object called 'data'
    data = json.load(f)
          
data # view the contents of the JSON file

Once again, readability is not great (see appendix A for other ways of viewing the contents of a JSON file) but we can pick out the core properties of the data structure:
* A dictionary begins with '{' and ends with '}';
* It can contain nested dictionaries - for example, the value of the first key ('name') is itself a dictionary containing ten key-value pairs (e.g. '0': '01/05/2008 00:00'); and
* key-value pairs are separated by a comma (',').

Let's dig into some of these properties in more detail:

In [None]:
data.keys() # view the keys

In [None]:
data.values() # view the values

In [None]:
data.items() # view the key-value pairs (items)

In [None]:
data['fye']['9'] 
# view the value of the '9' subkey under the 'fye' key i.e. the tenth value for the financial year end key 

##### XML

(EXtensible Markup Language (XML) is a hierarchical data structure (known as a document) that uses tags to identify (or 'markup') the different types of information it contains; XML is also a file format (.xml) used to store XML documents<sup>[4]</sup>. Let's examine an XML dataset in Python:



[4]: https://www.w3schools.com/xml/xml_whatis.asp

In [None]:
from lxml import objectify # import Python module for handling XML files

xml = objectify.parse(open('./data/oxfam-csv-2020-03-16.xml'))

root = xml.getroot()
root

##### Graphs

[Example of graph data structure e.g., an edge list]

If you are feeling flustered about having to master new data structures, then don't: converting from one structure to another is perfectly fine and common. For example, in my research I often convert dictonaries to data frames and save these as CSV files (see chapter 3). For now it is important that you familiarise yourself with the above structures, as much of the data available via the web is stored in these structures and file formats. 

## Writing code

Perhaps the most crucial aspect of being a computational social scientist, the ability to write code can bring enourmous rewards. _Description/definition of programming and how it is similar to writing syntax for SPSS, Stata etc_

Programming can be conceived as a social research method:

> as a multipurpose toolkit for understanding and intervening in the (digital) social world in lots of different ways (Brooker, 2019, p.#).<sup>[2]</sup>

[2]: https://doi.org/10.1177/0038026119840988

This idea of "Programming-as-Social Science" carries with it two important distinctions (Brooker, 2019):
1. **Coding-as-method** - using code to interact with or probe the social world (e.g. through data collection scripts).
2. **Programming-as-analysis** - employing a coding mindset/knowledge of code to conceptualise and research social phenomena differently (e.g. ).

Though the objective of any research project is to produce a robust and defensible finding (theoretical or empirical), the manner in which you conduct your activities is increasingly important. This impinges on the code you write, also. There is a school of thought that emphasises the readability and fluency of code, known as _literate programming (LP)_. The father of this approach, Donald Knuth (n.d.), summarises its high level aim:

> Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.<sup>[3]</sup>

Such a statement probably appears grandoise and abstract, but there are important practical implications of this idea. The coder (or essayist in LP parlance):

> chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding.

Being a literate programmer does not mean writing screeds of comments and headings for the sake of it, far from it (remember: conciseness is paramount). We'll cover this topic in more detail in section 1.3.5 (add anchor).

[3]: http://www.literateprogramming.com/knuthweb.pdf

<!-- Material added by JK to be worked in -->
<!-- Learning to code (it is not as scary as you think) 
Why coding is useful 
What coding languages should you bother with learning 
But how do you actually “do” the coding 
More on this  -->

## Documenting and enhancing your workflow

There is a growing movement across the scientific community for greater transparency and reproducibility of research. Put simply, "Reproducible research is necessary to ensure that scientific work can be trusted."<sup>[5]</sup>

Reproducibility can be summarised as the availability of data and code to fully rerun an analysis. The Turing Way provides a delineation of the various connotations of reproducibility and related terms:
* **Reproducible**: A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer.
* **Replicable**: A result is replicable when the same analysis performed on different datasets produces qualitatively similar answers.
* **Robust**: A result is robust when the same dataset is subjected to different analysis workflows to answer the same research question (for example one pipeline written in R and another written in Python) and a qualitatively similar or identical answer is produced. Robust results show that the work is not dependent on the specificities of the programming language chosen to perform the analysis.
* **Generalisable**: Combining replicable and robust findings allow us to form generalisable results. Note that running an analysis on a different software implementation and with a different dataset does not provide generalised results. There will be many more steps to know how well the work applies to all the different aspects of the research question. Generalised is an important step towards understanding that the result is not dependent on a particular dataset nor a particular version of the analysis pipeline.


Professor Vernon Gayle of the University of Edinburgh has distilled reproducibility best practices into the following guidance (or rules) for social science research:
1. Tell us about your software.
2. Tells us about your data.
3. Show us how you got your data ready.
4. Show us all the analysis you did.
5. Save all of this work openly.


*Further Resources:*
* [The Turing Way](https://the-turing-way.netlify.com) - Chapter 2.
* [New Rules of the Sociological Method](https://github.com/vernongayle/new_rules_of_the_sociological_method/blob/master/noobs.ipynb).

What is the relation between CSS and reproducible research. Well, the former provides a suite of tools and best practices for achieving the latter. Let's run through some of these quickly:
* **Jupyter Notebooks**: the materials you are working through were written in a Jupyter notebook, a software application that enables you to interleave live code, results and narrative in a single file. Traditionally, social scientists save their data cleaning and analysis work in one or more files (e.g., Stata DO files), and write up the results in another file (e.g., a MS Word or Latex file). Jupyter notebooks re-establish the connection between conducting and reporting research activities. As [Barba et al. (2019)](https://jupyter4edu.github.io/jupyter-edu-book/) espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

* Github: 

[J. Scott Long's Workflow of Data Analysis Using Stata]

[FAIR principles]

[Citing data]
    
[5]: https://the-turing-way.netlify.com/introduction/introduction.html

The most powerful aspect of the technologies outlined above is that they **integrate** with each other. For example, you can conduct and document your analysis in a Jupyter notebook, save it publicly in a Github repostitory, and then use mybinder.org to allow others to reproduce your analysis using only their web browser.

<!-- More material added by JK. Just notes, really. -->
<!-- Citing sources 
Collaborative work and version control 
Replication. Replication. Replication. 
Sharing data  -->


In [None]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/a5i42lSj-L4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## Conclusion

Hopefully this chapter has demystified aspects of CSS and whetted your appetite for some applied work. The subsequent chapters provide plenty of opportunity to practice CSS with various forms of data. For now I wanted to reflect on some outstanding issues.

<!-- #### Python vs R vs Julia vs ....

[Perhaps a table with some properties of each?] The general point is it's your choice.
 -->

## Bibliography

Brooker, P. (2020). Programming in Python for Social Scientists. London: Sage Publications.

Kimberg, D. Y., et al. (1997). "Effects of bromocriptine on human subjects depend on working memory capacity." Neuroreport 8(16): 3581-3585.

Simon, H. A. (1997). Models of bounded rationality: Empirically grounded economic reason, MIT press.

Pinker, S. (2003). The language instinct: How the mind creates language, Penguin UK.

Kahneman, D. (2011). Thinking, fast and slow, Macmillan.

<!-- ## Further Reading and Resources

[Copy AQMEN reading lists] -->