<span style="font-size: 20pt;"><span style="font-weight: bold;">Chapter 1.</span>Introduction: Thinking of life at scale</span>

Last update: 12 January 2024 

Thank you for checking out the code for: 

> Hogan, Bernie (2022, forthcoming) _From Social Science to Data Science_. Sage Publications. 

This notebook contains the code from the book, along with the headers and additional author notes that are not in the book as a way to help navigate the code. You can run this notebook in a browser by clicking the buttons below. 
    
The version that is uploaded to GitHub should have all the results pasted, but the best way to follow along is to clear all outputs and then start afresh. To do this in Jupyter go the menu and select "Kernel -> Restart Kernel and Clear all Outputs...". To do this on Google Colab go to the menu and select "Edit -> Clear all outputs".
    
The most up-to-date version of this code can be found at https://www.github.com/berniehogan/fsstds 

Additional resources and teaching materials can be found on Sage's forthcoming website for this book. 

All code for the book and derivative code on the book's repository is released open source under the  MIT license. 
    

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/berniehogan/fsstds/main?filepath=chapters%2FCh.01.Introduction.ipynb)[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/berniehogan/fsstds/blob/main/chapters/Ch.01.Introduction.ipynb)

<span style="font-size: 20pt;">📺 YouTube Video lecture for this chapter 📺</span>

In [11]:
from IPython.display import YouTubeVideo

YouTubeVideo('3SFpWNCPOi4')

# Introduction: From Social Science to what?

# (PO)DIKW - A potential theoretical framework for Data Science 

In the book we discuss this: DIKW is a nice framework from visualisation that delineates a hierarchy: Data > Information > Knowledge > Wisdom. It's not perfect and I describe in the text how we first ought to consider phenomena and operationalisation, hence (PO)DIKW. 
    

## What is data? 

## From Data to Wisdom. 

# Beyond the interface 

# Fixed, variable, and marginal costs: Why not to build a barn.

The examples below are perhaps not ideal, but the logic is important. In this section I discuss how to consider fixed and marginal costs in programming. A fixed cost is something cost once, and the marginal costs are the iterative or on-going costs. So we would pay the fixed cost for an oven, with the marginal costs being the electricity, cleaning costs, and the costs of the food we put in the oven. 

In programming, marginal costs happen with repetitive code. We should write once and reference in many places where possible. This can increase the fixed costs (i.e. you might need to write more general code) but then it can be used repeatedly. This is not always advised in research, however. This is because the fixed cost of doing something sufficiently general might be really high (leading to "over-engineering"). Finding the optimal route takes time and finesse. 

## From Economics to Data Science 

In [1]:
# Attempt 1. High marginal costs, low fixed costs 
email1 = "user.example@mail.com"
email_parts = email1.split("@")
name1 = email_parts[0]

email2 = "generic.student@oii.ox.ac.uk"
email_parts = email2.split("@")
name2 = email_parts[0]

print(name1,name2)

user.example generic.student


In [2]:
# Attempt 2. Low marginal costs
email_list = ["user.example@mail.com",
              "generic.student@oii.ox.ac.uk",
              "dr.professor@oii.ox.ac.uk"]

print([x.split("@")[0] for x in email_list])

['user.example', 'generic.student', 'dr.professor']


## The challenges of maximising fixed costs 

# Code should be FREE

I here introduce the FREE mnenomic for prioritising code. All code should be functioning (in the sense that with the right inputs you get the right outputs). Then code should be robust meaning that it should account for bad input or other edge cases that might break the code. Code should be elegant in the sense that the structure of the code makes sense and the langauge used reflects well the features of the programming language. Finally, we do want code to be efficient, but not at the expense of robustness or functionality. 

## Functioning code

In [3]:
def square(number):
    squarednumber = number * number  
    return squarednumber

print(square(3))

9


## Robust code

In [4]:
import numbers 

In [5]:
def square(number):
    # pre-emtively checking for inclusion
    if isinstance(number, numbers.Number):
        squarednumber = number * number  
        return squarednumber
    else:
        return float("NaN")

print(square("b"),square(2))

nan 4


In [6]:
def square(number):
    # duck typing to handle exclusion
    try:
        squarednumber = number * number  
        return squarednumber
    except:
        return float("NaN")

print(square("b"),square(2))

nan 4


## Elegant

In [7]:
def square(number):
    if isinstance(number, numbers.Number):
        return number * number
    return float("NaN")

print(square("b"),
      square(2))

nan 4


In [8]:
def to_exponent(number, power = 2):
    if isinstance(number, numbers.Number):
        return number ** power
    return float("NaN")

print(to_exponent("b"),
      to_exponent(2),
      to_exponent(3,3))

nan 4 27


## Efficient 

In [9]:
%%time

newlist = []

for i in range(1000): newlist.append(i)

CPU times: user 34 µs, sys: 3 µs, total: 37 µs
Wall time: 39.8 µs


In [10]:
newlist = [] 

%timeit -n 1000 for i in range(500): newlist.append(i)

%timeit -n 1000 newlist = [i for i in range(500)]

%timeit -n 1000 newlist = list(range(500))

7.66 µs ± 702 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
5.17 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.5 µs ± 55 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# Pseudocode (and pseudo-pseudocode)

The book gives several examples of pseduocode. There is both the formal pseudocode of computer science but also more loose pseudocode as in a recipe. You should practice pseduocode as it can help you structure your program rather than always have to worry about programming line by line. This also helps us consider elegance as pseudocode can delineate specific parts of a program to be considered modular. 

## Attempt 1. Pseudocode as written word

## Attempt 2. Pseudocode as mathematical formula

## Attempt 3. Pseudocode as written code 

## Attempt 4. Slightly more formal pseudocode (in a Python style)

# Summary

The summary, further reading, and extensions are found in the book. You can also find exercises related to the chapters in this repository. 

# Further reading 

# Extensions and reflections 