# Fundamentals of Social Data Science 

Week 1. Day 1. Exercises from Chapter 1 of FSStDS 

Within your week 1 study pod discuss the following questions. Please submit an individual assignment by 12:30pm tomorrow, Tuesday October 11, 2022 on Canvas. 

These will not be marked and are solely for recordkeeping and review upon request. They will, however, be discussed in the Tuesday tutorial and briefing.

# Exercise 1. Data as operationalisation 

In the book we talk about data as being measurements from the world. The measurements represent phenomena but are not the phenomena themselves. To this end, we think of social data science as a 'science of the operationalisation of the social world'. Below are several key concepts about the world that we can operationalise in a variety of ways. For each of the concepts think of: 
1. A way that you can measure this concept in a survey. Is there a scale that people have used? Has this been mentioned in an academic paper? 
2. A question that you can ask someone in order to get a more indepth response about the topic.
3. A set of data from a social media platform that might strongly predict with the measure, either directly or indirectly. Do you think that you could collect this data just by browsing or would you need to access this data in a more structured form?


| Topic                      | Survey Q. | Interview Q. | Trace data |
|----------------------------|-----------|--------------|------------|
| 1. Number of close friends |           |              |            |
| 2. Political affiliation   |           |              |            |
| 3. Preferred social media  |           |              |            |



## Answer 1.

Please fill in the table, either in the markdown or below just in text. 

As an optional challenge, think about looking online for sources where people have done any of these. Can you find at least one academic paper for each of the nine cells? 

__Answer below here:__ 
1. Number of close friends (as a survey Q, interview Q, and as trace data):
- Survey: "Name all your close friends:", "How often have you seen X in the last 6 months"
    - Should you standardise the definition of close friends?
- Interview: "What is a close friend to you? Name your close friends?"
- Trace: Number of people co-tagged more than x times on BeReal (publicly accessible for many profiles), Mutual friends (Facebook) or reciprocal following (Instagram, Twitter)

2. Political affiliation (as a survey Q, interview Q, and as trace data):
- Survey: What party did you vote for in the last X election(s)? Who are you considering voting for in the next election? Classical likert-style survey
- Interview: How much do you relate to the party you voted for in the last election?
- Trace: Look at retweet behavior on Twitter (easily accessible + people are more likely to retweet things they agree with), which facebook pages do they positively react to.

3. Preferred social media (as a survey Q, interview Q, and as trace data):
- Survey: What is your preferred social media?
- Interview: What is your preferred social media? Why? What qualities do you like in a social media.
- Trace: Time spent (but it might reflect addictiveness more than preference), "positive" activity (e.g. likes or reactions)

__Answer above here__

# Exercise 2. FREE coding 

Take the following function and try to find a way to refactor it so that it can:
1. [Be functioning] Give the right output with the right input (find the bug), 
2. [Be robust] Give a missing value with the wrong input (what if we sent it a number?), 
3. [Be elegant] Have less repetition (can we simply the `elif` statements)?, 
4. [Be efficient] Use a more efficient algorithm (did the last line take care of all the inefficiencies?)

*Challenge*: The function takes a string and returns only vowels. That means `"y"` is an edge case. What will you do with it? The most sophisticated NLP packages might know which `"y"` is a vowel. Do we need to go that far? Can we warn people or somehow use a parameter for an option to include or exclude `"y"`?

In [5]:
%%time

def return_only_vowels(text):
    newtext = ""
    for letter in text: 
        if letter == 'a':
            newtext += letter
        elif letter == 'e':
            newtext += letter
        elif letter == 'i':
            newtext += letter
        elif letter == 'o':
            newtext += letter
        elif letter == 'u':
            newtext += letter
        
    return text

text1 = "The quick brown fox jumped over the lazy dog"
text2 = "It was the best of times, it was the blurst of times."
text3 = "A stitch in time saves 9"

for i in range(10000):
    result_list = []
    for text in [text1, text2, text3]:
        result_list.append(return_only_vowels(text))

for result in result_list: 
    print(result)

The quick brown fox jumped over the lazy dog
It was the best of times, it was the blurst of times.
A stitch in time saves 9
CPU times: total: 438 ms
Wall time: 560 ms


## Answer 2. 

Below try to rewrite the function and the inputs so that it runs faster, returns the right text, handles bad input gracefully, and reads a little better: 

In [2]:
%%time
from typing import Optional

def return_only_vowels(text: str) -> Optional[str]:
    if not isinstance(text, str):
        return None
    return "".join([letter for letter in text.lower() if letter in 'aeiouæøå']) 


text1 = "The quick brown fox jumped over the lazy dog"
text2 = "It was the best of times, it was the blurst of times."
text3 = "A stitch in time saves 9"
notatext = 123

# The rest us up to you! 
for i in range(10000):
    result_list = [return_only_vowels(text) for text in [text1, text2, text3, notatext ]]
for result in result_list: 
    print(result)

UsageError: Line magic function `%%timeit` not found.


# Exercise 3. Pseudocode

Pseudocode a recipe for making a pizza! It should have a dough base, a sauce, and two toppings. No worries about making it more complicated even if a great pizza can be an art. 

Some questions: 
1. Will you ask the user for what toppings they want?
2. What assumptions will you make about the ingredients? That is, will you assume they are already cooked or otherwise prepared? 
3. What assumptions will you make about the pizza oven? 

## Answer 3. 

Below write the pseudocode. Share it with a friend of yours and ask: do you think they would make the same pizza as you with these instructions? What might vary? 

__Answer below here__: 
```
1. Create a list of available toppings 
2. Create a list of available styles (Bianca and rosso)
3. Ask the user to pick a base
4. Ask the user to pick up to two valid toppings (in a while loop)
5. Turn on the oven at 250*C (async def)
5. Roll out the dough
6. Add the sauce (to the dough)
7. Add the cheese
8. Add the toppings (for loop)
9. Put in the oven for 10 minutes
10. Turn the pizza around
11. Bake until done (while loop)
```
...

__Answer above here__

# Exercise 4. What data is available for whom? 

Jeremy Singer-Vine has been compiling a list of really interesting data sets for several years. He shares these via his mailing list "data is plural". The most recent version of this list is available [here on Google Sheets](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0). 

Browse through this list of data sets. Below are some questions to ask of any given row signifying a data set:

A. By viewing the summary of the data, give an example of a distribution that could be stored and summarised.

B. With this data, what is excluded? 
  - Would certain cases or classes of people/things be excluded that could be considered? 
  - Would other data about the existing cases could be useful or interesting? 
  - Could we merge in data to compensate or would we need to do a separate data collection effort?
  - Would accessing this data be ethically reasonable for academic research?

## Answer 4. 

Please select a data set and answer the questions above. 

__Answer below here__:

I have chosen the Kickstarter data from [werobot.io](https://webrobots.io/kickstarter-datasets/). The company does a monthly scrape of data and metadata on all kickstarter projects. 

### A: Summary of the data
There are a bunch of columns in this dataset from timestamps, campaign-information, and funding. The nature of the summary of course depend on the column and datatype: timestamps would afford summaries like average frequency and periodic patterns (i.e. seasonality), whereas the numeric ones (like funding amount and number of backers) would afford more classical summary statistics.


### B: What is excluded? 
- Would certain cases or classes of people/things be excluded that could be considered? 
    - Kickstarter has a quite [restricted user base](https://help.kickstarter.com/hc/en-us/articles/115005128594-Who-can-use-Kickstarter-). Only people above the age of 18 with a bank account in a relatively narrow set of countries predominately in Europe and North America can create campaigns. This excludes the whole of the Global South as well as less priveleged individuals from the included countries. Furthermore, the dataset only contains information specifically on the kickstarter website. 
- Would other data about the existing cases could be useful or interesting? 
    - Kickstarter projects are largely driven by hype (CITATION MISSING but probably true). Therefore it would be highly relevant to investigate how the projects are talked about on social media, in the classical media, or if any specific advertisement campaigns are undertaken. 
- Could we merge in data to compensate or would we need to do a separate data collection effort?
    - Getting media data would probably require a separate data collection efforts as it relies on different sources (like Twitter and Reuters). This would also drastically increase the complexity of the data as linking the datasets could be tricky - especially if the projects are only mentioned and not explicitly linked. Some kind of [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) pipeline could probably be set up, but this would be a lot of work. 
- Would accessing this data be ethically reasonable for academic research?
    - Webrobot state in their [terms and conditions](https://webrobots.io/terms-and-conditions/) that the datasets are provided as is. Given that the datasets are publicly accessible there should be no major ethical issues with analysing the data. 

...

__Answer above here__