# An Introduction to Social Data Science

## Lecture One

### An Introduction to Python: Part One

#### 9th August, 2021

Welcome! This is the first of three lectures in the 'Introduction to Social Data Science' module which teaches the fundamentals of Python. Please see the readme.md file in the [Github repository](https://github.com/crahal/Teaching/tree/master/PythonForSociologists) for the course handbook. Please note: this module will revolve around Python 3.x, and while you can follow the examples and homeworks in a text-editor and execute the scripts via the command line, [Jupyter Notebooks](https://jupyter.org/) are advised! All important Python concepts are in bold typeface.

In today's lecture we will cover primative data types (characters and numbers), introduce various types of object (including types of collection) before moving on to for loops. Without getting ahead of ourselves, we can note at this early stage that in Python, everything is an object (except control flows), and that it is an object-oriented language (although not a 'pure' one).

## Section 1.1: Primative Data Types

Before we get to objects, which are the abstract building blocks of data, let's first introduce some primitive data types:

* Characters: 'A', '!', '1' -- one single 'glyph'
* Numbers
* Two types of numbers:
 * Integer: 1,2,3, -500, +600,
 * Float: 1.123, 0.1232534, -4123.123123

### Section 1.1.1: Characters

A character is a single 'glyph' that is included in a character set. Some characters are visible, such as the letter A (or 'a' - note case sensitivity will be *extremely* important) and some are invisible. Within this specific document, multiple characters make content in 'markdown', or can be joined together to name form variable names or control the flow of a Python program. There are three important invisible characters:

1. Space (the common whitespace which is separating these words)
2. Tab (which is usually about four spaces long - more in Lecture Two when we talk about indenting)
3. Newline character (which tells the computer to move to the next line)

Lets look at some examples, and introduce the **print()** command:

In [1]:
print('1. hello friends! how are you today?')
print('2. hello friends! \nhow are you today?')
print('3. hello friends! \thow are you today?')

and the **type()** command

In [2]:
type('this is a group of characters')

In [3]:
type(1)

### 1.1.2 Numbers 

Numbers come in several **types** (of object), but two are specifically important to mention:

1. Integers - whole numbers, such as 1 or 42,
2. Floating point numbers - these allow for decimal points such as 2.17 or 3.33

Lets first consider this: 

In [4]:
12 / 5

What's happened here? We've divided two integers to get a float! Floats can be distinguished from integers because they have a fractional part. We can force a number to be either an integer or a float using int() and float():

In [5]:
int(12/5)

In [6]:
float(12/5)

In [7]:
type(3)

In [8]:
type(int(12/5))

In [9]:
type(12/5)

In [10]:
type(3)

## Section 1.2. Strings

Naturally, a character isn't very useful on its own! Multiple characters together form a _string_. In Python, strings are enclosed using quotations. You can use a variety of quotations in order to close the string (i.e. single or double). Two things to remember:

    1. Always close the string with the same quotes used to open it.
    2. Always escape a quotation character if you use it inside the string.

```python
'This is a string.'

"This is also a string."

'"This is yet another string!
s
sdfasdfa
asdf
asdf
asdf"'

''This is not a string, but why?''
```

In Python 3.* all strings are printed inside parentheses like the following:
```python
print("This is a string!")
```

How does this differ from Python 2.x? 

Now let's **assign** a string to a variable, then print the variable

In [11]:
SomeVar = "This is a string. It has been assigned to a variable called 'SomeVar'"
print(SomeVar)

A variable is a name that is given to an object whose contents can change.

There are some really important variable naming conventions:

* ALLCAPS should be used to represent a variable that we want to keep constant, like a secret (API) key.

* __underscoreunderscore means a variable that is hidden and shouldn't be referenced directly.

* variables should only start with ASCII characters (and not numbers).

* use a consistent style, such as camelCaseNames or underscore_variable_names.

* alllowercasenounderscorenames are hard to read.

If we want to print and assign a float or int, we have to force its type to a string.

In [12]:
RandomNumber = 2834
print('When printing numbers, we need to convert them into a string, first: ' + str(RandomNumber))
print(5.00+RandomNumber)

### Section 1.2.1: 📝 Character Sets 📝

#### Section 1.2.1.1: ASCII

Strings are drawn from character sets. Loosely, 'the alphabet' is a characeter set, but not a very useful one, because it's so limited. The basic Western character set is ASCII. It has 128 code points. The first 38 are control characters, like 'new line', and the remainder are the upper and lower case alphabet, ten digits and punctuation characters. ASCII is not really sufficient for most languages or most of our data intensive purposes.

#### Section 1.2.1.2: UTF-8

Unicode is meant to be a very large character set, containig over a million code points. As such, unicode includes most characters from most languages around the world, as well as the emergent emoji character set. Python 3 makes it pretty straightforward.

#### Section 1.2.1.3: Emoji

Emoji is an emerging unicode standard for pictograms. You cannot rely on every computer displaying every emoji, or more nuanced emojis such as the Apple skin tone emojis. If you want to reference an emoji in Python, you'll need to know the unicode definition. You can print emoji, but not really do much else with them. I mention emoji only to highlight that you don't need to strip your text right down to ASCII to work with it anymore, and this opens up new research questions.(Note: pictographic langauges such as Chinese are far easier to deal with in Python 3 than 2).

In [13]:
print('\U0001f334') # This is the emoji code point. 
print(b'\U0001f334') # This is what happens when you print it as a 'bytestring'
print('🌴') # You can print emoji directly
print('🌴' in 'Yeah, great job! 🌴')

### Section 1.2.2: String manipulation

It is critical to note that as opposed to other languages, strings are [indexed starting from 0](https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html) and work sequentially forward. So in the string: "python is the best", there is an 'p' as the 0th element and an 'n' as the 5th.

What other languages are zero indexed? What are not?

Lets look at string indexing:

In [14]:
variable = "python is the best"
print(variable[0], variable[1], variable[2],
     variable[3], variable[4], variable[5])

The above hopefully shows us that a string is really just a list of characters (as in a series of characters that one would string together).

##### Pop Quiz:

Can you print out the 'b' from this variable? Can you think of a shortcut to do it?

In [15]:
print(variable.find('b'))

In [16]:
print(variable.isalnum())

Lets look at some standard string **methods** (a method is 'attached' to an **object**):

* upper: change to upper case
* lower: change to lower case
* title (capitalize): change to title case
* find: return index of first instance of input
* isalnum: is this string alphanumeric?
* isalpha: is this string just letters?
* replace: find all instances of something and change to something else
* strip: remove whitespace characters from a string (useful when reading in from a file)

The period [.] is used to link the object to the method. So if we have a string object:

"This is an object"

And we attach the 'upper' method like so:

"This is an object".upper()

Note that some methods take **arguments**

We can change it to upper case. Try it below using ```variable``` from above:

In [17]:
print(variable)
print(variable.upper())
print(variable.lower())
print(variable.title())
print(variable.find('i'))
print(variable.isalnum())
print(variable.isalpha())
print(variable.replace(' is ',' is not ').replace('a string', 'a banana')) #we can 'chain' methods together
print(variable.strip(' '))

We can also get help on specific methods using a syntax such as ```help(somevar.title)``` (similar to other programming languages). We can also get a list of all methods associated with an object using ```dir(object)```. To determine the *type* of object, we can utilize ```type(object)```, and to get detailed help on any object or method ```help(object)```. Lets try a couple out:

In [18]:
type(variable)

In [19]:
dir(variable)

In [20]:
help(str)

### Section 1.2.3: Special Characters

What if we need to print a quotation character inside a string that uses quotes? Introducing the very important **escape character**! The escape character is the backslash: in order to print a quotation rather than use it to end the string, you would type:

```python
"Escaping a \" in a string"
```

However, sometimes you can sidestep this by using a different quotation type within the string itself:

```python
"This will 'work'."
'This will also "work".'
'''This will work for both " and ' types.'''

```
The triple quote is used for block quotes such as at the top of a function docstring, where you can just keep writing across lines. Lets try it out, and escape the escape character also (with a newline thrown in for good measure):

In [21]:
print('This will also "work".')
print("If you haven't inserted \\ characters\nThis will be \"totally\" broken")

In [22]:
''' This is a commented out block using
tripple single
quotes!
'''

### Section 1.2.4: Combining strings

Imagine you have two words that you wish to add together, such as 'Data' and 'Science'. There are several ways to do this.

#### Section 1.2.4.1: Concatenation

In python the + symbol means **concatenate**, such as for example when it appears between two strings. This is the simplest way to combine two strings. Here are some ways to concatenate. Try them out, but dont forget the whitespace if and where necessary:

In [23]:
var1 = "Social"
var2 = " Data "
var3 = "Science!"
print(var1 + var2 + var3)
print(type(var1+var2+var3))

Note that the + symbol is used for both addition *and* concatenation. So be careful, if you mix strings and numbers python will throw a **traceback** (try ```print(1 + '2')```...: what do you get?)

To make a number into a string, you can use the string function (```str()```)

In [24]:
num = 123
strNum = str(num)
print(strNum)

#### Section 1.4.2: Insertion

Sometimes you want to insert something in the middle of a statement but don't want to merely concatenate. Maybe you have a collection of things and want to insert them in a lot of places. The bonus is that you can also print digits really nicely this way.

In [25]:
print("Pi to two decimal points is %1.2f. Isn't that convenient?" % 3.1456)

#### Section 1.2.4.3: Joining

Sometimes you want to join strings together with a specific seperator:

In [26]:
";".join(["I want to"," join this together"])

More commonly, you want to join a list of words on whitespace to make a sentence: ```' '.join(list)```.

#### Section 1.2.4.4: Splitting

If you can join strings together, you can also split them (into a **list**: see Section 1.3.1 below -- this will be our first **collection**)! This is crucial for data cleaning, especially with _free text_ like social media data. The default way to split the data is using the whitespace character, but we can also split on specific substrings:

In [27]:
BigChunk = 'Lets split this into chunks'
print(BigChunk.split(' '))
print(type(BigChunk.split(' ')))

['Lets', 'split', 'this', 'into', 'chunks']
<class 'list'>


## Section 1.3: Collections

Virtually every programming language has a notion of a collection. A collection is a means for referring to one or more things at the same time, and Python has many collection types (note: a string can be thought of as a joined up list of characters). In general, collections are **iterable**, which means that you can ask for each item in the collection one-by-one. But beyond that they vary quite dramatically. Here are the major collection *types* that you will come across in Python:

### Section 1.3.1: Lists

A list is a sequential (the order is relevant), zero-indexed (first item is indexed at 0, just as with strings) and *mutable* (you can add or delete elements) collection signified by ```[... , ...]``` (we saw this above with the split). Lets make and play around with a list:

In [28]:
libraries = ["matplotlib", "numpy", "pandas"]

print('My first and third favourite python libraries are: ' +
      libraries[0] + ' and ' + libraries[2])

My first and third favourite python libraries are: matplotlib and pandas


How about appending to this list?

In [29]:
libraries.append("seaborn")
print(libraries)

['matplotlib', 'numpy', 'pandas', 'seaborn']


### Section 1.3.2: Tuples

A tuple is a sequential, zero-indexed and *immutable* collection signified by ```(... , ...)```: a list you can't change. It's denoted by parentheses rather than square brackets. They are used in lots of places where you don't want a list to change size or you want your object operations to be faster than with a list:

```python
libraries = ("matplotlib", "numpy", "pandas")
```

#### Section 1.3.1.2: Querying and Slicing Lists/Tuples 

You can index a list just like a string, and just like strings, you can ask for a range of values (a 'slice') using a colon (although if you run out of range, you will get an error):

```python
libraries[0:2]
```

Note here that the return is a list. If we want a specific string, we can index the new list:

```python
libraries[0:2][0]
```

You can also index from the end of the list/tuple/string to walk backwards. This is done with negative numbers:

```python
libraries[-1]
```

### 1.3.3: Dictionaries

A dictionary is an unordered, key-indexed and mutable collection signified by ```{... : ... , ... : ... }```. Like in English, where a dictionary defines a word, a dictionary in Python uses a key to fetch a value (we are in the world of `key-value' pairs:

```python
FamousSociologists = {"Marx": "1", "Weber": "2", "Durkheim":"3"}
```

In [30]:
FamousSociologists = {"Marx":"1","Weber":"2", "Durkheim":"3"}
FamousSociologists['C. Wright Mills']="4" #add a new key:value pair 'on the fly'

print(FamousSociologists.keys())
print(FamousSociologists.values())
print(FamousSociologists.items())
print(FamousSociologists['Weber'])

dict_keys(['Marx', 'Weber', 'Durkheim', 'C. Wright Mills'])
dict_values(['1', '2', '3', '4'])
dict_items([('Marx', '1'), ('Weber', '2'), ('Durkheim', '3'), ('C. Wright Mills', '4')])
2


Set up a dictionary of countries and capitals to show that we can nest collections:

In [31]:
geography = {
    "England": "London",
    "China": "Beijing",
    "Malaysia": ["Kuala Lumpur", "Putrajaya"],
}

Note: this principle is largely how the .json data format works

Lets now take a break. In that time, i'll set you an exercise question.

### Exercise Question 1.1:

Using the variables below, print the "[Konami Code](https://en.wikipedia.org/wiki/Konami_Code)"

```python
v1 = "up,"
v2 = "down,"
v3 = "left,"
v4 = "right,"
v5 = "b,"
v6 = "a,"
v7 = "start"
```

In [32]:
# Insert your answer here

### Exercise Question 1.2:

Change the cases of the following three variables, printing them out in their new cases one per line:
```python
uppercaseme = "Make me upper case!"
lowercaseme = "Make me lower case."
titlecaseme = "Make me title case."
```

In [33]:
# Insert your answer here

### Exercise Question 1.3:

Here is a list:
    
```python
wrong_order_list = ["third", "first", "second", "fourth", "sixth"]
```

Now create a new list that is in the right order by only indexing this list. Be sure to insert "fifth" into this new list.

In [34]:
# Insert your answer here

### Section 1.3.4: Advanced List Operations¶

#### 3.4.1. Slicing of lists

We can do more than simply query a list by its index. And also, indices can be negative numbers as well. When we use negative numbers we are indexing the list from the end, rather than the front (we briefly saw this last week). We can also ask for a part of a list in a range. This is called 'slicing'. Finally, if we are working with characters, we can chop up a string into a list, or take a list and join it together as a string. You can index a list using the []. To slice a list, you would use the : inside the []. Lets try an example where we get a single indexed return and a slice:

In [35]:
mylist = ['sociology', 'economics', 'political science', 'social policy']

print(mylist[2:4])

['political science', 'social policy']


What happens if we try and call an index not in the range of the list? i.e. print(mylist[6]) ?

Now you try: define your own list (it can be as long as you like) and index and slice it in various ways. Note, we index and split strings in exactly the same way as we do lists:

In [36]:
cat_breath = 'My cats breath smells like cat food'
print(cat_breath[22:])
print(cat_breath[3:7])
print(cat_breath[-4:])

like cat food
cats
food


We can also find the position at which something occurs with the find method:

In [37]:
cat_breath.find('cats')

3

#### 1.3.4.2. Splitting

Strings are just a special kind of list that only includes characters. We can query and slice strings the way we do lists. We can also alternate between strings and lists using ```.split()``` and ```.joint()```, i.e.:

In [38]:
oldstring = 'History repeats itself,\n' + \
            'first as tragedy,\n' + \
            'second as farce.'

newlist = oldstring.split(' ')

print(newlist)

newstring = ' '.join(newlist)

print('\n\n' + newstring)

['History', 'repeats', 'itself,\nfirst', 'as', 'tragedy,\nsecond', 'as', 'farce.']


History repeats itself,
first as tragedy,
second as farce.


Note again how we are breaking lines (Pep-8). We can also re-join a split string!

Here (below) we are splitting the string on the '.', and re-joining them with ' ':

In [39]:
iwanttobreakfree="I.want.to.break.free."

print(iwanttobreakfree)

godknows=iwanttobreakfree.split('.')

print(godknows)

godknowsiwanttobreakfree=" ".join(godknows)

print(godknowsiwanttobreakfree)

I.want.to.break.free.
['I', 'want', 'to', 'break', 'free', '']
I want to break free 


This string really wants to break free...

## Section 1.4: Iterating over a collection

We now move onto **control statements** which are absolutely critical. The first one we will consider is the _for loop_ for iterating over objects (collections) such as lists.

Virtually any collection can be iterated over. Python will keep track of the elements in a collection so that each element is used only once. If a collection is unordered python will not necessarily give you the elements in the order that they were created. The ordering is actually related to how they are stored in memory and so it could change at any point. The main way to iterate through a collection is to use the infamous *for loop*. This will iteratively 'loop' through all elements of our collection, operating on them one at a time. where 'i' is the conventional iterator, try something like the following:

In [40]:
PrimeList = [1, 2, 3, 5, 7, 11, 13]
for i in PrimeList: 
    print(i)

1
2
3
5
7
11
13


In [41]:
animals = ['dogs', 'cats', 'turtles']

for i in animals:
    print('Did you know that '+ i +' are my favourite animals?')

Did you know that dogs are my favourite animals?
Did you know that cats are my favourite animals?
Did you know that turtles are my favourite animals?


However, we can use any iterator. What are your favourite sandwiches?

In [42]:
sandwiches=['avocado', 'roast vegetables', 'salad'] # these two lines
for sandwich in sandwiches:
    print(sandwich + ' is my favourite sandwich filling!')

avocado is my favourite sandwich filling!
roast vegetables is my favourite sandwich filling!
salad is my favourite sandwich filling!


There are two important things to note here:

1. First, we *absolutely need* the colon at the end of the opening line of the control statement (try without!)

2. Secondly, note the indentation. In Python, whitespace is used to denote blocks (in other langauges, brackets or braces are used). ```def```, ```if```, ```elif```, ```else```, ```try```, ```except```, ```finally```, ```with```, ```for```, ```while```, and ```class``` all start blocks. Some of these we will see more of in this course. To end the indentation, we just 'outdent' (i.e. go back to where we were before the block began). An indent can either be 4 spaces or a tab. People have [very heated](https://stackoverflow.com/questions/119562/tabs-versus-spaces-in-python-programming) discussions about [which to use](https://stackoverflow.blog/2017/06/15/developers-use-spaces-make-money-use-tabs/), but at this level, either is fine. Put simply: this indentation is how python manages what is inside a loop and what is after the loop. What happens if we dont intent, or dont include the colon?

Lets try some more examples because this concept of a for loop is so critical:

In [43]:
for name in ['what?', 'who?', 'Dr. Who']:
    print('Hi! My name is: ')
    print(name)

Hi! My name is: 
what?
Hi! My name is: 
who?
Hi! My name is: 
Dr. Who


Lets move on to the concept of a 'loop counter' ([as discussed in Season 2, episode 13 of the Big Bang Theory, where Sheldon displays his friendship algorithm](http://padcandy.com/wp-content/uploads/2013/11/cb55_full_view.jpg):

In [44]:
counter=0
languages = ['Python', 'R']
for language in languages:
    counter=counter+1
    print('Computer programming language ' + language +' is rank: '+str(counter))

Computer programming language Python is rank: 1
Computer programming language R is rank: 2


### 1.4.1. [Iteration isn't always in sequence](https://stackoverflow.com/questions/3848091/set-iteration-order-varies-from-run-to-run).

Below, you will see iteration over sets, lists and dictionaries. *Importantly*: note that sets and dictionaries are not guaranteed to come back in order (as discussed above), but all will be returned eventually. Notice how dictionaries are slightly different since they are not collections of elements, but pairs of elements. You can also have ordered dictionaries, although this is left for the optional homework.

In [45]:
college_list = ['nuffield', 'st cats', 'st cross', 'nuffield', 'trinty']
college_set = set(college_list) # notice how some dissapear? why?

print("List of colleges:")
for i in college_list: 
    print(i)
print("\n") # this just adds a space in between the results rather than actually printing any content

print("Set of colleges:")
for i in college_set: 
    print(i)

List of colleges:
nuffield
st cats
st cross
nuffield
trinty


Set of colleges:
st cats
st cross
nuffield
trinty


In [46]:
colleges = {"a":"all souls", "b":["brasenose", "balliol"], "e":"exeter"}

print("\nDictionary [default, keys]:")
for i in colleges:
    print(i)
    
print("\nDictionary [values]:")
for i in colleges.values(): 
    print(i)

print("\nDictionary [values] - by querying:")
for i in colleges:
    print(colleges[i])

print("\nDictionary [items] - single query:")
for i in colleges.items(): 
    print(i)
    
print("\nDictionary [items] - double query:")
for i,j in colleges.items(): 
    print(i,":",j)


Dictionary [default, keys]:
a
b
e

Dictionary [values]:
all souls
['brasenose', 'balliol']
exeter

Dictionary [values] - by querying:
all souls
['brasenose', 'balliol']
exeter

Dictionary [items] - single query:
('a', 'all souls')
('b', ['brasenose', 'balliol'])
('e', 'exeter')

Dictionary [items] - double query:
a : all souls
b : ['brasenose', 'balliol']
e : exeter


More advanced functionality that we wont cover: [zip](https://www.w3schools.com/python/ref_func_zip.asp)
    
For example, we won't consider:

```python
for f, b in zip(foo, bar):
    print(f, b)
```

### Homework 1.1

Using ```split``` and ```len```, figure out how to compute how many words are in this famous quote by Albert Einstein:

```python
albert_einstein = 'Logic will get you from A to Z, but imagination will take you everywhere.'
```

In [47]:
# Answer 1.1 here

### Homework 1.2

Loop through [the words](https://soundcloud.com/leahkardos/leah-kardos-dface-practice): if the word starts with an 'a', then print it:

```python
practicethis = "The object of this lesson is to internalise the notes and the keys; in other words, it should become like a reflex for you to see the note on the page and immediately play that appropriate key, in the same way that you see a letter in a word, and immediately visualise the pronounciation."
```

In [48]:
# Answer 1.2 here

### Homework 1.3

```python
wrongphrase = "The slow red fox sat by the friendly dog"
```

Print the right phrase ("The quick brown fox jumps over the lazy dog") by splitting, replacing the wrong words with the right ones, and then (re-)joining.

In [49]:
# Answer 1.3 here

### Homework 1.4

Print a number UNLESS it has already been printed. Here's a bit of code to get you started:

```python
import random
for i in range(10):
    rando = random.randint(0,9)
```

In [50]:
# Answer 1.4 here