In [1]:
#prep
my_name = "kenobi"
my_num1 = 42 
my_num2 = "42"
my_calc = 42 * 42

# Introduction to Python for Social Science

## What is python and why use it?

Python is a general purpose programming language. It has in recent years gained a lot of popularity as a tool for data science and statistics.
It has a number of advantages:

1. General purpose language - add what you need
2. Portable (Linux, Windows, Mac)
3. Interactive
4. Free
5. Large community and eco-system

## Installing Anaconda
* Go to (https://www.anaconda.com/download)
* Install a version 3.x
* Open Jupyter notebook
  * Windows: search for jupyter
  * Mac and linux: $ <anaconda_root_dir>/jupyter notebook

## Content of the python introduction

- Working with python
- The Jupyter Notebook environment
- Working with variables
- Expanding python functionality by importing modules
- Working with tables in python (`pandas` module)
- Exporting data

### Working with Python
* Workflows - many - find your own!
* In this course - Jupyter notebook and pandas:
  * Python + Jupyter notebook + pandas = A complete environment
  * Interactive
  * Encourage an iterative work process (research?)
  * Documentation, code and visualization in one - literate programming
  * Reproducing results

## Class room rules
* Discuss with your neighbour
* Help each other
* Try everything out yourself (write along as we go)
* Stickers: Put a sticker up on your laptop if you need help

# Using Jupyter Notebook

Jupyter Notebook is a web-based interactive computational environment for creating notebooks for codes.

It is originally developed for python but can be used for other code as well (for example R).

Jupyter Notebook works in a cell-structure. Each notebook is comprised of cells. The cells can be either code cells or markdown cells (or raw cells)

- Code cells: Contains python code
- Markdown cells: Write text

Run cells by highlighting it and pressing `Shift + Enter`.

## Adding documentation
* Comment on a line using `#` (code following will not be interpreted)
* Doc-strings for longer comments using `"""`

In [None]:
2 + 2 # wow!

In [None]:
"""What I'm about to do here
is really cool!"""

2 - 2

# Python basics

Working with python means working entirely almost entirely with codes written in the python language. 
Python works by writing lines of code (commands) and having python interpret that code (running commands or cells).

## Python as a calculator
So what does it mean that python interprets our code?
It means that you tell python to do something by writing a command and python will do that (if python can understand you).

Python, for example, understands mathematical expressions:

In [None]:
2 + 5

In [None]:
0.37 * 256

## Working with variables, functions and methods

Working with python means working with variables. 

Variables in python can store all kinds of values and information: text, numbers, datasets, graphs. 

Using a function or calling a method can then transform the variable or create something from it: calculating a statistical model, saving a file, creating a graph and so on.

Data analysis in python can mostly be boiled down to 3 basic steps:

1. Assign values to a variable: `data = pd.read_csv('employees.csv')`
2. Make sure python interprets the variable correctly (its type): `data['Compensation'].dtypes()`
3. Perfom some operation or manipulation on the variable using a function or calling a method: `data['Compensation'].mean()`

## Variables

A lot of writing in python is about defining variables: A name to use to call up stored data.

Variables can be a lot of things: 
- a word
- a number
- a series of numbers
- a dataset 
- a URL
- a formula
- a result 
- a filepath
- and so on...

When a variable is defined, it is available in the current working space (or environment).

This makes it possible to store and work with a variety of informaiton simultaneously.

### Defining variables

Variables are defined using `=`

In [None]:
a = 2 + 5

In [None]:
print(a)

In [None]:
b = 'jedi'

In [None]:
print(b)

Using `' '` or `" "` denotes that the code should be read as text.

### Naming variables
Variables can be named almost anything but a good rule of thumb is to use names that are indicative of what the variable contains.

#### Restrictions for naming variables
- Most special characters not allowed: `/`, `?`, `*`, `+`, `.` and so on (most characters mean something in python and will be read as an expression)
- Already existing names in python (will overwrite the function/variable in the environment)

#### Good naming conventions 
- Using '`_`': `my_variable`, `room_number`

or:

- Capitalize each word except the first: `myVariable`, `roomNumber`


# EXERCISE 1: DEFINING VARIABLES

1. Define the following variables:

    - `my_num1`: `42` (without quotation marks)
    - `my_num2`: `"42"` (with quotation marks)
    - `my_calc`: `42 * 42`
    - `my_name`: `"kenobi"`

2. Divide each variable by 2

## Bonus exercise
- Try multiplying each variable by 2

# EXERCISE 1: DEFINING VARIABLES
*What happens?*

In [None]:
my_num1 / 2

In [None]:
my_num2 / 2

In [None]:
my_calc / 2

In [None]:
my_name / 2

# Different types of variables
Why did we get errors in the previous exercise?

Because python distinguishes between different types of variables!

A variable is stored as a *type*. The type denotes what kind of variable it is and affects what operations are possible.

## Numeric and character types
As you work with python, you will encounter a lot of different types. For now we will be focusing on two of the more common ones:
- Numeric types (like integers or floats)
- Character types (like strings)

Numbers are automatically stored as a numeric type (like a float, integer etc.).

When using `''` or `""` around the information to be stored in the variable, python will interpret that as text. Variables containing text are refered to as *strings*.

*Numbers enclosed in `''` or `""` are therefore stored as strings, as python interprets it as text!*

Python has to be told that something is text as Python would otherwise interpret it as an existing variable.

In [None]:
my_name2 = vader

Note that python can interpret multiplying and adding text:

In [None]:
2 * my_name

In [None]:
2 * my_num2

In [None]:
'obi-wan ' + my_name

## Casting types

The type of a variable can be examined with `type(variable)`.

Variables can be coerced with specific functions:

- Coerce to character type:`str(variable)`
- Coerce to numeric type: `int(variable)` or `float(variable)`

Python will always try to "guess" the type. If python guesses wrong, you can tell python what type it should be (if possible).

In [None]:
a_number = '56'
print(type(a_number))

In [None]:
a_number = int(a_number)
print(type(a_number))

### Integers and floats

Wholenumbers will by default be interpreted as integers. Any number with a decimal will be interpreted as a float.

Dividing will always convert the variable to a float.

In [None]:
a_number = 36
b_number = 12
print(type(a_number), type(b_number))

In [None]:
c_number = a_number / b_number
print(c_number, type(c_number))

# EXERCISE 2: TYPES

1. Create the variable `my_product` containing the product of your 2 my_num variables: `my_num1 * my_num2`
2. Does `my_product` contain the number you would expect? 
3. Change the type of `my_num2` to an integer (use `int()` and overwrite `my_num2`)
4. Try creating the variable `my_product` again
5. Check if `my_product` is equal to `my_calc` using `==`. Assign it to the variable `my_test`
6. Check the type of `my_test`. What type is it?

In [None]:
my_num1 * my_num2

In [None]:
my_num2 = int(my_num2)
my_product = my_num1 * my_num2
my_product

In [None]:
my_test = my_product == my_calc
my_test

In [None]:
type(my_test)

# The boolean type
*Boolean* variables are variables containing the value `True` or `False`.

When using the following operators (among others), python will interpret it as a boolean:
- `>`
- `>=`
- `<`
- `<=`
- `==`
- `!=`

Booleans can be used in functions, loops and if-statements to ensure that a certain condition is met before something is run.

# Importing modules
Base python has very limited functionality. You will always have to import various modules in order to perform your analysis.

A module is a collection of functions, variables and methods that can be loaded into your python environment.

Once loaded, the contents of the module is usable in the python environment.

It is possible to either import whole modules or parts of a module.

In [None]:
c = sqrt(a**2 + b**2)
print(c)

In [None]:
import math #whole module/package

# Or...

from math import sqrt #specific function/method

In [None]:
c = sqrt(my_num1**2 + my_num2**2)
print(c)

# Lists
So far we have looked at python variables containing single values: a number, a word or a boolean.

Python has different ways of storing a series of elements (values, variables, etc.). One of the more common is the *list*.

A list is a grouping of elements. They are created by enclosing the values in `[]`:

In [None]:
my_list = [1, 9, 7, 3]
print(my_list)

Note that lists can contain variables of different types.

In [None]:
my_list2 = [my_name, 3, 42.0, my_test]
print(my_list2)

## Adding to lists

Elements can be added to the list with the method `append`.

*Note that using this method changes the contents of the list*

In [None]:
my_list.append(22)
print(my_list)

## Indexes

Each element in the list is assigned an index running from 0 to the number of elements - 1. We can use the index to refer to specific elements with `[]`:

In [None]:
my_list2[0]

Single elements in a list can be changed by refering to their index:

In [None]:
my_list2[0] = 'vader'
print(my_list2)

## Tuples

*Tuples* are immutable lists meaning their values can't be changed.

They are created by enclosing elements in `()`:

In [None]:
t = (1.0, 4.0)
t, type(t)

In [None]:
t[1]

In [None]:
t[1] = 2

# Working with table data: Pandas

*Pandas* is a module that allows you to create dataframes in python: A spreadsheet-like data structure for data with rows and columns.

The pandas modules contain a lot of instruments and methods for data handling and processing.

A `DataFrame` is a type. An important subtype is a `series`: A one-dimensional datastructure with an index (like a type-specific list or a variable, as it is understood in statistics).

In [None]:
import pandas as pd

# Pandas Series
* One dimensional data
* Data is labeled with a index
* Series consist of pairs (index, data)

In [None]:
a = pd.Series([4, 2, 7, 8, 4, 4])
print(a)

In [None]:
print(a*2 + 4)

A wide range of operations can be performed on series (like variables in any other statistics software).

In [None]:
print(a.unique())

In [None]:
print(a.isin([2, 4]))

# EXERCISE 3: Lists and Series
1. Create a list containing the numbers: 5, 13, 26, 42, 101
2. Try multiplying the list by 4 - What happens?
3. Create a panda series containing the same numbers (Note: you can create a panda series from a list)
4. Try multiplying the panda series by 4 - What happens?
5. Check if the number 20 is in your series using the `.isin()` method. (*NOTE:* `isin()` requires a list as input).

In [None]:
my_list = [5, 13, 26, 42, 101] #Creating the list
my_list * 4 #Multiplying by 4

In [None]:
my_series = pd.Series(my_list) #Converting to series
my_series * 4 #Multiplying by 4

In [None]:
my_series.isin([20]) #Does the series contain the number 20

# Pandas DataFrames
* Two dimensional data (rows and columns)
* Data is labelled with an index and a column name

With the pandas module, various files can be imported directly as dataframes (also from the web).

The `Iris` dataset contains measurements and species for various iris flowers. The measurements are in centimeters.

In [None]:
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

*Load the Iris dataset into your own environment now*

## Inspecting DataFrames

Use the method `.head()` to inspect the first 5 rows of the data.

In [None]:
iris.head()

Inspect column names with method .columns

In [None]:
list(iris.columns)

See key summary statistics using `.descibe()`. (n, mean, std, min, max, quartiles).

In [None]:
iris.describe()

## Slicing rows and selecting columns

Selecting rows is refered to as *slicing*. Rows can be selected by their index using `[]`. It excludes the last index.

Columns can be selected the same way by refering to the column name.

In [None]:
iris[0:1] # First row

In [None]:
iris['species'] # Selecting species column.

The method `.loc[]` is used for subsetting the data. First rows, then columns. Columns have to be specified by their name.

Several columns can be selected by refering to a list of column names.

Unlike the "standard" indexing/slicing, using `.loc[]` includes the last index.

*NOTE*: `.loc[]` is also used for recoding specific values.

In [None]:
iris.loc[2:4, 'species'] # Returns as a series

In [None]:
iris.loc[2:4, ['species', 'sepal_width']] # Returns as a dataframe

In [None]:
iris[2:4][['species', 'sepal_width']] # Alternative - excludes last index

## Operations on dataframes
Operations can be performed on DataFrame series much like on lists.

Operations on DataFrames series are not restricted to pandas functions!

In [None]:
(iris['sepal_width'] / 100).head() #converting to meters - first 5 rows

In [None]:
iris['sepal_width'].mean()

The type of a dataframe column (series) can be inspected using the attribute `dtypes`.

In [None]:
iris['sepal_width'].dtypes

## Creating variables

Variables are created by refering to columns not yet in the dataframe.

In [None]:
iris['sepal_length_m'] = iris['sepal_length'] / 100

In [None]:
list(iris.columns)

Empty variables/columns are created the same way but by filling them with missing values (NaN).

The `numpy` module allows us to work with the NaN value.

In [None]:
import numpy as np

iris['category'] = np.nan
iris.head()

### NaN: "Not a Number"

`NaN` is the python equivalent of missing.

Notice that python does not treat NaN-values as larger or smaller than zero. NaN-values do not have a value. We therefore need to use specific methods to refer to them (like `isnull()`).

In [None]:
print(
    iris.loc[2, 'category'] < 0,
    iris.loc[2, 'category'] > 0,
    iris.loc[2, 'category'] == 0
)

In [None]:
iris[2:3]['category'].isnull()

## Recoding variables

The standard way of recoding is by using booleans.

In [None]:
iris.loc[(iris['sepal_length'] <= 5.84), 'category'] = "short"
iris.loc[(iris['sepal_length'] > 5.84), 'category'] = "long"
print(iris)

# EXERCISE 4: DataFrames

1. Create a new variable/column called `petal_area` containing the product of `petal_length` and `petal_width`.
2. What is the smallest petal area? Use either `.describe` or `.min`

## Bonus exercise

- Using `.loc` and `.min()`, can you determine the species of the iris flower with the smallest petal area? (think in booleans)

In [None]:
iris['petal_area'] = iris['petal_length'] * iris['petal_width'] #Variable for petal_area

iris['petal_area'].min #Smallest petal area

In [None]:
iris.loc[iris['petal_area'] == iris['petal_area'].min(), 'species'] #Determine species for the smallest petal area

# Writing and reading data
Pandas supports exporting DataFrames as various files, including:

* .csv (comma-separated values)
* .xlsx (excel)
* .dta (Stata)

## Exporting data with pandas

Files are created using the methods `to_csv()`, `to_excel` and `to_stata` respectively.

In [None]:
iris.to_csv('my_iris.csv')
iris.to_stata('my_iris.dta')

## Reading a simple csv file
The method `.read_csv` can be used to read a .csv-file as a dataframe.

Note that unless you specify an index, one will be automatically generated.

In [None]:
iris_new = pd.read_csv('my_iris.csv', index_col=0) 
iris_new.head()

# EXERCISE 5: WRITE AND READ DATA

1. Save your iris data as a .csv
2. Read your data file as a dataframe - does it look right?