In [1]:
# Prep
import pandas as pd
import numpy as np

data_path = "/work/809805/data/eurobarometer-96_dk_subset.csv"

eurob = pd.read_csv(data_path)

age_recode = {"15 years": 15, "Refusal": np.nan}

eurob['d11'] = eurob['d11'].replace(age_recode)
eurob['d11'] = eurob['d11'].astype('float') # float = floatpoint

# Introduction to Python

## Program

- What is Python?
- Interacting with Python with Jupyter Notebooks (via Google Colab)
- Introduction to pandas data frames
- Data management with pandas
- Summary of data with pandas

## What is python?

- Python as a programming language
- Python as an interpreter
- One works with Python through commands that are evaluated
- “Object-oriented”: Everything is defined as “variables” which can be used for different purposes depending on the type/class

## Python as a language

Limited vocabulary - the rest "made up" or imported!

```
and       del       global     not    with
as        elif      if         or    yield
assert    else      import     pass
break     except    in         raise
class     finally   is         return
continue  for       lambda     try
def       from      nonlocal   while
```

## Python as an interpreter

Commands written in Python are evaluated by a "python interpreter".

Very literal language: An error is given if the command is not understood.

## Python as “object-oriented”

One interacts with Python by defining and redefining *objects*.

*Objects* are defined as *variables* - a name to call up information.

All *variable* are a *class*.

The *class* sets conditions for what the *variable* can.

![dog1](./img/python-dog_eng.png)

In [2]:
the_ball = [2, 4, 6, 10, 21]

![dog1](../img/conf-dog_eng.png)

In [5]:
print(ball)

NameError: name 'ball' is not defined

![dog1](./img/dog-happy-text_eng.png)

In [6]:
print(the_ball)

[2, 4, 6, 10, 21]


## Classes

<img src = "./img/classes-example_eng.png" style = "width: 85.0%"/>

In [8]:
class bookcase:
    def __init__(self, objects):
        self.top_shelf = objects[0]
        self.middle_shelf = objects[1]
        self.bottom_shelf = objects[2]

b = bookcase(["books", "boardgames", "vases"])

In [10]:
b.bottom_shelf

'vases'

In [11]:
b.bottom_drawer

AttributeError: 'bookcase' object has no attribute 'bottom_drawer'

# Functions and packages

Defining variables is one way of expanding the Python vocabulary - remember: limited vocabulary initially!

Other ways of expanding the vocabulary is by:
- Writing functions
- Importing packages (containing functions, variables, etc.)

# Writing functions

A function is a block of code that runs when it is called.

It can take parameters (inputs) and returnsome output. 

Functions are defined with `def` followed by a name for the function, followed by the parameters in parenthesis.

When functions should return output, use `return`. Functions end when reaching a `return` statement..

In [1]:
def add_numbers(a, b):
    result = a + b
    return result

add_numbers(2, 7)

9

## Joint exercise: Simple problem solving with python

How can we create a feature that calculates the area of ​​a circle from a given radius?

$ A = \ pi * r^2 $

# Using packages

Packages in Python are a way of organizing and reusing code. 

They contain functions, classes, and variables defined by other programmers. 

One reason why Python is prefered for data science and machine learning is the extensive library of packages for these tasks.

Packages are first *installed* and then *imported*. This separation is important in order to control and maintain a specific Python *environment*.

# Using packages

In [3]:
import math

def comparea(r):
    A = math.pi * r**2
    return(A)

# Using packages

![envir](./img/environment_books.png)

# Data structures

You will encounter many different data structures when working with Pyhthon. 

There are both built-in data structures but many packages use their own data structures as well. 

Many data handling tasks involve transforming data from one data structure to another.

## Lists

A fundamental Python data structure is the *list*. *Lists* are versatile, allowing you to store a sequence of items and modify them later. 

A list in Python is *ordered* and items can be of mixed types/different classes (you can even create lists of lists!).

Lists are defined by having values between square brackets [ ].

**Key features**:

- Ordered: The order of items in a list is preserved, which means items can be accessed by their position (index).- 
Mutable: You can add, remove, or change items in a list after it has been created
- Dynamic: c: Lists can grow or shrink in size as items are added or removed.

In [6]:
numbers = [10, 20, 30, 40, 50]

print(numbers)

[10, 20, 30, 40, 50]


## Dictionaries

Dictionaries are another fundamental data structure in Python. They store data in a key-value pair format, making them incredibly useful for associating information with unique identifiers. 

Dictionaries are defined by enclosing items in curly braces {} where each item consists of a key followed by a colon : and then the value.

**Key features**:

-   Unordered: The order of items does not matter a it is not possible to access items by positione
-    Mutable: You can add, remove, or modify entries after the dictionary is creat
-     Indexed by keys: Instead of numeric indices, dictionaries use  pe, to retrieve val
- 
    No duplicate keys: Each key must be unique within a dictionary.

# Dictionaries

In [5]:
student_ages = {
    'Alice': 28,
    'Bob': 36,
    'Charlie': 20
}

print(student_ages)

{'Alice': 28, 'Bob': 36, 'Charlie': 20}


# Introduction to pandas data frames

## What is a (pandas) data frame?

- a data structure for table data in Python (a representation of data)

![DF](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg)

- Each row and column has an *index*
- Typically rows identified by *index* (row number - but can also be something else!)
- Columns typically identified by column name

### Each column in a data frame is a `Series`

- `Series` a single-column format in Pandas
- Compared to a Python List, a `Series` can have only one type of data
- Indexes in a `Series` need not start at 0

![SERIES](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_series.svg)

## from data to data frame

- A data frame is just a representation of data in python
- Many data formats can be converted to a data frame
- Data frames are usable for many forms of analysis

Examples of files that can be read for data frames (if in correct format!):
- .csv
- .json
- .xls (Excel)
- .dta (Static)
- .SAS7BDAT (SAS)

# Basal data management in pandas

## Select columns

![Col](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_columns.svg)

In [8]:
eurob['polintr']

0      Not at all
1          Medium
2          Medium
3          Medium
4          Medium
          ...    
988        Strong
989        Medium
990        Strong
991        Medium
992           Low
Name: polintr, Length: 993, dtype: object

## Select rows

![rows](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_rows.svg)

In [9]:
eurob[eurob['polintr'] == "Low"].head(2) #boolean indexing

Unnamed: 0,uniqid,d11,polintr,qb1,qb3_1,qb3_2,qb3_3,qb3_4,qb3_5,qb3_6,...,d10,d15a,d15b,d25,d63,d1,p1,p2,p3,region_denmark
10,110005583,91.0,Low,Don't know (SPONTANEOUS),Not mentioned,Not mentioned,Not mentioned,Not mentioned,Not mentioned,Not mentioned,...,Man,"Retired, unable to work","Employed position, travelling",Large town,The working class of society,5,17 Sep 21,13 - 16 h,2636,DK05 - Nordjylland
19,110005592,18.0,Low,Very important,Use of personal data and information by compan...,Not mentioned,Not mentioned,The safety and well-being of children,Not mentioned,The difficulty of disconnecting and finding a ...,...,Woman,Student,"Unskilled manual worker, etc.",Rural area or village,The middle class of society,3,17 Sep 21,13 - 16 h,3252,DK04 - Midtjylland


## Subsetting with `.loc[]` and `.iloc[]` (specific rows and columns)

![LOC](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_columns_rows.svg)

In [10]:
eurob.loc[eurob['polintr'] == "Low", ['polintr', 'd10']].head(3) 

Unnamed: 0,polintr,d10
10,Low,Man
19,Low,Woman
24,Low,Woman


## Subsetting with `.loc[]` and `.iloc[]`

- `.loc[]`: "Label-Based Location" (based on the naming of rows and columns)
- `.iloc[]`: "Index-Based Location" (based on index for rows and columns)

**Syntax:**

`.loc[rows, columns]`

- `rows` can be specified as a row names or via conditions ("Boolean Indexing")
- `columns` can be specified as list of column names

## Recoding with `.loc`

- Think recoding as to locate specific parts of data that are overwritten with a value

<img src = "./img/loc_example.png" Style = "Width: 50.0%"/>

```python
df.loc [df ['v1']> 10, 'v1'] = 0
```

<img src = "./img/loc_example2.png" style = "width: 28.0%"/>

## Recoding with mappings

- When recoding categories, using `.loc[]` can be difficult
- Alternatively you can use a *mapping*, indicating what values ​​to be replaced and what to be replaced with
- A mapping can be considered as a form of "search-and-replace" used on a column
- A mapping is made as a dictionary with old value as keys and new values as values:

```
mapping = {"OLD VALUE X": "New Value X", "Old Value Y": "New Value Y"}
```

- A mapping can be used to replace values ​​in a column (or `Series`) with the method` .replace ()`

## Recoding with Mappings - Example

In [12]:
eurob['qb1'].value_counts()

qb1
Very important              716
Fairly important            191
Not very important           44
Not at all important         32
Don't know (SPONTANEOUS)     10
Name: count, dtype: int64

In [13]:
qb1_map = {"Very important": "Important", 
          "Fairly important": "Important", 
          "Not very important": "Not important",
          "Not at all important": "Not important",
          "Don't know (SPONTANEOUS)": np.nan}

eurob['qb1_bin'] = eurob['qb1'].replace(qb1_map)

eurob['qb1_bin'].value_counts()

qb1_bin
Important        907
Not important     76
Name: count, dtype: int64

# Summary

- Python is a programming language and an interpreter
- Python *variables* are names used to call up information
- Variables always have a *class*
- A pandas data frame is a data structure in python to work with data in tables
- A column in a data frame is called a `Series`
- `.loc[]` used for both subsetting and recoding of data frames