In [None]:
# Prep
import pandas as pd
import numpy as np

data_path = "https://github.com/CALDISS-AAU/sds-ss-2024/raw/master/datasets/eurobarometer-96_dk_subset.csv"

eurob = pd.read_csv(data_path)

age_recode = {"15 years": 15, "Refusal": np.nan}

eurob['d11'] = eurob['d11'].replace(age_recode)
eurob['d11'] = eurob['d11'].astype('float') # float = floatpoint

# Introduction to pandas data frames

## What is a (pandas) data frame?

- a data structure for table data in Python (a representation of data)

![DF](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg)

- Each row and column has an *index*
- Typically rows identified by *index* (row number - but can also be something else!)
- Columns typically identified by column name

### Each column in a data frame is a `Series`

- `Series` a single-column format in Pandas
- Compared to a Python List, a `Series` can have only one type of data
- Indexes in a `Series` need not start at 0

![SERIES](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_series.svg)

## From data to data frame

- A data frame is just a representation of data in python
- Many data formats can be converted to a data frame
- Data frames are usable for many forms of analysis

Examples of files that can be read for data frames (if in correct format!):
- .csv
- .json
- .xls (Excel)
- .dta (Static)
- .SAS7BDAT (SAS)

# Basic data management in pandas

## Select columns

![Col](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_columns.svg)

In [8]:
eurob['polintr']

0      Not at all
1          Medium
2          Medium
3          Medium
4          Medium
          ...    
988        Strong
989        Medium
990        Strong
991        Medium
992           Low
Name: polintr, Length: 993, dtype: object

## Select rows

![rows](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_rows.svg)

In [9]:
eurob[eurob['polintr'] == "Low"].head(2) #boolean indexing

Unnamed: 0,uniqid,d11,polintr,qb1,qb3_1,qb3_2,qb3_3,qb3_4,qb3_5,qb3_6,...,d10,d15a,d15b,d25,d63,d1,p1,p2,p3,region_denmark
10,110005583,91.0,Low,Don't know (SPONTANEOUS),Not mentioned,Not mentioned,Not mentioned,Not mentioned,Not mentioned,Not mentioned,...,Man,"Retired, unable to work","Employed position, travelling",Large town,The working class of society,5,17 Sep 21,13 - 16 h,2636,DK05 - Nordjylland
19,110005592,18.0,Low,Very important,Use of personal data and information by compan...,Not mentioned,Not mentioned,The safety and well-being of children,Not mentioned,The difficulty of disconnecting and finding a ...,...,Woman,Student,"Unskilled manual worker, etc.",Rural area or village,The middle class of society,3,17 Sep 21,13 - 16 h,3252,DK04 - Midtjylland


## Subsetting with `.loc[]` and `.iloc[]` (specific rows and columns)

![LOC](https://pandas.pydata.org/pandas-docs/stable/_images/03_subset_columns_rows.svg)

In [10]:
eurob.loc[eurob['polintr'] == "Low", ['polintr', 'd10']].head(3) 

Unnamed: 0,polintr,d10
10,Low,Man
19,Low,Woman
24,Low,Woman


## Subsetting with `.loc[]` and `.iloc[]`

- `.loc[]`: "Label-Based Location" (based on the naming of rows and columns)
- `.iloc[]`: "Index-Based Location" (based on index for rows and columns)

**Syntax:**

`.loc[rows, columns]`

- `rows` can be specified as a row names or via conditions ("Boolean Indexing")
- `columns` can be specified as list of column names

## Recoding with `.loc`

- Think recoding as to locate specific parts of data that are overwritten with a value

<img src = "https://github.com/CALDISS-AAU/sds-ss-2024/raw/master/slides/img/loc_example.png" Style = "Width: 50.0%"/>

```python
df.loc [df ['v1']> 10, 'v1'] = 0
```

<img src = "https://github.com/CALDISS-AAU/sds-ss-2024/raw/master/slides/img/loc_example2.png" style = "width: 28.0%"/>

## Recoding with mappings

- When recoding categories, using `.loc[]` can be difficult
- Alternatively you can use a *mapping*, indicating what values ​​to be replaced and what to be replaced with
- A mapping can be considered as a form of "search-and-replace" used on a column
- A mapping is made as a dictionary with old values as keys and new values as values:

```
mapping = {"old value X": "new value X", "old value Y": "new value Y"}
```

- A mapping can be used to replace values in a column (or `Series`) with the method` .replace ()`

## Recoding with Mappings - Example

In [12]:
eurob['qb1'].value_counts()

qb1
Very important              716
Fairly important            191
Not very important           44
Not at all important         32
Don't know (SPONTANEOUS)     10
Name: count, dtype: int64

In [13]:
qb1_map = {"Very important": "Important", 
          "Fairly important": "Important", 
          "Not very important": "Not important",
          "Not at all important": "Not important",
          "Don't know (SPONTANEOUS)": np.nan}

eurob['qb1_bin'] = eurob['qb1'].replace(qb1_map)

eurob['qb1_bin'].value_counts()

qb1_bin
Important        907
Not important     76
Name: count, dtype: int64