# Session 5: Case 1

*Nicklas Johansen & Jacob Troelsgård*

## Agenda

In this session, we will prepare for solving our first case:
- Reading files
- Functions
- Practical Organiztional Theory
- How to solve a case

## Recap 

- Missings and Duplicated Data
- Combining Data Sets
- Split-Apply-Combine
- Rehaping Data

In [1]:
# Loading packages

import numpy as np
import pandas as pd
import seaborn as sns

tips = sns.load_dataset('tips')
titanic = sns.load_dataset('titanic')

### Reading Files


Sometimes reading a .csv file...

In [2]:
pd.read_csv('data.csv')

Unnamed: 0,Name;Tænder;Indkomst
0,Nicklas;4;400
1,Jacob;6;500
2,Preben;8;800
3,Randi;1;200


... requires us to state the seperator

In [4]:
pd.read_csv('data.csv', sep = ';')

Unnamed: 0,Name,Tænder,Indkomst
0,Nicklas,4,400
1,Jacob,6,500
2,Preben,8,800
3,Randi,1,200


You can also read a .xlsx file...

In [8]:
pd.read_excel('data.xlsx')

Unnamed: 0,Name,Tænder,Indkomst
0,Nicklas,4,400
1,Jacob,6,500
2,Preben,8,800
3,Randi,1,200


... sometimes you may...

In [13]:
pd.read_excel('data_2.xlsx')

Unnamed: 0,1
0,2
1,3


... need to specify which sheet you want to import.

In [16]:
pd.read_excel('data_2.xlsx', sheet_name='Sheet2')

Unnamed: 0,Name,Tænder,Indkomst
0,Nicklas,4,400
1,Jacob,6,500
2,Preben,8,800
3,Randi,1,200


Please notice that this only works because the files we read is located in the same working directory as out jupyter notebook.

If you want to work with mutiple directories you might want to learn about the python module [`os`](https://www.geeksforgeeks.org/os-module-python-examples/).
```python
    import os
    # Python program to explain os.getcwd() method
         
    # importing os module
    import os

    # Get the current working
    # directory (CWD)
    cwd = os.getcwd()

    # Print the current working
    # directory (CWD)
    print("Current working directory:", cwd)
```

## Functions

Sometimes we need to create functions that can handle specific data wrangling challenges. Later they can easily be applied to our datafram using `apply`. Let look at two examples.


In [12]:
# from seaborn

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,aldersgruppe
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,Tyverne
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,Trediverne
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,Tyverne
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,Trediverne
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,Trediverne


In [4]:
# Function that splits people into age gorups

def lav_aldersgruppe(var):
    if var < 20:
        return 'Teenager'
    if var < 30:
        return 'Tyverne'
    if var < 40:
        return 'Trediverne'
    else:
        'Gammel'
        
titanic['aldersgruppe'] = titanic['age'].apply(lav_aldersgruppe)
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,aldersgruppe
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,Tyverne
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,Trediverne
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,Tyverne
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,Trediverne
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,Trediverne
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,Tyverne
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,Teenager
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,Tyverne


In [None]:
# Alternative strucutre:
# Stored the answer in a result and return it in the end

def lav_aldersgruppe(var):
    if var >10 & var < 20:
        result = 'Teenager'
    if var >20 & var < 30:
        result = 'Tyverne'
    return result

In [9]:
# from ex. 4.1.1
# we want to slit each person into a income group based on their occupation
url = f'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
labels = ['age','workclass', 'fnlwgt', 'educ', 'educ_num', 'marital_status', 'occupation','relationship', 'race', 'sex','capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'wage']
df_census = pd.read_csv(url, header=None, skipinitialspace = True, names=labels)
df_census

Unnamed: 0,age,workclass,fnlwgt,educ,educ_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [10]:
# We count people having a specific occupation
df_census['occupation'].value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [13]:
# then we create a rule for referring specific occupations into income groups
# remember to check whether you fucntion capture varians in your data like:
# spelling mistakes, capitalizaiton, abbreviation, etc.

def beslut_indkomstsklasse(var):
    if var.lower() in ['prof-specialty','exec-managerial']:
        return 'High'
    if var in ['Craft-repair']:
        return 'Middel'
    else: 
        return 'Low'

df_census['indkomstklasse'] = df_census['occupation'].apply(beslut_indkomstsklasse)
df_census.head()   

Unnamed: 0,age,workclass,fnlwgt,educ,educ_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage,indkomstklasse
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,Low
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,High
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,Low
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,Low
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,High
