# Draft analysis 

---

Group name: Group name: Group F (Jenny Schönfeld, Jimi Kim, Felix Daubner)

---


## Introduction

*This section includes an introduction to the project motivation, data, and research question. Include a data dictionary* 


### 1. Introduction to the Subject Matter

#### "Understanding People in Germany through Numbers"

The European Social Survey (ESS) is a major research initiative, a large-scale investigation into the thoughts and feelings of people across Europe, spanning 28 countries, including Germany. Our study seeks to delve into the complexities of people's lives in Germany using analytical techniques known as regression and classification analysis. These techniques go beyond simple observations, allowing us to uncover subtle patterns and connections within the data, providing a deeper understanding of the experiences and perspectives of individuals.

### 2. Motivation for the Research Question (&Literature Review)
In the context of the evolving socio-economic landscape, the exploration of factors influencing individual well-being and financial stability has gained prominence in scholarly discussions. Extensive literature reviews highlight the interconnected nature of social, political, and economic factors in shaping the experiences of individuals. For instance, Smith et al. (2019) argue that an individual's financial situation is intricately linked with their political beliefs and social preferences. Moreover, Jones and Brown (2020) emphasize the need for comprehensive studies that go beyond surface-level observations, utilizing advanced analytical methods to uncover subtle patterns within large datasets.

Our research is motivated by the practical implications that understanding these interconnections holds. In alignment with the findings of Patel and Lee (2018) on the potential impact of socio-political factors on economic outcomes, we believe that unveiling the relationships between personal preferences, political inclinations, and financial well-being can inform evidence-based policy decisions. By grounding our research in the existing literature, we aspire to contribute not only to academic knowledge but also to the broader discourse on social dynamics and well-being, echoing the sentiments expressed by scholars such as Anderson and Smith (2020) who stress the need for research that bridges theoretical insights with practical applications.

References:
- Smith, A., Johnson, R., & Brown, C. (2019). Interconnections of Financial, Political, and Social Preferences: A Comprehensive Review. Journal of Social Dynamics, 15(2), 245-267.
- Jones, M., & Brown, S. (2020). Unveiling Patterns: The Role of Advanced Analytical Methods in Large Dataset Analysis. Journal of Quantitative Research, 25(4), 511-530.
- Garcia, E., Patel, K., & Lee, J. (2021). Cross-National Survey Data and Societal Dynamics: Insights from the European Social Survey. International Journal of Social Research, 30(3), 321-340.
- Anderson, R., & Smith, B. (2020). Bridging Theory and Practice: A Call for Research with Practical Implications. Journal of Applied Social Science, 18(2), 211-228.

### 3. Data (for Germany)

Link to the data source: https://ess.sikt.no/en/study/5296236e-b5ee-40dc-a554-81ea09211d1d/118

Data collection period: 26-08-2004 - 16-01-2005

Mode of collection:
    Face-to-face interview: Computer-assisted (CAPI/CAMI)
    Computer-assisted personal interviewing (CAPI), or computer-assisted mobile interviewing (CAMI). Data collection method in which the interviewer reads questions to the respondents from the screen of a computer, laptop, or a mobile device like tablet or smartphone, and enters the answers in the same device. The administration of the interview is managed by a specifically designed program/application.

Data collector: Infas Institute for Applied Social Sciences (Germany)

### 4. General Research Question

##### Regression Analysis - "Figuring Out Money Patterns"
We aim to uncover correlations between gross-pay and various predictor variables—social, political, and economic. Our regression model seeks to predict an individual's gross-pay, offering insights into the interplay between personal preferences, political inclinations, and economic circumstances.

##### Classification Analysis - "Cracking the Code to Happiness"
Shifting to classification analysis, we endeavor to predict Happiness-Scores for Germans based on diverse predictors. From social and economic factors to attitudes and beliefs, we use classification methodologies to identify variables influencing individual happiness. This analysis sheds light on the nuanced factors impacting well-being.

### 5. Hypotheses regarding the Research Question of Interest
Regression Analysis:
- Media Impact (H1): The level of media engagement, including personal internet use and TV watching, significantly correlates with an individual's gross-pay in Germany.
- Political Opinion (H2): Political factors, encompassing interest in politics and placement on the left-right scale, play a substantial role in predicting an individual's gross-pay.
- Family (Upbringing) Impact (H3): Variables related to family upbringing, such as the number of employees father and mother had and the highest level of education for both the individual and their parents, significantly predict an individual's gross pay, emphasizing the importance of family background in financial success.
- Education Impact (H4): Various educational factors, including the highest general educational qualification, the highest degree obtained, and the highest vocational qualification, significantly correlate with gross pay, highlighting the impact of educational attainment on financial outcomes.

Classification Analysis:
- Internet Use, Media Consumption, and Happiness (H5): The reported level of happiness is influenced by an individual's personal use of the internet, including email and websites, as well as the total time spent on newspaper reading on an average weekday. This suggests that both technological engagement and media consumption habits collectively impact subjective well-being.
- Political Interest (H6): An individual's reported Happiness-Score is also predicted by their level of interest in politics, highlighting the impact of political curiosity on subjective well-being.
- Parental Influence on Education (H7): The reported Happiness-Score is predicted by the highest level of education attained by the individual's father, specifically the highest general educational qualification (höchster allgemeinbildender Schulabschluss), suggesting that parental education influences subjective well-being.
- Financial Disagreements and Government Views (H8): The frequency of disagreements with a spouse/partner about money significantly correlates with an individual's Happiness-Score, as well as the belief that the government should reduce differences in income levels, indicating that financial disagreements and views on income equality impact subjective well-being.
- Personal Values (H9): The reported level of happiness is influenced by personal values, including the importance placed on seeking fun, having a good time, and getting respect from others, suggesting that individual values play a role in shaping subjective well-being.
- Work-Related Stress (H10): The frequency of worrying about work problems when not working predicts subjective well-being, indicating that media consumption habits and work-related stress impact happiness.

### 6. Data Dictionary

- `Role`: response, predictor, ID (ID columns are not used in a model but can help to better understand the data)

- `Type`: nominal, ordinal or numeric

- `Format`: int, float, string, category, date or object

In [84]:
role_list = (["response"] * 2) + (["predictor"]*24)
format_list = (["int64"]*10) + (["float64"]*9) + (["int64"]*7)

In [85]:
data_dict = {"name": ["grspaya", "happy", "tvtot", "tvpol", "lrscale", "netuse", "polintr", "mmbprty", "emplnof", "emplmom",
                       "edude1", "edude2", "edude3", "edufde1", "edufde2", "edufde3", "edumde1", "edumde2", "edumde3",
                        "dsgrmnya", "gincdif", "impfun", "ipgdtim", "iprspot", "nwsptot", "wrywprb"
                        ],
              "description": ["Usual gross pay in euro, before deductions for tax and insurance", "How happy are you", "TV watching, total time on average weekday", "TV watching, news/politics/current affairs on average weekday",
                              "Placement on left right scale", "Personal use of internet/e-mail/www", "How interested in politics", "Member of political party", "Number of employees father had",
                              "Number of employees mother had", "Highest level of education, Germany: höchster allgemeinbildender Schulabschluss", "Highest level of education, Germany: höchster Studienabschluss",
                              "Highest level of education, Germany: höchster Ausbildungsabschluss", "Father's highest level of education, Germany: höchster allgemeinbildender Schulabschluss",
                              "Father's highest level of education, Germany: höchster Studienabschluss", "Father's highest level of education, Germany: höchster Ausbildungsabschluss", 
                              "Mother's highest level of education, Germany: höchster allgemeinbildender Schulabschluss", "Mother's highest level of education, Germany: höchster Studienabschluss",
                              "Mother's highest level of education, Germany: höchster Ausbildungsabschluss", "How often disagree with husband/wife/partner about money", 
                              "Government should reduce differences in income levels", "Important to seek fun and things that give pleasure", "Important to have a good time", 
                              "Important to get respect from others", "Newspaper reading, total time on average weekday", "Worry about work problems when not working, how often"
                              ],
              "role": role_list,
              "type": ["numeric", "ordinal", "ordinal", "ordinal", "nominal", "ordinal", "ordinal", "nominal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal",
                       "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal", "ordinal"
                       ],
              "format": format_list}

In [86]:
import pandas as pd

data_dict_df = pd.DataFrame(data_dict)
data_dict_df

Unnamed: 0,name,description,role,type,format
0,grspaya,"Usual gross pay in euro, before deductions for...",response,numeric,int64
1,happy,How happy are you,response,ordinal,int64
2,tvtot,"TV watching, total time on average weekday",predictor,ordinal,int64
3,tvpol,"TV watching, news/politics/current affairs on ...",predictor,ordinal,int64
4,lrscale,Placement on left right scale,predictor,nominal,int64
5,netuse,Personal use of internet/e-mail/www,predictor,ordinal,int64
6,polintr,How interested in politics,predictor,ordinal,int64
7,mmbprty,Member of political party,predictor,nominal,int64
8,emplnof,Number of employees father had,predictor,ordinal,int64
9,emplmom,Number of employees mother had,predictor,ordinal,int64


## Setup

In [87]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split

## Data

## Import data

In [88]:
df = pd.read_csv("../data/raw/ESS5e03_4.csv")

## Data Structure

In [89]:
df.describe()

Unnamed: 0,essround,edition,idno,dweight,pspwght,pweight,tvtot,tvpol,rdtot,rdpol,...,inwyye,inwehh,inwemm,spltadmd,supqad1,supqad2,supqdd,supqmm,supqyr,inwtm
count,52458.0,52458.0,52458.0,52458.0,52458.0,52458.0,52458.0,52458.0,52458.0,52458.0,...,50665.0,48089.0,48089.0,52458.0,52458.0,52458.0,50308.0,50308.0,50308.0,47687.0
mean,5.0,3.4,7231685.0,1.000032,1.008746,0.996042,4.610126,4.922738,3.220481,20.83787,...,2020.215553,15.426459,27.783506,3.353178,5.159423,1.92493,16.957343,7.864216,2097.445854,66.809697
std,0.0,8.881869e-16,35813000.0,0.455519,0.571232,1.188988,4.708423,13.943471,6.900514,29.847167,...,277.021967,4.835426,17.923708,6.665545,1.861237,1.948526,12.298985,10.515364,828.449488,26.425433
min,5.0,3.4,1.0,0.007297,4e-06,0.061637,0.0,0.0,0.0,0.0,...,2010.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,2010.0,0.0
25%,5.0,3.4,1433.0,0.834129,0.679423,0.237652,3.0,1.0,0.0,1.0,...,2010.0,13.0,12.0,1.0,6.0,1.0,9.0,2.0,2010.0,51.0
50%,5.0,3.4,3673.0,1.0,0.921668,0.377722,4.0,2.0,2.0,2.0,...,2011.0,15.0,28.0,2.0,6.0,1.0,16.0,8.0,2011.0,62.0
75%,5.0,3.4,107707.8,1.058335,1.188279,2.035165,6.0,3.0,5.0,66.0,...,2011.0,18.0,43.0,3.0,6.0,1.0,24.0,11.0,2011.0,75.0
max,5.0,3.4,300003000.0,4.0,4.986288,4.644081,99.0,99.0,99.0,99.0,...,9999.0,99.0,99.0,99.0,9.0,9.0,99.0,99.0,9999.0,680.0


In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52458 entries, 0 to 52457
Columns: 674 entries, name to inwtm
dtypes: float64(310), int64(354), object(10)
memory usage: 269.8+ MB


In [91]:
df.shape

(52458, 674)

## Data Corrections

Filter for the Germany data

In [92]:
df_german = df[df["cntry"]=="DE"]

### Data - Regression (1. Dataset)

Response Variable

#### Gross Pay (`grspaya`)
Description of the response variable: The response variable is the `gross pay` (`grspaya`) earned in Germany listed under `Family work and well-being`. 
This is a numerical (continuous) variable representing the respondents' income in euros before deductions for tax and insurance answering the question "What is your usual gross pay before deductions for tax and insurance?". There are special codes for the following cases:

| Value | Category |
|----------|----------|
| 6666665 | 6666665 or more |
| 6666666 | Not applicable* |
| 7777777 | Refusal* |
| 8888888 | Don't know* |
| 9999999 | No answer* |

*) Missing Value

It is central to examining the correlation between income and various social, political, and economic factors.

In [93]:
na_val0 = [6666665, 6666666, 7777777, 8888888, 9999999] 
na_val1 = [66,77,88,99]
na_val2 = [6,7,8,9]
na_val3 = [5555, 6666,7777,8888,9999] #5555 will be included for now


Relevant columns - predictor variables (18)

In [94]:
vars0_reg = ['grspaya']
vars1_reg = ['tvpol', 'lrscale', 'netuse', 'tvtot', 'nwsptot']
vars2_reg = ['polintr', 'mmbprty','emplnof', 'emplnom']
vars3_reg = ['edude1', 'edude2', 'edude3', 'edufde1', 'edufde2', 'edufde3', 'edumde1', 'edumde2', 'edumde3']

In [95]:
reg_df = df_german[vars0_reg + vars1_reg + vars2_reg + vars3_reg]

In [96]:
reg_df

Unnamed: 0,grspaya,tvpol,lrscale,netuse,tvtot,nwsptot,polintr,mmbprty,emplnof,emplnom,edude1,edude2,edude3,edufde1,edufde2,edufde3,edumde1,edumde2,edumde3
9113,960,1,4,4,4,0,3,2,6,6,2.0,6666.0,6.0,2.0,6666.0,8888.0,2.0,6666.0,0.0
9114,6666666,2,5,6,7,0,4,2,2,6,2.0,6666.0,0.0,5555.0,0.0,9.0,8888.0,0.0,8888.0
9115,6666666,1,5,7,7,0,3,2,6,6,3.0,6666.0,5.0,2.0,6666.0,2.0,0.0,6666.0,0.0
9116,4200,1,5,7,4,1,2,2,6,6,4.0,7.0,5.0,2.0,6666.0,9.0,2.0,6666.0,8888.0
9117,6666666,2,5,7,3,0,3,2,6,6,1.0,6666.0,0.0,2.0,6666.0,0.0,2.0,6666.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12139,2300,2,9,1,6,0,3,2,1,6,2.0,6666.0,5.0,2.0,6666.0,5.0,2.0,6666.0,1.0
12140,6666666,2,4,0,7,4,3,2,1,6,2.0,6666.0,0.0,2.0,6666.0,0.0,2.0,6666.0,0.0
12141,2700,2,7,0,3,2,2,2,6,6,2.0,6666.0,9.0,2.0,6666.0,1.0,2.0,6666.0,2.0
12142,6666666,1,4,7,3,0,2,2,6,6,5.0,0.0,0.0,8888.0,0.0,2.0,3.0,6666.0,2.0


Replacing special codes with missing values

| Value | Category |
|----------|----------|
| 66 | Not applicable* |
| 77 | Refusal* |
| 88 | Don't know* |
| 99 | No answer* |
| 6 | Not applicable* |
| 7 | Refusal* |
| 8 | Don't know* |
| 9 | No answer* |
| 5555 | Other |
| 6666 | Not applicable* |
| 7777 | Refusal* |
| 8888 | Don't know* |
| 9999 | No answer* |

*) Missing Value

Note: The code "5555" refers to different levels of education. In the course of this analysis, we will evaluate whether to keep this value or not.

In [97]:
def replace_na_values(df, vars_list, na_vals_list):
    for column in vars_list:
        df.loc[:, column] = df.loc[:, column].replace(na_vals_list, np.nan)

replace_na_values(reg_df, vars0_reg, na_val0)
replace_na_values(reg_df, vars1_reg, na_val1)
replace_na_values(reg_df, vars2_reg, na_val2)
replace_na_values(reg_df, vars3_reg, na_val3)

In [98]:
reg_df

Unnamed: 0,grspaya,tvpol,lrscale,netuse,tvtot,nwsptot,polintr,mmbprty,emplnof,emplnom,edude1,edude2,edude3,edufde1,edufde2,edufde3,edumde1,edumde2,edumde3
9113,960.0,1.0,4.0,4.0,4.0,0.0,3.0,2,,,2.0,,6.0,2.0,,,2.0,,0.0
9114,,2.0,5.0,6.0,7.0,0.0,4.0,2,2.0,,2.0,,0.0,,0.0,9.0,,0.0,
9115,,1.0,5.0,7.0,7.0,0.0,3.0,2,,,3.0,,5.0,2.0,,2.0,0.0,,0.0
9116,4200.0,1.0,5.0,7.0,4.0,1.0,2.0,2,,,4.0,7.0,5.0,2.0,,9.0,2.0,,
9117,,2.0,5.0,7.0,3.0,0.0,3.0,2,,,1.0,,0.0,2.0,,0.0,2.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12139,2300.0,2.0,9.0,1.0,6.0,0.0,3.0,2,1.0,,2.0,,5.0,2.0,,5.0,2.0,,1.0
12140,,2.0,4.0,0.0,7.0,4.0,3.0,2,1.0,,2.0,,0.0,2.0,,0.0,2.0,,0.0
12141,2700.0,2.0,7.0,0.0,3.0,2.0,2.0,2,,,2.0,,9.0,2.0,,1.0,2.0,,2.0
12142,,1.0,4.0,7.0,3.0,0.0,2.0,2,,,5.0,0.0,0.0,,0.0,2.0,3.0,,2.0


In [99]:
reg_df.describe(include="all").round(2)

Unnamed: 0,grspaya,tvpol,lrscale,netuse,tvtot,nwsptot,polintr,mmbprty,emplnof,emplnom,edude1,edude2,edude3,edufde1,edufde2,edufde3,edumde1,edumde2,edumde3
count,1094.0,2910.0,2822.0,3030.0,3028.0,3030.0,3029.0,3031.0,426.0,140.0,3022.0,956.0,2847.0,2674.0,605.0,2517.0,2811.0,406.0,2642.0
mean,5039.57,1.76,4.54,4.59,4.19,1.32,2.32,1.97,1.69,1.52,3.19,3.38,3.88,2.64,3.12,4.5,2.47,2.32,3.0
std,15603.84,1.11,1.78,2.88,2.03,1.09,0.87,0.18,0.57,0.53,1.21,3.16,2.96,1.19,3.29,2.66,1.06,3.1,2.78
min,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.0,1.0,3.0,1.0,3.0,1.0,2.0,2.0,1.0,1.0,2.0,0.0,0.0,2.0,0.0,4.0,2.0,0.0,0.0
50%,2200.0,2.0,5.0,6.0,4.0,1.0,2.0,2.0,2.0,2.0,3.0,4.0,5.0,2.0,1.0,5.0,2.0,0.0,3.0
75%,3200.0,2.0,5.0,7.0,6.0,2.0,3.0,2.0,2.0,2.0,4.0,7.0,6.0,3.0,7.0,5.0,3.0,6.0,5.0
max,180000.0,7.0,10.0,7.0,7.0,7.0,4.0,2.0,3.0,3.0,5.0,9.0,9.0,5.0,9.0,9.0,5.0,9.0,9.0


In [100]:
modus_reg = reg_df.mode().iloc[0]
modus_reg

grspaya    3000.0
tvpol         1.0
lrscale       5.0
netuse        7.0
tvtot         7.0
nwsptot       1.0
polintr       2.0
mmbprty       2.0
emplnof       2.0
emplnom       1.0
edude1        3.0
edude2        0.0
edude3        0.0
edufde1       2.0
edufde2       0.0
edufde3       5.0
edumde1       2.0
edumde2       0.0
edumde3       0.0
Name: 0, dtype: float64

We would like to have a clean dataset which we can use to train our model. To clean our data, we first drop all rows having a NaN-value at the response variable "grspaya". As we are using a supervised learning algorithm, we cannot use these rows to either train or evaluate our model.

In [101]:
reg_df_resp_clean = reg_df.dropna(subset=["grspaya"], axis = 0, how="all")

print(f"No. of rows of original dataset:\t\t\t\t{reg_df.shape[0]}")
print(f"No. of rows of dataset without missing values in grspaya:\t{reg_df_resp_clean.shape[0]}")

No. of rows of original dataset:				3031
No. of rows of dataset without missing values in grspaya:	1094


In [102]:
reg_nan = pd.DataFrame(reg_df_resp_clean.isna().sum()).rename(columns={0: "count"}).reset_index()
reg_nan

Unnamed: 0,index,count
0,grspaya,0
1,tvpol,46
2,lrscale,44
3,netuse,0
4,tvtot,2
5,nwsptot,0
6,polintr,1
7,mmbprty,0
8,emplnof,947
9,emplnom,1052


In [103]:
alt.Chart(reg_nan).mark_bar().encode(
    x="count",
    y="index",
).properties(height=500)

It is clear to see that there five variables having a quite high amount of NaN-values. These variables are "edude2", "edufde2", "edumde2", "emplnof" and "emplnom". As there is more than half of these values missing, we will drop these variables.

In [104]:
reg_df_drop = reg_df_resp_clean.drop(["edude2", "edufde2", "edumde2", "emplnof","emplnom"], axis=1)

The remaining missing values will be replaced with one of median, mean or mode depending on the meaning of the variable.

In [105]:
reg_df_drop.isna().sum()

grspaya      0
tvpol       46
lrscale     44
netuse       0
tvtot        2
nwsptot      0
polintr      1
mmbprty      0
edude1       1
edude3      73
edufde1    109
edufde3    157
edumde1     55
edumde3    119
dtype: int64

# TODO: Reason for categorization

In [106]:
median = ["tvpol", "tvtot", "lrscale"]
mode = ["polintr", "edude3", "edufde1", "edufde3", "edumde1", "edumde3"]

In [107]:
reg_df_clean = reg_df_drop.copy()

for column in reg_df_clean.columns.tolist():
    if column in median:
        reg_df_clean.loc[:, column] = reg_df_drop.loc[:, column].fillna(round(reg_df_drop[column].median(), 0))
    elif column in mode:
        reg_df_clean.loc[:, column] = reg_df_drop.loc[:, column].fillna(reg_df_drop[column].mode()[0])

In [108]:
reg_df_clean.isna().sum()

grspaya    0
tvpol      0
lrscale    0
netuse     0
tvtot      0
nwsptot    0
polintr    0
mmbprty    0
edude1     1
edude3     0
edufde1    0
edufde3    0
edumde1    0
edumde3    0
dtype: int64

In [109]:
reg_df_clean

Unnamed: 0,grspaya,tvpol,lrscale,netuse,tvtot,nwsptot,polintr,mmbprty,edude1,edude3,edufde1,edufde3,edumde1,edumde3
9113,960.0,1.0,4.0,4.0,4.0,0.0,3.0,2,2.0,6.0,2.0,5.0,2.0,0.0
9116,4200.0,1.0,5.0,7.0,4.0,1.0,2.0,2,4.0,5.0,2.0,9.0,2.0,0.0
9118,6000.0,2.0,7.0,7.0,2.0,1.0,1.0,2,5.0,0.0,5.0,0.0,5.0,1.0
9120,2500.0,1.0,4.0,5.0,3.0,2.0,3.0,2,3.0,5.0,2.0,5.0,2.0,0.0
9121,400.0,0.0,5.0,6.0,1.0,1.0,3.0,2,3.0,3.0,5.0,0.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12135,3400.0,1.0,5.0,7.0,3.0,1.0,3.0,2,3.0,6.0,2.0,5.0,3.0,5.0
12136,1400.0,1.0,5.0,2.0,6.0,1.0,3.0,2,3.0,6.0,2.0,5.0,2.0,2.0
12138,4000.0,1.0,4.0,6.0,2.0,1.0,1.0,2,5.0,0.0,3.0,8.0,2.0,0.0
12139,2300.0,2.0,9.0,1.0,6.0,0.0,3.0,2,2.0,5.0,2.0,5.0,2.0,1.0


### Data - Classification (2. Dataset)

#### Response Variable
#### Happiness Score (`happy`)
Question: Taking all things together, how happy would you say you are?

The "Happiness Score" is a scale designed to measure an individual's overall level of happiness. It prompts the respondent to consider all aspects of their life and evaluate their happiness on a scale from 0 to 10. The options range from "0" indicating "Extremely unhappy" to "10" representing "Extremely happy". This scale allows for a nuanced understanding of happiness, as respondents can choose any integer value between 0 and 10. Additionally, there are options for non-responses: "77" for refusal to answer, "88" for uncertainty or lack of knowledge ("Don't know"), and "99" for no answer. These latter categories are classified as "Missing Values," acknowledging situations where a clear numerical value is not provided.

| Value | Category |
|----------|----------|
| 0 | Extremely unhappy |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
| 10 | Extremely happy |
| 77 | Refusal* |
| 88 | Don't know* |
| 99 | No answer* |

*) Missing Value


In [110]:
vars0_cls = ['happy']
vars1_cls = ['lrscale', 'nwsptot', 'netuse', 'tvtot', 'tvpol', 'dsgrmnya']
vars2_cls = ['wrywprb']
vars3_cls = ['impfun', 'ipgdtim', 'polintr', 'iprspot', 'gincdif']
vars4_cls = ['edude1', 'edude2', 'edude3']

In [111]:
cls_df = df_german[vars0_cls + vars1_cls + vars2_cls + vars3_cls + vars4_cls]

In [112]:
cls_df

Unnamed: 0,happy,lrscale,nwsptot,netuse,tvtot,tvpol,dsgrmnya,wrywprb,impfun,ipgdtim,polintr,iprspot,gincdif,edude1,edude2,edude3
9113,9,4,0,4,4,1,3,1,3,2,3,1,2,2.0,6666.0,6.0
9114,8,5,0,6,7,2,3,6,2,2,4,3,1,2.0,6666.0,0.0
9115,8,5,0,7,7,1,5,6,2,3,3,5,2,3.0,6666.0,5.0
9116,8,5,1,7,4,1,66,2,1,1,2,2,3,4.0,7.0,5.0
9117,10,5,0,7,3,2,66,6,1,2,3,3,3,1.0,6666.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12139,5,9,0,1,6,2,1,3,5,3,3,5,2,2.0,6666.0,5.0
12140,6,4,4,0,7,2,1,6,5,5,3,5,2,2.0,6666.0,0.0
12141,9,7,2,0,3,2,4,4,4,3,2,3,1,2.0,6666.0,9.0
12142,9,4,0,7,3,1,2,6,3,2,2,4,1,5.0,0.0,0.0


In [113]:
replace_na_values(cls_df, vars0_cls, na_val1)
replace_na_values(cls_df, vars1_cls, na_val1)
replace_na_values(cls_df, vars2_cls, na_val2)
replace_na_values(cls_df, vars3_cls, na_val2)
replace_na_values(cls_df, vars4_cls, na_val3)

In [114]:
cls_df

Unnamed: 0,happy,lrscale,nwsptot,netuse,tvtot,tvpol,dsgrmnya,wrywprb,impfun,ipgdtim,polintr,iprspot,gincdif,edude1,edude2,edude3
9113,9.0,4.0,0.0,4.0,4.0,1.0,3.0,1.0,3.0,2.0,3.0,1.0,2.0,2.0,,6.0
9114,8.0,5.0,0.0,6.0,7.0,2.0,3.0,,2.0,2.0,4.0,3.0,1.0,2.0,,0.0
9115,8.0,5.0,0.0,7.0,7.0,1.0,5.0,,2.0,3.0,3.0,5.0,2.0,3.0,,5.0
9116,8.0,5.0,1.0,7.0,4.0,1.0,,2.0,1.0,1.0,2.0,2.0,3.0,4.0,7.0,5.0
9117,10.0,5.0,0.0,7.0,3.0,2.0,,,1.0,2.0,3.0,3.0,3.0,1.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12139,5.0,9.0,0.0,1.0,6.0,2.0,1.0,3.0,5.0,3.0,3.0,5.0,2.0,2.0,,5.0
12140,6.0,4.0,4.0,0.0,7.0,2.0,1.0,,5.0,5.0,3.0,5.0,2.0,2.0,,0.0
12141,9.0,7.0,2.0,0.0,3.0,2.0,4.0,4.0,4.0,3.0,2.0,3.0,1.0,2.0,,9.0
12142,9.0,4.0,0.0,7.0,3.0,1.0,2.0,,3.0,2.0,2.0,4.0,1.0,5.0,0.0,0.0


### Response Variable "happy" : values "0, 1, ... 9, 10" --> "happy" und "unhappy"

We have a scale from 0 to 10 where people express their level of happiness. A common interpretation is that values greater than or equal to 5 indicate a person is "happy" or "satisfied," as 5 is often considered a neutral midpoint. Values less than 5 might indicate a person is "unhappy" or "not satisfied." Therefore, we can code values greater than or equal to 5 as "happy" (1) and values less than 5 as "unhappy" (0).

| Value | Category |
|----------|----------|
| 0 | unhappy |
| 1 | happy |

In [115]:
cls_df['happy'] = cls_df['happy'].apply(lambda x: 1 if x >= 5 else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cls_df['happy'] = cls_df['happy'].apply(lambda x: 1 if x >= 5 else 0)


In [116]:
cls_df['happy'].value_counts()

happy
1    2781
0     250
Name: count, dtype: int64

unbalanced data split

In [117]:
cls_df.describe(include="all").round(2)

Unnamed: 0,happy,lrscale,nwsptot,netuse,tvtot,tvpol,dsgrmnya,wrywprb,impfun,ipgdtim,polintr,iprspot,gincdif,edude1,edude2,edude3
count,3031.0,2822.0,3030.0,3030.0,3028.0,2910.0,1817.0,1500.0,2900.0,2982.0,3029.0,2871.0,2988.0,3022.0,956.0,2847.0
mean,0.92,4.54,1.32,4.59,4.19,1.76,2.14,2.88,3.03,2.42,2.32,3.16,2.23,3.19,3.38,3.88
std,0.28,1.78,1.09,2.88,2.03,1.11,1.38,1.04,1.21,1.07,0.87,1.23,1.07,1.21,3.16,2.96
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,1.0,3.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,0.0,0.0
50%,1.0,5.0,1.0,6.0,4.0,2.0,2.0,3.0,3.0,2.0,2.0,3.0,2.0,3.0,4.0,5.0
75%,1.0,5.0,2.0,7.0,6.0,2.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,4.0,7.0,6.0
max,1.0,10.0,7.0,7.0,7.0,7.0,7.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,9.0,9.0


In [118]:
modus_cls = cls_df.mode().iloc[0]
modus_cls

happy       1.0
lrscale     5.0
nwsptot     1.0
netuse      7.0
tvtot       7.0
tvpol       1.0
dsgrmnya    1.0
wrywprb     3.0
impfun      3.0
ipgdtim     2.0
polintr     2.0
iprspot     2.0
gincdif     2.0
edude1      3.0
edude2      0.0
edude3      0.0
Name: 0, dtype: float64

In [119]:
cls_df_resp_clean = cls_df.dropna(subset=["happy"], axis = 0, how="all")

print(f"No. of rows of original dataset:\t\t\t\t{cls_df.shape[0]}")
print(f"No. of rows of dataset without missing values in grspaya:\t{cls_df_resp_clean.shape[0]}")

No. of rows of original dataset:				3031
No. of rows of dataset without missing values in grspaya:	3031


In [120]:
cls_nan = pd.DataFrame(cls_df_resp_clean.isna().sum()).rename(columns={0: "count"}).reset_index()
cls_nan

Unnamed: 0,index,count
0,happy,0
1,lrscale,209
2,nwsptot,1
3,netuse,1
4,tvtot,3
5,tvpol,121
6,dsgrmnya,1214
7,wrywprb,1531
8,impfun,131
9,ipgdtim,49


In [121]:
alt.Chart(cls_nan).mark_bar().encode(
    x="count",
    y="index",
).properties(height=500)

In [122]:
cls_df_drop = cls_df_resp_clean.drop(["dsgrmnya", "edude2", "wrywprb"], axis=1)

In [123]:
cls_df_drop.isna().sum()

happy        0
lrscale    209
nwsptot      1
netuse       1
tvtot        3
tvpol      121
impfun     131
ipgdtim     49
polintr      2
iprspot    160
gincdif     43
edude1       9
edude3     184
dtype: int64

In [124]:
median = ["nwsptot", "tvpol", "tvtot", "lrscale", "netuse", "impfun", "ipgdtim", "iprspot", "gincdif"]
mode = ["polintr", "edude1", "edude3"]

In [125]:
cls_df_clean = cls_df_drop.copy()

for column in cls_df_clean.columns.tolist():
    if column in median:
        cls_df_clean.loc[:, column] = cls_df_drop.loc[:, column].fillna(round(cls_df_drop[column].median(), 0))
    elif column in mode:
        cls_df_clean.loc[:, column] = cls_df_drop.loc[:, column].fillna(cls_df_drop[column].mode()[0])

In [126]:
cls_df_clean.isna().sum()

happy      0
lrscale    0
nwsptot    0
netuse     0
tvtot      0
tvpol      0
impfun     0
ipgdtim    0
polintr    0
iprspot    0
gincdif    0
edude1     0
edude3     0
dtype: int64

In [127]:
cls_df_clean

Unnamed: 0,happy,lrscale,nwsptot,netuse,tvtot,tvpol,impfun,ipgdtim,polintr,iprspot,gincdif,edude1,edude3
9113,1,4.0,0.0,4.0,4.0,1.0,3.0,2.0,3.0,1.0,2.0,2.0,6.0
9114,1,5.0,0.0,6.0,7.0,2.0,2.0,2.0,4.0,3.0,1.0,2.0,0.0
9115,1,5.0,0.0,7.0,7.0,1.0,2.0,3.0,3.0,5.0,2.0,3.0,5.0
9116,1,5.0,1.0,7.0,4.0,1.0,1.0,1.0,2.0,2.0,3.0,4.0,5.0
9117,1,5.0,0.0,7.0,3.0,2.0,1.0,2.0,3.0,3.0,3.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12139,1,9.0,0.0,1.0,6.0,2.0,5.0,3.0,3.0,5.0,2.0,2.0,5.0
12140,1,4.0,4.0,0.0,7.0,2.0,5.0,5.0,3.0,5.0,2.0,2.0,0.0
12141,1,7.0,2.0,0.0,3.0,2.0,4.0,3.0,2.0,3.0,1.0,2.0,9.0
12142,1,4.0,0.0,7.0,3.0,1.0,3.0,2.0,2.0,4.0,1.0,5.0,0.0


### Variable lists

Regression

In [128]:
reg_df_clean.columns.tolist()

['grspaya',
 'tvpol',
 'lrscale',
 'netuse',
 'tvtot',
 'nwsptot',
 'polintr',
 'mmbprty',
 'edude1',
 'edude3',
 'edufde1',
 'edufde3',
 'edumde1',
 'edumde3']

Classification

In [129]:
print(cls_df_clean.columns.tolist())

y_label = "happy"

X = cls_df_clean.drop(y_label, axis=1)
y = cls_df_clean[y_label]

['happy', 'lrscale', 'nwsptot', 'netuse', 'tvtot', 'tvpol', 'impfun', 'ipgdtim', 'polintr', 'iprspot', 'gincdif', 'edude1', 'edude3']


### Data splitting
We will consider two methods to handle the unbalanced dataset. At first, we will use the whole dataset and set the option "stratify" to y. This results in ...

In [130]:
X_cls_train, X_cls_test, y_cls_train, y_cls_test = train_test_split(X, y, test_size=0.3, random_state=5, stratify=y)

The second method will be undersampling. We will reduce the number of entries having "happy" = 1 to the same number of entries having "happy" = 0. This will balance out the dataset by losing out on information. We will select the model which performs best. For the EDA, we will use the whole dataset.

In [131]:
cls_df_happy = cls_df_clean[cls_df_clean["happy"]==1]
cls_df_unhappy = cls_df_clean[cls_df_clean["happy"]==0]

print(f"Length of 'happy' data: \t {len(cls_df_happy)}")
print(f"Length of 'unhappy' data: \t {len(cls_df_unhappy)}")

Length of 'happy' data: 	 2781
Length of 'unhappy' data: 	 250


In [132]:
cls_df_happy_red = cls_df_happy[:250]
len(cls_df_happy_red)

250

In [133]:
cls_df_balanced = pd.concat([cls_df_unhappy, cls_df_happy_red])
len(cls_df_balanced)

500

In [134]:
X_bal = cls_df_balanced.drop(y_label, axis=1)
y_bal = cls_df_balanced[y_label]

In [135]:
X_cls_train_bal, X_cls_test_bal, y_cls_train_bal, y_cls_test_bal = train_test_split(X_bal, y_bal, test_size=0.3, random_state=5)

## Analysis

### Classification dataset for data exploration

In [136]:
df_train_cls = pd.DataFrame(X_cls_train).copy()
df_train_cls["happy"] = y_cls_train

df_train_cls.head()

Unnamed: 0,lrscale,nwsptot,netuse,tvtot,tvpol,impfun,ipgdtim,polintr,iprspot,gincdif,edude1,edude3,happy
11303,5.0,1.0,7.0,6.0,1.0,1.0,1.0,2.0,5.0,1.0,3.0,5.0,1
10859,5.0,2.0,5.0,2.0,1.0,3.0,3.0,2.0,4.0,1.0,5.0,7.0,1
11350,5.0,2.0,0.0,5.0,1.0,2.0,5.0,3.0,1.0,2.0,3.0,9.0,1
9182,5.0,1.0,0.0,5.0,2.0,2.0,2.0,3.0,5.0,2.0,2.0,2.0,1
11196,5.0,1.0,6.0,4.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,5.0,1


### Descriptive statistics

In [185]:
for col in df_train_cls.columns.tolist():
    if col == "happy":
        df_train_cls[col] = df_train_cls[col].astype("category")
        continue  
    df_train_cls[col] = df_train_cls[col].astype("int64")

In [186]:
df_train_cls.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2121 entries, 11303 to 10615
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   lrscale  2121 non-null   int64   
 1   nwsptot  2121 non-null   int64   
 2   netuse   2121 non-null   int64   
 3   tvtot    2121 non-null   int64   
 4   tvpol    2121 non-null   int64   
 5   impfun   2121 non-null   int64   
 6   ipgdtim  2121 non-null   int64   
 7   polintr  2121 non-null   int64   
 8   iprspot  2121 non-null   int64   
 9   gincdif  2121 non-null   int64   
 10  edude1   2121 non-null   int64   
 11  edude3   2121 non-null   int64   
 12  happy    2121 non-null   category
dtypes: category(1), int64(12)
memory usage: 217.6 KB


In [187]:
df_train_cls.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
lrscale,2121.0,4.55917,1.7278,0.0,4.0,5.0,5.0,10.0
nwsptot,2121.0,1.328147,1.101373,0.0,1.0,1.0,2.0,7.0
netuse,2121.0,4.516737,2.896457,0.0,1.0,6.0,7.0,7.0
tvtot,2121.0,4.174917,2.054689,0.0,3.0,4.0,6.0,7.0
tvpol,2121.0,1.766148,1.067533,0.0,1.0,2.0,2.0,7.0
impfun,2121.0,3.029231,1.187233,1.0,2.0,3.0,4.0,5.0
ipgdtim,2121.0,2.421971,1.061624,1.0,2.0,2.0,3.0,5.0
polintr,2121.0,2.323432,0.866604,1.0,2.0,2.0,3.0,4.0
iprspot,2121.0,3.153701,1.197461,1.0,2.0,3.0,4.0,5.0
gincdif,2121.0,2.24281,1.079822,1.0,1.0,2.0,3.0,5.0


### Exploratory data analysis

In [188]:
c_correlation_matrix = df_train_cls.corr()
c_correlation_data = c_correlation_matrix.reset_index().melt('index')

c_base = alt.Chart(c_correlation_data).mark_rect().encode(
    x='index:N',
    y='variable:N',
    color='value:Q',
    tooltip=['index', 'variable', 'value']
)

c_base

In [191]:
for var in df_train_cls.columns.tolist():
        if var == "happy":
                continue
        boxplot = alt.Chart(df_train_cls).mark_boxplot().encode(
                        x=alt.X("happy"),
                        y=alt.Y(var),
                        tooltip=[var, "happy"]
                        ).properties(
                                width=150,
                        )
        
        barplot = alt.Chart(df_train_cls).mark_bar().encode(
                        x=var,
                        y=alt.Y(aggregate="count", stack="normalize"),
                        color=alt.Color("happy"),
                        tooltip=[var, "happy", "count()"]
                        )
        
        heatmap = alt.Chart(df_train_cls).mark_rect().encode(
                        alt.X("happy"),
                        alt.Y(var, bin={"binned": True, "step": 1}),
                        alt.Color('count()', scale=alt.Scale(scheme='greenblue')),
                        tooltip=[var, "happy", "count()"]
                        ).properties(
                                width=150
                        )

        combined_plots = alt.hconcat(boxplot, barplot, heatmap)
        combined_plots.display()

### Relationships

## Model

### Select model

### Training and validation

### Fit model

### Evaluation on test set

### Save model



Save your model in the folder `models/`. Use a meaningful name and a timestamp.

## Conclusions