## **Mental Health in Tech Survey**

### **1. Library and data loading**


In [None]:
# library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
from scipy.stats import randint

#visualization libraries

import missingno as msno
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

# prep
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.datasets import make_classification
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler

# models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Validation libraries
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve
from sklearn.model_selection import cross_val_score

#Neural Network
from sklearn.neural_network import MLPClassifier

# Datensatz einlesen und laden
from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/MyDrive/survey.csv'

# Datei laden
df = pd.read_csv(file_path)


Mounted at /content/drive


In [None]:
!pip install scikit-learn



### **2. Überblick des Datensatzes**

* Verständnis der Faktoren, die zur mentalen Gesundheit einer Person beitragen
* Umfrage von 2014, erfasst Einstellungen zu mentaler Gesundheit und Häufigkeit von psychischen Erkrankungen im tech workplace

**Variablen**
* **Timestamp**
* **Age**
* **Gender**
* **Country**
* **state**: If you live in the United States, which state or territory do you live in?
* **self_employed**: Are you self-employed?
* **family_history**: Do you have a family history of mental illness?
* **treatment**: Have you sought treatment for a mental health condition?
* **work_interfere**: If you have a mental health condition, do you feel that it * interferes with your work?
* **no_employees**: How many employees does your company or organization have?
* **remote_work**: Do you work remotely (outside of an office) at least 50% of the time?
* **tech_company**: Is your employer primarily a tech company/organization?
* **benefits**: Does your employer provide mental health benefits?
* **are_options**: Do you know the options for mental health care your employer provides?
* **wellness_program**: Has your employer ever discussed mental health as part of an employee wellness program?
* **seek_help**: Does your employer provide resources to learn more about mental health issues and how to seek help?
* **anonymity**: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
* **leave**: How easy is it for you to take medical leave for a mental health condition?
* **mentalhealthconsequence**: Do you think that discussing a mental health issue with your employer would have negative consequences?
* **physhealthconsequence**: Do you think that discussing a physical health issue with your employer would have negative consequences?
* **coworkers**: Would you be willing to discuss a mental health issue with your coworkers?
* **physhealthinterview**: Would you bring up a physical health issue with a potential employer in an interview?
* **mentalvsphysical**: Do you feel that your employer takes mental health as seriously as physical health?
* **obs_consequence**: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
* **comments**: Any additional notes or comments

In [None]:
# ersten Zeilen anschauen
df.head()

# data row count
print(df.shape)

# types of data
print(df.info())

# missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
print(missing_data)

(1259, 27)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 1

* 1259 Einträge, 27 Spalten
* Außer Alter (int) nur object datatype
* Viele Missings in **comments** (87%; ergibt Sinn, weil optimal) und **state** (40%); auch paar bei **work interfere** (20%) **self-employed** (1,4%)

In [None]:
# Country und State Varibale anschauen
print(df['Country'].value_counts())
print("\n \n")
print(df['state'].unique())

United States             751
United Kingdom            185
Canada                     72
Germany                    45
Ireland                    27
Netherlands                27
Australia                  21
France                     13
India                      10
New Zealand                 8
Poland                      7
Switzerland                 7
Sweden                      7
Italy                       7
South Africa                6
Belgium                     6
Brazil                      6
Israel                      5
Singapore                   4
Bulgaria                    4
Austria                     3
Finland                     3
Mexico                      3
Russia                      3
Denmark                     2
Greece                      2
Colombia                    2
Croatia                     2
Portugal                    2
Moldova                     1
Georgia                     1
Bahamas, The                1
China                       1
Thailand  

* **Timestamp** kann raus (= Datum des Ausfüllens; irrelevant)
* **comments** kann raus (da irrelevant)
* **state** kann raus (nicht sinnvoll zu interpretieren)
* **self-employed** -> mit der häufigsten Kategorie ersetzten

* **work-interfere** mit der häufigsten Kategorie ersetzten



In [None]:
# Alter und Geschlecht anschauen
print(df['Age'].unique())
print("\n \n")
print(df['Gender'].unique())

[         37          44          32          31          33          35
          39          42          23          29          36          27
          46          41          34          30          40          38
          50          24          18          28          26          22
          19          25          45          21         -29          43
          56          60          54         329          55 99999999999
          48          20          57          58          47          62
          51          65          49       -1726           5          53
          61           8          11          -1          72]

 

['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme' 'Guy (-ish) ^_^' 'male leaning androgynous

In [None]:
# Variablen droppen
df.drop(columns=['Timestamp', 'state', 'comments'], inplace = True)

# Geschlecht kategorisieren
df['Gender'].replace(['Male ', 'male', 'M', 'm', 'Male', 'Cis Male',
                     'Man', 'cis male', 'Mail', 'Male-ish', 'Male (CIS)',
                      'Cis Man', 'msle', 'Malr', 'Mal', 'maile', 'Make',], 'Male', inplace = True)

df['Gender'].replace(['Female ', 'female', 'F', 'f', 'Woman', 'Female',
                     'femail', 'Cis Female', 'cis-female/femme', 'Femake', 'Female (cis)',
                     'woman',], 'Female', inplace = True)

df["Gender"].replace(['Female (trans)', 'queer/she/they', 'non-binary',
                     'fluid', 'queer', 'Androgyne', 'Trans-female', 'male leaning androgynous',
                      'Agender', 'A little about you', 'Nah', 'All',
                      'ostensibly male, unsure what that really means',
                      'Genderqueer', 'Enby', 'p', 'Neuter', 'something kinda male?',
                      'Guy (-ish) ^_^', 'Trans woman',], 'Divers', inplace = True)

df['Gender'].value_counts()

# Missings in self-employed mit der häufigsten Kategorie ersetzten
df['self_employed'] = df['self_employed'].fillna(df['self_employed'].value_counts().index[0])

## Zeige die häufigste Antwort der Variable an
most_common_value = df['self_employed'].mode().iloc[0]
print(f'Die häufigste Antwort der Variable ist: {most_common_value}')

# Missings in work-interfere mit der häufigsten Kategorie ersetzten
df['work_interfere'] = df['work_interfere'].fillna(df['work_interfere'].value_counts().index[0])

## Missings in work-interfere mit der häufigsten Kategorie ersetzten
df['work_interfere'] = df['work_interfere'].fillna(df['work_interfere'].value_counts().index[0])

# Ersatz für Werte unter 18 und über 75 durch Median
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Age'] = df['Age'].apply(lambda x: min(75, max(18, x)))

median_age = df['Age'].median()
df['Age'] = df['Age'].fillna(median_age)

df.head()

Die häufigste Antwort der Variable ist: No


Unnamed: 0,Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,Female,United States,No,No,Yes,Often,6-25,No,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,Male,United States,No,No,No,Rarely,More than 1000,No,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,Male,Canada,No,No,No,Rarely,6-25,No,Yes,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,Male,United Kingdom,No,Yes,Yes,Often,26-100,No,Yes,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,Male,United States,No,No,No,Never,100-500,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


* Geschlecht: Male: 991, Female 247, Divers 21
* Altersangaben unter 18 und über 75 durch Median ersetzt

# **2. Umwandlung kategorial -> numerisch**

In [None]:
from sklearn.preprocessing import LabelEncoder

# Erstellen eines neuen DataFrames für die codierten Werte
df_e = df.copy()

# Gender umwandeln
label_encoder_Gender = LabelEncoder()
df_e['Gender'] = label_encoder_Gender.fit_transform(df['Gender'])

# Country umwandeln
label_encoder_Country = LabelEncoder()
df_e['Country'] = label_encoder_Country.fit_transform(df['Country'])

# self_employed umwandeln
label_encoder_self_employed = LabelEncoder()
df_e['self_employed'] = label_encoder_self_employed.fit_transform(df['self_employed'])

# family_history umwandeln ()
label_encoder_family_history = LabelEncoder()
df_e['family_history'] = label_encoder_family_history.fit_transform(df['family_history'])

# treatment umwandeln
label_encoder_treatment = LabelEncoder()
df_e['treatment'] = label_encoder_treatment.fit_transform(df['treatment'])

# work_interfere umwandeln
label_encoder_work_interfere = LabelEncoder()
df_e['work_interfere'] = label_encoder_work_interfere.fit_transform(df['work_interfere'])

# remote_work umwandeln
label_encoder_remote_work = LabelEncoder()
df_e['remote_work'] = label_encoder_remote_work.fit_transform(df['remote_work'])

# tech_company umwandeln
label_encoder_tech_company = LabelEncoder()
df_e['tech_company'] = label_encoder_tech_company.fit_transform(df['tech_company'])

# benefits umwandeln
label_encoder_benefits = LabelEncoder()
df_e['benefits'] = label_encoder_benefits.fit_transform(df['benefits'])

# care_options umwandeln
label_encoder_care_options = LabelEncoder()
df_e['care_options'] = label_encoder_care_options.fit_transform(df['care_options'])

# wellness_program umwandeln
label_encoder_wellness_program = LabelEncoder()
df_e['wellness_program'] = label_encoder_wellness_program.fit_transform(df['wellness_program'])

# seek_help umwandeln
label_encoder_seek_help = LabelEncoder()
df_e['seek_help'] = label_encoder_seek_help.fit_transform(df['seek_help'])

# anonymity umwandeln
label_encoder_anonymity = LabelEncoder()
df_e['anonymity'] = label_encoder_anonymity.fit_transform(df['anonymity'])

# leave umwandeln
label_encoder_leave = LabelEncoder()
df_e['leave'] = label_encoder_leave.fit_transform(df['leave'])

# mental_health_consequence umwandeln
label_encoder_mental_health_consequence = LabelEncoder()
df_e['mental_health_consequence'] = label_encoder_mental_health_consequence.fit_transform(df['mental_health_consequence'])

# phys_health_consequence umwandeln
label_encoder_phys_health_consequence = LabelEncoder()
df_e['phys_health_consequence'] = label_encoder_phys_health_consequence.fit_transform(df['phys_health_consequence'])

# coworkers umwandeln
label_encoder_coworkers = LabelEncoder()
df_e['coworkers'] = label_encoder_coworkers.fit_transform(df['coworkers'])

# supervisor umwandeln
label_encoder_supervisor = LabelEncoder()
df_e['supervisor'] = label_encoder_supervisor.fit_transform(df['supervisor'])

# mental_health_interview umwandeln
label_encoder_mental_health_interview = LabelEncoder()
df_e['mental_health_interview'] = label_encoder_mental_health_interview.fit_transform(df['mental_health_interview'])

# phys_health_interview umwandeln
label_encoder_phys_health_interview = LabelEncoder()
df_e['phys_health_interview'] = label_encoder_phys_health_interview.fit_transform(df['phys_health_interview'])

# mental_vs_physical umwandeln
label_encoder_mental_vs_physical = LabelEncoder()
df_e['mental_vs_physical'] = label_encoder_mental_vs_physical.fit_transform(df['mental_vs_physical'])

# obs_consequence umwandeln
label_encoder_obs_consequence = LabelEncoder()
df_e['obs_consequence'] = label_encoder_obs_consequence.fit_transform(df['obs_consequence'])

no_employees -> in kategorien einteilen? aber wie?

In [None]:
# Umwandlung prüfen
df.head()

Unnamed: 0,Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,Female,United States,No,No,Yes,Often,6-25,No,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,Male,United States,No,No,No,Rarely,More than 1000,No,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,Male,Canada,No,No,No,Rarely,6-25,No,Yes,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,Male,United Kingdom,No,Yes,Yes,Often,26-100,No,Yes,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,Male,United States,No,No,No,Never,100-500,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


**Zuordnung anzeigen**

In [None]:
# Gender
mapping_Gender = pd.DataFrame({'Original_Gender': df['Gender'].unique(), 'Encoded_Gender': df_e['Gender'].unique()})
print(mapping_Gender)

# Country
mapping_Country = pd.DataFrame({'Original_Country': df['Country'].unique(), 'Encoded_Country': df_e['Country'].unique()})
print(mapping_Country)

# self_employed
mapping_self_employed = pd.DataFrame({'Original_self_employed': df['self_employed'].unique(), 'Encoded_self_employed': df_e['self_employed'].unique()})
print(mapping_self_employed)

# family_history
mapping_family_history = pd.DataFrame({'Original_family_history': df['family_history'].unique(), 'Encoded_family_history': df_e['family_history'].unique()})
print(mapping_family_history)

# treatment
mapping_treatment = pd.DataFrame({'Original_treatment': df['treatment'].unique(), 'Encoded_treatment': df_e['treatment'].unique()})
print(mapping_treatment)

# work_interfere
mapping_work_interfere = pd.DataFrame({'Original_work_interfere': df['work_interfere'].unique(), 'Encoded_work_interfere': df_e['work_interfere'].unique()})
print(mapping_work_interfere)

# remote_work
mapping_remote_work = pd.DataFrame({'Original_remote_work': df['remote_work'].unique(), 'Encoded_remote_work': df_e['remote_work'].unique()})
print(mapping_remote_work)

# tech_company
mapping_tech_company = pd.DataFrame({'Original_tech_company': df['tech_company'].unique(), 'Encoded_tech_company': df_e['tech_company'].unique()})
print(mapping_tech_company)

# benefits
mapping_benefits = pd.DataFrame({'Original_benefits': df['benefits'].unique(), 'Encoded_benefits': df_e['benefits'].unique()})
print(mapping_benefits)

# care_options
mapping_care_options = pd.DataFrame({'Original_care_options': df['care_options'].unique(), 'Encoded_care_options': df_e['care_options'].unique()})
print(mapping_care_options)

# wellness_program
mapping_wellness_program = pd.DataFrame({'Original_wellness_program': df['wellness_program'].unique(), 'Encoded_wellness_program': df_e['wellness_program'].unique()})
print(mapping_wellness_program)

# seek_help
mapping_seek_help = pd.DataFrame({'Original_seek_help': df['seek_help'].unique(), 'Encoded_seek_help': df_e['seek_help'].unique()})
print(mapping_seek_help)

# anonymity
mapping_anonymity = pd.DataFrame({'Original_anonymity': df['anonymity'].unique(), 'Encoded_anonymity': df_e['anonymity'].unique()})
print(mapping_anonymity)

# leave
mapping_leave = pd.DataFrame({'Original_leave': df['leave'].unique(), 'Encoded_leave': df_e['leave'].unique()})
print(mapping_leave)

# mental_health_consequence
mapping_mental_health_consequence = pd.DataFrame({'Original_mental_health_consequence': df['mental_health_consequence'].unique(), 'Encoded_mental_health_consequence': df_e['mental_health_consequence'].unique()})
print(mapping_mental_health_consequence)

# phys_health_consequence
mapping_phys_health_consequence = pd.DataFrame({'Original_phys_health_consequence': df['phys_health_consequence'].unique(), 'Encoded_phys_health_consequence': df_e['phys_health_consequence'].unique()})
print(mapping_phys_health_consequence)

# coworkers
mapping_coworkers = pd.DataFrame({'Original_coworkers': df['coworkers'].unique(), 'Encoded_coworkers': df_e['coworkers'].unique()})
print(mapping_coworkers)

# supervisor
mapping_supervisor = pd.DataFrame({'Original_supervisor': df['supervisor'].unique(), 'Encoded_supervisor': df_e['supervisor'].unique()})
print(mapping_supervisor)

# mental_health_interview
mapping_mental_health_interview = pd.DataFrame({'Original_mental_health_interview': df['mental_health_interview'].unique(), 'Encoded_mental_health_interview': df_e['mental_health_interview'].unique()})
print(mapping_mental_health_interview)

# phys_health_interview
mapping_phys_health_interview = pd.DataFrame({'Original_phys_health_interview': df['phys_health_interview'].unique(), 'Encoded_phys_health_interview': df_e['phys_health_interview'].unique()})
print(mapping_phys_health_interview)

# mental_vs_physical
mapping_mental_vs_physical = pd.DataFrame({'Original_mental_vs_physical': df['mental_vs_physical'].unique(), 'Encoded_mental_vs_physical': df_e['mental_vs_physical'].unique()})
print(mapping_mental_vs_physical)

# obs_consequence
mapping_obs_consequence = pd.DataFrame({'Original_obs_consequence': df['obs_consequence'].unique(), 'Encoded_obs_consequence': df_e['obs_consequence'].unique()})
print(mapping_obs_consequence)


  Original_Gender  Encoded_Gender
0          Female               1
1            Male               2
2          Divers               0
          Original_Country  Encoded_Country
0            United States               45
1                   Canada                7
2           United Kingdom               44
3                 Bulgaria                6
4                   France               15
5                 Portugal               34
6              Netherlands               28
7              Switzerland               42
8                   Poland               33
9                Australia                0
10                 Germany               17
11                  Russia               36
12                  Mexico               26
13                  Brazil                5
14                Slovenia               38
15              Costa Rica               10
16                 Austria                1
17                 Ireland               21
18                   India  

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Annahme: df_e ist dein DataFrame

# Zufällige Aufteilung in Trainings-, Test- und Validierungsdatensätze
train_df_e, temp_df_e = train_test_split(df_e, test_size=0.3, random_state=42)
test_df_e, valid_df_e = train_test_split(temp_df_e, test_size=0.5, random_state=42)

# Ausgabe der Größen der Datensätze
print("Anzahl der Datensätze insgesamt:", len(df_e))
print("Anzahl der Datensätze im Trainingsdatensatz:", len(train_df_e))
print("Anzahl der Datensätze im Testdatensatz:", len(test_df_e))
print("Anzahl der Datensätze im Validierungsdatensatz:", len(valid_df_e))

train_df_e.head()


Anzahl der Datensätze insgesamt: 1259
Anzahl der Datensätze im Trainingsdatensatz: 881
Anzahl der Datensätze im Testdatensatz: 189
Anzahl der Datensätze im Validierungsdatensatz: 189


Unnamed: 0,Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
380,41,2,7,0,0,1,0,500-1000,0,1,...,0,2,1,1,1,1,1,1,0,0
227,34,2,0,1,0,0,3,1-5,1,1,...,0,1,2,2,0,0,1,0,2,0
451,40,0,45,0,1,0,0,More than 1000,0,1,...,2,0,2,0,0,1,1,0,2,0
578,29,2,7,1,0,0,0,26-100,1,1,...,2,3,2,1,1,0,1,2,1,0
1197,34,2,21,0,0,1,1,6-25,0,1,...,0,1,2,0,1,2,1,0,1,0


**Age Feature scalen**

In [None]:
num_columns = train_df_e.shape[1]
print(f'Number of columns in the train data: {num_columns}')


Number of columns in the train data: 24


In [None]:
# Assuming train_df_e is your DataFrame
num_rows_train = train_df_e.shape[0]

print(f'Number of rows in the training data: {num_rows_train}')

Number of rows in the training data: 881


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Annahme: train_df_e ist dein Trainingsdatensatz
# Annahme: test_df_e ist dein Testdatensatz

# Extrahiere die 'Age'-Spalte im Training Set
age_column_train = train_df_e['Age'].values.reshape(-1, 1)

# Initialisiere den MinMaxScaler und wende die Min-Max-Normalisierung auf die 'Age'-Spalte im Training Set an
scaler = MinMaxScaler()
train_df_e['Age'] = scaler.fit_transform(age_column_train)

# Extrahiere die 'Age'-Spalte im Test Set
age_column_test = test_df_e['Age'].values.reshape(-1, 1)

# Wende die Min-Max-Normalisierung auf die 'Age'-Spalte im Test Set an (mit dem selben Scaler)
test_df_e['Age'] = scaler.transform(age_column_test)

# Extrahiere die 'Age'-Spalte im Validation Set
age_column_valid = valid_df_e['Age'].values.reshape(-1, 1)

# Wende die Min-Max-Normalisierung auf die 'Age'-Spalte im Validation Set an (mit dem selben Scaler)
valid_df_e['Age'] = scaler.transform(age_column_valid)

# Ausgabe der umcodierten 'Age'-Spalte
print(train_df_e['Age'])

train_df_e.head()

380     0.403509
227     0.280702
451     0.385965
578     0.192982
1197    0.280702
          ...   
1044    0.192982
1095    0.315789
1130    0.421053
860     0.245614
1126    0.035088
Name: Age, Length: 881, dtype: float64


Unnamed: 0,Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
380,0.403509,2,7,0,0,1,0,500-1000,0,1,...,0,2,1,1,1,1,1,1,0,0
227,0.280702,2,0,1,0,0,3,1-5,1,1,...,0,1,2,2,0,0,1,0,2,0
451,0.385965,0,45,0,1,0,0,More than 1000,0,1,...,2,0,2,0,0,1,1,0,2,0
578,0.192982,2,7,1,0,0,0,26-100,1,1,...,2,3,2,1,1,0,1,2,1,0
1197,0.280702,2,21,0,0,1,1,6-25,0,1,...,0,1,2,0,1,2,1,0,1,0


**Country Feature auch scalen**

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Annahme: train_df_e ist dein Trainingsdatensatz
# Annahme: test_df_e ist dein Testdatensatz

# Extrahiere die 'Country'-Spalte im Training Set
country_column_train = train_df_e['Country'].values.reshape(-1, 1)

# Initialisiere den MinMaxScaler und wende die Min-Max-Normalisierung auf die 'Country'-Spalte im Training Set an
scaler_country = MinMaxScaler()
train_df_e['Country'] = scaler_country.fit_transform(country_column_train)

# Extrahiere die 'Country'-Spalte im Test Set
country_column_test = test_df_e['Country'].values.reshape(-1, 1)

# Wende die Min-Max-Normalisierung auf die 'Country'-Spalte im Test Set an (mit dem selben Scaler)
test_df_e['Country'] = scaler_country.transform(country_column_test)

# Extrahiere die 'Country'-Spalte im Validation Set
country_column_valid = valid_df_e['Country'].values.reshape(-1, 1)

# Wende die Min-Max-Normalisierung auf die 'Country'-Spalte im Validation Set an (mit dem selben Scaler)
valid_df_e['Country'] = scaler_country.transform(country_column_valid)


**Employee Feature droppen**

In [None]:
# Assuming "no_employees" is the column you want to drop

# Drop the "no_employees" column from the training set
train_df_e = train_df_e.drop(columns=['no_employees'])

# Drop the "no_employees" column from the test set
test_df_e = test_df_e.drop(columns=['no_employees'])

# Drop the "no_employees" column from the validation set
valid_df_e = valid_df_e.drop(columns=['no_employees'])


# Modelltraining, Testing und manuelle Hyperparameteroptimierung.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Assuming you have your datasets: train_df_e, test_df_e, valid_df_e

# Target column
target_column = 'treatment'  # Replace with your actual target column

# Extract features and target for training set
X_train = train_df_e.drop(columns=[target_column])
y_train = train_df_e[target_column]

# Extract features and target for test set
X_test = test_df_e.drop(columns=[target_column])
y_test = test_df_e[target_column]


# Instantiate the MLPClassifier (you can customize the parameters)
neural_network = MLPClassifier(
    hidden_layer_sizes=(50, 20),  # Adjust as needed
    activation='relu',
    solver='adam',
    alpha=0.0001,
    batch_size=100,
    learning_rate_init=0.0001,
    max_iter=150,
    random_state=42
)

# Train the neural network
neural_network.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = neural_network.predict(X_train)

# Make predictions on the test set
y_test_pred = neural_network.predict(X_test)

# Evaluate the model on the training set (you can use other classification metrics)
accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on the training set: {accuracy_train:.2f}')

# Evaluate the model on the test set
accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on the test set: {accuracy_test:.2f}')


Accuracy on the training set: 0.76
Accuracy on the test set: 0.71


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Assuming you have your datasets: train_df_e, test_df_e, valid_df_e

# Target column
target_column = 'treatment'  # Replace with your actual target column

# Extract features and target for training set
X_train = train_df_e.drop(columns=[target_column])
y_train = train_df_e[target_column]

# Extract features and target for test set
X_test = test_df_e.drop(columns=[target_column])
y_test = test_df_e[target_column]


# Instantiate the MLPClassifier (you can customize the parameters)
neural_network = MLPClassifier(
    hidden_layer_sizes=(100,50),
    activation='relu',
    solver='adam',
    alpha=0.0001,
    batch_size=50,
    learning_rate_init=0.0001,
    max_iter=100,
    random_state=42
)

# Train the neural network
neural_network.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = neural_network.predict(X_train)

# Make predictions on the test set
y_test_pred = neural_network.predict(X_test)

# Evaluate the model on the training set (you can use other classification metrics)
accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on the training set: {accuracy_train:.2f}')

# Evaluate the model on the test set
accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on the test set: {accuracy_test:.2f}')


Accuracy on the training set: 0.78
Accuracy on the test set: 0.74


# Hyperparameter Optimierung mit Random Search

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint, uniform

# Assuming you have your datasets: train_df_e, test_df_e, valid_df_e

# Target column
target_column = 'treatment'  # Replace with your actual target column

# Extract features and target for training set
X_train = train_df_e.drop(columns=[target_column])
y_train = train_df_e[target_column]

# Extract features and target for test set
X_test = test_df_e.drop(columns=[target_column])
y_test = test_df_e[target_column]

# Define the parameter search space
param_dist = {
    'hidden_layer_sizes': [(30, 10), (50, 20), (100, 50)],
    'activation': ['relu'],
    'solver': ['adam'],
    'alpha': uniform(1e-7, 1e-1),
    'batch_size': [16, 32, 64, 128, 256],
    'learning_rate_init': uniform(1e-6, 1e-2),
    'max_iter': randint(50, 500),
}


# Instantiate the MLPClassifier
neural_network = MLPClassifier(random_state=42)

# Instantiate RandomizedSearchCV
random_search = RandomizedSearchCV(
    neural_network,
    param_distributions=param_dist,
    n_iter=7,  # Number of parameter settings that are sampled
    cv=3,  # Number of cross-validation folds
    random_state=42,
    n_jobs=-1,  # Use all available processors
    verbose=2,  # Higher verbosity for more information
)

# Perform random search on the training set
random_search.fit(X_train, y_train)

# Print the best parameters
print("Best Parameters: ", random_search.best_params_)

# Evaluate the best model on the training set
y_train_pred = random_search.best_estimator_.predict(X_train)

# Evaluate accuracy on the training set
accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on the training set: {accuracy_train:.2f}')

# Evaluate the best model on the test set
y_test_pred = random_search.best_estimator_.predict(X_test)

# Evaluate accuracy on the test set
accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on the test set: {accuracy_test:.2f}')

# Kreuzvalidierung mal ausprobiert
cv_scores = cross_val_score(random_search, X_test, y_test, cv=5)  # 5-Fold Cross-Validation

print("Cross-Validation Scores:", cv_scores)
print("Durchschnittliche Genauigkeit:", cv_scores.mean())



Fitting 3 folds for each of 7 candidates, totalling 21 fits
Best Parameters:  {'activation': 'relu', 'alpha': 0.03998619717152555, 'batch_size': 128, 'hidden_layer_sizes': (100, 50), 'learning_rate_init': 0.004561699842170359, 'max_iter': 224, 'solver': 'adam'}
Accuracy on the training set: 1.00
Accuracy on the test set: 0.67
Fitting 3 folds for each of 7 candidates, totalling 21 fits
Fitting 3 folds for each of 7 candidates, totalling 21 fits
Fitting 3 folds for each of 7 candidates, totalling 21 fits
Fitting 3 folds for each of 7 candidates, totalling 21 fits
Fitting 3 folds for each of 7 candidates, totalling 21 fits
Cross-Validation Scores: [0.76315789 0.57894737 0.65789474 0.52631579 0.67567568]
Durchschnittliche Genauigkeit: 0.640398293029872


**Validation Set**

In [None]:
# hier das beste, was ich finden konnte
# (angelehnt an den ersten code von oben, hatte da noch was ausprobiert. denke so ist das gut jetzt :) )

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Assuming you have your datasets: train_df_e, test_df_e, valid_df_e

# Dataset Preparation
## Target column = treatment
target_column = 'treatment'

## Extract features and target for training set
X_train = train_df_e.drop(columns=[target_column])
y_train = train_df_e[target_column]

## Extract features and target for test set
X_test = test_df_e.drop(columns=[target_column])
y_test = test_df_e[target_column]

## Extract features and target for validation set
X_valid = valid_df_e.drop(columns=[target_column])
y_valid = valid_df_e[target_column]

# Neural Network Initialization
## Instantiate the MLPClassifier with optimized hyperparameters
neural_network = MLPClassifier(
    hidden_layer_sizes=(23, 31),
    activation='relu',
    solver='adam',
    alpha=0.0001,
    batch_size=100,
    learning_rate_init=0.0001,
    max_iter=250,
    random_state=42
)

## Train the neural network
neural_network.fit(X_train, y_train)

## Make predictions on the training set
y_train_pred = neural_network.predict(X_train)

# Model testing
## Make predictions on the test set
y_test_pred = neural_network.predict(X_test)

# Model testing
## Make predictions on the test set
y_valid_pred = neural_network.predict(X_valid)

## Evaluate the model on the training set (you can use other classification metrics)
accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on the training set: {accuracy_train:.2f}')

## Evaluate the model on the test set
accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on the test set: {accuracy_test:.2f}')

## Evaluate the model on the validation set
accuracy_test = accuracy_score(y_valid, y_valid_pred)
print(f'Accuracy on the validation set: {accuracy_test:.2f}')

Accuracy on the training set: 0.74
Accuracy on the test set: 0.76
Accuracy on the validation set: 0.67
