## Datasets Source
This dataset was from OSMH/OSMI Mental Health in Tech Survey 2014:
https://osmihelp.org/research

## Dataset Information
This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace. This training dataset contains 1007 records, including 510 persons who sought treatment for a mental health condition and 497 persons without seeking treatment. To study this dataset, it may help to assist companies in making supportive environments for those impacted by mental health disorders. The "treatment" field is a class label used to divide into groups (sought treatment or not).

## Attribute Information:
This dataset contains the following data:

1. Age
2. Gender
3. self_employed: Are you self-employed?
4. family_history: Do you have a family history of mental illness?
5. work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
6. no_employees: How many employees does your company or organization have?
remote_work: Do you work remotely (outside of an office) at least 50% of the time?
7. tech_company: Is your employer primarily a tech company/organization?
8. benefits: Does your employer provide mental health benefits?
9. care_options: Do you know the options for mental health care your employer provides?
10. wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
11. seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
12. anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
13. leave: How easy is it for you to take medical leave for a mental health condition?
14. mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
15. physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
16. coworkers: Would you be willing to discuss a mental health issue with your coworkers?
17. supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
18. mentalhealthinterview: Would you bring up a mental health issue with a potential employer in an interview?
19. physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?
20. mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
21. obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
22. treatment: Have you sought treatment for a mental health condition?


### Download the training set

In [None]:
# Download from Google Drive
!gdown --id 1HZnYBOe8Z04UzK6T0BXeTH5oaU_ABjIz

Downloading...
From: https://drive.google.com/uc?id=1HZnYBOe8Z04UzK6T0BXeTH5oaU_ABjIz
To: /content/project2.zip
100% 19.2k/19.2k [00:00<00:00, 19.4MB/s]


In [None]:
!unzip project2.zip
# if seeing the message: "replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:"
# you may enter "A"

Archive:  project2.zip
replace project2_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: project2_test.csv       
  inflating: project2_train.csv      


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('project2_train.csv')
df.columns

Index(['Age', 'Gender', 'self_employed', 'family_history', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'treatment'],
      dtype='object')

In [None]:
df.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment
0,37,Female,,No,Often,6-25,No,Yes,Yes,Not sure,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,Yes
1,44,M,,No,Rarely,More than 1000,No,No,Don't know,No,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,No
2,32,Male,,No,Rarely,6-25,No,Yes,No,No,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,No
3,31,Male,,Yes,Often,26-100,No,Yes,No,Yes,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,Yes
4,31,Male,,No,Never,100-500,Yes,Yes,Yes,No,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,No


In [None]:
df.columns

Index(['Age', 'Gender', 'self_employed', 'family_history', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'treatment'],
      dtype='object')

In [None]:
for col in df.columns:
  print('Unique values in {} :'.format(col), len(df[col].unique()))

Unique values in Age : 52
Unique values in Gender : 44
Unique values in self_employed : 3
Unique values in family_history : 2
Unique values in work_interfere : 5
Unique values in no_employees : 6
Unique values in remote_work : 2
Unique values in tech_company : 2
Unique values in benefits : 3
Unique values in care_options : 3
Unique values in wellness_program : 3
Unique values in seek_help : 3
Unique values in anonymity : 3
Unique values in leave : 5
Unique values in mental_health_consequence : 3
Unique values in phys_health_consequence : 3
Unique values in coworkers : 3
Unique values in supervisor : 3
Unique values in mental_health_interview : 3
Unique values in phys_health_interview : 3
Unique values in mental_vs_physical : 3
Unique values in obs_consequence : 2
Unique values in treatment : 2


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        1007 non-null   int64 
 1   Gender                     1007 non-null   object
 2   self_employed              994 non-null    object
 3   family_history             1007 non-null   object
 4   work_interfere             800 non-null    object
 5   no_employees               1007 non-null   object
 6   remote_work                1007 non-null   object
 7   tech_company               1007 non-null   object
 8   benefits                   1007 non-null   object
 9   care_options               1007 non-null   object
 10  wellness_program           1007 non-null   object
 11  seek_help                  1007 non-null   object
 12  anonymity                  1007 non-null   object
 13  leave                      1007 non-null   object
 14  mental_h

In [None]:
#檢查缺失值
df.isnull().sum().sort_values(ascending=False)

work_interfere               207
self_employed                 13
Age                            0
leave                          0
obs_consequence                0
mental_vs_physical             0
phys_health_interview          0
mental_health_interview        0
supervisor                     0
coworkers                      0
phys_health_consequence        0
mental_health_consequence      0
seek_help                      0
anonymity                      0
Gender                         0
wellness_program               0
care_options                   0
benefits                       0
tech_company                   0
remote_work                    0
no_employees                   0
family_history                 0
treatment                      0
dtype: int64

In [None]:
df['work_interfere'].unique()

array(['Often', 'Rarely', 'Never', 'Sometimes', nan], dtype=object)

In [None]:
df.self_employed.unique()

array([nan, 'Yes', 'No'], dtype=object)

In [None]:
df.Gender.unique()

array(['Female', 'M', 'Male', 'female', 'male', 'm', 'maile',
       'Trans-female', 'Cis Female', 'F', 'something kinda male?',
       'Cis Male', 'Woman', 'f', 'Mal', 'Male (CIS)', 'queer/she/they',
       'non-binary', 'woman', 'Make', 'Nah', 'All', 'Enby', 'fluid',
       'Genderqueer', 'Androgyne', 'cis-female/femme', 'Guy (-ish) ^_^',
       'male leaning androgynous', 'Male ', 'Trans woman', 'Man', 'msle',
       'Neuter', 'queer', 'Female (cis)', 'Mail', 'cis male',
       'A little about you', 'Malr', 'p', 'femail', 'Cis Man',
       'ostensibly male unsure what that really means'], dtype=object)

In [None]:
other  = ['A little about you', 'p', 'Nah', 'Enby', 'Trans-female','something kinda male?','queer/she/they','non-binary','All','fluid', 'Genderqueer','Androgyne', 'Agender','Guy (-ish) ^_^', 'male leaning androgynous','Trans woman','Neuter', 'Female (trans)','queer','ostensibly male unsure what that really means','trans']
male   = ['male', 'Male','M', 'm', 'Male-ish', 'maile','Cis Male','Mal', 'Male (CIS)','Make','Male ', 'Man', 'msle','cis male', 'Cis Man','Malr','Mail']
female = ['Female', 'female','Cis Female', 'F','f','Femake', 'woman','Female ','cis-female/femme','Female (cis)','femail','Woman','female']

In [None]:
df['Gender'].unique()

array(['Female', 'M', 'Male', 'female', 'male', 'm', 'maile',
       'Trans-female', 'Cis Female', 'F', 'something kinda male?',
       'Cis Male', 'Woman', 'f', 'Mal', 'Male (CIS)', 'queer/she/they',
       'non-binary', 'woman', 'Make', 'Nah', 'All', 'Enby', 'fluid',
       'Genderqueer', 'Androgyne', 'cis-female/femme', 'Guy (-ish) ^_^',
       'male leaning androgynous', 'Male ', 'Trans woman', 'Man', 'msle',
       'Neuter', 'queer', 'Female (cis)', 'Mail', 'cis male',
       'A little about you', 'Malr', 'p', 'femail', 'Cis Man',
       'ostensibly male unsure what that really means'], dtype=object)

In [None]:
df.Age.min(), df.Age.max()

(-1726, 99999999999)

In [None]:
df.treatment = df.treatment.astype('category')
df.treatment = df.treatment.cat.codes
df.treatment.value_counts()

1    510
0    497
Name: treatment, dtype: int64

### The stage is yours

In [None]:
#處理年齡
df['Age'].value_counts()

 32             67
 26             64
 29             62
 31             60
 27             60
 34             57
 28             54
 30             52
 33             51
 35             46
 25             46
 23             40
 24             33
 37             33
 38             32
 36             32
 40             27
 39             27
 43             20
 41             17
 22             16
 42             15
 21             12
 44             10
 46             10
 45              8
 19              7
 48              6
 18              4
 50              4
 56              4
 51              3
 20              3
 55              3
 54              3
 60              2
 49              2
-1726            1
 11              1
-1               1
 8               1
 61              1
 5               1
 47              1
 57              1
 65              1
 62              1
-29              1
 58              1
 99999999999     1
 329             1
 72              1
Name: Age, d

In [None]:
df["Age"].unique()

array([         37,          44,          32,          31,          33,
                35,          42,          23,          36,          29,
                46,          41,          34,          30,          40,
                27,          38,          50,          24,          28,
                26,          22,          19,          25,          39,
                18,          45,         -29,          43,          21,
                56,          60,          54,         329,          55,
       99999999999,          48,          20,          58,          47,
                62,          65,          49,          57,       -1726,
                 5,          51,          61,           8,          11,
                -1,          72])

In [None]:
##
df.loc[df.Age<12,'Age']=16
df.loc[df.Age>75,'Age']=75

In [None]:
#確認已沒有80歲以上
df[df['Age']>80].head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment


In [None]:
df['remote_work'].value_counts()

No     692
Yes    315
Name: remote_work, dtype: int64

In [None]:
##
#補齊worl_interfere缺失值
df["work_interfere"]=df["work_interfere"].fillna("Sometimes")

In [None]:
##
df["self_employed"]=df["self_employed"].fillna("Sometimes")

In [None]:
##
#處理性別
df['Gender'].replace(['Male', 'male','M', 'm', 'Male-ish', 'maile','Cis Male','Mal', 'Male (CIS)','Make','Male ', 'Man', 'msle','cis male', 'Cis Man','Malr','Mail'], 
                     inplace = True)
df['Gender'].replace(['Female', 'female','Cis Female', 'F','f','Femake', 'woman','Female ','cis-female/femme','Female (cis)','femail','Woman','female'], 
                     inplace = True)
df['Gender'].replace(['A little about you', 'p', 'Nah', 'Enby', 'Trans-female','something kinda male?','queer/she/they','non-binary','All','fluid', 'Genderqueer','Androgyne', 'Agender','Guy (-ish) ^_^', 'male leaning androgynous',
                      'Trans woman','Neuter', 'Female (trans)','queer','ostensibly male unsure what that really means','trans'], inplace = True)

In [None]:
#確認皆無缺失值
df.isna().sum()

Age                          0
Gender                       0
self_employed                0
family_history               0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
treatment                    0
dtype: int64

In [None]:
df.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment
0,37,Female,Sometimes,No,Often,6-25,No,Yes,Yes,Not sure,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,1
1,44,Female,Sometimes,No,Rarely,More than 1000,No,No,Don't know,No,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,0
2,32,Female,Sometimes,No,Rarely,6-25,No,Yes,No,No,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,0
3,31,Female,Sometimes,Yes,Often,26-100,No,Yes,No,Yes,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,1
4,31,Female,Sometimes,No,Never,100-500,Yes,Yes,Yes,No,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,0


In [None]:
##
#男性轉為1，女性轉為0，其他為2
df['Gender'][df['Gender'] == 'Male'] = 1
df['Gender'][df['Gender'] == 'Female'] = 0
df['Gender'][df['Gender'] == 'Other'] = 2

#self_employed ['Yes']、['No']、['Sometimes']
df['self_employed'][df['self_employed'] == 'Yes'] = 1
df['self_employed'][df['self_employed'] == 'No'] = 0
df['self_employed'][df['self_employed'] == 'Sometimes'] = 2

#family_history ['Yes']、['No']
df['family_history'][df['family_history'] == 'Yes'] = 1
df['family_history'][df['family_history'] == 'No'] = 0

#work_interfere ['Often']、['Rarely']、['Never']、['Sometimes']
df['work_interfere'][df['work_interfere'] == 'Often'] = 1
df['work_interfere'][df['work_interfere'] == 'Never'] = 0
df['work_interfere'][df['work_interfere'] == 'Rarely'] = 2
df['work_interfere'][df['work_interfere'] == 'Sometimes'] = 3

#no_employees ['1-5']、['6-25']、['26-100']、['100-500']、[500-1000]、['More than 1000']
df['no_employees'][df['no_employees'] == '1-5'] = 0
df['no_employees'][df['no_employees'] == '6-25'] = 1
df['no_employees'][df['no_employees'] == '26-100'] = 2
df['no_employees'][df['no_employees'] == '100-500'] = 3
df['no_employees'][df['no_employees'] == '500-1000'] = 4
df['no_employees'][df['no_employees'] == 'More than 1000'] = 5

#remote_work ['Yes']、['No']
df['remote_work'][df['remote_work'] == 'Yes'] = 1
df['remote_work'][df['remote_work'] == 'No'] = 0

#tech_company ['Yes']、['No']
df['tech_company'][df['tech_company'] == 'Yes'] = 1
df['tech_company'][df['tech_company'] == 'No'] = 0

#benefits ['Yes']、['No']、['Don't know']
df['benefits'][df['benefits'] == 'Yes'] = 1
df['benefits'][df['benefits'] == 'No'] = 0
df['benefits'][df['benefits'] == "Don't know"] = 2

#care_options ['Yes']、['No']、['Not sure']
df['care_options'][df['care_options'] == 'Yes'] = 1
df['care_options'][df['care_options'] == 'No'] = 0
df['care_options'][df['care_options'] == 'Not sure'] = 2

#wellness_program ['No']、['Don't know']
df['wellness_program'][df['wellness_program'] == 'No'] = 0
df['wellness_program'][df['wellness_program'] == 'Yes'] = 1
df['wellness_program'][df['wellness_program'] == "Don't know"] = 2

#seek_help ['Yes']、['No']、['Don't know']
df['seek_help'][df['seek_help'] == 'Yes'] = 1
df['seek_help'][df['seek_help'] == 'No'] = 0
df['seek_help'][df['seek_help'] == "Don't know"] = 2

#anonymity ['Yes']、['No']、['Don't know']
df['anonymity'][df['anonymity'] == 'Yes'] = 1
df['anonymity'][df['anonymity'] == 'No'] = 0
df['anonymity'][df['anonymity'] == "Don't know"] = 2

#leave ['Somewhat easy']、['Somewhat difficult']、['Don't know']
df['leave'][df['leave'] == 'Somewhat easy'] = 1
df['leave'][df['leave'] == 'Somewhat difficult'] = 0
df['leave'][df['leave'] == "Don't know"] = 2
df['leave'][df['leave'] == "Very easy"] = 3
df['leave'][df['leave'] == "Very difficult"] = 4

#mental_health_consequence ['Yes']、['No']、['Maybe']
df['mental_health_consequence'][df['mental_health_consequence'] == 'Yes'] = 1
df['mental_health_consequence'][df['mental_health_consequence'] == 'No'] = 0
df['mental_health_consequence'][df['mental_health_consequence'] == "Maybe"] = 2

#phys_health_consequence ['Yes']、['No']
df['phys_health_consequence'][df['phys_health_consequence'] == 'Yes'] = 1
df['phys_health_consequence'][df['phys_health_consequence'] == 'No'] = 0
df['phys_health_consequence'][df['phys_health_consequence'] == "Maybe"] = 2

#coworkers ['Yes']、['No']、['Some of them']
df['coworkers'][df['coworkers'] == 'Yes'] = 1
df['coworkers'][df['coworkers'] == 'No'] = 0
df['coworkers'][df['coworkers'] == 'Some of them'] = 2

#supervisor ['Yes']、['No']
df['supervisor'][df['supervisor'] == 'Yes'] = 1
df['supervisor'][df['supervisor'] == 'No'] = 0
df['supervisor'][df['supervisor'] == 'Some of them'] = 2

#mental_health_interview ['Yes']、['No']、['Maybe']
df['mental_health_interview'][df['mental_health_interview'] == 'Yes'] = 1
df['mental_health_interview'][df['mental_health_interview'] == 'No'] = 0
df['mental_health_interview'][df['mental_health_interview'] == "Maybe"] = 2

#phys_health_interview ['Yes']、['No']、['Maybe']
df['phys_health_interview'][df['phys_health_interview'] == 'Yes'] = 1
df['phys_health_interview'][df['phys_health_interview'] == 'No'] = 0
df['phys_health_interview'][df['phys_health_interview'] == "Maybe"] = 2

#mental_vs_physical ['Yes']、['No']、['Don't know']
df['mental_vs_physical'][df['mental_vs_physical'] == 'Yes'] = 1
df['mental_vs_physical'][df['mental_vs_physical'] == 'No'] = 0
df['mental_vs_physical'][df['mental_vs_physical'] == "Don't know"] = 2

#obs_consequence ['Yes']、['No']
df['obs_consequence'][df['obs_consequence'] == 'Yes'] = 1
df['obs_consequence'][df['obs_consequence'] == 'No'] = 0

In [None]:
df.value_counts()

Age  Gender  self_employed  family_history  work_interfere  no_employees  remote_work  tech_company  benefits  care_options  wellness_program  seek_help  anonymity  leave  mental_health_consequence  phys_health_consequence  coworkers  supervisor  mental_health_interview  phys_health_interview  mental_vs_physical  obs_consequence  treatment
35   0       1              1               1               0             0            1             0         1             1                 0          2          1      0                          0                        2          2           2                        1                      1                   0                1            2
32   0       0              1               2               1             0            0             0         0             0                 0          0          2      1                          2                        2          0           0                        0                      0               

In [None]:
df

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment
0,37,0,2,0,1,1,0,1,1,2,...,1,0,0,2,1,0,2,1,0,1
1,44,0,2,0,2,5,0,0,2,0,...,2,2,0,0,0,0,0,2,0,0
2,32,0,2,0,2,1,0,1,0,0,...,0,0,0,1,1,1,1,0,0,0
3,31,0,2,1,1,2,0,1,0,1,...,0,1,1,2,0,2,2,0,1,1
4,31,0,2,0,0,3,1,1,1,0,...,2,0,0,2,1,1,1,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1002,36,0,0,1,2,5,0,0,2,0,...,1,2,2,2,2,0,0,2,0,0
1003,26,0,0,0,3,2,0,1,0,0,...,1,0,0,2,2,0,0,2,0,1
1004,32,0,0,1,1,2,1,1,1,1,...,0,0,0,2,1,0,0,1,0,1
1005,34,0,0,1,3,5,0,1,1,1,...,0,1,1,0,0,0,0,0,0,1


In [None]:
#邏輯回歸
from sklearn.model_selection import train_test_split
# Train and Test set
X = df.drop('treatment', axis=1)
y = df.treatment

# Splitting
#X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 409570298) # random_state改成你自己的學號，純數字

In [None]:
#print("X_train dataset: ", X_train.shape)
#print("y_train dataset: ", y_train.shape)
#print("X_val dataset: ", X_val.shape)
#print("y_val dataset: ", y_val.shape)

In [None]:
##
#標準化
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
df[df.columns] = std_scaler.fit_transform(df)

In [None]:
df

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment
0,0.643414,0.0,4.660493,-0.798372,-0.932585,-0.819278,-0.674687,0.469687,-0.035176,1.439872,...,-0.876747,-1.121628,-0.589562,0.726846,0.040081,-0.495418,1.024941,-0.242424,-0.416725,0.987173
1,1.570524,0.0,4.660493,-0.798372,-0.081939,1.490065,-0.674687,-2.129077,1.229890,-1.086175,...,0.088154,1.155548,-0.589562,-1.754285,-1.261901,-0.495418,-1.159431,0.966096,-0.416725,-1.012994
2,-0.018808,0.0,4.660493,-0.798372,-0.081939,-0.819278,-0.674687,0.469687,-1.300241,-1.086175,...,-1.841647,-1.121628,-0.589562,-0.513720,0.040081,0.824387,-0.067245,-1.450945,-0.416725,-1.012994
3,-0.151252,0.0,4.660493,1.252548,-0.932585,-0.241942,-0.674687,0.469687,-1.300241,0.176848,...,-1.841647,0.016960,0.604983,0.726846,-1.261901,2.144192,1.024941,-1.450945,2.399664,0.987173
4,-0.151252,0.0,4.660493,-0.798372,-1.783231,0.335394,1.482169,0.469687,-0.035176,-1.086175,...,0.088154,-1.121628,-0.589562,0.726846,0.040081,0.824387,-0.067245,0.966096,-0.416725,-1.012994
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1002,0.510970,0.0,-0.391301,1.252548,-0.081939,1.490065,-0.674687,-2.129077,1.229890,-1.086175,...,-0.876747,1.155548,1.799527,0.726846,1.342062,-0.495418,-1.159431,0.966096,-0.416725,-1.012994
1003,-0.813474,0.0,-0.391301,-0.798372,0.768707,-0.241942,-0.674687,0.469687,-1.300241,-1.086175,...,-0.876747,-1.121628,-0.589562,0.726846,1.342062,-0.495418,-1.159431,0.966096,-0.416725,0.987173
1004,-0.018808,0.0,-0.391301,1.252548,-0.932585,-0.241942,1.482169,0.469687,-0.035176,0.176848,...,-1.841647,-1.121628,-0.589562,0.726846,0.040081,-0.495418,-1.159431,-0.242424,-0.416725,0.987173
1005,0.246081,0.0,-0.391301,1.252548,0.768707,1.490065,-0.674687,0.469687,-0.035176,0.176848,...,-1.841647,0.016960,0.604983,-1.754285,-1.261901,-0.495418,-1.159431,-1.450945,-0.416725,0.987173


In [None]:
X

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,0,2,0,1,1,0,1,1,2,...,1,1,0,0,2,1,0,2,1,0
1,44,0,2,0,2,5,0,0,2,0,...,2,2,2,0,0,0,0,0,2,0
2,32,0,2,0,2,1,0,1,0,0,...,2,0,0,0,1,1,1,1,0,0
3,31,0,2,1,1,2,0,1,0,1,...,0,0,1,1,2,0,2,2,0,1
4,31,0,2,0,0,3,1,1,1,0,...,2,2,0,0,2,1,1,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1002,36,0,0,1,2,5,0,0,2,0,...,2,1,2,2,2,2,0,0,2,0
1003,26,0,0,0,3,2,0,1,0,0,...,2,1,0,0,2,2,0,0,2,0
1004,32,0,0,1,1,2,1,1,1,1,...,1,0,0,0,2,1,0,0,1,0
1005,34,0,0,1,3,5,0,1,1,1,...,2,0,1,1,0,0,0,0,0,0


In [None]:
from sklearn.preprocessing import LabelEncoder
lb=LabelEncoder()
y=lb.fit_transform(y)

In [None]:
from sklearn.linear_model import LogisticRegression
lg= LogisticRegression(penalty='none',solver='sag')


In [None]:
lg.fit(X,y)

LogisticRegression(penalty='none', solver='sag')

In [None]:
y

array([1, 0, 0, ..., 1, 1, 0])

In [None]:
from sklearn.svm import SVC
s=SVC()
s.fit(X,y)

SVC()

### Make prediction and submission file

In [None]:
df1 = pd.read_csv('project2_test.csv')

In [None]:
df1.loc[df1.Age<16,'Age']=16
df1.loc[df1.Age>75,'Age']=75

In [None]:
#補齊worl_interfere缺失值
df1["work_interfere"]=df1["work_interfere"].fillna("Sometimes")

In [None]:
df1["self_employed"]=df1["self_employed"].fillna("Sometimes")

In [None]:
df1.isnull().sum()

Age                          0
Gender                       0
self_employed                0
family_history               0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

In [None]:
#處理性別
df1['Gender'].replace(['Male', 'male','M', 'm', 'Male-ish', 'maile','Cis Male','Mal', 'Male (CIS)','Make','Male ', 'Man', 'msle','cis male', 'Cis Man','Malr','Mail'], 
                     inplace = True)
df1['Gender'].replace(['Female', 'female','Cis Female', 'F','f','Femake', 'woman','Female ','cis-female/femme','Female (cis)','femail','Woman','female'], 
                     inplace = True)
df1['Gender'].replace(['A little about you', 'p', 'Nah', 'Enby', 'Trans-female','something kinda male?','queer/she/they','non-binary','All','fluid', 'Genderqueer','Androgyne', 'Agender','Guy (-ish) ^_^', 'male leaning androgynous',
                      'Trans woman','Neuter', 'Female (trans)','queer','ostensibly male unsure what that really means','trans'], inplace = True)

In [None]:
#男性轉為1，女性轉為0，其他為2
df1['Gender'][df1['Gender'] == 'Male'] = 1
df1['Gender'][df1['Gender'] == 'Female'] = 0
df1['Gender'][df1['Gender'] == 'Other'] = 2

#self_employed ['Yes']、['No']、['Sometimes']
df1['self_employed'][df1['self_employed'] == 'Yes'] = 1
df1['self_employed'][df1['self_employed'] == 'No'] = 0
df1['self_employed'][df1['self_employed'] == 'Sometimes'] = 2

#family_history ['Yes']、['No']
df1['family_history'][df1['family_history'] == 'Yes'] = 1
df1['family_history'][df1['family_history'] == 'No'] = 0

#work_interfere ['Often']、['Rarely']、['Never']、['Sometimes']
df1['work_interfere'][df1['work_interfere'] == 'Often'] = 1
df1['work_interfere'][df1['work_interfere'] == 'Never'] = 0
df1['work_interfere'][df1['work_interfere'] == 'Rarely'] = 2
df1['work_interfere'][df1['work_interfere'] == 'Sometimes'] = 3

#no_employees ['1-5']、['6-25']、['26-100']、['100-500']、[500-1000]、['More than 1000']
df1['no_employees'][df1['no_employees'] == '1-5'] = 0
df1['no_employees'][df1['no_employees'] == '6-25'] = 1
df1['no_employees'][df1['no_employees'] == '26-100'] = 2
df1['no_employees'][df1['no_employees'] == '100-500'] = 3
df1['no_employees'][df1['no_employees'] == '500-1000'] = 4
df1['no_employees'][df1['no_employees'] == 'More than 1000'] = 5

#remote_work ['Yes']、['No']
df1['remote_work'][df1['remote_work'] == 'Yes'] = 1
df1['remote_work'][df1['remote_work'] == 'No'] = 0

#tech_company ['Yes']、['No']
df1['tech_company'][df1['tech_company'] == 'Yes'] = 1
df1['tech_company'][df1['tech_company'] == 'No'] = 0

#benefits ['Yes']、['No']、['Don't know']
df1['benefits'][df1['benefits'] == 'Yes'] = 1
df1['benefits'][df1['benefits'] == 'No'] = 0
df1['benefits'][df1['benefits'] == "Don't know"] = 2

#care_options ['Yes']、['No']、['Not sure']
df1['care_options'][df1['care_options'] == 'Yes'] = 1
df1['care_options'][df1['care_options'] == 'No'] = 0
df1['care_options'][df1['care_options'] == 'Not sure'] = 2

#wellness_program ['No']、['Don't know']
df1['wellness_program'][df1['wellness_program'] == 'No'] = 0
df1['wellness_program'][df1['wellness_program'] == 'Yes'] = 1
df1['wellness_program'][df1['wellness_program'] == "Don't know"] = 2

#seek_help ['Yes']、['No']、['Don't know']
df1['seek_help'][df1['seek_help'] == 'Yes'] = 1
df1['seek_help'][df1['seek_help'] == 'No'] = 0
df1['seek_help'][df1['seek_help'] == "Don't know"] = 2

#anonymity ['Yes']、['No']、['Don't know']
df1['anonymity'][df1['anonymity'] == 'Yes'] = 1
df1['anonymity'][df1['anonymity'] == 'No'] = 0
df1['anonymity'][df1['anonymity'] == "Don't know"] = 2

#leave ['Somewhat easy']、['Somewhat difficult']、['Don't know']
df1['leave'][df1['leave'] == 'Somewhat easy'] = 1
df1['leave'][df1['leave'] == 'Somewhat difficult'] = 0
df1['leave'][df1['leave'] == "Don't know"] = 2
df1['leave'][df1['leave'] == "Very easy"] = 3
df1['leave'][df1['leave'] == "Very difficult"] = 4

#mental_health_consequence ['Yes']、['No']、['Maybe']
df1['mental_health_consequence'][df1['mental_health_consequence'] == 'Yes'] = 1
df1['mental_health_consequence'][df1['mental_health_consequence'] == 'No'] = 0
df1['mental_health_consequence'][df1['mental_health_consequence'] == "Maybe"] = 2

#phys_health_consequence ['Yes']、['No']
df1['phys_health_consequence'][df1['phys_health_consequence'] == 'Yes'] = 1
df1['phys_health_consequence'][df1['phys_health_consequence'] == 'No'] = 0
df1['phys_health_consequence'][df1['phys_health_consequence'] == "Maybe"] = 2

#coworkers ['Yes']、['No']、['Some of them']
df1['coworkers'][df1['coworkers'] == 'Yes'] = 1
df1['coworkers'][df1['coworkers'] == 'No'] = 0
df1['coworkers'][df1['coworkers'] == 'Some of them'] = 2

#supervisor ['Yes']、['No']
df1['supervisor'][df1['supervisor'] == 'Yes'] = 1
df1['supervisor'][df1['supervisor'] == 'No'] = 0
df1['supervisor'][df1['supervisor'] == 'Some of them'] = 2

#mental_health_interview ['Yes']、['No']、['Maybe']
df1['mental_health_interview'][df1['mental_health_interview'] == 'Yes'] = 1
df1['mental_health_interview'][df1['mental_health_interview'] == 'No'] = 0
df1['mental_health_interview'][df1['mental_health_interview'] == "Maybe"] = 2

#phys_health_interview ['Yes']、['No']、['Maybe']
df1['phys_health_interview'][df1['phys_health_interview'] == 'Yes'] = 1
df1['phys_health_interview'][df1['phys_health_interview'] == 'No'] = 0
df1['phys_health_interview'][df1['phys_health_interview'] == "Maybe"] = 2

#mental_vs_physical ['Yes']、['No']、['Don't know']
df1['mental_vs_physical'][df1['mental_vs_physical'] == 'Yes'] = 1
df1['mental_vs_physical'][df1['mental_vs_physical'] == 'No'] = 0
df1['mental_vs_physical'][df1['mental_vs_physical'] == "Don't know"] = 2

#obs_consequence ['Yes']、['No']
df1['obs_consequence'][df1['obs_consequence'] == 'Yes'] = 1
df1['obs_consequence'][df1['obs_consequence'] == 'No'] = 0

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(df1))]
df_submit['Treatment'] = s.predict(df1)

In [None]:
df_submit.to_csv('submission_1.csv', index=None)