 The dataset (https://drive.google.com/drive/folders/1XNqNsnC0tIwdRn4FEaP71LmstbvgZSJ2?usp=sharing) contains the demographic and medical data of a set of patients.
There are multiple features which include ID: is a unique identifier for each patient. sex: the patient’s gender. age: the patient’s age and a number of other categorical and numerical variables, some of which will need to be cleaned. Use the training data to create a model using logistic regression that will predict whether a patient has disease (target=1) or not (target=0). Applying the model to the test set, which of the following is closest to your answer?

In [722]:
import numpy as np
import pandas as pd

In [723]:
test_data = pd.read_csv("medical_data/heart_test_data.csv")
train_data = pd.read_csv("medical_data/heart_training_data.csv")

In [724]:
print("Length of train data: ", len(train_data))
print("Length of test data: ", len(test_data))

Length of train data:  219
Length of test data:  94


In [725]:
train_data

Unnamed: 0,age,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,ID,target
0,58.0,,2.0,105.0,240.0,0,0.0,154.0,1,0.6,1.0,0.0,253,1
1,60.0,0,3.0,150.0,240.0,0,1.0,171.0,0,0.9,2.0,0.0,333,1
2,67.0,0,0.0,106.0,223.0,,1.0,142.0,0,0.3,2.0,2.0,496,1
3,29.0,m,1.0,130.0,204.0,0,0.0,202.0,0,0.0,2.0,0.0,446,1
4,35.0,1,0.0,126.0,282.0,0,0.0,156.0,1,0.0,2.0,0.0,289,0
5,50.0,0,2.0,120.0,219.0,0,1.0,,0,1.6,1.0,0.0,391,1
6,54.0,,0.0,140.0,239.0,0,1.0,160.0,0,,2.0,0.0,956,1
7,63.0,1,0.0,130.0,,0,0.0,147.0,0,1.4,1.0,1.0,392,0
8,43.0,,0.0,110.0,211.0,0,1.0,161.0,,0.0,2.0,0.0,619,1
9,52.0,1,0.0,108.0,233.0,1,1.0,147.0,0,0.1,2.0,3.0,940,1


In [726]:
test_data

Unnamed: 0,age,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,ID
0,44.0,1,2.0,140.0,235.0,0,0.0,180.0,0,0.0,2.0,0.0,531
1,57.0,1,2.0,150.0,168.0,0,1.0,174.0,0,1.6,2.0,0.0,636
2,59.0,1,0.0,110.0,239.0,0,0.0,142.0,,1.2,1.0,1.0,884
3,64.0,,0.0,128.0,263.0,0,1.0,105.0,1,0.2,1.0,1.0,266
4,71.0,0,2.0,110.0,265.0,,0.0,130.0,,0.0,2.0,1.0,900
5,54.0,1,2.0,125.0,,0,0.0,152.0,0,0.5,0.0,1.0,410
6,59.0,1,0.0,,176.0,1,0.0,90.0,0,1.0,1.0,2.0,141
7,60.0,1,0.0,130.0,253.0,0,1.0,144.0,1,1.4,2.0,1.0,261
8,63.0,0,0.0,124.0,197.0,0,1.0,136.0,1,0.0,1.0,0.0,831
9,57.0,1,1.0,124.0,261.0,0,1.0,141.0,0,0.3,2.0,0.0,552


In [727]:
# concatenate both train and test
data = pd.concat([train_data, test_data], axis = 0, sort = False)

In [728]:
data = data.set_index('ID')

# Data Preprocessing

## Dealing with Missing data


In [729]:
data.describe()

Unnamed: 0,age,feature1,feature2,feature3,feature5,feature6,feature8,feature9,feature10,target
count,303.0,304.0,302.0,303.0,302.0,302.0,303.0,303.0,304.0,219.0
mean,54.227723,0.976974,131.890728,245.135314,0.536424,150.119205,1.071947,1.39934,0.733553,0.543379
std,9.226956,1.029016,17.824299,51.829364,0.525431,23.173455,1.184547,0.621576,1.026735,0.499256
min,29.0,0.0,94.0,126.0,0.0,71.0,0.0,0.0,0.0,0.0
25%,47.0,0.0,120.0,210.5,0.0,136.0,0.0,1.0,0.0,0.0
50%,55.0,1.0,130.0,240.0,1.0,153.5,0.8,1.0,0.0,1.0
75%,61.0,2.0,140.0,274.0,1.0,168.0,1.6,2.0,1.0,1.0
max,77.0,3.0,200.0,564.0,2.0,202.0,6.2,2.0,4.0,1.0


In [730]:
# chack number of missing values in each column
data.isnull().sum()

age          10
sex          11
feature1      9
feature2     11
feature3     10
feature4      9
feature5     11
feature6     11
feature7     10
feature8     10
feature9     10
feature10     9
target       94
dtype: int64

In [731]:
# drop rows where all columns are Na
len(data.dropna(how='all'))

313

In [732]:
data.groupby('sex').count()

Unnamed: 0_level_0,age,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,93,94,93,94,90,92,92,91,91,90,95,69
1,190,191,191,190,194,190,191,194,193,193,189,136
f,1,1,1,1,1,1,1,1,1,1,1,1
female,1,1,1,1,1,1,1,1,1,1,1,1
m,4,3,4,4,4,4,4,4,4,4,4,2
male,3,3,1,3,3,3,2,2,3,3,3,2


In [733]:
type(data.sex)

pandas.core.series.Series

In [734]:
#change where age is not available to "unknown"
#train_data['sex']= train_data['sex'].fillna('unkown')
#train_data['sex'] = train_data['sex'].astype(str)

In [735]:
data['age'][data['age'] < 30]

ID
446    29.0
Name: age, dtype: float64

In [736]:
# I'm assuming that male are value 1 and female 0 
data['sex'][data['sex']=='f'] = '0'
data['sex'][data['sex']=='female'] = '0'
data['sex'][data['sex']=='m'] = '1'
data['sex'][data['sex']=='male'] = '1'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [737]:
data.groupby('sex').count()

Unnamed: 0_level_0,age,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,95,96,95,96,92,94,94,93,93,92,97,71
1,197,197,196,197,201,197,197,200,200,200,196,140


In [738]:
data['age'][data['age'] < 30]

ID
446    29.0
Name: age, dtype: float64

In [739]:
data['age'].isna().sum()

10

In [740]:
data['age'] = data['age'].fillna(-1)
data['age'] = data['age'].astype(int)

In [741]:
bins= [20, 30, 40, 50, 60, 70, 80]
labels = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-80']
data['age_group'] = pd.cut(data.age, bins, labels = labels,right = False)

In [742]:
data['age_group'] = data['age_group'].cat.add_categories('unknown').fillna('unknown')


In [743]:
data.groupby('age_group').count()

Unnamed: 0_level_0,age,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
20-29,1,1,1,1,1,1,1,1,1,1,1,1,1
30-39,17,17,16,17,16,17,17,17,16,15,16,16,14
40-49,73,70,72,67,73,72,71,69,71,72,71,69,51
50-59,124,119,121,121,117,121,121,120,121,118,120,122,86
60-69,78,75,77,77,76,74,74,75,75,77,75,76,56
70-80,10,10,8,9,10,9,10,10,9,10,10,10,5
unknown,10,10,9,10,10,10,8,10,10,10,10,10,6


In [744]:
data['sex'].isna().sum()

11

In [745]:
data['age'].isna().sum()

0

In [746]:
data.age[data.age == -1] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [747]:
data.age.isna().sum()

10

In [748]:
data.age.dtype

dtype('float64')

In [749]:
data.groupby(['age_group', 'sex']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target
age_group,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
20-29,0,,,,,,,,,,,,
20-29,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
30-39,0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,5.0,5.0,6.0,4.0
30-39,1,11.0,10.0,11.0,10.0,11.0,11.0,11.0,10.0,10.0,11.0,10.0,10.0
40-49,0,18.0,18.0,16.0,18.0,18.0,17.0,18.0,18.0,17.0,17.0,18.0,14.0
40-49,1,52.0,51.0,48.0,52.0,51.0,51.0,48.0,51.0,52.0,51.0,48.0,35.0
50-59,0,34.0,34.0,34.0,33.0,33.0,33.0,31.0,33.0,33.0,32.0,34.0,25.0
50-59,1,85.0,82.0,82.0,80.0,83.0,83.0,84.0,83.0,81.0,83.0,83.0,57.0
60-69,0,32.0,32.0,32.0,32.0,29.0,32.0,32.0,30.0,31.0,31.0,32.0,25.0
60-69,1,43.0,42.0,42.0,41.0,42.0,39.0,40.0,42.0,43.0,41.0,41.0,29.0


In [750]:
data

Unnamed: 0_level_0,age,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target,age_group
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
253,58.0,,2.0,105.0,240.0,0,0.0,154.0,1,0.6,1.0,0.0,1.0,50-59
333,60.0,0,3.0,150.0,240.0,0,1.0,171.0,0,0.9,2.0,0.0,1.0,60-69
496,67.0,0,0.0,106.0,223.0,,1.0,142.0,0,0.3,2.0,2.0,1.0,60-69
446,29.0,1,1.0,130.0,204.0,0,0.0,202.0,0,0.0,2.0,0.0,1.0,20-29
289,35.0,1,0.0,126.0,282.0,0,0.0,156.0,1,0.0,2.0,0.0,0.0,30-39
391,50.0,0,2.0,120.0,219.0,0,1.0,,0,1.6,1.0,0.0,1.0,50-59
956,54.0,,0.0,140.0,239.0,0,1.0,160.0,0,,2.0,0.0,1.0,50-59
392,63.0,1,0.0,130.0,,0,0.0,147.0,0,1.4,1.0,1.0,0.0,60-69
619,43.0,,0.0,110.0,211.0,0,1.0,161.0,,0.0,2.0,0.0,1.0,40-49
940,52.0,1,0.0,108.0,233.0,1,1.0,147.0,0,0.1,2.0,3.0,1.0,50-59


In [751]:
data['sex'] = data.groupby('age_group')['sex'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [752]:
data.groupby('sex').count()

Unnamed: 0_level_0,age,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target,age_group
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,95,96,95,96,92,94,94,93,93,92,97,71,97
1,208,208,207,207,212,208,208,210,210,211,207,148,216


In [753]:
data['sex'] = data['sex'].astype(float)

In [754]:
data.groupby('sex')['age'].agg(np.mean)

sex
0.0    55.515789
1.0    53.639423
Name: age, dtype: float64

In [755]:
data['age']=data.groupby('sex')['age'].apply(lambda x:x.fillna(x.mean()).astype(int))

In [756]:
data.groupby('age').count()

Unnamed: 0_level_0,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target,age_group
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
29,1,1,1,1,1,1,1,1,1,1,1,1,1
34,2,2,2,1,2,2,2,2,2,2,2,1,2
35,6,6,6,6,6,6,6,6,6,6,6,5,6
37,2,2,2,2,2,2,2,1,1,2,2,2,2
38,3,2,3,3,3,3,3,3,2,3,2,3,3
39,4,4,4,4,4,4,4,4,4,3,4,3,4
40,3,3,2,3,3,3,3,3,3,3,2,1,3
41,10,10,10,10,10,10,10,10,10,10,10,7,10
42,9,9,7,9,9,9,7,9,9,8,9,8,9
43,8,7,8,8,7,8,8,7,8,8,8,8,8


In [757]:
bins= [20, 30, 40, 50, 60, 70, 80]
labels = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-80']
data['age_group'] = pd.cut(data.age, bins, labels = labels,right = False)

In [758]:
data.groupby(['age_group']).count()

Unnamed: 0_level_0,age,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,target
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
20-29,1,1,1,1,1,1,1,1,1,1,1,1,1
30-39,17,17,16,17,16,17,17,17,16,15,16,16,14
40-49,73,73,72,67,73,72,71,69,71,72,71,69,51
50-59,134,134,130,131,127,131,129,130,131,128,130,132,92
60-69,78,78,77,77,76,74,74,75,75,77,75,76,56
70-80,10,10,8,9,10,9,10,10,9,10,10,10,5


In [759]:
data.isnull().sum()

age           0
sex           0
feature1      9
feature2     11
feature3     10
feature4      9
feature5     11
feature6     11
feature7     10
feature8     10
feature9     10
feature10     9
target       94
age_group     0
dtype: int64

In [760]:
data.iloc[:, 2:12]

Unnamed: 0_level_0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
253,2.0,105.0,240.0,0,0.0,154.0,1,0.6,1.0,0.0
333,3.0,150.0,240.0,0,1.0,171.0,0,0.9,2.0,0.0
496,0.0,106.0,223.0,,1.0,142.0,0,0.3,2.0,2.0
446,1.0,130.0,204.0,0,0.0,202.0,0,0.0,2.0,0.0
289,0.0,126.0,282.0,0,0.0,156.0,1,0.0,2.0,0.0
391,2.0,120.0,219.0,0,1.0,,0,1.6,1.0,0.0
956,0.0,140.0,239.0,0,1.0,160.0,0,,2.0,0.0
392,0.0,130.0,,0,0.0,147.0,0,1.4,1.0,1.0
619,0.0,110.0,211.0,0,1.0,161.0,,0.0,2.0,0.0
940,0.0,108.0,233.0,1,1.0,147.0,0,0.1,2.0,3.0


### Feature 1

In [761]:
data.feature1.value_counts()

0.0    142
2.0     91
1.0     49
3.0     22
Name: feature1, dtype: int64

In [762]:
#fill Na using max counts 
data['feature1'] = data.groupby(['sex', 'age_group'])['feature1'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [763]:
data.feature1.value_counts()

0.0    150
2.0     92
1.0     49
3.0     22
Name: feature1, dtype: int64

### Feature 2 & 3

In [764]:
data.feature2.value_counts()

130.0    36
120.0    36
140.0    32
150.0    18
110.0    18
128.0    12
138.0    12
125.0    11
160.0    10
112.0     9
118.0     8
132.0     8
124.0     6
135.0     6
108.0     6
134.0     5
145.0     5
152.0     5
122.0     4
170.0     4
126.0     4
100.0     4
136.0     4
180.0     4
115.0     3
142.0     3
178.0     2
105.0     2
146.0     2
144.0     2
102.0     2
148.0     2
192.0     2
94.0      2
106.0     1
104.0     1
129.0     1
101.0     1
155.0     1
174.0     1
114.0     1
154.0     1
172.0     1
200.0     1
165.0     1
123.0     1
117.0     1
Name: feature2, dtype: int64

In [765]:
# use mean
data['feature2']=data.groupby(['sex', 'age_group'])['feature2'].apply(lambda x:x.fillna(x.mean()).astype(int))

### Feature 3

In [766]:
data.feature3.value_counts()

234.0    6
204.0    6
197.0    6
240.0    5
212.0    5
282.0    5
269.0    5
226.0    4
243.0    4
274.0    4
256.0    4
177.0    4
283.0    4
254.0    4
239.0    4
233.0    4
211.0    4
229.0    3
203.0    3
250.0    3
244.0    3
149.0    3
236.0    3
245.0    3
175.0    3
219.0    3
258.0    3
199.0    3
231.0    3
230.0    3
        ..
311.0    1
305.0    1
319.0    1
281.0    1
210.0    1
224.0    1
178.0    1
221.0    1
241.0    1
166.0    1
157.0    1
186.0    1
278.0    1
259.0    1
360.0    1
160.0    1
342.0    1
141.0    1
306.0    1
252.0    1
172.0    1
200.0    1
180.0    1
195.0    1
176.0    1
131.0    1
174.0    1
169.0    1
167.0    1
300.0    1
Name: feature3, Length: 149, dtype: int64

In [767]:
data['feature3']=data.groupby(['sex', 'age_group'])['feature3'].apply(lambda x:x.fillna(x.mean()).astype(int))

### Feature 4

In [768]:
data.feature4.value_counts()

0        247
1         46
false      4
FALSE      3
f          2
False      1
TRUE       1
Name: feature4, dtype: int64

In [716]:
#data.feature4  =data.feature4.astype(str)

In [769]:
data['feature4'][data['feature4']=='f'] = '0'
data['feature4'][data['feature4']=='false'] = '0'
data['feature4'][data['feature4']=='FALSE'] = '0'
data['feature4'][data['feature4']=='False'] = '0'
data['feature4'][data['feature4']=='TRUE'] = '1'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

Se

In [770]:
data.feature4.value_counts()

0    257
1     47
Name: feature4, dtype: int64

In [595]:
#train_data.feature4 = train_data.feature4.astype(float)

In [771]:
data['feature4'] = data.groupby(['sex', 'age_group'])['feature4'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [773]:
data.feature4.value_counts()

0    266
1     47
Name: feature4, dtype: int64

### Feature 5


In [774]:
data.feature5.value_counts()

1.0    154
0.0    144
2.0      4
Name: feature5, dtype: int64

In [775]:
data['feature5'] = data.groupby(['sex', 'age_group'])['feature5'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [776]:
data.feature5.value_counts()

1.0    159
0.0    150
2.0      4
Name: feature5, dtype: int64

### Feature 6

In [777]:
data.feature6.value_counts()

162.0    11
160.0     9
163.0     8
152.0     8
150.0     8
173.0     8
144.0     7
156.0     7
172.0     7
143.0     7
182.0     6
169.0     6
174.0     6
142.0     6
125.0     6
140.0     6
168.0     5
126.0     5
178.0     5
132.0     5
179.0     5
157.0     5
154.0     5
170.0     5
158.0     5
161.0     5
165.0     5
147.0     5
130.0     4
159.0     4
         ..
194.0     2
116.0     2
96.0      2
202.0     1
192.0     1
134.0     1
185.0     1
99.0      1
95.0      1
90.0      1
106.0     1
112.0     1
88.0      1
129.0     1
184.0     1
137.0     1
187.0     1
118.0     1
127.0     1
97.0      1
167.0     1
190.0     1
117.0     1
71.0      1
177.0     1
121.0     1
139.0     1
113.0     1
124.0     1
128.0     1
Name: feature6, Length: 90, dtype: int64

In [778]:
data['feature6']=data.groupby(['sex', 'age_group'])['feature6'].apply(lambda x:x.fillna(x.mean()).astype(int))

### Feature 7

In [779]:
data.feature7.value_counts()

0      196
1       97
Yes      3
N        2
no       2
Y        2
No       1
Name: feature7, dtype: int64

In [783]:
data['feature7'][data['feature7']=='Yes'] = '1'
data['feature7'][data['feature7']=='Y'] = '1'
data['feature7'][data['feature7']=='N'] = '0'
data['feature7'][data['feature7']=='no'] = '0'
data['feature7'][data['feature7']=='No'] = '0'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

Se

In [784]:
data['feature7'] = data.groupby(['sex', 'age_group'])['feature7'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [785]:
data.feature7.value_counts()

0    211
1    102
Name: feature7, dtype: int64

### Feature 8


In [786]:
data.feature8.value_counts()

0.0    96
1.2    16
0.8    15
1.0    14
0.6    14
1.4    14
1.6    13
1.8    11
0.2    11
2.0     8
0.4     8
0.1     7
2.8     6
2.6     6
1.5     5
3.0     5
0.5     5
1.9     5
2.2     4
3.6     4
3.4     3
0.3     3
4.0     3
0.9     3
2.4     3
1.1     2
5.6     2
2.5     2
2.3     2
4.2     2
3.2     2
4.4     1
3.5     1
1.3     1
2.1     1
3.1     1
3.8     1
0.7     1
2.9     1
6.2     1
Name: feature8, dtype: int64

In [787]:
data['feature8']=data.groupby(['sex', 'age_group'])['feature8'].apply(lambda x:x.fillna(x.mean()).astype(int))

### Feature 9

In [788]:
data.feature9.value_counts()

2.0    143
1.0    138
0.0     22
Name: feature9, dtype: int64

In [789]:
data['feature9'] = data.groupby(['sex', 'age_group'])['feature9'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [790]:
data.feature9.value_counts()

2.0    147
1.0    144
0.0     22
Name: feature9, dtype: int64

### Feature 10

In [791]:
data.feature10.value_counts()

0.0    175
1.0     66
2.0     37
3.0     21
4.0      5
Name: feature10, dtype: int64

In [792]:
data['feature10'] = data.groupby(['sex', 'age_group'])['feature10'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [793]:
data.feature10.value_counts()

0.0    182
1.0     68
2.0     37
3.0     21
4.0      5
Name: feature10, dtype: int64

In [794]:
data.isnull().sum()

age           0
sex           0
feature1      0
feature2      0
feature3      0
feature4      0
feature5      0
feature6      0
feature7      0
feature8      0
feature9      0
feature10     0
target       94
age_group     0
dtype: int64




# Modelling Logistic Regression

In [None]:
pd.get_dummies(data['age_group'], prefix='age_group')

In [802]:
# use pd.concat to join the new columns with your original dataframe
df = pd.concat([data,pd.get_dummies(data['age_group'], prefix='age_group')],axis=1)

# now drop the original 'country' column (you don't need it anymore)
df.drop(['age_group'],axis=1, inplace=True)

In [803]:
# split data to traina and test
len_train = len(train_data)
train_df = df[:len_train]
test_df = df[len_train:]

In [825]:
target = train_df['target']

KeyError: 'target'

In [823]:
train_df = train_df.drop('target', axis = 1)

In [828]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train_df, target , test_size=0.20, random_state=0)

In [830]:
from sklearn.linear_model import LogisticRegression

In [831]:
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()

In [832]:
logisticRegr.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [833]:
predictions = logisticRegr.predict(x_test)

In [834]:
score = logisticRegr.score(x_test, y_test)
print(score)

0.8636363636363636


# Cleaning the Test data

In [836]:
test_df

Unnamed: 0_level_0,age,sex,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,age_group_20-29,age_group_30-39,age_group_40-49,age_group_50-59,age_group_60-69,age_group_70-80
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
531,44,1.0,2.0,140,235,0,0.0,180,0,0,2.0,0.0,0,0,1,0,0,0
636,57,1.0,2.0,150,168,0,1.0,174,0,1,2.0,0.0,0,0,0,1,0,0
884,59,1.0,0.0,110,239,0,0.0,142,0,1,1.0,1.0,0,0,0,1,0,0
266,64,1.0,0.0,128,263,0,1.0,105,1,0,1.0,1.0,0,0,0,0,1,0
900,71,0.0,2.0,110,265,0,0.0,130,0,0,2.0,1.0,0,0,0,0,0,1
410,54,1.0,2.0,125,235,0,0.0,152,0,0,0.0,1.0,0,0,0,1,0,0
141,59,1.0,0.0,132,176,1,0.0,90,0,1,1.0,2.0,0,0,0,1,0,0
261,60,1.0,0.0,130,253,0,1.0,144,1,1,2.0,1.0,0,0,0,0,1,0
831,63,0.0,0.0,124,197,0,1.0,136,1,0,1.0,0.0,0,0,0,0,1,0
552,57,1.0,1.0,124,261,0,1.0,141,0,0,2.0,0.0,0,0,0,1,0,0


In [856]:
ID = pd.DataFrame(test_df.index, columns = ['ID'])

In [841]:
predictions = logisticRegr.predict(test_df)

preds = pd.DataFrame(predictions, columns=['predictions']) 

In [862]:
final = pd.concat([ID, preds], axis = 1).set_index('ID')

In [869]:
final.loc[946]

predictions    1.0
Name: 946, dtype: float64

In [870]:
final.loc[637]

predictions    1.0
Name: 637, dtype: float64

In [871]:
final.loc[688]

predictions    1.0
Name: 688, dtype: float64

In [872]:
final.loc[831]

predictions    1.0
Name: 831, dtype: float64

In [875]:
final['predictions'].value_counts()

1.0    54
0.0    40
Name: predictions, dtype: int64