-
Notifications
You must be signed in to change notification settings - Fork 0
/
w6p2.py
479 lines (344 loc) · 17 KB
/
w6p2.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
# coding: utf-8
# If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/lcdm-uiuc/info490-sp17/blob/master/help/act_assign_tab.md).
#
# A few things you should keep in mind when working on assignments:
#
# 1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.
#
# 2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).
#
# 3. Do not change the title (i.e. file name) of this notebook.
#
# 4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)
#
# 5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.
# # Problem 6.2. Anomaly Detection
#
# In this problem, we will use machine learning techniques for anomaly detection to search for outliers in the data that pertains to the presence or absence of a vertebal / backbone disorder in a patient.
# In[1]:
get_ipython().magic('matplotlib inline')
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn import svm
from sklearn.utils import check_random_state
from nose.tools import assert_equal, assert_almost_equal, assert_is_instance, assert_is_not, assert_in
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_frame_equal
import warnings
warnings.filterwarnings("ignore")
# ## Vertebral Column Data set
# We shall be looking at the vertebral column data set for this problem. This data set has been taken from the [UCI Machine Learning repository](http://archive.ics.uci.edu/ml/datasets/Vertebral+Column). The [vertebral column](https://en.wikipedia.org/wiki/Vertebral_column) or the backbone in humans consists of thirty three vertebrae that provide support to the body, but most importantly, houses the spinal cord which is the integral part of the [central nervous system](https://en.wikipedia.org/wiki/Central_nervous_system).
# The data set consists of data from 310 patients with six biomechanical features related to the pelvis and lumbar spine, namely, pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius and grade of spondylolisthesis. Further information on these features, can be found in this [publication](https://pdfs.semanticscholar.org/a972/7a403fc0b6c9324be101295fd6a42577f3c2.pdf). In addition to these features, the data set contains information about the condition of the patient. There are 210 patients that were diagnosed with either [sponylolisthesis](https://en.wikipedia.org/wiki/Spondylolisthesis) or [disc herniation](https://en.wikipedia.org/wiki/Spinal_disc_herniation) were classified with "abnormal" vertebral condition, denoted by 'AB', while there are 100 patients who didn't show any signs of these diseases were classified as having "normal" vetebral condition, denoted by 'NO'.
# Let's load the data set into a Pandas DataFrame.
# In[2]:
vc_data = pd.read_csv('/home/data_scientist/data/misc/vert_column.csv', header=0, na_values='NA')
#vc_data = pd.read_csv('./data/vert_column.csv', header=0, na_values='NA')
print(vc_data.head())
# In[3]:
#Let's take a look at the statistical summary of our attributes
vc_data.describe()
# ## Standardization
# PCA is a scale-dependent method. For example, if the range of one column is [-100, 100], while the that of another column is [-0.1, 0.1], PCA will place more weight on the feature with larger values. One way to avoid this is to standardize a data set by scaling each feature so that the individual features all look like Gausssian distributions with zero mean and unit variance.
# For further detail, see Preprocessing data. The function scale provides a quick and easy way to perform this operation on a single array-like dataset.
# Write a function named standardize() that uses sklearn.preprocessing.StandardScaler to scale each features so that they have zero mean and unit variance.
# In[4]:
def standardize(df):
'''
Uses sklearn.preprocessing.StandardScaler to make each features look like
a Gaussian with zero mean and unit variance.
Parameters
----------
df: A pandas.DataFrame
Returns
-------
A numpy array.
'''
#Creates scale for df features to look Gaussian with zero mean and unit variance
scaler= StandardScaler().fit(df)
#Transforms df to the scale
scaled=scaler.transform(df)
return scaled
# In[5]:
scaled = standardize(vc_data[vc_data.columns[:6]])
# In[6]:
rng = np.random.RandomState(0)
n_samples, n_features = 4, 5
df_t1 = pd.DataFrame(
rng.randn(n_samples, n_features),
index=[i for i in 'abcd'],
columns=[c for c in 'abcde']
)
df_t1.loc[:, 'a'] = 0.0 # make first feature zero
scaled_t1 = standardize(df_t1)
assert_is_not(df_t1, scaled_t1)
assert_is_instance(scaled_t1, np.ndarray)
assert_array_almost_equal(
scaled_t1.mean(axis=0),
n_features * [0.0] # scaled data should have mean zero
)
assert_array_almost_equal(
scaled_t1.std(axis=0),
[0., 1., 1., 1., 1.] # unit variance except for 1st feature
)
#
# ## Principal Components Analysis
# Write a function named fit_pca() that takes a pandas.DataFrame and uses sklearn.decomposition.PCA to fit a PCA model on all values of df.
# In[7]:
def fit_pca(df, n_components):
'''
Uses sklearn.decomposition.PCA to fit a PCA model on "df" via fit using fit_transform()
Parameters
----------
df: A pandas.DataFrame. Comes from delta.csv.
n_components: An int. Number of principal components to keep.
Returns
-------
a tuple of an sklearn.decomposition.pca.PCA instance and a numpy array containing the reduced data set
'''
#Creates model
pca=PCA(n_components)
#Reduces the df
reduced=pca.fit_transform(df)
return pca, reduced
# In[8]:
# using 4 components (randomly chosen)
n_components = 4
pca, reduced = fit_pca(scaled, n_components=n_components)
# In[9]:
assert_almost_equal(pca.explained_variance_ratio_.sum(), 0.945664604953, 3)
assert_is_instance(reduced, np.ndarray)
t_pca, t_reduced = fit_pca(scaled_t1, 5)
assert_almost_equal(t_pca.explained_variance_ratio_.sum(), 1.0, 3)
# ## Calculating gamma and nu for oneClass SVM
# Before we can create a [oneClass SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) model, we first need to calculate the tow of the important parameters required. A rule of thumb for deciding upon the value of gamma is that it should be a value between the inverse of the total instances (number of rows) and inverse of the features. For the purposes of this assignment:
# $$ \gamma = \dfrac {1}{total number of rows} $$
# In[10]:
def get_gamma_nu(df, frac_of_outliers):
'''
Calculates the gamma and nu values to be used for the model.
Uses the inverse of the number of cases for calculating gamma.
Uses the scikit learn recommended formula for calculating nu.
Parameters
----------
df: A pandas.DataFrame containing pca data.
frac_of_outliers: A float. Expected Outlier Fraction.
Returns
-------
A tuple of gamma and nu values
'''
#Using the definitions from the notebook
gamma=1/len(df)
nu=0.95*frac_of_outliers+0.05
return gamma, nu
# In[11]:
rng = np.random.RandomState(0)
n_samples, n_features = 4, 5
df_t1 = pd.DataFrame(
rng.randn(n_samples, n_features),
index=[i for i in 'abcd'],
columns=[c for c in 'abcde']
)
df_t1.loc[:, 'a'] = 0.0 # make first feature zero
test1, test2 = get_gamma_nu(scaled,0.1)
assert_almost_equal(test1, 0.0032258064516129032, 5)
assert_almost_equal(test2, 0.14500000000000002, 3)
scaled_t1 = standardize(df_t1)
pca_t1,t_reduced = fit_pca(scaled_t1,5)
test3, test4 = get_gamma_nu(t_reduced, 0.2)
assert_almost_equal(test3, 0.25, 2)
assert_almost_equal(test4, 0.24, 2)
# ## OneClassSVM fitting and estimates
# In the following code cell, write a function named get_svc that requires four parameters: df, kernel, frac_of_outliers and random_state. It returns a sklearn.svm.OneClassSVM Classifier instance.
# In[12]:
def get_svc(df, kernel, frac_of_outliers, random_state):
'''
Takes in df, kernel and random state parameters, calculates nu & gamma using get_gamma_nu()
and returns a One Class SVM classifier
Parameters
----------
df: A pandas.DataFrame. Standardized version of data.
kernel: A string that specifies kernel to be used in SVM
random_state: A numpy.random.RandomState instance
Returns
-------
An sklearn.svm.OneClassSVM
'''
#Calling the previous function
gamma,nu=get_gamma_nu(df, frac_of_outliers)
#Creates the model
svc=svm.OneClassSVM(kernel=kernel, gamma=gamma, nu=nu, random_state=random_state)
#Fits the model
model=svc.fit(df)
return model
# In[13]:
t_ocSVM1 = get_svc(scaled_t1, 'rbf',0.1, check_random_state(0))
assert_is_instance(t_ocSVM1, svm.OneClassSVM)
assert_equal(t_ocSVM1.kernel, 'rbf')
assert_almost_equal(t_ocSVM1.gamma,0.25)
t_ocSVM2 = get_svc(scaled_t1, 'linear',0.1, check_random_state(0))
assert_is_instance(t_ocSVM2, svm.OneClassSVM)
assert_equal(t_ocSVM2.kernel, 'linear')
assert_almost_equal(t_ocSVM1.gamma,0.25)
# ## Fit and Predict
# In the following code cell, write a function named fit_and_pred that makes classification predictions on the standardized version of the dataset, other than taking in the classifier instance as a parameter.
#
# In[14]:
def fit_and_pred(svc, df):
'''
Fits the "svc" model on the standardized dataframe "df"
Makes predictions on "df".
Parameters
----------
ocSVM: A sklearn classifier instance.
df: A pandas.DataFrame. Standardized version of data.
Returns
-------
A numpy array
'''
#Creates the fit for the predict
svc.fit(df)
#Makes a prediction based off of the model
preds=svc.predict(df)
return preds
# In[15]:
t_svc = get_svc(scaled_t1, 'rbf',0.1, check_random_state(0))
test_pred = fit_and_pred(t_svc, scaled_t1)
assert_array_equal(test_pred, np.array([ 1., 1., -1., -1.]))
t_data = pd.DataFrame({
'X0': np.arange(100),
'X1': np.arange(-25, 75),
})
scaled_t2 =standardize(t_data)
pca_t2 = fit_pca(scaled_t2, 2)
t2_svc = get_svc(scaled_t2, 'rbf',0.1, check_random_state(0))
test2_pred = fit_and_pred(t2_svc, scaled_t2)
assert_array_equal(test2_pred, np.array(
[-1., -1., -1., -1., -1., -1., -1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., -1., -1., -1., -1., -1., -1., -1., -1.]))
# In[16]:
svc = get_svc(reduced,'rbf',0.1, check_random_state(0))
preds = fit_and_pred(svc, scaled)
# ## Identifying inliers and outliers
# Next, we sort the inliers and outliers based on the prediction classes from the prediction results of the model we just fitted.
# In[17]:
def get_pred_classes(preds):
'''
Retrieves prediction classes (outliers and inliers) from the data from the predictions
made by the model.
Parameters
----------
preds: A numpy array, containing predictions.
Returns
-------
A tuple of outliers and inliers that contain indices from the data set
'''
#Creates a df of predictions
df = pd.DataFrame({'p': preds})
outliers=[]
inliers=[]
#Creates arrays for the outliers and inliers
outliers.append(df[df.p < 0].index)
inliers.append(df[df.p > 0].index)
return outliers, inliers
# In[18]:
t_outliers, t_inliers = get_pred_classes(test_pred)
assert_array_equal(t_outliers, [[2,3]])
assert_array_equal(t_inliers, [[0, 1]])
t2_outliers, t2_inliers = get_pred_classes(test2_pred)
assert_array_equal(t2_outliers, [[ 0, 1, 2, 3, 4, 5, 6, 7, 92, 93, 94, 95, 96, 97, 98, 99]])
assert_array_equal(t2_inliers, [[ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]])
# In[19]:
outliers, inliers = get_pred_classes(preds)
# ## Visualizing data along the first two principal components
# Use maplotlib.plt to write a function named draw_scatter_plot() that creates a scatter plot of the data points (with inliers and outliers identified) along the first two principal components, as these components explain bulk of the variance in the data set, as can be observed by executing the code cell below.
#
# In[20]:
print("Explained Variance per PCA component\n", 35*'-')
for idx, val in enumerate(pca.explained_variance_ratio_):
print('PCA component {0:2d}: {1:4.1f}%'.format(idx, val*100))
print(35*'-', '\nTotal Explained Variance (All PCA components: {0:4.1f}%\n'.format(
sum(pca.explained_variance_ratio_[:]*100)))
print('Variance explained (First two PCA components): {0:4.1f}%\n'.format(
sum(pca.explained_variance_ratio_[:2]*100)))
print('Variance explained (Last two PCA components): {0:4.1f}%\n'.format(
sum(pca.explained_variance_ratio_[-2:]*100)))
#
# When you run the following code for draw_scatter_plot, you should generate the following plot:
# ![](./images/PCA1vsPCA2.png)
# It makes use of the parameters inliers and outliers (obtained from get_pred_classes() ) that contain indices or locations of rows that have been identified for the respective classes. These indices pertain to those rows in the dataframe "reduced" that contain the principal component values for all the four (n_components we arbitrarily chose) numerical features for each of 310 cases.
# ``` python
# >>> print(reduced)
#
# [[-0.23836132 -0.89346458 -1.54747828 0.05457487]
# [-2.24645299 -0.40130963 -0.30289504 0.64297846]
# [ 0.27829698 -0.65388174 -1.00857292 -0.51419182]
# ...,
# [-0.49758212 0.66527053 -0.36814074 -0.52018082]
# [-1.53121937 -0.40717078 0.2418938 -0.10213197]
# [-2.44753393 -0.0975671 0.52558415 0.10893291]]
# ```
#
# For the scatterplot, we use the first principal component for the x-axis and the second principal component for the y-axis.
# In[27]:
def draw_scatter_plot(inliers, outliers, reduced):
'''
Uses matplotlib.plt to plot a scatter plot with PCA1 on x-axis and PCA2 on the y-axis
First plot the inliers on the scatter plot and then plot the outliers on the same graph.
Parameters
----------
inliers: array containing indices of the data points identified as inliers
outliers: array containing indices of the data points identified as outliers
reduced: dataFrame containing the principal components of our data (returned by pca_fit() )
Returns
-------
A maplotlib.Axes instance
'''
fig, ax=plt.subplots()
#Creates a df for the data
r = pd.DataFrame(reduced[:,0:4],columns=['PCA 1','PCA 2','PCA 3','PCA 4'])
#Plots inliers
for i in inliers[0]:
plt.plot(r['PCA 1'].loc[i],r['PCA 2'].loc[i],marker='o',color=sns.xkcd_rgb['orange'])
#Plots outliers
for i in outliers[0]:
plt.plot(r['PCA 1'].loc[i],r['PCA 2'].loc[i],marker='^',color=sns.xkcd_rgb['blue'])
#Sets the title, x-axis, y-axis
ax.set(title='Explained Variance',xlabel='PCA 1',ylabel='PCA 2')
return ax
# In[28]:
ax = draw_scatter_plot(inliers, outliers, reduced)
# In[29]:
assert_is_instance(ax, mpl.axes.Axes, msg="Your function should return a matplotlib.axes.Axes object.")
assert_is_not(len(ax.title.get_text()), 0, msg="Your plot doesn't have a title.")
assert_is_not(ax.xaxis.get_label_text(), '', msg="Change the x-axis label to something more descriptive.")
assert_is_not(ax.yaxis.get_label_text(), '', msg="Change the y-axis label to something more descriptive.")
lines = ax.get_lines()
pts_lines= [l for l in lines]
#t_inliers = [l for l in lines if l.get_label()=='inliers'][0]
#t_outliers = [l for l in lines if l.get_label()=='outliers'][0]
t_xdata = pts_lines[0].get_xdata()
t_ydata = pts_lines[0].get_ydata()
t2_xdata = pts_lines[-1].get_xdata()
t2_ydata = pts_lines[-1].get_ydata()
assert_array_almost_equal(abs(t_xdata), [0.23836132], 3)
assert_array_almost_equal(abs(t_ydata), [0.89346458], 3)
assert_array_almost_equal(abs(t2_xdata), [2.59795974], 3)
assert_array_almost_equal(abs(t2_ydata), [0.1584883],3)
# In[ ]:
# In[ ]: