<span style="color:#6116BC">**Urine Analysis Data**</span><br>
<span style="color:#327689">**Description**</span><br>
The urine data frame has <span style="color:#5158BB">79</span> rows and <span style="color:#5158BB">7</span> columns.<br>
79 urine specimens were analyzed in an effort to determine if certain physical characteristics of the urine might be related to the formation of calcium oxalate crystals.<br>

**This data frame contains the following columns:**
<span style="color:#327689">*r, gravity, ph,osmo, cond,urea and calc*</span><br>
<span style="color:#327689">r</span> 腎結石有無出現<br>
Indicator of the presence of calcium oxalate crystals.<br>

<span style="color:#327689">gravity</span> 尿液比重<br>
The specific gravity of the urine.<br>

<span style="color:#327689">ph</span><br>
The pH reading of the urine.<br>

<span style="color:#327689">osmo</span> 滲透壓<br>
The osmolarity of the urine. Osmolarity is proportional to the concentration of molecules in solution.<br>
<span style="color:#0C6FF9">No.54 has no osmo value</span><br>

<span style="color:#327689">cond</span><br>
The conductivity of the urine. Conductivity is proportional to the concentration of charged ions in solution.<br>
<span style="color:#0C6FF9">No.0 has no cond value</span><br>

<span style="color:#327689">urea</span> 血清脲濃度<br>
The urea concentration in millimoles per litre.<br>

<span style="color:#327689">calc</span> 鈣濃度<br>
The calcium concentration in millimoles per litre.<br>

**Source**<br>
The data were obtained from Andrews, D.F. and Herzberg, A.M. (1985) Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer-Verlag.<br>

**References**<br>
Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.<br>

In [13]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LinearRegression

<span style="color:#6116BC">**Reading data**</span>

In [3]:
df = pd.read_csv("urine.csv")

In [12]:
df.describe()

  interpolation=interpolation)


Unnamed: 0.1,Unnamed: 0,r,gravity,ph,osmo,cond,urea,calc
count,79.0,79.0,79.0,79.0,78.0,78.0,79.0,79.0
mean,40.0,0.43038,1.018114,6.028481,615.038462,20.901282,266.405063,4.138987
std,22.949219,0.498293,0.007239,0.724307,238.247685,7.952072,131.25455,3.260051
min,1.0,0.0,1.005,4.76,187.0,5.1,10.0,0.17
25%,20.5,0.0,1.012,5.53,,,160.0,1.46
50%,40.0,0.0,1.018,5.94,,,260.0,3.16
75%,59.5,1.0,1.0235,6.385,,,372.0,5.93
max,79.0,1.0,1.04,7.94,1236.0,38.0,620.0,14.34


<span style="color:#0C6FF9">**NaN in urine testing data**</span>

<span style="color:#327689">**Mathod1: fill null with 0**</span>

In [14]:
df1 = df.iloc[:,1:]

In [16]:
df1.fillna(0,inplace=True)

In [17]:
df1.describe()

Unnamed: 0,r,gravity,ph,osmo,cond,urea,calc
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,0.43038,1.018114,6.028481,607.253165,20.636709,266.405063,4.138987
std,0.498293,0.007239,0.724307,246.622179,8.243462,131.25455,3.260051
min,0.0,1.005,4.76,0.0,0.0,10.0,0.17
25%,0.0,1.012,5.53,409.0,13.85,160.0,1.46
50%,0.0,1.018,5.94,594.0,21.4,260.0,3.16
75%,1.0,1.0235,6.385,792.0,26.55,372.0,5.93
max,1.0,1.04,7.94,1236.0,38.0,620.0,14.34


In [18]:
df1.describe().to_csv("description_after0.csv")

<span style="color:#327689">**Mathod2: fill the null with the average of the feature**</span>

In [19]:
df2 = df.iloc[:,1:]

In [20]:
df2.iloc[54,3] = df2.iloc[:,3].mean()
df2.iloc[0,4] = df2.iloc[:,4].mean()

In [22]:
df2.describe()

Unnamed: 0,r,gravity,ph,osmo,cond,urea,calc
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,0.43038,1.018114,6.028481,615.038462,20.901282,266.405063,4.138987
std,0.498293,0.007239,0.724307,236.71553,7.900933,131.25455,3.260051
min,0.0,1.005,4.76,187.0,5.1,10.0,0.17
25%,0.0,1.012,5.53,413.0,14.45,160.0,1.46
50%,0.0,1.018,5.94,615.038462,21.4,260.0,3.16
75%,1.0,1.0235,6.385,792.0,26.55,372.0,5.93
max,1.0,1.04,7.94,1236.0,38.0,620.0,14.34


In [23]:
df2.describe().to_csv("description_nulltoaverage.csv")

<span style="color:#327689">**Mathod3: directly drop the null**</span>

In [7]:
#remove the extra index
df3 = df.iloc[:,1:]

In [10]:
#Mathod3: directly drop the null
df3.dropna(inplace=True)

In [21]:
df3.describe()

Unnamed: 0,r,gravity,ph,osmo,cond,urea,calc
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,0.428571,1.018026,6.040649,613.61039,20.905195,262.402597,4.16039
std,0.498117,0.007313,0.722068,239.473719,8.004142,130.486363,3.296907
min,0.0,1.005,4.76,187.0,5.1,10.0,0.17
25%,0.0,1.012,5.53,410.0,14.3,159.0,1.45
50%,0.0,1.018,5.94,594.0,21.4,255.0,3.16
75%,1.0,1.024,6.4,803.0,27.0,362.0,6.19
max,1.0,1.04,7.94,1236.0,38.0,620.0,14.34


<span style="color:#6116BC">**normalization**</span>

In [24]:
#features
X2 = df3.iloc[:, 1:]
#answer
Y2 = df3.iloc[:, 0]

In [25]:
X3 = (X2 - X2.mean()) / (X2.max() - X2.min())

In [334]:
X3.describe().to_csv("description_X3nornalized.csv")

<span style="color:#6116BC">**Visualization**</span>

In [27]:
data = X3.values

In [28]:
gravity = data[:,0]
ph = data[:,1]
osmo = data[:,2]
cond = data[:,3]
urea = data[:,4]
calc = data[:,5]

In [341]:
#binary
Y = Y2

In [342]:
x1 = gravity
x2 = urea

In [343]:
#single feature distribution
plt.figure(2, figsize=(25, 16))

# Plot the training points
plt.scatter(x1, x2, c=Y, cmap=plt.cm.winter, edgecolor='k', s=200, 
            marker='8', alpha=0.8)

plt.xlabel('$Specific$ $gravity$ $of$ $the$ $urine$',fontsize=30, color = "#276C8C")
plt.ylabel('$Urea$ $concentration$ $per$ $litre$', fontsize=30, color = "#276C8C")
plt.title('$Scatter$ $between$ $Gravity$ $and$ $Urea$', fontsize = 40, color = "#3E9145")
#plt.xlim(x1.min() - 0.05, x1.max() + 0.05)
plt.ylim(x2.min() - 0.05, x2.max() + 0.05)
plt.grid(True)
plt.show()

In [44]:
dis = range(len(gravity))

In [345]:
#the presence of calcium oxalate crystals with distribution of each column
plt.figure(figsize=(18,8),dpi=100)

p1 = plt.subplot(231)#(2columnwidth 1rowlength index1)
p2 = plt.subplot(232)#(2columnwidth 1rowlength index2)
p3 = plt.subplot(233)#(2columnwidth 1rowlength index2)
p4 = plt.subplot(234)#(2columnwidth 1rowlength index2)
p5 = plt.subplot(235)#(2columnwidth 1rowlength index2)
p6 = plt.subplot(236)#(2columnwidth 1rowlength index2)

#數學式$式子$
title_f1 = "$Gravity$ $of$ $the$ $urine$"
title_f2 = "$ph$"
title_f3 = "$Osmolarity$ $of$ $urine$"
title_f4 = "$Conductivity$ $Concentration$"
title_f5 = "$Urea$ $Concentration$"
title_f6 = "$Calcium$ $Concentration$"

p1.scatter(dis, gravity, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8, label=title_f1)
p2.scatter(dis, ph, c=Y2, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p3.scatter(dis, osmo, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p4.scatter(dis, cond, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p5.scatter(dis, urea, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p6.scatter(dis, calc, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)

p1.set_ylabel(title_f1, fontsize=14, color = "#276C8C")
p2.set_ylabel(title_f2, fontsize=14, color = "#276C8C")
p3.set_ylabel(title_f3, fontsize=14, color = "#276C8C")
p4.set_ylabel(title_f4, fontsize=14, color = "#276C8C")
p5.set_ylabel(title_f5, fontsize=14, color = "#276C8C")
p6.set_ylabel(title_f6, fontsize=14, color = "#276C8C")

#missing value marker
#p3.text(49,0+20,"$missing$ $value$",fontsize= 15, color = "#103900", verticalalignment="bottom",horizontalalignment="left")
#p3.annotate("$missing$ $value$", fontsize= 15, color = "#103900",xy=(49, 0),xytext=(49,50),arrowprops=dict(facecolor="#F98BB0", shrink=0.001))

#p4.text(2,0+0.05,"$missing$ $value$",fontsize= 15, color = "#103900", verticalalignment="bottom",horizontalalignment="left")
#p4.annotate("$missing$ $value$",fontsize= 15, color = "#103900",xy=(0, 0),xytext=(2, 0+0.05),arrowprops=dict(facecolor="#F98BB0", shrink=0.001))


plt.show()

In [346]:
##test for 6 column
plt.figure(figsize=(18,8),dpi=100)

p1 = plt.subplot(231)#(2columnwidth 1rowlength index1)
p2 = plt.subplot(232)#(2columnwidth 1rowlength index2)
p3 = plt.subplot(233)#(2columnwidth 1rowlength index2)
p4 = plt.subplot(234)#(2columnwidth 1rowlength index2)
p5 = plt.subplot(235)#(2columnwidth 1rowlength index2)
p6 = plt.subplot(236)#(2columnwidth 1rowlength index2)

#數學式$式子$
title_f1 = "$Gravity$ $of$ $the$ $urine$"
title_f2 = "$ph$"
title_f3 = "$Osmolarity$ $of$ $urine$"
title_f4 = "$Conductivity$ $Concentration$"
title_f5 = "$Urea$ $Concentration$"
title_f6 = "$Calcium$ $Concentration$"

p1.scatter(x1, gravity, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8, label=title_f1)
p2.scatter(x1, ph, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p3.scatter(x1, osmo, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p4.scatter(x1, cond, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p5.scatter(x1, urea, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)
p6.scatter(x1, calc, c=Y, cmap=plt.cm.winter, edgecolor='k', s=100, 
            marker='8', alpha=0.8)

p1.set_ylabel(title_f1, fontsize=14, color = "#276C8C")
p2.set_ylabel(title_f2, fontsize=14, color = "#276C8C")
p3.set_ylabel(title_f3, fontsize=14, color = "#276C8C")
p4.set_ylabel(title_f4, fontsize=14, color = "#276C8C")
p5.set_ylabel(title_f5, fontsize=14, color = "#276C8C")
p6.set_ylabel(title_f6, fontsize=14, color = "#276C8C")

#missing value marker
p3.text(49,0+20,"$missing$ $value$",fontsize= 15, color = "#103900", verticalalignment="bottom",horizontalalignment="left")
#p3.annotate("$missing$ $value$", fontsize= 15, color = "#103900",xy=(49, 0),xytext=(49,50),arrowprops=dict(facecolor="#F98BB0", shrink=0.001))

p4.text(2,0+0.05,"$missing$ $value$",fontsize= 15, color = "#103900", verticalalignment="bottom",horizontalalignment="left")
#p4.annotate("$missing$ $value$",fontsize= 15, color = "#103900",xy=(0, 0),xytext=(2, 0+0.05),arrowprops=dict(facecolor="#F98BB0", shrink=0.001))


plt.show()

In [352]:
plt.figure(2, figsize=(25, 16))

# Plot the training points
plt.scatter(x1, x2, c=Y, cmap=plt.cm.winter, edgecolor='k', s=200, 
            marker='8', alpha=0.8)
plt.xlabel('$Urea$ $concentration$ $per$ $litre$',fontsize=30, color = "#276C8C")
plt.ylabel('$Calcium$ $concentration$ $per$ $litre$', fontsize=30, color = "#276C8C")
plt.title('$Scatter$ $between$ $Urea$ $and$ $Calcium$', fontsize = 40, color = "#3E9145")
plt.xlim(x1.min() - 0.05, x1.max() + 0.05)
plt.ylim(x2.min() - 0.05, x2.max() + 0.05)
plt.grid(True)
plt.show()

In [351]:
plt.figure(2, figsize=(15, 8))

# Plot the training points
plt.scatter(osmo, calc, c=Y, cmap=plt.cm.winter, edgecolor='k', s=200, 
            marker='8', alpha=0.4)
plt.xlabel('$osmo$',fontsize=30, color = "#348FBA")
plt.ylabel('$calc$', fontsize=30, color = "#348FBA")
plt.title('$Relation$ $Between$ $Urea$ $and$ $calc$', fontsize = 40, color = "#CF8BA3")
plt.xlim(osmo.min() - 0.05, osmo.max() + 0.05)
plt.ylim(calc.min() - 0.05, calc.max() + 0.05)
plt.grid(True)
plt.show()

In [350]:
plt.figure(2, figsize=(15, 8))

# Plot the training points
plt.scatter(cond, ph, c=Y, cmap=plt.cm.winter, edgecolor='k', s=200, 
            marker='8', alpha=0.4)
plt.xlabel('$Urea$',fontsize=30, color = "#348FBA")
plt.ylabel('$ph$', fontsize=30, color = "#348FBA")
plt.title('$Relation$ $Between$ $Urea$ $and$ $ph$', fontsize = 40, color = "#CF8BA3")
plt.xlim(cond.min() - 0.05, cond.max() + 0.05)
plt.ylim(ph.min() - 0.05, ph.max() + 0.05)
plt.grid(True)
plt.show()

In [None]:
# fig = plt.figure(1, figsize=(8, 6))
# ax = Axes3D(fig, elev=-150, azim=110)
# X_reduced = PCA(n_components=3).fit_transform(iris.data)
# ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
#            cmap=plt.cm.Set1, edgecolor='k', s=40)
# ax.set_title("First three PCA directions")
# ax.set_xlabel("1st eigenvector")
# ax.w_xaxis.set_ticklabels([])
# ax.set_ylabel("2nd eigenvector")
# ax.w_yaxis.set_ticklabels([])
# ax.set_zlabel("3rd eigenvector")
# ax.w_zaxis.set_ticklabels([])

In [172]:
df1.iloc[:45,:]

Unnamed: 0,r,gravity,ph,osmo,cond,urea,calc
0,0,1.021,4.91,725.0,0.0,443,2.45
1,0,1.017,5.74,577.0,20.0,296,4.49
2,0,1.008,7.2,321.0,14.9,101,2.36
3,0,1.011,5.51,408.0,12.6,224,2.15
4,0,1.005,6.52,187.0,7.5,91,1.16
5,0,1.02,5.27,668.0,25.3,252,3.34
6,0,1.012,5.62,461.0,17.4,195,1.4
7,0,1.029,5.67,1107.0,35.9,550,8.48
8,0,1.015,5.41,543.0,21.9,170,1.16
9,0,1.021,6.13,779.0,25.7,382,2.21


In [354]:
lm = LinearRegression()
lm.fit(np.reshape(gravity, (len(gravity), 1)), np.reshape(urea, (len(urea), 1)))

# 印出係數
print(lm.coef_)

# 印出截距
print(lm.intercept_ )

[[ 0.8469039]]
[  1.69059979e-14]


In [355]:
to_be_predicted = np.array([1.0095])
predicted_urea= lm.predict(np.reshape(to_be_predicted, (len(to_be_predicted), 1)))

print(predicted_urea)

[[ 0.85494949]]


In [356]:
X1 = np.reshape(x1, (len(x1), 1))
Y1 = np.reshape(y1, (len(y1), 1))

In [357]:
x1 = gravity
y1 = urea

In [91]:
x1

array([ 1.017,  1.008,  1.011,  1.005,  1.02 ,  1.012,  1.029,  1.015,
        1.021,  1.011,  1.025,  1.006,  1.007,  1.011,  1.018,  1.007,
        1.025,  1.008,  1.014,  1.024,  1.019,  1.014,  1.02 ,  1.023,
        1.017,  1.017,  1.01 ,  1.008,  1.02 ,  1.017,  1.019,  1.017,
        1.008,  1.023,  1.02 ,  1.008,  1.02 ,  1.009,  1.018,  1.021,
        1.009,  1.015,  1.01 ,  1.02 ,  1.021,  1.024,  1.024,  1.021,
        1.024,  1.026,  1.013,  1.01 ,  1.011,  1.011,  1.031,  1.02 ,
        1.04 ,  1.021,  1.025,  1.026,  1.034,  1.033,  1.015,  1.013,
        1.014,  1.012,  1.025,  1.026,  1.028,  1.027,  1.018,  1.022,
        1.025,  1.017,  1.024,  1.016,  1.015])

In [359]:
#Sklearn linear regression
plt.figure(2, figsize=(25, 16))
plt.scatter(x1, y1,c=Y, cmap=plt.cm.winter, edgecolor='#F779CF', s=200, 
            marker='8', alpha=1)
plt.plot(x1, lm.predict(np.reshape(x1, (len(x1), 1))), color='#B75D69', linewidth=5)
#plt.plot(to_be_predicted, predicted_urea, color = '#309BDD', marker = '^', markersize = 20)

#p3.text(49,0+20,"$missing$ $value$",fontsize= 15, color = "#103900", verticalalignment="bottom",horizontalalignment="left")
#p3.annotate("$missing$ $value$", fontsize= 15, color = "#103900",xy=(49, 0),xytext=(49,50),arrowprops=dict(facecolor="#F98BB0", shrink=0.001))

plt.xlabel('$Specific$ $gravity$ $of$ $the$ $urine$',fontsize=30, color = "#276C8C")
plt.ylabel('$Urea$ $concentration$ $per$ $litre$', fontsize=30, color = "#276C8C")
plt.title('$Linear$ $regression$ $between$ $gravity$ $and$ $urine$', fontsize = 40, color = "#3E9145")
plt.xlim(x1.min() - 0.0005, x1.max() + 0.0005)
plt.ylim(y1.min() - 0.05, y1.max() + 0.05)
plt.show()

In [360]:
to_be_predicted, predicted_urea

(array([ 1.0095]), array([[ 0.85494949]]))

In [361]:
# 模型績效
mse = np.mean((lm.predict(X1) - Y1) ** 2)
r_squared = lm.score(X1, Y1)

# 印出模型績效
print(mse)
print(r_squared)

85623.0185121
-4.09492049532


In [210]:
X

array([[  1.01700000e+00,   5.74000000e+00,   5.77000000e+02,
          2.00000000e+01,   2.96000000e+02,   4.49000000e+00],
       [  1.00800000e+00,   7.20000000e+00,   3.21000000e+02,
          1.49000000e+01,   1.01000000e+02,   2.36000000e+00],
       [  1.01100000e+00,   5.51000000e+00,   4.08000000e+02,
          1.26000000e+01,   2.24000000e+02,   2.15000000e+00],
       [  1.00500000e+00,   6.52000000e+00,   1.87000000e+02,
          7.50000000e+00,   9.10000000e+01,   1.16000000e+00],
       [  1.02000000e+00,   5.27000000e+00,   6.68000000e+02,
          2.53000000e+01,   2.52000000e+02,   3.34000000e+00],
       [  1.01200000e+00,   5.62000000e+00,   4.61000000e+02,
          1.74000000e+01,   1.95000000e+02,   1.40000000e+00],
       [  1.02900000e+00,   5.67000000e+00,   1.10700000e+03,
          3.59000000e+01,   5.50000000e+02,   8.48000000e+00],
       [  1.01500000e+00,   5.41000000e+00,   5.43000000e+02,
          2.19000000e+01,   1.70000000e+02,   1.16000000e+00],


In [80]:
#https://ithelp.ithome.com.tw/articles/10187191

In [362]:
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn import neighbors
from sklearn.tree import DecisionTreeRegressor

In [394]:
xx = X3.values

In [396]:
train_X, test_X, train_y, test_y = train_test_split(xx, Y2, test_size = 0.2)

In [397]:
clf = tree.DecisionTreeClassifier()
urine_clf = clf.fit(train_X, train_y)

In [412]:
test_y_predicted = urine_clf.predict(test_X)
a = test_y.reshape(1,len(test_y))
print(test_y_predicted)
print(a)

[1 1 1 0 1 0 1 1 1 0 1 0 0 1 1 0]
[[0 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0]]


In [414]:
result = clf.predict_proba(test_X)

In [402]:
accuracy = metrics.accuracy_score(test_y, test_y_predicted)
print(accuracy)

0.625


In [381]:
clf2 = neighbors.KNeighborsClassifier(n_neighbors= 7)
knn_urine_clf = clf2.fit(train_X2, train_y2)
test_y_predicted2 = knn_urine_clf.predict(test_X2)
print(test_y_predicted2)

[0 1 0 0 1 1 1 1 1 0 0 0 1 0 0 0]


In [379]:
accuracy2 = metrics.accuracy_score(test_y2, test_y_predicted2)
print(accuracy2)

0.4375


In [380]:
range = np.arange(1, round(0.2 * train_X.shape[0]) + 1)
accuracies = []

for i in range:
    clf = neighbors.KNeighborsClassifier(n_neighbors = i)
    knn_urine_clf = clf.fit(train_X, train_y)
    test_y_predicted = knn_urine_clf.predict(test_X)
    accuracy = metrics.accuracy_score(test_y, test_y_predicted)
    accuracies.append(accuracy)
appr_k = accuracies.index(max(accuracies)) + 1

# 視覺化
plt.scatter(range, accuracies, c='#76DBD2', edgecolor='#1B998B', s=200, 
            marker='8', alpha=1)
plt.scatter(appr_k, max(accuracies), c='#103900', s = 200)
plt.xlabel('$k$',fontsize=20, color = "#276C8C")
plt.ylabel('$Accuracy$', fontsize=20, color = "#276C8C")
plt.title('$Choose$ $the$ $best$ $k$', fontsize = 26, color = "#3E9145")

plt.annotate("$Appropriate$ $value$ $(11 ,$ $0.6875)$", fontsize= 18, color = "#103900",xy=(appr_k, max(accuracies)),xytext=(appr_k-6,max(accuracies)+0.02))
plt.show()

print(appr_k, max(accuracies))

1 0.625
