In [25]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings


warnings.filterwarnings('ignore') # ignore pandas future warnings
float_t = np.float64

### Loading data

In [26]:
data = pd.read_csv('.\\task_data.csv')
print(len(data))
data.head()

37


Unnamed: 0,ID,Cardiomegaly,Heart width,Lung width,CTR - Cardiothoracic Ratio,xx,yy,xy,normalized_diff,Inscribed circle radius,Polygon Area Ratio,Heart perimeter,Heart area,Lung area
0,1,0,172,405,424691358,1682.360871,3153.67188,-638.531109,-0.304239,688186,0.213446,6794873689,24898,75419
1,2,1,159,391,4066496164,1526.66096,5102.159054,-889.678405,-0.539387,7392564,0.203652,7886589419,29851,94494
2,5,0,208,400,52,2465.903392,5376.834707,-1755.344699,-0.371163,6933974,0.320787,8623229369,33653,66666
3,7,1,226,435,5195402299,2509.063593,6129.82127,-1025.079806,-0.419123,8414868,0.317545,906724959,42018,82596
4,8,1,211,420,5023809524,2368.770135,5441.767075,-1493.040062,-0.393442,7378347,0.263542,8642396777,35346,85631


### Preprocessing
Since "xx", "yy", "xy", "normalized_diff" apply to photos which are not provided, and "ID" is not relevant we will be dropping these columns as they are useless in the given task. We'll additionally convert all useful valuest to floats and perform regularization of the data (scale them down to range [0,1])

In [27]:
# drop useless data
data.drop(labels=['xx','yy','xy','ID','normalized_diff'] , axis=1, inplace=True)
# Convert strings to floats
for col_name in data:
    new_col = []
    for val in data[col_name]:
        if isinstance(val, str):
            val = val.replace(',', '.')
        new_col.append(float_t(val))
    # Apply regularization of data (scale them down to range [0,1])
    new_col = (np.array(new_col) - np.min(new_col)) / (np.max(new_col) - np.min(new_col))
    data[col_name] = new_col

X_train, X_test = train_test_split(data, test_size=0.2)
y_train, y_test = X_train['Cardiomegaly'], X_test['Cardiomegaly']

X_train.drop('Cardiomegaly', axis=1, inplace=True)
X_test.drop('Cardiomegaly', axis=1, inplace=True)

X_train = np.array(X_train, dtype=float_t)
X_test = np.array(X_test, dtype=float_t)
y_test = np.array(y_test, dtype=int)
y_train = np.array(y_train, dtype=int)

## K-nearest neighbours

Since a patient can be described as a vector (in our case 8 dimentional: v = `[Heart width, Lung width, CTR - Cardiothoracic Ratio, Inscribed circle radius,	Polygon Area Ratio, Heart perimeter, Heart area, Lung area]`) we can think of a patient as a point in n-dimentional space. For a new given vector 'v' we can calculate distances to other points  and then choose k nearest of them. If most of the points close to v represent sick patient then the patient represented by v is most likely sick as well. 

In [28]:
from ML import KNN_classifier
knn = KNN_classifier(k=3)
knn.fit(X_train, y_train)

### Evaluation of the model

In [29]:
knn.evaluate(X_test, y_test, intermediate_states=True)

Evaluation metrics:
 num  | Accuracy | Precision | Recall | F1_Score |
  1   |   0.00   |   0.00    |  0.00  |   0.00   |
  2   |   0.00   |   0.00    |  0.00  |   0.00   |
  3   |   0.33   |   0.00    |  0.00  |   0.00   |
  4   |   0.50   |   0.33    |  1.00  |   0.50   |
  5   |   0.60   |   0.50    |  1.00  |   0.67   |
  6   |   0.67   |   0.60    |  1.00  |   0.75   |
  7   |   0.71   |   0.67    |  1.00  |   0.80   |
  8   |   0.75   |   0.71    |  1.00  |   0.83   |

Final:
Accuracy=0.75 | Precision=0.71 | Recall=1.00 | F1 Score=0.83

                    Actual Positive  Actual Negative
Predicted Positive                5                2
Predicted Negative                0                1


### Conclusion
The model works not bad but dataset is not big enough to provide accurate predictions consistently.

In [30]:
del knn

## Decision Tree

Data can be split in two subsets based on some criterion. Then these two can be split again and again as long as we need to do it. We'll have a tree structure then and the smallest sets (leafs) have a specified label. Given feature vector v we can match it to a leaf group by traversing this tree and leaf value of the matched set is most likely label for this vector

In [31]:
from ML import DecisionTree
dt = DecisionTree()
dt.fit(X_train, y_train)

### Evaluation of the model

In [32]:
dt.evaluate(X_test, y_test, intermediate_states=True)

Evaluation metrics:
 num  | Accuracy | Precision | Recall | F1_Score |
  1   |   0.00   |   0.00    |  0.00  |   0.00   |
  2   |   0.00   |   0.00    |  0.00  |   0.00   |
  3   |   0.33   |   0.00    |  0.00  |   0.00   |
  4   |   0.50   |   0.33    |  1.00  |   0.50   |
  5   |   0.60   |   0.50    |  1.00  |   0.67   |
  6   |   0.67   |   0.60    |  1.00  |   0.75   |
  7   |   0.71   |   0.67    |  1.00  |   0.80   |
  8   |   0.75   |   0.71    |  1.00  |   0.83   |

Final:
Accuracy=0.75 | Precision=0.71 | Recall=1.00 | F1 Score=0.83

                    Actual Positive  Actual Negative
Predicted Positive                5                2
Predicted Negative                0                1


### Conclusion

As previously model doesn't give best results due to small size of the dataset.