### Binary Classification with TuriCreate 
+ TuriCreate
  - made by Apple

#### Task
+ Predict Early stage diabetes risk 
+ https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.

#### Installation
+ pip install turicreate


#### Features
+ Easy-to-use: Focus on tasks instead of algorithms
+ Visual: Built-in, streaming visualizations to explore your data
+ Flexible: Supports text, images, audio, video and sensor data
+ Fast and Scalable: Work with large datasets on a single machine
+ Ready To Deploy: Export models to Core ML for use in iOS, macOS, watchOS, and tvOS apps

In [3]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv"

In [4]:
!pip install turicreate

Collecting turicreate
[?25l  Downloading https://files.pythonhosted.org/packages/25/9f/a76acc465d873d217f05eac4846bd73d640b9db6d6f4a3c29ad92650fbbe/turicreate-6.4.1-cp37-cp37m-manylinux1_x86_64.whl (92.0MB)
[K     |████████████████████████████████| 92.0MB 61kB/s 
[?25hCollecting coremltools==3.3
[?25l  Downloading https://files.pythonhosted.org/packages/1b/1d/b1a99beca7355b6a026ae61fd8d3d36136e5b36f13e92ec5f81aceffc7f1/coremltools-3.3-cp37-none-manylinux1_x86_64.whl (3.5MB)
[K     |████████████████████████████████| 3.5MB 32.4MB/s 
Collecting numba<0.51.0
[?25l  Downloading https://files.pythonhosted.org/packages/04/be/8c88cee3366de2a3a23a9ff1a8be34e79ad1eb1ceb0d0e33aca83655ac3c/numba-0.50.1-cp37-cp37m-manylinux2014_x86_64.whl (3.6MB)
[K     |████████████████████████████████| 3.6MB 42.5MB/s 
Collecting resampy==0.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/14/b6/66a06d85474190b50aee1a6c09cdc95bb405ac47338b27e9b21409da1760/resampy-0.2.1.tar.gz (322kB)
[K     |

In [5]:
# Load Pkgs
import turicreate as tc

In [6]:
# Load Dataset
df = tc.SFrame(data_url)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [7]:
# Preview Dataset
df.head()

Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring
40,Male,No,Yes,No,Yes,No,No,No
58,Male,No,No,No,Yes,No,No,Yes
41,Male,Yes,No,No,Yes,Yes,No,No
45,Male,No,No,Yes,Yes,Yes,Yes,No
60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes
55,Male,Yes,Yes,No,Yes,Yes,No,Yes
57,Male,Yes,Yes,No,Yes,Yes,Yes,No
66,Male,Yes,Yes,Yes,Yes,No,No,Yes
67,Male,Yes,Yes,No,Yes,Yes,Yes,No
70,Male,No,Yes,Yes,Yes,Yes,No,Yes

Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
Yes,No,Yes,No,Yes,Yes,Yes,Positive
No,No,No,Yes,No,Yes,No,Positive
Yes,No,Yes,No,Yes,Yes,No,Positive
Yes,No,Yes,No,No,No,No,Positive
Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive
Yes,No,Yes,No,Yes,Yes,Yes,Positive
No,No,Yes,Yes,No,No,No,Positive
Yes,Yes,No,Yes,Yes,No,No,Positive
Yes,Yes,No,Yes,Yes,No,Yes,Positive
Yes,Yes,No,No,No,Yes,No,Positive


In [8]:
# Check Datatype
df.dtype

[int,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str,
 str]

In [9]:
# Plot the Value Count /Class Distribution
df['class'].show()

In [19]:
# Class/Target & Features
df.column_names()


['Age',
 'Gender',
 'Polyuria',
 'Polydipsia',
 'sudden weight loss',
 'weakness',
 'Polyphagia',
 'Genital thrush',
 'visual blurring',
 'Itching',
 'Irritability',
 'delayed healing',
 'partial paresis',
 'muscle stiffness',
 'Alopecia',
 'Obesity',
 'class']

In [20]:
feature_names = ['Age',
 'Gender',
 'Polyuria',
 'Polydipsia',
 'sudden weight loss',
 'weakness',
 'Polyphagia',
 'Genital thrush',
 'visual blurring',
 'Itching',
 'Irritability',
 'delayed healing',
 'partial paresis',
 'muscle stiffness',
 'Alopecia',
 'Obesity']

In [21]:
#### Split Dataset
train_data,test_data = df.random_split(0.7)

In [22]:
# Training
train_data.shape

(367, 17)

In [23]:
# Original Shape
df.shape


(520, 17)

In [25]:
# Build Model
lr_model = tc.logistic_classifier.create(train_data,target='class',features=feature_names)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [27]:
# Get model summary
lr_model.summary()

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 17
Number of examples             : 348
Number of classes              : 2
Number of feature columns      : 16
Number of unpacked features    : 16

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 7
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 0.0333

Settings
--------
Log-likelihood                 : 42.9568

Highest Positive Coefficients
-----------------------------
Gender[Female]                 : 5.1663
Irritability[Yes]              : 3.5232
Itching[No]                    : 3.4068
Polyuria[Yes]                  : 3.2939
Genital thrush[Yes]            : 2.4532

Lowest Negative Coefficients
----------------------------
Polydipsia[No]                 : -6.5078
Alopecia

In [28]:
### Model Evaluation
metrics = lr_model.evaluate(test_data)

In [29]:
metrics

{'accuracy': 0.9084967320261438,
 'auc': 0.9548611111111112,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |   Negative   |     Negative    |   51  |
 |   Negative   |     Positive    |   6   |
 |   Positive   |     Positive    |   88  |
 |   Positive   |     Negative    |   8   |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9263157894736843,
 'log_loss': 0.29441990146756725,
 'precision': 0.9361702127659575,
 'recall': 0.9166666666666666,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +-----------+--------------------+-----+----+----+
 | threshold |        fpr         | tpr | p  | n  |
 +-----------+--------------------+-----+----+----+
 |    0.0    |        1.0         | 1.0 | 96 | 57 |
 |   0.001   | 0.

In [30]:
type(metrics)

dict

In [31]:
# Get Accuracy
metrics['accuracy']

0.9084967320261438

In [None]:
#### Rules for Making Single Sample Prediction
+ Must be an SFrame

In [52]:
d = {'Age': 41,
 'Alopecia': 'Yes',
 'Gender': 'Male',
 'Genital thrush': 'No',
 'Irritability': 'No',
 'Itching': 'Yes',
 'Obesity': 'No',
 'Polydipsia': 'No',
 'Polyphagia': 'Yes',
 'Polyuria': 'Yes',
 'class': 'Positive',
 'delayed healing': 'Yes',
 'muscle stiffness': 'Yes',
 'partial paresis': 'No',
 'sudden weight loss': 'No',
 'visual blurring': 'No',
 'weakness': 'Yes'}


In [53]:
ex1 = tc.SFrame({'data':[d.values()]})

In [54]:
ex1

data
"[41, Yes, Male, No, No, Yes, No, No, Yes, Yes, ..."


In [55]:
# Make Prediction
lr_model.predict(ex1)

dtype: str
Rows: 1
['Negative']

In [56]:
# Prediction Prob
lr_model.classify(ex1)

class,probability
Negative,0.9191211571241118


In [57]:
# Save Model
lr_model.save('dm_risk_lr_classifier_27_may_2021.model')

In [None]:
### Thanks For Watching
### Jesus Saves @JCharisTech
### Jesse E.Agbe(JCharis)