# 0- Business Problem

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied

For each id in the test set, you must predict the probability of target (likelihood of the presence of a kidney stone). The file should contain a header and have the following format:

This dataset can be used to predict the presence of kidney stones based on urine analysis.

The 79 urine specimens, were analyzed in an effort to
determine if certain physical characteristics of the urine might be related to the
formation of calcium oxalate crystals.
The six physical characteristics of the urine are: 
(1) specific gravity, the density of the urine relative to water; 
(2) pH, the negative logarithm of the hydrogen ion; 
(3) osmolarity (mOsm), a unit used in biology and medicine but not in
physical chemistry. Osmolarity is proportional to the concentration of
molecules in solution; 
(4) conductivity (mMho milliMho). One Mho is one
reciprocal Ohm. Conductivity is proportional to the concentration of charged
ions in solution; 
(5) urea concentration in millimoles per litre; and 
(6) calcium
concentration (CALC) in millimolesllitre.

The data is obtained from 'Physical Characteristics of Urines With and Without Crystals',a chapter from Springer Series in Statistics.

https://link.springer.com/chapter/10.1007/978-1-4612-5098-2_45

# 1 - Import Packages

In [10]:
import pandas as pd
import numpy as np

import seaborn as sns

from scipy.stats import skew,kurtosis
from sklearn.metrics import roc_auc_score,roc_curve

## 1.1 Functions

# 2 - Load Data

In [4]:
raw_test = pd.read_csv("../data/test.csv")
raw_train = pd.read_csv("../data/train.csv")

In [5]:
print(f"Train sahpe = {raw_train.shape}   Test shape = {raw_test.shape}")

Train sahpe = (414, 8)   Test shape = (276, 7)


In [6]:
raw_train.columns

Index(['id', 'gravity', 'ph', 'osmo', 'cond', 'urea', 'calc', 'target'], dtype='object')

In [7]:
raw_train.dtypes

id           int64
gravity    float64
ph         float64
osmo         int64
cond       float64
urea         int64
calc       float64
target       int64
dtype: object

In [8]:
raw_train.head(5)

Unnamed: 0,id,gravity,ph,osmo,cond,urea,calc,target
0,0,1.013,6.19,443,14.8,124,1.45,0
1,1,1.025,5.4,703,23.6,394,4.18,0
2,2,1.009,6.13,371,24.5,159,9.04,0
3,3,1.021,4.91,442,20.8,398,6.63,1
4,4,1.021,5.53,874,17.8,385,2.21,1


In [9]:
raw_train.describe()

Unnamed: 0,id,gravity,ph,osmo,cond,urea,calc,target
count,414.0,414.0,414.0,414.0,414.0,414.0,414.0,414.0
mean,206.5,1.017894,5.955459,651.545894,21.437923,278.657005,4.114638,0.444444
std,119.655756,0.006675,0.64226,234.676567,7.51475,136.442249,3.217641,0.497505
min,0.0,1.005,4.76,187.0,5.1,10.0,0.17,0.0
25%,103.25,1.012,5.53,455.25,15.5,170.0,1.45,0.0
50%,206.5,1.018,5.74,679.5,22.2,277.0,3.13,0.0
75%,309.75,1.022,6.28,840.0,27.6,385.0,6.63,1.0
max,413.0,1.04,7.94,1236.0,38.0,620.0,14.34,1.0


# 3 - Descriptive Analysis

# 4 - Feature Engineering

# 5 - EDA

# 6 - Model Data

# 7 - Machine Learning

# 8 - Fine Tunning

# 9 - Results