# Kidney Stone Prediction based on Urine Analysis

## Task: Develop ML/DL models to predict occurrence of kidney stones

### Data Understanding


This dataset can be used to predict the presence of kidney stones based on urine analysis.

The `79 urine specimens`, were analyzed in an effort to
determine if certain `physical characteristics of the urine` might be related to the
formation of `calcium oxalate crystals`.

The `six physical characteristics` of the urine are: 

- `(1) specific gravity`, the density of the urine relative to water;
- `(2) pH`, the negative logarithm of the hydrogen ion; 
- `(3) osmolarity (mOsm)`, a unit used in biology and medicine but not in
physical chemistry. Osmolarity is proportional to the concentration of
molecules in solution;
- `(4) conductivity (mMho milliMho)`. One Mho is one reciprocal Ohm.
Conductivity is proportional to the concentration of charged
ions in solution; 
- `(5) urea concentration in millimoles per litre`;
- `(6) calcium concentration (CALC) in millimolesllitre`.

The data is obtained from 'Physical Characteristics of Urines With and Without Crystals',a chapter from Springer Series in Statistics.

https://link.springer.com/chapter/10.1007/978-1-4612-5098-2_45

### Data Understanding II

There are two datasets, The original dataset and the generated dataset, we will be using both to compare and contrast features etc.

Files from the generated dataset:

- `train.csv` - the training dataset; `target` is the likelihood of a kidney stone being present
- `test.csv` - the test dataset; your objective is to `predict the probability of target`
- `sample_submission.csv` - a sample submission file in the correct format

Files from the original dataset:

- `kindey_stone_urine_analysis.csv`


### Data Understanding III, Understanding our Features in Depth

- `specific gravity`: Urine specific gravity is a laboratory test that shows the concentration of all chemical particles in the urine. The normal range for urine specific gravity is `1.005 to 1.030`. USG is the ratio of the density (mass of a unit volume) of urine to the density (mass of the same unit volume) of a reference substance (water). USG values vary between 1.000 and 1.040 g/mL, USG less than 1.008 g/mL is regarded as dilute, and USG greater than 1.020 g/mL is considered concentrated. USG was higher in patients with stone formation than in those without stone formation (1.018±0.007 vs. 1.017±0.007). Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7844516/ 

- `pH` : When the pH of urine drops below 5.5, urine becomes saturated with uric acid crystals, a condition known as hypercalciuria. When there is too much uric acid in the urine, stones can form. Uric acid stones are more common in people who consume large amounts of protein, such as that found in red meat or poultry.
Source: https://www.hopkinsmedicine.org/health/conditions-and-diseases/kidney-stones

- `osmolarity (mOsm)` : Osmolarity refers to the number of solute particles per 1 L of solvent


### Importing Libraries & Data

In [1]:
#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time

#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)

Python version: 3.8.3 (v3.8.3:6f8c8320e9, May 13 2020, 16:29:34) 
[Clang 6.0 (clang-600.0.57)]
pandas version: 2.0.0
matplotlib version: 3.7.1
NumPy version: 1.24.2
SciPy version: 1.10.1
IPython version: 8.12.0
scikit-learn version: 1.2.2
-------------------------


### Load Data Modelling Libraries

We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.

In [2]:

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8


### Making our datasets available in our coding environment 

In [3]:
PATH = "/kaggle/input/playground-series-s3e12"
TRAIN_FILENAME = "/Users/richeyjay/Desktop/Kidney_Stone_PredictionML/env/Code/train.csv"
TEST_FILENAME = "/Users/richeyjay/Desktop/Kidney_Stone_PredictionML/env/Code/test.csv"
SUBMISSION_FILENAME = "/Users/richeyjay/Desktop/Kidney_Stone_PredictionML/env/Code/sample_submission.csv"
ORIGINAL_PATH = "/kaggle/input/media-campaign-cost-prediction"
ORIGINAL_FILENAME = "/Users/richeyjay/Desktop/Kidney_Stone_PredictionML/env/Code/kidney_stone_urine_analysis.csv"

### Reading in our csv files and putting them into a dataframe object

In [None]:
original_data = pd.read_csv(ORIGINAL_FILENAME)
train_data = pd.read_csv(TRAIN_FILENAME)
test_data = pd.read_csv(TEST_FILENAME)