## Step 1.1, Loading our Data

In [1]:
import pandas as pd

# after importing the pandas library, a library that is good for large scale algebra and data manipulation
# we gather our data into the data variable in the form a pandas dataframe
data = pd.read_csv('patient_priority.csv')# our data has an index column
data.describe() # shows some basic statistics about our data

Unnamed: 0.1,Unnamed: 0,age,gender,chest pain type,blood pressure,cholesterol,max heart rate,exercise angina,plasma glucose,skin_thickness,insulin,bmi,diabetes_pedigree,hypertension,heart_disease
count,6962.0,6962.0,6961.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0
mean,2011.95418,57.450014,0.531964,0.529015,109.629991,184.71129,163.502442,0.061764,98.394283,56.813416,111.09164,27.190908,0.467386,0.071531,0.0395
std,1560.966466,11.904948,0.499013,1.253791,21.534852,32.010359,15.458693,0.240743,28.598084,22.889316,17.470033,7.362886,0.102663,0.257729,0.194796
min,0.0,28.0,0.0,0.0,60.0,150.0,138.0,0.0,55.12,21.0,81.0,10.3,0.078,0.0,0.0
25%,604.0,48.0,0.0,0.0,92.0,164.0,150.0,0.0,78.7075,36.0,97.0,21.8,0.467386,0.0,0.0
50%,1628.5,56.0,1.0,0.0,111.0,179.0,163.0,0.0,93.0,55.0,111.0,26.2,0.467386,0.0,0.0
75%,3368.75,66.0,1.0,0.0,127.0,192.0,177.0,0.0,111.6325,77.0,125.0,31.0,0.467386,0.0,0.0
max,5109.0,82.0,1.0,4.0,165.0,294.0,202.0,1.0,199.0,99.0,171.0,66.8,2.42,1.0,1.0


Well, a quick look at our data shows it is definitely some preliminary medical information.
<br>I am not familiar with every term so I'll need to set up some definitions for later. 
<br><br>
We have an unnamed column. That is a previous index. We can fix this when loading data but I am going to just drop the column this time

## Step 1.2, Cleaning our data

In [2]:
data = data.drop('Unnamed: 0', axis=1)# setting the drop function to entire column

In [3]:
# first lets remove our labels from the data
triage = data['triage']
data = data.drop('triage', axis=1)

In [4]:
# we should not have any null values but let's check what kind of variables our dataset
# will contain
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6962 entries, 0 to 6961
Data columns (total 16 columns):
age                  6962 non-null float64
gender               6961 non-null float64
chest pain type      6962 non-null float64
blood pressure       6962 non-null float64
cholesterol          6962 non-null float64
max heart rate       6962 non-null float64
exercise angina      6962 non-null float64
plasma glucose       6962 non-null float64
skin_thickness       6962 non-null float64
insulin              6962 non-null float64
bmi                  6962 non-null float64
diabetes_pedigree    6962 non-null float64
hypertension         6962 non-null float64
heart_disease        6962 non-null float64
Residence_type       6962 non-null object
smoking_status       6962 non-null object
dtypes: float64(14), object(2)
memory usage: 870.3+ KB


We have a few different objects in our dataset along with our numerical data. Triage will be our target for this model. Triage represents the level of immediate care a patient requires upon assessment. 
<br>We'll want to check for some null values and if there are duplicates now

In [26]:
# built in function that returns a list of booleans for each row, showing true if a null if contained
if data.isnull().any: #.any just returns true if any value is True
    print('There are Null values')

# another built-in function that returns a list of booleans that shows true if there are repeating rows
dups = data.duplicated()
if dups.any():
    print('There are duplicates')

There are Null values


We assumed there would be no duplicates initially but always safe to check
<br>Well, I guess a hunting for these null values we shall go!

In [6]:
# Lets find what rows have nulls
print(data.loc[data.isnull().any(axis=1)])
#this is a cool trick to find all rows with a null and print them out
# we iterate through the dataframe, finding the Null values via booleans
# then having any(axis=1) prints out the entire row

       age  gender  chest pain type  blood pressure  cholesterol  \
4968  72.0     NaN              0.0            85.0        160.0   

      max heart rate  exercise angina  plasma glucose  skin_thickness  \
4968           178.0              0.0          143.33            87.0   

      insulin   bmi  diabetes_pedigree  hypertension  heart_disease  \
4968    116.0  22.4           0.467386           0.0            0.0   

     Residence_type   smoking_status  
4968          Rural  formerly smoked  


Huh. A single row is missing a gender value. Is it controversial to drop this? 
<br>Gender is just specified via 0 or 1. This can be a true or false value in our data or
is just a 1 or 0. 
<br>There seems to be a true or false represented as integers for Heart Disease so
we have these values to represent binary values. 