In [8]:
!pip install scikit-learn



1. Diabetes Data

In [9]:
import numpy as np #Used to make numpy arrays
import pandas as pd #Used to create data frames
from sklearn.preprocessing import StandardScaler #Will be used to standardize the data to a common range 
from sklearn.model_selection import train_test_split #Split the data into training and testing data
from sklearn import svm
from sklearn.metrics import accuracy_score

In [10]:
df = pd.read_csv('diabetes.csv')
# This loads the diabetes dataset to a pandas DataFrame

In [11]:
df.head() #Prints the first 5 rows of the dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


The outcome tells us 0 or 1 basically telling us if its diabetic or non-diabetic. 

We need to develop a system that will classify the data into either 1 or 0

In [12]:
df.shape #This gives us the number of rows and columns basically the number of people the data is taken from and the number of factors (attributes) each persons data depends on

(768, 9)

In [13]:
df.describe() # Gives the statistical measures of the data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [15]:
df['Outcome'].value_counts() #Tells us the number of values of each value in the outcome column of the dataset.

Outcome
0    500
1    268
Name: count, dtype: int64

In [165]:
df.groupby('Outcome').mean() #This groups the rows that have the same value together along with the mean for the other attributes.

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


From this data, we can notice that people with low glucose are non-diabetic whereas people with high glucose and high age are diabetic. This is what the machine learning algorithm will see while determining if the patient is diabetic or non-diabetic.

In [167]:
X = df.drop(columns = 'Outcome', axis = 1) #You are dropping the column Outcome. If you are dropping the row we say axis = 0 and for column, we say axis = 1
Y = df['Outcome']

In [168]:
print(X.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  


In [169]:
print(Y.head())

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


2. Data Pre-processing - Standardize the data 

We need to standardize the data since the range of data is different for each column. We need to standardize this. 

In [170]:
scaler = StandardScaler()
scaler.fit(X)
standard = scaler.transform(X)
X = standard
#This code initializes a StandardScaler, fits (just computes) it to the data X to compute the scaling parameters (mean and standard deviation), transforms (implements the parameters calculated by fit) X using these parameters to standardize its features, and then updates X with the standardized data.

Now all the data will be in the range of 0 and 1. We have therefore standardized the data successfully.

It's important to note that you should fit the scaler only on the training data and not on the full dataset to prevent information leakage from the test data. After fitting on the training data, you use the same scaler (with statistics computed from the training set) to transform the test data. This ensures that the model you develop generalizes well to new, unseen data, mimicking the real-world scenario where the exact statistics of new data are not known in advance.

In [171]:
print(X)
print(Y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


So now we have all the data in X and all the labels in Y. Now all we have to do is split the data into training and testing data.

3. Train Test split

In [172]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2) #test_size as 0.2 basically means 20% of the data from the dataset will be used to test the data, which implies 80% of the data will be used to train the data. 

This line of code splits the dataset into training and testing sets, where X_train and Y_train are the features and labels for training, X_test and Y_test are for testing, using 20% of the data for testing, ensuring the proportion of classes in Y is the same in both training and testing sets (stratification), and the split is reproducible due to a specified random state of 2. 1 will have a different form of split. It is used for replicating code.

When you set stratify=Y, it means that the data is split in a way that preserves the same proportions of the outcome variable Y in the training and test subsets as are present in the full dataset.

Assuming we don't add the stratify there is a chance that all outcomes of 1 go to X train whereas all the outcomes of 0 go to X test. Stratify maintains the proportion of 1 and 0 in the X trains and X tests

In [152]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


4. Training the Model

In [173]:
classifier = svm.SVC(kernel = 'linear') #This line of code creates an SVM classifier (SVC) using a linear kernel, which is suitable for finding a linear decision boundary between different classes in a dataset.
classifier.fit(X_train, Y_train)
#The model is trained and is stored in the variable 'classifier'

5. Model Evaluation - Accuracy Score

In [174]:
#Accuracy score on the training data 
X_train_prediction = classifier.predict(X_train) #This will predict the label for all the X_train. It should basically predict the Y_train. This will store all the labels in the X_train_prediction.
X_train_accuracy = accuracy_score(X_train_prediction, Y_train) #Comparing the labels predicted vs the actual labels.
print ('Accuracy score of the training data', X_train_accuracy)

Accuracy score of the training data 0.7866449511400652


Accuracy score of above 75% is very good. Since we are not using a lot of data there is a higher chance of getting a lower accuracy score.

We need to find the accuracy score on the test data because the model has already seen the training data. We need to now use the model to predict some unknown data.

In [175]:
#Accuracy score on the test data 
X_test_prediction = classifier.predict(X_test) #This will predict the label for all the X_test. It should basically predict the Y_test. This will store all the labels in the X_test_prediction.
X_test_accuracy = accuracy_score(X_test_prediction, Y_test) #Comparing the labels predicted vs the actual labels.
print ('Accuracy score of the testing data (Unknown data)', X_test_accuracy)

Accuracy score of the testing data (Unknown data) 0.7727272727272727


6. Making a Predictive System to Predict anything

In [176]:
input = (4,110,92,0,0,37.6,0.191,30)

input_to_numpy_array = np.asarray(input) #Converts the input list to a numpy array. Reshaping is best done in the form of an array. 

input_data_reshaped = input_to_numpy_array.reshape(1,-1) #The model expects 768 values therefore we reshape to tell the model we are only going to give it 1 value. Reshaping the array as we are only predicting for one instance and not the 768 as we did above.

standardized_input = scaler.transform(input_data_reshaped) #We need to now standardize the input data since the model was trained to give predictions on standardized data only.

prediction = classifier.predict(standardized_input) #We now use the model to predict

print(prediction)

[0]


