# Diabetes Prediction

#### SVM ( support vector machine )

- Support vector machine (SVM) is a type of supervised machine learning algorithm that can be used for classification and regression tasks. The idea behind the SVM is to find the best boundary (or hyperplane) that separates the different classes of data




## Work Flow

Diabetes Data -> Data pre processing -> Train Test split -> SVM model -> Trained SVM model -> New Data -> Diabetic or Non-Diabetic (prediction)

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split   #### training and test model
from sklearn import svm                                #### importing SVM Model
from sklearn.metrics import accuracy_score             #### importing accuracy_score

In [2]:
data=pd.read_csv('diabetes.csv')

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
data.shape

(768, 9)

In [5]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
# 0 represent the Non-Diabetic 
# 1 represent the Diabetic

data['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [7]:
# finding the mean value with respect to Diabetic or Non-Diabetic values

data.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [8]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
x=data.iloc[:,:-1] #### X variable store all the independents variables

# ------------------------------------------------or------------------------------------------

# x=data.drop('Outcome',axis=1) Or x=data.drop(columns='Outcome',axis=1)


y=data.iloc[:,-1] #### Y variable store the dependent variable

# ------------------------------------------------or------------------------------------------

#y=data['Outcome']

In [10]:
x

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [11]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

## Data Standardization
- we are doing this because the upper data of the dataframe is note standardized
- for example the columns Pregnancies, Glucose, BloodPressure and all are unstandardized so, for making them standardized we use "StandardScaler"


In [12]:
standard_scaler=StandardScaler()

In [13]:
standard_scaler.fit(x)

In [14]:
standardized_data=standard_scaler.transform(x)

In [15]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [16]:
x=standardized_data #### assigning the value of standardize data to a variable x
y=data['Outcome'] #### assigning the Outcome to y

In [17]:
print(x,y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]] 0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


## Train and Test

In [18]:
# stratify=y
# In machine learning, when we split our data into training and testing sets,
# we want to ensure that both sets have a similar distribution of target classes. 
# This is where stratification comes in. Stratification is the process of rearranging the data so that each fold has a representative proportion of each class.
# In other words, stratification ensures that each subset of the data has approximately the same percentage of samples of each target class as the complete dataset.

# For example, if we have a binary classification problem with 90% of samples belonging to class 0 and 10% belonging to class 1, 
# then stratification will ensure that both training and testing sets have approximately 90% samples belonging to class 0 and 10% belonging to class 1.


X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=1,stratify=y)

In [19]:
print(x.shape,X_train.shape,X_test.shape)

(768, 8) (614, 8) (154, 8)


## Training the Model
- In machine learning, Support Vector Machines (SVM) are a set of supervised learning methods used for classification, regression and outliers detection. The standard SVM algorithm is designed for binary classification problems. In order to extend SVM to multi-class classification problems, several methods have been developed. One such method is called Support Vector Classification (SVC).

- Support Vector Classification (SVC) is a type of SVM that is used for classification tasks. It works by finding the hyperplane that maximizes the margin between the two classes. The margin is defined as the distance between the hyperplane and the closest data points from each class.

#### Kernel

- In Support Vector Machines (SVM), a kernel is a function that takes data as input and transforms it into the required form. The most commonly used kernels are linear, polynomial, radial basis function (RBF), and sigmoid.

- The kernel functions are used to map the original data into a higher-dimensional space so that it can be more easily separated by a hyperplane. The choice of kernel function depends on the problem at hand and the nature of the data.

In [20]:
classifier=svm.SVC(kernel='linear')

In [21]:
classifier.fit(X_train,Y_train)

### Model Evaluation

In [22]:
#### Finding the accuracy of the training data

X_train_prediction=classifier.predict(X_train)
X_train_accuracy=accuracy_score(X_train_prediction,Y_train)
print(X_train_accuracy)

0.7833876221498371


In [23]:
#### Finding the accuracy of the testing data

X_test_prediction=classifier.predict(X_test)
X_test_accuracy=accuracy_score(X_test_prediction,Y_test)
print(X_test_accuracy)

0.7792207792207793


### Making a Prediction System

In [30]:
input_data=(11,143,94,33,146,36.6,0.254,51)

# changing the input_data to numpy array
input_data_as_numpy_arrya=np.asarray(input_data)

# reshape the array as we are predicting for one instance, because our model is trained on 786 rows and 8 columns
# and if we didn't reshape the array it will throw an error
# we are just working on one dataset.

input_data_reshaped=input_data_as_numpy_arrya.reshape(1,-1) # (1,-1) this will give an idea to the model that we are not working on 786 rows and 8 columns 

# after reshaping the data, we standardized the data with the help of StandardScaler() model
std_data=standard_scaler.transform(input_data_reshaped)
print(std_data)

[[ 2.12477957  0.69183807  1.28699125  0.7818138   0.57481223  0.58477051
  -0.65801229  1.51108316]]




In [31]:
# let's make the prediction of the input data
prediction_input_data=classifier.predict(std_data)
print(prediction_input_data)

[1]


In [32]:
## the patient is non-diabetic and this is True

if prediction_input_data==0:
    print("The Patient  is Non-Diabetic")
else:
    print("The Patient is Diabetic")

The Patient is Diabetic
