# Support vector machines

**Name - Mitul Srivastava**

**ID - C00313606**


## **LOG** : Introduction to dataset and importing the data
### **DATASET** : Heart disease dataset
### **LINK** : https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
### **DETAIL** : The dataset has 11 features which are Age, Sex, ChestPain, RestingBP etc. and the target column HeartDisease.
### **AIM** : To train and fine tune SVM model to predict if someone has a heart disease. 

In [46]:
import pandas as pd 
data = pd.read_csv("C:/Users/Mitul/Desktop/Study/Algorithms/SVM/heart.csv",encoding='latin-1')
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [72]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                918 non-null    int64  
 1   RestingBP          918 non-null    int64  
 2   MaxHR              918 non-null    int64  
 3   Oldpeak            918 non-null    float64
 4   HeartDisease       918 non-null    int64  
 5   Sex_M              918 non-null    bool   
 6   ChestPainType_ATA  918 non-null    bool   
 7   ChestPainType_NAP  918 non-null    bool   
 8   ChestPainType_TA   918 non-null    bool   
 9   ExerciseAngina_Y   918 non-null    bool   
 10  ST_Slope_Flat      918 non-null    bool   
 11  ST_Slope_Up        918 non-null    bool   
dtypes: bool(7), float64(1), int64(4)
memory usage: 42.3 KB


### **LOG** : Dropping unwanted columns with not much impact on heart disease.

In [47]:
data = data.drop(columns=["Cholesterol","FastingBS","RestingECG",])

### **LOG** : Performing One-Hot encoding to convert categorical columns to numerical distinct features.
### *One-Hot encoding is the process of changing categorical variables into numerical values. The categorical columns are converted into binary columns for each category and value is assigned where they are present in data.*

In [49]:
data = pd.get_dummies(data, columns=["Sex", "ChestPainType", "ExerciseAngina", "ST_Slope"], drop_first=True)

### **LOG** : Spliting dataset in X and y. X being the independent features and y the target variable.

In [50]:
X = data.drop(columns=["HeartDisease"])
y = data["HeartDisease"] 

### **LOG** : Using Selectkbest to find the top 5 features to train the model.
### *Selectkbest is a feature selection method in which top k most features are selected based on a statistical test.It evaluates each feature and assigns only the k best features.*

In [91]:
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features)

Selected Features: Index(['Oldpeak', 'ChestPainType_ATA', 'ExerciseAngina_Y', 'ST_Slope_Flat',
       'ST_Slope_Up'],
      dtype='object')


### **LOG** : Normalising the data since SVM is sensitive to different scales.
### *StandardScaler is a preprocessing technique in scikit-learn that standardizes numerical features by removing the mean and scaling to unit variance. It transforms data to have a mean of 0 and a standard deviation of 1, helping machine learning models perform better by ensuring features are on a similar scale.*

In [92]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_new)

### **LOG** : Training the SVM model choosing the kernel as rbf (Radial Basis Function) and C parameter as 1.

In [93]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model = SVC(kernel="rbf", C=1, gamma="scale")
model.fit(X_train, y_train)

### **LOG** : Evaluation report analysis
### Precision - How many predicted values matched the actual value.
### Recall - How many in total were coreectly identified.
### F1 score - Harmonic mean of Precision and Recall.

### **Results :**
### The SVM model has an overall accuracy of 81 percent. 
### The model correctly identifies 63 non heart disease case and 86 heart disease case while wrongly identifying 14 cases as heart disease when they should have been not heart disease and 21 false positives.

In [94]:
from sklearn.metrics import classification_report, confusion_matrix

# Prediction
y_pred = model.predict(X_test)

# E valuation metrics
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.82      0.78        77
           1       0.86      0.80      0.83       107

    accuracy                           0.81       184
   macro avg       0.80      0.81      0.81       184
weighted avg       0.81      0.81      0.81       184

[[63 14]
 [21 86]]


### **LOG** : Creating a seperate SVM model using Sigmoid as kernel and setting C parameter to 10.

In [95]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model2 = SVC(kernel="sigmoid", C=10, gamma="scale")
model2.fit(X_train, y_train)

### **LOG** : The above changes increase the accuracy by 1 percent to 82%. 

In [96]:
from sklearn.metrics import classification_report, confusion_matrix

# Prediction
y_pred2 = model2.predict(X_test)

# E valuation metrics
print(classification_report(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.78      0.78      0.78        77
           1       0.84      0.84      0.84       107

    accuracy                           0.82       184
   macro avg       0.81      0.81      0.81       184
weighted avg       0.82      0.82      0.82       184

[[60 17]
 [17 90]]


### **REFERENCES** :
### https://chatgpt.com/
### https://www.kaggle.com/
### https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.07-Support-Vector-Machines.ipynb

## **END**