# **Title of Project**

-------------

Binary Classification

## **Objective**
The objective of binary classification is to categorize data points into one of two distinct classes or categories based on their features. This is commonly used in various applications, such as:

Spam Detection: Classifying emails as either "spam" or "not spam."
Sentiment Analysis: Determining whether a piece of text expresses a positive or negative sentiment.
Medical Diagnosis: Identifying whether a patient has a certain disease or not based on symptoms and test results.
Image Recognition: Classifying images as containing a specific object (e.g., "cat" vs. "not cat").
Key components of binary classification include:

Feature Selection: Identifying relevant attributes that help distinguish between the two classes.
Model Training: Using labeled data to train a model that can learn the patterns associated with each class.
Prediction: Applying the trained model to new, unseen data to classify it into one of the two categories.
Evaluation: Assessing the model's performance using metrics such as accuracy, precision, recall, and F1-score.

## **Data Source**


https://github.com/YBIFoundation/Dataset/raw/main/Diabetes.csv

## **Import Library**

In [13]:
import pandas as pd

## **Import Data**

In [14]:
diabetes = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/Diabetes.csv')

## **Describe Data**

In [15]:
diabetes.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [16]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [17]:
diabetes.describe()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [18]:
diabetes.columns

Index(['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age', 'diabetes'],
      dtype='object')

## **Data Visualization**

In [19]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

# Plotting the distribution of the target variable
plt.figure(figsize=(8, 5))
sns.countplot(x='diabetes', data=diabetes)
plt.title('Distribution of Diabetes Outcome')
plt.xlabel('Diabetes Outcome (0: No, 1: Yes)')
plt.ylabel('Count')
plt.show()

# Pairplot to visualize relationships between features
sns.pairplot(diabetes, hue='diabetes', diag_kind='kde')
plt.title('Pairplot of Features by Diabetes Outcome')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = diabetes.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Heatmap of Features')
plt.show()

# Boxplot to see the distribution of a specific feature
plt.figure(figsize=(10, 6))
sns.boxplot(x='diabetes', y='glucose', data=diabetes)
plt.title('Glucose Levels by Diabetes Outcome')
plt.xlabel('Diabetes Outcome')
plt.ylabel('Glucose Level')
plt.show()

Output hidden; open in https://colab.research.google.com to view.

## **Data Preprocessing**

In [20]:
from sklearn.preprocessing import StandardScaler
print(diabetes.isnull().sum())

pregnancies    0
glucose        0
diastolic      0
triceps        0
insulin        0
bmi            0
dpf            0
age            0
diabetes       0
dtype: int64


## **Define Target Variable (y) and Feature Variables (X)**

In [21]:
y = diabetes['diabetes']
X = diabetes.drop(['diabetes'],axis=1)

## **Train Test Split**

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=2529)

In [23]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((537, 8), (231, 8), (537,), (231,))

## **Modeling**

In [24]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=500)

In [25]:
model.fit(X_train,y_train)

## **Model Evaluation**

In [27]:
model.intercept_


array([-8.13045782])

In [28]:
model.coef_

array([[ 1.01246406e-01,  3.60547984e-02, -2.09737931e-02,
        -2.57336457e-03, -2.04620718e-04,  8.24718338e-02,
         9.51045046e-01,  2.53467630e-02]])

## **Prediction**

In [29]:
y_pred = model.predict(X_test)
y_pred


array([0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
confusion_matrix(y_test,y_pred)

array([[133,  12],
       [ 41,  45]])

In [31]:
accuracy_score(y_test,y_pred)

0.7705627705627706

In [32]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.76      0.92      0.83       145
           1       0.79      0.52      0.63        86

    accuracy                           0.77       231
   macro avg       0.78      0.72      0.73       231
weighted avg       0.77      0.77      0.76       231



## **Explaination**

Data Loading: We load the Iris dataset and create a binary target variable that indicates whether the flower is setosa.

Data Splitting: The dataset is split into training and testing sets (80% training, 20% testing).

Feature Scaling: Standardization is applied to the features to improve model performance.

Model Training: A Logistic Regression model is initialized and trained on the training data.

Making Predictions: The model predicts the classes for the test set.

Evaluation: The accuracy, confusion matrix, and classification report are printed to evaluate the model's performance.