## About the Dataset
    The datasets consist of several medical predictor (independent) variables and one target (dependent) variable,
    Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age,
    and so on.

### Columns
    Pregnancies: Number of times pregnant
    Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    BloodPressure : Diastolic blood pressure (mm Hg)
    SkinThickness : Triceps skin fold thickness (mm)
    Insulin : 2-Hour serum insulin (mu U/ml)
    BMI : Body mass index (weight in kg/(height in m)^2)
    DiabetesPedigreeFunction : It provided some data on diabetes mellitus history in relatives and the genetic relationship
                               of those relatives to the patient.
    Age : Age (years)
    Outcome : Class variable (0 or 1) 268 of 768 are 1, the others are 0
    
### Source: 
     Online medical Records stored in CSV file  

### Task - Data Cleaning and Preprocessing:
       
     Clean the data by handling missing values, duplicates, and outliers.
     Standardize data formats and units to ensure consistency.
     Perform data transformation, such as log scaling or normalization, for improved model performance.

# 1. Importing Libraries 

In [3]:
#importing Libraries
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline


#models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

#Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
#from sklearn.metrics import plot_roc_curve

#for warning 
from warnings import filterwarnings
filterwarnings("ignore")

# 2. Loading DataSet

In [4]:
#Load the dataset
data = pd.read_csv("diabetes.csv")

# 3. Data Exploration 

In [6]:
data.head(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [7]:
data.tail(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
747,1,81,74,41,57,46.3,1.096,32,0
748,3,187,70,22,200,36.4,0.408,36,1
749,6,162,62,0,0,24.3,0.178,50,1
750,4,136,70,0,0,31.2,1.182,22,1
751,1,121,78,39,74,39.0,0.261,28,0
752,3,108,62,24,0,26.0,0.223,25,0
753,0,181,88,44,510,43.3,0.222,26,1
754,8,154,78,32,0,32.4,0.443,45,1
755,1,128,88,39,110,36.5,1.057,37,1
756,7,137,90,41,0,32.0,0.391,39,0


In [8]:
data.shape

(767, 9)

In [9]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,3.848761,120.9309,69.104302,20.522816,79.90352,31.994654,0.472081,33.254237,0.349413
std,3.370207,31.977581,19.36841,15.958143,115.283105,7.889095,0.331496,11.762079,0.477096
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.2435,24.0,0.0
50%,3.0,117.0,72.0,23.0,32.0,32.0,0.374,29.0,0.0
75%,6.0,140.5,80.0,32.0,127.5,36.6,0.6265,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [10]:
len(data)

767

In [11]:
data.ndim,data.size

(2, 6903)

In [12]:
data.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [13]:
data["Outcome"].value_counts()

0    499
1    268
Name: Outcome, dtype: int64

# DATASET 
  
    Dataset have no null values now
    Outliers and duplicate values are removed 
    data is in now perfect condition for EDA (Exploratory Data Analysis)  