**Instructions**

* Dataset description:

 This dataset includes medical measurements and indicators that can help in diagnosing diabetes.
 
 The goal is to explore the dataset and uncover any patterns, correlations, and potential issues such as missing values or outliers.
 

* Dataset Columns Description

Pregnancies: Number of times the patient was pregnant.

Glucose: Plasma glucose concentration (2 hours in an oral glucose tolerance test).

BloodPressure: Diastolic blood pressure (mm Hg).

SkinThickness: Triceps skinfold thickness (mm).

Insulin: 2-Hour serum insulin (mu U/ml).

BMI: Body Mass Index (weight in kg/(height in m)^2).

DiabetesPedigreeFunction: Diabetes pedigree function (a measure of genetic influence).

Age: Age of the patient (years).

Outcome: Diabetes diagnosis outcome (0 = No, 1 = Yes).

# Step 1: Data Exploration with Pandas

* *Q1* Load the Dataset
 
* *Q2* General Information:
 
Display the dataset's structure, including column names, data types, and memory usage.

Identify the number of missing values or zeros in the dataset.
 
* *Q3* Descriptive Analysis:

Use the describe() function to analyze:
 
Summary statistics for each column (mean, min, max, quartiles).

Look for irregularities, such as columns with unrealistic minimum or maximum values.


In [1]:
import pandas as pd

In [None]:
# Q1

data=pd.read_csv('../draft check point/first dataset_diabetes.csv')

data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
# Q2

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [10]:
data.where(data==0).count()

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

In [11]:
# Q3

data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


**> the irregularities that i can concluded from describe result, is that there is some unrealistic minimum and maximum for some values below :**

> BMI min 0, can not find someone with no body mass

> insulin min 0 & max 846, normally is around 500u/ML as maximum value

> skinthickness min 0, skinthickness is one of structure of the skin

> blood pressure min 0, can not find this value as the heart is stil functioning and pumping blood

> glucose min 0 and max 199, normaly is between 70 and 180

# Step 2: Data Exploration with ydata-profiling

* *Q1* Generate a Profiling Report:
 
Use ydata-profiling to create an interactive report that includes:
 
Column descriptions (type, unique values, missing values).

Distributions for numerical columns.

Correlation matrices to identify relationships between variables.

Highlighted outliers or anomalies.
 
* *Q2* Analyze the Report:
 
Identify missing values in key columns such as Glucose, Insulin, and BMI.

Examine correlations between columns like Age, Glucose, and Outcome.

Note any interesting insights or patterns (e.g., higher glucose levels correlated with diabetes diagnosis).


In [12]:
from ydata_profiling import ProfileReport

In [15]:
# Q1

report = ProfileReport(data, title='diagnosing diabetes_report')

report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 9/9 [00:00<00:00, 2247.22it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

# *Q2*

* **identifying missing values**

we can see in the variables section that :

glucose has no missing values same as insulin and BMI


* **examining correlations**

in the report above we have interaction and correlation section that may help us in conducting firsts analysis :

between age and glucose: we could say they have a slight correlation, it can be confirmed by having a look in heatmap where the correlation value is 0.285 

between age and outcome : the correlation value is 0,314, we can not say there is a real correlation
 
between glucose and outcome: we talk about 0,487, we could say they are partially correlated

* **any interesting insights or patterns**

>>*there is a high correlation between:*

Age & Pregnancies
Insulin & SkinThickness

>>*having zéros values in criticals columns such as :*

SkinThickness has 227 zeros
Insulin has 374 zeros


>>*some patterns to take into consideration :*

there is a pattern between glucose and diabètes diagnosis since they have a positive correlation

we might conclude that glucose increase relatively by age as there is a slighted correlation between them

# Step 3: Summary

1.	Document Findings:

Write a summary of key observations from both Pandas and the ydata-profiling report.

2. Mention:

Patterns or trends in glucose, BMI, or pregnancies.

Any notable correlations between variables.

Issues such as missing or zero values in critical columns. 

3.	Suggestions:

Recommend next steps, such as handling missing values, addressing outliers, or exploring predictive modeling with the data.

# *Q1*

### **in pandas steps**
we have deducted that there is no missing value, but we do have zeros in variables

we have seen as well some irregularities in min and max value such as blood pressure and glucose  


### **in ydata report**
we can see clearly that some variables has zeros value, in the same way we can say there is no missing value

there is a high correlation between age and pregnancies + insulin and skinthickness

# *Q2*

* patterns and trends:

we might say that glucose values is usually between [50,150] as shown in histogram figure
which indicate that glucose concentration is high in this medical dataset

we might say that BMI is  usually between [25,40] as shown in histogram figure
which indicate that body mass is high in this medical dataset

we might say that pregnancies is knowing a decrease a shown in histogram figure
values are between [0,8]


* correlations between variables

there is a positive correlation betwwen these variables :

age and pregnancies
BMI and skinthickness, as well as skinthickness and insulin
glucose and outcome


* missing and zero values in critical columns :

in this dataset we have no missing values, but have zeros values in some variables that may give irregularities;

>> in first position we have these variables, which may be significant given the number of dataset rows:

SkinThickness has 227 zeros

Insulin has 374 zeros


>> whereas these variables have less rows with zeros value,which may be not significant given the number of dataset rows :

BloodPressure has 35 (4.6%) zeros

BMI has 11 (1.4%) zeros

# **Q3**

## *Suggestions*:

* we have no missing data to handle.

but in case we had them, we could use the imputation technique by using either mean, median or mode depending on the variable. only in case, the number of NA is important and may impact our exploration.


* addressing outliers, we haven't yet seen how to manage outliers

but we can for example remove the least impact column which having a significant amount of outliers for example 'pregnancies'

* exploring predictive modeling with the data : not yet seen but i can think of something ;

i have noticed that there is a relationship between this parameters : 

diabetespedigreefunction and insulin || insuline and skintickness || BMI and skinthickness|| glucose and outcome

our modele should focus more on these properties as they are related somehow to the outcome