## Understanding Your Dataset with Pandas

## About Dataset

**Diabetes Diagnosis Dataset**

This dataset contains 9,538 medical records related to diabetes diagnosis and risk factors. It includes various health parameters, lifestyle habits, and genetic predispositions that contribute to diabetes risk. The data is structured with realistic distributions, making it valuable for medical research, statistical analysis, and machine learning applications.

Age: The age of the individual (18-90 years).

Pregnancies: Number of times the patient has been pregnant.

BMI (Body Mass Index): A measure of body fat based on height and weight (kg/m²).

Glucose: Blood glucose concentration (mg/dL), a key diabetes indicator.

BloodPressure: Systolic blood pressure (mmHg), higher levels may indicate hypertension.

HbA1c: Hemoglobin A1c level (%), representing average blood sugar over months.

LDL (Low-Density Lipoprotein): "Bad" cholesterol level (mg/dL).

HDL (High-Density Lipoprotein): "Good" cholesterol level (mg/dL).

Triglycerides: Fat levels in the blood (mg/dL), high values increase diabetes risk.

WaistCircumference: Waist measurement (cm), an indicator of central obesity.

HipCircumference: Hip measurement (cm), used to calculate WHR.

WHR (Waist-to-Hip Ratio): Waist circumference divided by hip circumference.

FamilyHistory: Indicates if the individual has a family history of diabetes (1 = Yes, 0 = No).

DietType: Dietary habits (0 = Unbalanced, 1 = Balanced, 2 = Vegan/Vegetarian).

Hypertension: Presence of high blood pressure (1 = Yes, 0 = No).

MedicationUse: Indicates if the individual is taking medication (1 = Yes, 0 = No).

Outcome: Diabetes diagnosis result (1 = Diabetes, 0 = No Diabetes).

This dataset is useful for exploring the relationships between lifestyle choices, genetic factors, and diabetes risk, providing valuable insights for predictive modeling and health analytics.

---

Objective: In this tutorial, you will:
- Load a dataset using Pandas
- Explore its structure and contents
- Handle missing values

#### Questions

1. What are the columns in your dataset?
2. How many rows and columns does your dataset have?
3. Identify which columns are numerical and which are categorical.
4. Look at the mean, min, max, standard deviation of numerical data
5. Look up **value_counts()** and explain what it is  used for,show example on how to use it
6. Identify which columns have missing values and how many.


**For each question, explain your observations about the data. Your insights will serve as the foundation for our next session on Data Exploration and Analysis.**


In [16]:
import pandas as pd

data = pd.read_csv('diabetes_dataset.csv')

data.head()

Unnamed: 0,Age,Pregnancies,BMI,Glucose,BloodPressure,HbA1c,LDL,HDL,Triglycerides,WaistCircumference,HipCircumference,WHR,FamilyHistory,DietType,Hypertension,MedicationUse,Outcome
0,69,5,28.39,130.1,77.0,5.4,130.4,44.0,50.0,90.5,107.9,0.84,0,0,0,1,0
1,32,1,26.49,116.5,72.0,4.5,87.4,54.2,129.9,113.3,81.4,1.39,0,0,0,0,0
2,89,13,25.34,101.0,82.0,4.9,112.5,56.8,177.6,84.7,107.2,0.79,0,0,0,1,0
3,78,13,29.91,146.0,104.0,5.7,50.7,39.1,117.0,108.9,110.0,0.99,0,0,0,1,1
4,38,8,24.56,103.2,74.0,4.7,102.5,29.1,145.9,84.1,92.8,0.91,0,1,0,0,0


In [17]:
print(data.shape)
print(data.columns)

(9538, 17)
Index(['Age', 'Pregnancies', 'BMI', 'Glucose', 'BloodPressure', 'HbA1c', 'LDL',
       'HDL', 'Triglycerides', 'WaistCircumference', 'HipCircumference', 'WHR',
       'FamilyHistory', 'DietType', 'Hypertension', 'MedicationUse',
       'Outcome'],
      dtype='object')


There are 17 columns - which are all relevant medical information, which serves as a crucial guide to whether or not someone has diabetes. With 9538 data entries, we should have sufficient enough information to able to accurately predict whether someone has diabetes or not, based on their specific values for each attribute.

In [28]:
print(data.info(),"\n")
print(data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9538 entries, 0 to 9537
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 9538 non-null   int64  
 1   Pregnancies         9538 non-null   int64  
 2   BMI                 9538 non-null   float64
 3   Glucose             9538 non-null   float64
 4   BloodPressure       9538 non-null   float64
 5   HbA1c               9538 non-null   float64
 6   LDL                 9538 non-null   float64
 7   HDL                 9538 non-null   float64
 8   Triglycerides       9538 non-null   float64
 9   WaistCircumference  9538 non-null   float64
 10  HipCircumference    9538 non-null   float64
 11  WHR                 9538 non-null   float64
 12  FamilyHistory       9538 non-null   int64  
 13  DietType            9538 non-null   int64  
 14  Hypertension        9538 non-null   int64  
 15  MedicationUse       9538 non-null   int64  
 16  Outcom

Not a single attribute has an entry that is null. This is good, as we do not need to remove any entries to keep our prediction accurate. We can also see that all columns are numerical.

In [14]:
data.describe()

Unnamed: 0,Age,Pregnancies,BMI,Glucose,BloodPressure,HbA1c,LDL,HDL,Triglycerides,WaistCircumference,HipCircumference,WHR,FamilyHistory,DietType,Hypertension,MedicationUse,Outcome
count,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0
mean,53.577584,7.986161,27.052364,106.104183,84.475781,4.650661,100.133456,49.953418,151.147746,93.951678,103.060621,0.9174,0.302474,0.486161,0.001048,0.405012,0.344097
std,20.764651,4.933469,5.927955,21.91859,14.12348,0.476395,29.91191,15.242194,48.951627,15.594468,13.438827,0.140828,0.459354,0.661139,0.032364,0.49092,0.475098
min,18.0,0.0,15.0,50.0,60.0,4.0,-12.0,-9.2,50.0,40.3,54.8,0.42,0.0,0.0,0.0,0.0,0.0
25%,36.0,4.0,22.87,91.0,74.0,4.3,80.1,39.7,117.2,83.4,94.0,0.82,0.0,0.0,0.0,0.0,0.0
50%,53.0,8.0,27.05,106.0,84.0,4.6,99.9,50.2,150.55,93.8,103.2,0.91,0.0,0.0,0.0,0.0,0.0
75%,72.0,12.0,31.18,121.0,94.0,5.0,120.2,60.2,185.1,104.6,112.1,1.01,1.0,1.0,0.0,1.0,1.0
max,89.0,16.0,49.66,207.2,138.0,6.9,202.2,107.8,345.8,163.0,156.6,1.49,1.0,2.0,1.0,1.0,1.0


The average age of this data set is quite high - about 53 years old. The average amount of pregnancies are also high - 8 pregnancies. The remaining medical information will further then shape our algorithm for prediction.

In [27]:
print(data['Age'].value_counts(sort=True),"\n")
print(data['DietType'].value_counts(sort=True),"\n")
print(data['Outcome'].value_counts(sort=True),"\n")

Age
34    168
71    153
43    153
79    152
36    150
     ... 
63    117
31    116
24    112
51    108
22    108
Name: count, Length: 72, dtype: int64 

DietType
0    5794
1    2851
2     893
Name: count, dtype: int64 

Outcome
0    6256
1    3282
Name: count, dtype: int64 



Above is an example of how to use value counts. It can be used to return a count of the amount of unique values in each series (column).