# Analysis of Diabetes and its causes 

## Loading the data 

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np 
import pandas as pd 
from warnings import filterwarnings
filterwarnings("ignore") 

In [None]:
#import the data
db = pd.read_csv("diabetes.csv")

In [None]:
#view the first rows
db.head()

In [None]:
#view the last rows
db.tail()

In [None]:
#get information on the name, type an amount of non-null values.
db.info()

In [None]:
db.Outcome.replace(1, True, inplace=True)
db.Outcome.replace(0,False,inplace = True )


## Exploratory data analysis 

### What is the ratio of patients with diabetes

In [None]:
def value_count (count,having = None,data=db ):
    if having != None:
        ans = data.groupby(having)[count].count()
    else: 
        ans = data.groupby(count)[count].count()
    return ans

In [None]:
def percent (num, total= None ):
    if type(num) is int:
        ans = (num/total)*100 
    else:
        for i in num:
            ans = []
            ans.append((num/sum(num))*100)
    return ans
        

In [None]:
out_vc=value_count("Outcome")
out_vc

In [None]:
label1 = "Non-\ndiabetic \n" + str(round (percent(out_vc)[0][0], 1))+"%"
label2 = "Diabetic \n" + str(round(percent(out_vc)[0][1],1))+"%"
label_m = ["Diabetic","Non Diabetic"]
plt.pie(out_vc, labels=[label1,label2], labeldistance=0.4)
plt.show()

**65.1%** of the those interviewed were **Not Diabetic**, while **34.9%** of them were **Diabetic**.

In [None]:
label_m

## Analyse the age distribution of the candidates

In [None]:
for i in [True,False]:
    db1=db[db["Outcome"] == i]
    sns.distplot(db1["Age"],90)
plt.legend(label_m)
plt.title("Outcome")
plt.show()

The distribution of the **ages** show a bias in the data. The data was majorly coallated from people between the ages of **20 to 25**. This might affect the analysis of the data.

## Analyse the distribution of pregnancies of the candidates

In [None]:
for i in [True,False]:
    db1=db[db["Outcome"] == i]
    sns.distplot(db1["Pregnancies"],20)
    
plt.legend(label_m)
plt.title("Outcome")
plt.show()

Most of the candidates without diabetes had had 0 to 5 **pregnancies**. Meanwhile, those with **diabetes** had had a more even spread of **pregnancies** between 0 and 10

In [None]:
sns.scatterplot("Age","Pregnancies",hue= "Outcome",data= db)

The **older** the people get the more their **pregnancies** but pregnancies do not seem to affect the probability that the person has **diabetes**.

## Analyse the BMI of the candidates

In [None]:
for i in [True,False]:
    db1=db[db["Outcome"] == i]
    sns.distplot(db1["BMI"],60,label=str(i))
plt.legend(label_m)
plt.title("Outcome")
plt.show()

The **BMI** for **non-diabetic** patients peak at about 33. This shows that most of the candidates with **diabetes** had a **BMI**  within the range of 30 to 40. It also shows some whose **BMI** were not recorded.

In [None]:
sns.violinplot("Outcome","BMI",data = db)

The distribution of the **BMI** for the **Diabetic** and **Non-diabetic** patients are similar, this suggests that **BMI** has no relationship with the **outcome**.

## Analyse the amount of Glucose intake of the candidates

In [None]:
for i in [True,False]:
    db1=db[db["Outcome"] == i]
    sns.distplot(db1["Glucose"],60)
plt.legend(label_m)
plt.title("Outcome")
plt.show()

Most of those **without Diabetes** have a glucose intake of about **100g**. The density of the glucose intake of those with **diabetes** shifted further to the right showing that those with **diabetes** had **high levels of glucose**. 


In [None]:
sns.boxplot(y="Glucose",x="Outcome",data = db)
plt.xticks()

Those with diabetes tend to have more glucose than those without further suggesting that higher glucose in **higher** chance of getting **diabetes**.