# Dataset Description
The dataset originally has 22 features (columns), but based on diabetes disease research regarding factors influencing diabetes disease and other chronic health conditions, only select features are included in this analysis. The dataset contains the following information:

* Diabetes_binary : 0 = no diabetes 1 = prediabetes 2 = diabetes
* HighBP (High blood pressure ): 0 = no high BP 1 = high BP
* HighChol: 0 = no high cholesterol 1 = high cholesterol
* CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
* BMI: Body Mass Index
* Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
* Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
* PhysActivity: physical activity in past 30 days - not including job 0 = no 1 = yes
* Fruits: Consume Fruit 1 or more times per day 0 = no 1 = yes* Weight in gms: It is the weight in grams.
* Veggies: Consume Vegetables 1 or more times per day 0 = no 1 = yes

* HvyAlcoholConsump : (adult men >=14 drinks per week and adult women>=7 drinks per week) 0 = no 1 = yes
* AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes
* NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes
* GenHlth: Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor
* MentHlth: days of poor mental health scale 1-30 days
* PhysHlth: physical illness or injury days in past 30 days scale 1-30
* DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes
* Sex: 0 = female 1 = male
* Age: 13-level age category / 1 = 18-24 9 = 60-64 13 = 80 or older
* Education : Education level (EDUCA see codebook) scale 1-6 1 = Never attended school or only kindergarten 2 = elementary etc.
* Income: Income scale (INCOME2 see codebook) scale 1-8 1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [None]:
Diabetes = pd.read_csv(r'C:\Users\Manaralbogamii\Desktop\ProjectT5\Datasetdiabetes.csv')
Diabetes.head()

# Exploratory data analysis (EDA)

In [None]:
print('Dimension of Diabetes Date : {}'.format(Diabetes.shape))

In [None]:
Diabetes.info()

In [None]:
Diabetes.describe()

In [None]:
#Get overall idea of the distribution of all columns using hist()

In [None]:
Diabetes.hist(figsize=(20,28))

### Outcomes
Diabetes_binary is unbalanced which means number of people with no diabetes is more than number of people with diabetes.

## Questions :
  #### 1-  what is the incidence rate of Diabetics and non-Diabetics?
  #### 2-  What is the rate of Diabetes by Sex ?
  #### 3-  Do people with diabetes have high blood pressure?
  #### 4-  Do people with diabetes have high cholesterol?
  #### 5-  What age is most affected by diabetes?


 <span style="color:green ;font-size:20px; font-weight:bold"> 1-  what is the incidence rate of Diabetics and non-Diabetics? </span>


In [None]:
colors = ["b","r"]
labels = ["No Diabetes","Diabetes"]
Diabetes.Diabetes_binary.value_counts().plot.pie(labels = labels, figsize=(5,5), autopct='%1.1f%%',colors = colors)
plt.title("The Incidence Rate of Diabetics and non-Diabetics")

### Outcomes
- Percentage of people with diabetes represents 13.9%
- Percentage of people with no-diabetes represents 86.1%

In [None]:
Diabetes.Diabetes_binary.value_counts()

In [None]:
Diabetes.Sex.value_counts()

## Checking the missing vlaues


In [None]:
Diabetes.isnull().sum()


### Outcomes:
- no missing value found

## Change vlaues


In [None]:
Diabetes.loc[Diabetes["Sex"] == 0,"Sex"] = 'F'
Diabetes.loc[Diabetes["Sex"] == 1,"Sex"] = 'M'

In [None]:
#cheack value 
Diabetes.Sex.value_counts()

In [None]:
Diabetes.loc[Diabetes["Diabetes_binary"] == 0,"Diabetes_binary"] = 'No Diabetes'
Diabetes.loc[Diabetes["Diabetes_binary"] == 1,"Diabetes_binary"] = 'Diabetes'

In [None]:
#cheack value 
Diabetes.Diabetes_binary.value_counts()

<span style="color:green ;font-size:20px; font-weight:bold"> 2- What is the rate of Diabetes by Sex ?</span>


In [None]:
Percentage_of_infected=pd.DataFrame(Diabetes.groupby('Sex')['Diabetes_binary'].count()).apply(lambda x : x / sum(x) * 100)
Percentage_of_infected

In [None]:
Diabetes_by_sex= Diabetes.groupby('Sex').Diabetes_binary.value_counts().plot(kind='bar',width =1,figsize=(5,5), color=['Blue','red'])
Diabetes_by_sex.set_title("Diabetes by Sex ?")
plt.ylabel("Count")
plt.xlabel("Sex")
plt.legend(['No_Diabetes','Diabetes']);
plt.show()

### Outcomes:
- number of female with No_Diabetes is higher than males 
- number of female with Diabetes is approximatly like number of males with Diabetes


<span style="color:green ;font-size:20px; font-weight:bold"> 3-Do people with diabetes have high blood pressure?</span>


In [None]:
Diabetes['HighBP'].value_counts()

In [None]:
Diabetes.loc[Diabetes["HighBP"] == 0,"HighBP"] = 'No high blood pressure'
Diabetes.loc[Diabetes["HighBP"] == 1,"HighBP"] = 'Have high blood pressure'

In [None]:
sns.countplot(Diabetes.HighBP,hue='Diabetes_binary',data=Diabetes)

### Outcomes:

- number of people who have Diabetes with high blood pressure is higher than number of people who have Diabetes with no high blood pressure


<span style="color:green ;font-size:20px; font-weight:bold"> 4-Do people with diabetes have high cholesterol?</span>


In [None]:
Diabetes['HighChol'].value_counts()

In [None]:
Diabetes.loc[Diabetes["HighChol"] == 0,"HighChol"] = 'No high Cholesterol'
Diabetes.loc[Diabetes["HighChol"] == 1,"HighChol"] = 'Have high Cholesterol'

In [None]:
sns.countplot(Diabetes.HighChol,hue='Diabetes_binary',data=Diabetes)

### Outcomes:

- number of people who have Diabetes with high Cholesterol is higher than number of people who have Diabetes with no high Cholesterol.


<span style="color:green ;font-size:20px; font-weight:bold">5-What age is most affected by diabetes? </span>


In [None]:
Diabetes['Age'].value_counts()

In [None]:
#change dtype for age to int
Diabetes['Age'] = Diabetes['Age'].astype('int')

In [None]:
Diabetes['Age'].value_counts()

In [None]:
age = Diabetes.groupby('Age')['Diabetes_binary'].value_counts().unstack().plot(kind='bar',width =1,figsize=(5,5), color=['red','blue'])
age.set_title("Age classification of people with diabetes")
age.set_xlabel('Age classification')
age.set_ylabel('count')
plt.show()

### Outcomes:
age classification is skewed to the left. which means 
- the percentage of people who are between the level 8-11 have a diabetes more than the rest.
- number of people who have no diabetes is higher 

**hint: Age: 13-level age category** 

* 1 = 18-24

* 9 = 60-64 
* 13 = 80 or older



In [None]:
Diabetes_binary_dict = {
    'No Diabetes' : 0,
    'Diabetes' : 1
}

In [None]:
Diabetes.Diabetes_binary.map(Diabetes_binary_dict) 

In [None]:
Diabetes['Diabetes_binary'] =Diabetes.Diabetes_binary.map(Diabetes_binary_dict) 
Diabetes.head()

In [None]:
Diabetes['Diabetes_binary'].value_counts()

In [None]:
HighBP_dict = {
    'No high blood pressure' : 0,
    'Have high blood pressure' : 1
}

In [None]:
Diabetes.HighBP.map(HighBP_dict) 
Diabetes['HighBP'] =Diabetes.HighBP.map(HighBP_dict) 
Diabetes.head()

In [None]:
HighChol_dict = {
    'No high Cholesterol' : 0,
    'Have high Cholesterol' : 1
}
Diabetes.HighChol.map(HighChol_dict) 
Diabetes['HighChol'] =Diabetes.HighChol.map(HighChol_dict) 
Diabetes['HighChol'].value_counts()

In [None]:
Sex_dict = {
    'F' : 0,
    'M' : 1
}
Diabetes.Sex.map(Sex_dict) 
Diabetes['Sex'] =Diabetes.Sex.map(Sex_dict) 
Diabetes['Sex'].value_counts()

In [None]:
#change Data type for columns form flot to int
Diabetes['CholCheck'] = Diabetes['CholCheck'].astype('int')
Diabetes['BMI'] = Diabetes['BMI'].astype('int')
Diabetes['Smoker'] = Diabetes['Smoker'].astype('int')
Diabetes['Stroke'] = Diabetes['Stroke'].astype('int')
Diabetes['HeartDiseaseorAttack'] = Diabetes['HeartDiseaseorAttack'].astype('int')
Diabetes['PhysActivity'] = Diabetes['PhysActivity'].astype('int')
Diabetes['Fruits'] = Diabetes['Fruits'].astype('int')
Diabetes['Veggies'] = Diabetes['Veggies'].astype('int')
Diabetes['HvyAlcoholConsump'] = Diabetes['HvyAlcoholConsump'].astype('int')
Diabetes['AnyHealthcare'] = Diabetes['AnyHealthcare'].astype('int')
Diabetes['NoDocbcCost'] = Diabetes['NoDocbcCost'].astype('int')
Diabetes['GenHlth'] = Diabetes['GenHlth'].astype('int')
Diabetes['MentHlth'] = Diabetes['MentHlth'].astype('int')
Diabetes['PhysHlth'] = Diabetes['PhysHlth'].astype('int')
Diabetes['DiffWalk'] = Diabetes['DiffWalk'].astype('int')
Diabetes['Education'] = Diabetes['Education'].astype('int')
Diabetes['Income'] = Diabetes['Income'].astype('int')

# preprocessing for modeling  


In [None]:
Diabetes= Diabetes.drop(columns = ['Smoker','Stroke','HeartDiseaseorAttack','PhysActivity','Fruits','Veggies',
              'HvyAlcoholConsump', 'AnyHealthcare','NoDocbcCost','GenHlth','MentHlth','PhysHlth','DiffWalk','Education','CholCheck','BMI','Income'] )

In [None]:
Diabetes.head()


In [None]:
sns.pairplot(Diabetes,hue='Diabetes_binary',kind='kde'); #how to use good feature from of pairplot 

# Modeling 

## KNN 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighborsint= 5)
knn.fit(x_train_scaled, y_train)

## logistsic regression  

In [None]:
##lr= LogisticRegression()
##lr.fit(x_train_scaled(), y_train)

In [None]:
##y_pred=lr.predict(x_test_scaled)