
 # **EARLY STAGE DIABETES RISK PREDICTION**
   

![istockphoto-1267723572-612x612.jpg](attachment:ca2b4275-3dd8-4935-a88a-5123a923ea54.jpg)

 # **Introduction**

<span style="font-size:16px;"> Diabetes is a chronic disease that occurs either when the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces. Insulin is a hormone that regulates blood glucose. Hyperglycaemia, also called raised blood glucose or raised blood sugar, is a common effect of uncontrolled diabetes and over time leads to serious damage to many of the body's systems, especially the nerves and blood vessels. <br/> <br/>According to WHO, In 2014, 8.5% of adults aged 18 years and older had diabetes. In 2019, diabetes was the direct cause of 1.5 million deaths and 48% of all deaths due to diabetes occurred before the age of 70 years. Another 460 000 kidney disease deaths were caused by diabetes, and raised blood glucose causes around 20% of cardiovascular deaths (1).<br/> <br/> Between 2000 and 2019, there was a 3% increase in age-standardized mortality rates from diabetes. In lower-middle-income countries, the mortality rate due to diabetes increased 13%. </span>

> **DATA SOURCE**

<span style="font-size:16px;">This dataset includes the evidence and symptoms of newly diabetic or developing diabetes disease in the Human body. The data file has been collected from Kaggle website </span>

> **OBJECTIVES**

<span style="font-size:16px;">  
    To explore the symptoms that are responsible for diabetes <br/>
    To use Exploratory Data Analysis (EDA) methods to show how these symptoms<br/>
    To analyze the data and identify the symptoms of people with diabetes<br/>
    To present my final project as a KaggleX BIPOC Mentee 2022 which could help non-ml or every kaggler gain insights
</span>
    

> **RESEARCH QUESTIONS**

<span style="font-size:16px;">  
    What is the distribution of the population of people with the risk of having of having diabetes? <br/>
    Which gender is most likely to have a diabetes?<br/>
    What are the symptoms that contributes most to having diabetes?<br/>
    Does age determine the possibilities of having diabetes? <br/>
    Which age-groups are mostly with the risk of having diabetes? <br/>
    Does obesity determine the risk of having diabetes? <br/>
</span>

> **MEDICAL TERMS AND DEFINITIONS**


<span style="font-size:16px;"><b>Polyuria</b> is a condition where the body urinates more than usual and passes excessive or abnormally large amounts of urine each time you urinate. Polyuria is defined as the frequent passage of large volumes of urine – more than 3 litres a day compared to the normal daily urine output in adults of about 1 to 2 litres. <br/> <br/> <b>Polydipsia</b>  is a medical name for the feeling of extreme thirstiness. Polydipsia is often linked to urinary conditions that cause you to urinate a lot. This can make your body feel a constant need to replace the fluids lost in urination. It can also be caused by physical processes that cause you to lose a lot of fluid.<br/> <br/> <b>Genital Thrush (or candidiasis)</b>  is a common condition caused by a type of yeast called Candida. It mainly affects the vagina, though may affect the penis too, and can be irritating and painful. Many types of yeast and bacteria naturally live in the vagina and rarely cause problems. <br/> <br/> <b>Partial Paresis</b> Paresis involves the weakening of a muscle or group of muscles. It may also be referred to as partial or mild paralysis. Unlike paralysis, people with paresis can still move their muscles. These movements are just weaker than normal. <br/> <br/> <b>Polyphagia</b>  also known as hyperphagia, is the medical term for excessive or extreme hunger. It's different than having an increased appetite after exercise or other physical activity. While your hunger level will return to normal after eating in those cases, polyphagia won't go away if you eat more food. <br/> <br/> <b>Alopecia</b>  Areata is a condition that causes hair to fall out in small patches, which can be unnoticeable.</span>

<span style = "font-size:20px; font-weight: bold;"> DATASET INFORMATION:</span>

* **Age** 1.20-65

* **Sex** 1. Male, 2.Female

* **Polyuria** 1.Yes, 2.No.

* **Polydipsia** 1.Yes, 2.No.

* **sudden weight loss** 1.Yes, 2.No.

* **weakness** 1.Yes, 2.No.

* **Polyphagia** 1.Yes, 2.No.

* **Genital thrush** 1.Yes, 2.No.

* **visual blurring** 1.Yes, 2.No.

* **Itching** 1.Yes, 2.No.

* **Irritability** 1.Yes, 2.No

* **delayed healing** 1.Yes, 2.No.

* **partial paresis** 1.Yes, 2.No.
 
* **muscle stiffness** 1.Yes, 2.No

* **Alopecia** 1.Yes, 2.No.

* **Obesity** 1.Yes, 2.No.

* **Class** 1.Positive, 2.Negative.

# **Data Preparation**

<span style="font-size:18px; font-weight: bold;">Import Libraries </span>

In [None]:
#import libraries

import matplotlib.pyplot as plt  #data visualization
import seaborn as sns
import pandas as pd #data processing
import numpy as np #linear algebra
from collections import Counter
from sklearn.feature_selection import mutual_info_classif
from sklearn import preprocessing
import random
import spacy

import warnings # mute warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_colwidth', None)

print('imported libraries are loaded')

<span style="font-size:18px; font-weight: bold;">Obtain data file path</span>

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


<span style="font-size:18px; font-weight: bold;">Load data </span>

In [None]:
# read the csv file using pandas library and store it as diab_data
diab_data=pd.read_csv('/kaggle/input/early-stage-diabetes-risk-prediction-dataset/diabetes_data_upload.csv')

<span style="font-size:18px; font-weight: bold;">View data</span>

<span style="font-size:18px; font-weight: bold">Preview the first five rows of the dataset</span>

In [None]:
diab_data.head()

<span style="font-size:18px; font-weight: bold;">Preview the last five rows of the dataset</span>

In [None]:
diab_data.tail()

# **Data Exploration**

<span style="font-size:18px;">Preview the column names of the dataset.</span>

In [None]:
diab_data.columns #this checks the columns names

In [None]:
diab_data.shape #check the number of rows and columns in the dataset

<span style="font-size:18px; font-weight: bold;">Observation:</span>

<span style="font-size:20px;">There are 520 rows and 17 columns in the dataset.</span>

In [None]:
diab_data.dtypes #this checks the datatypes of the features in the dataset

<span style="font-size:18px; font-weight: bold;">Observation:</span>

<span style="font-size:20px;">The features in the dataset are of object data types except age which is interger or numeric data type. </span>

<span style="font-size:18px; font-weight: bold;">Statistical information of the dataset</span>

In [None]:
diab_data.info()

<span style="font-size:18px; font-weight: bold;">Descriptive analysis of the dataset</span>

In [None]:
diab_data.describe(include='all')

<span style="font-size:18px; font-weight: bold;">Observations:</span>

<span style="font-size:18px;">The dataset has more categorical features <b>(0s and 1s)</b> except age. <br/>
 The dataset has 520 observations for analysis.<br/>
The average age of individuals in the dataset that are at the risk of having diabetes is 48.<br/> 
The age group of the analysis is between 16 and 90. <br/>
There are more males with 328 occurences in the dataset. <br />
    We have more individuals that are susceptible to be diabetic <b>(class = positive)</b> in the dataset.</span>

# **Data Cleaning**

<span style="font-size:18px; font-weight: bold;">Check null values/missing data</span>

In [None]:
diab_data.isnull().sum()

<span style="font-size:18px; font-weight: bold;">Observation:</span>

<span style="font-size:20px;">There are no missing values in the dataset.</span>

<span style="font-size:20px; font-weight: bold;">Check for duplicates</span>

In [None]:
diab_data.duplicated().sum()

<span style="font-size:18px; font-weight: bold;">Preview the first five duplicated rows in the dataset

In [None]:
diab_data[diab_data.duplicated()].head()

<span style="font-size:18px; font-weight: bold;">Observation:</span>

<span style="font-size:20px;">There are duplicates in the dataset. Since the duplicates are peculiar to the rows (observations) and not to the features, we will not be removing them. </span>

# **Feature Engineering**



<span style="font-size:20px;">Create a copy of the dataset and new feature <b>age group</b>.<br/>This is done to help me have a clear understanding of the age groups of people rather than dealing with the ages of individuals which will make the visualizations crowdy. </span>

In [None]:
diab_data1 = diab_data.copy()

def age(i):
    for x in range(10,100,10):
        if i<x:
            m = f'{x-10}-{x}'
            return m
            break

diab_data1['Age_group'] = diab_data['Age'].apply(lambda x:age(x))
diab_data1.head()

<span style="font-size:18px; font-weight: bold;"> Observations:</span>

<span style="font-size:18px;"> <b>Age_group</b> feature has been created as a feature of the dataset. This could be seen from the first five rows of the dataset previewed above.</span>

# **Data Visualizations**

In [None]:
plt.figure(figsize=(12,6))
plt.hist(diab_data['Age'], bins =40)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age distribution with histogram')
plt.show()

<span style="font-size:20px; font-weight: bold;"> Observations: </span>

<span style="font-size:18px;">From the chart, we have <b>Age</b> on the x-axis and <b>Frequency of occurence</b> on the y-axis. <br/>    
 The ages of people in the dataset are predominantly between 25 and 75 years. <br/>    
 We have few individuals less than 20 years and more than 75 years of age.</span>

In [None]:
diab_data.Age.value_counts().sort_values(ascending = False).to_frame() #value counts

In [None]:
diab_data1.Age_group.value_counts().sort_values(ascending = False).to_frame()

<span style="font-size:18px; font-weight: bold;">Observation:</span>

<span style="font-size:18px;">We have more population of individuals between the ages of 35 to 68.</span>

<span style="font-size:18px; font-weight: bold;">Distribution of Categorical features in the dataset</span>

In [None]:
def bar_plot(variable):
    """
     input: variable
     output: barplot & value count
     """
    var = diab_data[variable]
    
    varValue = var.value_counts()
    #visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel('Frequency')
    plt.title(variable)
    plt.show()
    
    print("{}\n{}".format(variable, varValue))



In [None]:
categorical =['Gender', 'Polyuria', 'Polydipsia','sudden weight loss', 'weakness', 'Polyphagia', 'Genital thrush','visual blurring','Itching', 'Irritability', 'delayed healing', 'partial paresis', 'muscle stiffness', 'Alopecia', 'Obesity','class']

for c in categorical:
    bar_plot(c)

<span style="font-size:18px; font-weight: bold;">Observations:</span>
    
<span style="font-size:18px; font-weight: bold;">We were able to deduce the counts of the features in the dataset;</span>

<span style="font-size:18px;">There are more males in the population than female.<br/>
 The numbers of individuals with polyuria, polydipsia and sudden weight loss are slightly lower than individuals not showing these symptoms.<br/> Individuals that are more susceptible to diabetes are more in the dataset.</span>


<span style="font-size:20px; font-weight: bold;">Age Distribution</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">Histogram distribution of all features in the dataset versus Target variable (class)</span></p></div>

In [None]:
def plot_hist(variable):
    
    plt.figure(figsize=(9,3))
    sns.histplot(data=diab_data, x = variable, hue="class", multiple="dodge", shrink=.8, palette='Set1')
    plt.xlabel(variable)
    plt.ylabel('Frequency')
    plt.title('Histogram of {} vs class'.format(variable))
    plt.show()

    

In [None]:
Var = ['Age','Gender', 'Polyuria', 'Polydipsia','sudden weight loss', 'weakness', 'Polyphagia', 'Genital thrush','visual blurring','Itching', 'Irritability', 'delayed healing', 'partial paresis', 'muscle stiffness', 'Alopecia', 'Obesity']
for n in Var:
    plot_hist(n)
        

<span style="font-size:18px; font-weight: bold;">Observation:</span>

<span style="font-size:18px; font-weight: bold;">From the visualization, we observed that:</span>

<span style="font-size:18px;">As people grow older, they become more prone to be diabetic. This means <b>Age</b> is one of the major factor of having diabetes. A healthy lifestyle is advised as one grows older.<br/>
Females are seen to more susceptible to being diabetic. Therefore, early medical check-ups is advised to prevent its occurence.<br/><br/>
The ratio of people exhibiting the following symptoms are more in the dataset:<br/>
    1) <b>Polyuria (excessive and frequent urine).</b><br/>
    2) <b>Polydipsia (extreme thirstiness).</b><br/>
    3) <b>Sudden weight loss (reduction in body mass).</b><br/>
    4) <b>Weakness (frequent tiredness).</b><br/>
    5) <b>Polyphagia (extreme and excessive hunger).</b><br/>
Therefore,if anyone exhibits one or any of these symptoms, a proper medical screening is advised to avert the risk of being diabetic.<br/></span>

In [None]:
#this cell plots the Status distribution using pie-chart.
df=diab_data['class'].value_counts().head(20) #head() function only considers the top 20 candidates
df

explode = (0, 0.2)
labels=df.index
sizes=df.values
fig = plt.subplots(1,1, figsize=(9,5))
plt.pie(sizes, labels=labels,explode = explode, autopct='%1.1f%%',shadow=True, startangle=90)
plt.axis("equal")
plt.title("Distribution of Population with Diabetes ",fontsize=26)
plt.show()

<span style="font-size:18px; font-weight: bold;">Observation</span>

<span style="font-size:18px;">We have more people with the risk of being diabetic  with 62% than negative in the dataset.</span>

<span style="font-size:18px; font-weight: bold;">Age in relation to gender</span>

In [None]:
sns.boxplot(x="Gender", y='Age', data = diab_data)

<span style="font-size:22px; font-weight: bold;">Observation</span>

<span style="font-size:18px;">Females are older than males in the dataset</span>

<span style="font-size:20px; font-weight: bold;">Age Distribution</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">Which age_group has the high risk of being diabetic?</span></p></div>

In [None]:
plt.figure(figsize=(12,8)),
ax = sns.countplot(x = 'Age_group', data = diab_data1, hue ='class', order=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90'])

for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()), (p.get_x()+.1, p.get_height()+1))
    
plt.title("Age Distribution with Class")
plt.show()

<span style="font-size:22px; font-weight: bold;">Observation</span>

<span style="font-size:18px;">We observed patients within the age groups of 30 to 60 are more prone to be diabetic.</span>


<span style="font-size:20px; font-weight: bold;">Analysis Based on Gender</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">Which Gender has the high risk of being diabetic? </span></p></div>

In [None]:
diab_data.groupby('Gender')['class'].value_counts().unstack()

In [None]:
def annotate_percent(ax, column, no_of_col_cat, no_hue_cat):
    a = [p.get_height() for p in ax.patches]
    patch = [p for p in ax.patches]
    
    for i in range(no_of_col_cat):
        total = column.value_counts().values[i]        
        for j in range(no_hue_cat):
            percentage = '{:.1f}%'.format(100 * a[(j*no_of_col_cat + i)]/total)
            x = patch[(j*no_of_col_cat + i)].get_x() + patch[(j*no_of_col_cat + i)].get_width() /1.8 - 0.07
            y = patch[(j*no_of_col_cat + i)].get_y() + patch[(j*no_of_col_cat + i)].get_height() 
            
            print(x, y)
            ax.annotate(percentage, (x, y), size = 10)

In [None]:
#Set color palette for the chart
g = sns.color_palette("Set1")

#Use a countplot to plot the chart
plt.rcParams["figure.figsize"] = (12,8)
g = sns.countplot(data= diab_data1, x='Gender',hue='class', color="salmon", palette=g, saturation=0.8, linewidth=1)
plt.title("Class Distribution vs Gender", pad=20, fontsize = 26)
plt.xlabel("Gender", labelpad=20)
plt.ylabel("Count", labelpad=20)
annotate_percent(g, diab_data1.Gender, 2, 2)

<span style="font-size:20px; font-weight: bold;">Observation:<br/></span>

<span style="font-size:20px; ">From the visualization, we have the <b>Gender</b> at the x-axis and <b>counts of occurence</b> at the y-axis.<br/> 
   We observed that the females are more prone to be diabetic with over 90% as against males with ~45% in the population.<br/>
</span>



<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">Which age_group are more in the population?</span></p></div>

In [None]:
#Set color palette for the chart
#g = sns.color_palette("Set1")

#Use a countplot to plot the chart
#plt.rcParams["figure.figsize"] = (18,8)
# = sns.countplot(data=diab_data1, x='Age_group',hue='Gender', color="salmon", palette=g, saturation=0.8, linewidth=1, order=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90'])
#plt.title("Age Distribution vs Gender", pad=20, fontsize = 26)
#plt.xlabel("Age Group", labelpad=20)
#plt.ylabel("Count", labelpad=20)
#annotate_percent(g, diab_data1.Age_group, 8, 2)


plt.figure(figsize=(12,8)),
ax = sns.countplot(x = 'Age_group', data = diab_data1, hue ='Gender', order=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90'])

for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()), (p.get_x()+.1, p.get_height()+1))
    
plt.title("Age Distribution with Gender")
plt.show()

<span style="font-size:20px; font-weight: bold;">Observation:<br/></span>

<span style="font-size:20px;">From the visualization, we observed that there are more older males in the population as against females.
</span>


<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">What is the age distribution of people with polyuria?</span></p></div>

In [None]:
plt.figure(figsize=(12,8)),
ax = sns.countplot(x = 'Age_group', data = diab_data1, hue ='Polyuria', order=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90'])

for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()), (p.get_x()+.1, p.get_height()+1))
    
plt.title("Age Distribution with Polyuria")
plt.show()

<span style="font-size:20px; font-weight: bold;">Observation:<br/></span>

<span style="font-size:20px; ">From the visualization, we observed that people between the age group of 30 and 70 have polyuria (a condition where the body urinates more than usual and passes excessive or abnormally large amounts of urine each time you urinate). It is therefore advised that people in these age brackets should carry out regular medical check-ups to rule out the risk of being diabetic. <br/>
</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">What is the age distribution of people with polydipsia?</span></p></div>

In [None]:
plt.figure(figsize=(12,8)),
ax = sns.countplot(x = 'Age_group', data = diab_data1, hue ='Polydipsia', order=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90'])

for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()), (p.get_x()+.1, p.get_height()+1))
    
plt.title("Age Distribution with Polydipsia")
plt.show()

<span style="font-size:20px; font-weight: bold;">Observation:<br/></span>

<span style="font-size:20px; ">From the visualization, we observed that people between the age group of 30 and 70 have polydipsia (the feeling of extreme thirstiness). Therefore people in these age brackets are advised to carry out regular medical check-ups to rule out the risk of being diabetic. <br/>
</span>

In [None]:
ax = sns.countplot(x=diab_data["class"]) #displaying the status distribution of target feature
for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()), (p.get_x()+.4, p.get_height()+1))
plt.title("Class Distribution");


# **Data Transformation**

<span style="font-size:18px;">Change the values of the features to <b>0s and 1s</b> using LabelEnconder function.</span>

In [None]:
label_encoder = preprocessing.LabelEncoder()

for column in diab_data.columns[1:]:
    diab_data[column] =  label_encoder.fit_transform(diab_data[column])
    
diab_data.head()

<span style="font-size:20px; font-weight: bold;">Observation:<br/></span>

<span style="font-size:20px; ">From the table above, we observed that the values of the features have been changed to <b> 0s and 1s</b> for the model to understand it better. <br/>
</span>

In [None]:
neg = len(diab_data[diab_data['class']==0])
pos = len(diab_data[diab_data['class']==1])

pct_of_neg = neg/(neg+pos)
print("percentage of negative class is", pct_of_neg*100)
pct_of_pos = pos/(neg+pos)
print("percentage of positive class", pct_of_pos*100)

<span style="font-size:20px; font-weight: bold;">Observation:</span>

<span style="font-size:18px;">The dataset is imbalanced in which <b>38%</b> of the dataset are not diabetic and <b>~62%</b> have positive class.</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">What is the average age of people with diabetes?</span></p></div>

In [None]:
diab_data.groupby('class').mean()

<span style="font-size:20px; font-weight: bold;">Observation:</span>

<span style="font-size:20px;">The average age of patients with diabetes is <b>49</b>. </span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">What is the average age and the gender more prone to be diabetic?</span></p></div>

In [None]:
diab_data.groupby('Gender').mean()

<span style="font-size:20px; font-weight: bold;">Observation:</span>

<span style="font-size:20px;">Females are more prone to be diabetic at the average age of <b>47</b>.</span>

In [None]:
diab_data.groupby('Age').mean()

# **Correlation**

In [None]:
diab_data.corr()

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(diab_data.corr(),cmap="Blues", annot=True)
plt.title('Correlation Heatmap\n',fontweight='bold',fontsize=14)
plt.show()

<span style="font-size:20px; font-weight: bold;">Observations:<br/></span>

<span style="font-size:20px;">From the visualization:<br/> 1) <b>polyuria</b> and <b>polydipsia</b> have good correlations with <b>class.</b><br/>
    2) <b>sudden weight loss</b> and <b>partial paresis</b>  have slight correlation with <b>class.</b> <br/>
    3) <b>gender</b> has negative or inverse correlation with <b>class.</b><br/>
    4) <b>polyuria</b> and <b>polydipsia</b> have slight correlations with <b>weakness</b> and <b>partial_paresis.</b></span>

# **Feature Ranking with mutual information**

In [None]:

X = diab_data.copy()
y = X.pop('class')

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    color = np.array(["C0"] * scores.shape[0])
    # Create plot
    plt.barh(width, scores, color=color)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")
    
def make_mi_scores(X, y):
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y)
mi_scores  # show a few features with their MI scores
plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores.head(20))

<span style="font-size:20px; font-weight: bold;">Observations:<br/></span>

<span style = 'font-size: 18px;'> The chart above shows the feature that are most important for an individual to be at risk of having diabetes. we are going to consider nine most important features / symptoms.<br/>
They are  <b>Polydipsia, Polyuria, Partial paresis, Age, Sudden weight loss, Weakness, Irritabilty, Alopecia, Gender</b>.</span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">CONCLUSION</span></p></div>
    
Thus far, we have been able to do the following:
* Load some modules and clean the data.
* Visualize the relationships between predicting and target variables.
* Visualize the feature correlations.
* Visualize the feature rankings with mutual informations.

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
<span style="font-size:18px;">
NEXT ACTION?
    </span>
    </p>
    </div>
    
*   To prepare data to build Machine learning models.  <br/>   
*   To build the model using different machine learning algorithms.<br/>
*   To identify features that affect the performance of the model(feature importances).<br/>
*   To evaluate the performances of the model using recall.<br/>
*   Learn and create simple decision rules to enable non-ml medical providers diagnose themselves.
*   Examine probabilities and likelihood of the occurence of early stage diabetes.
*   Future likelihood of ESD based on changes to current features.



<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    <span style="font-size:18px;">CREDITS</span></p></div>

Many were called but few were chosen. I will like to appreciate the organizers of KaggleX BIPOC Mentorship Programme Cohort 2022 for finding me worthy to be part of this programme. This program has given me the opportunity to meet with amazing data scientist / engineers in the world. I appreciate my Mentors for their resilience in ensuring that more knowledge is impacted into me - **Dr. Olaitan Olaleye and Mani Sarkar**. **Mani** taught me that machine learning is not just about deploying models but understanding your data (Data Analysis),I also learnt how to use Trello board and how to make my notebook understandable to non-ml individuals. **Dr Olaitan** introduced me to building models for real-world applications and creating simple decision rules. I am also grateful to my fellow mentees for their contributions- **Sesugh, Chiemela, Florence, Patience and Vannia.**

# **REFERENCES**

**About Diabetes-** https://www.who.int/news-room/fact-sheets/detail/diabetes

**Factors that causes diabetes -** https://my.clevelandclinic.org/health/diseases/7104-diabetes

**Beautifying Jupyter Notebook -**  www.kaggle.com/code/shubhamksingh/create-beautiful-notebooks-formatting-tutorial/notebook

**How to make clean visualizations -** www.kaggle.com/code/gaetanlopez/how-to-make-clean-visualizations/notebook
 
**Kaggle Global Outreach Notebook by Mani Sarkar-** https://www.kaggle.com/code/neomatrix369/kaggle-global-outreach-

**Tweet Sentiment Extraction Analysis Notebook by Mani Sarkar-** https://www.kaggle.com/code/neomatrix369/fastchai-tweet-sentiment-extraction-analysis

