<a id="0"></a>
## Introduction
Anemia is a condition in which the body does not have enough healthy red blood cells. Red blood cells provide oxygen to body tissues.

Different types of anemia include:

* Anemia due to vitamin B12 deficiency
* Anemia due to folate (folic acid) deficiency
* Anemia due to iron deficiency
* Anemia of chronic disease
* Hemolytic anemia
* Idiopathic aplastic anemia
* Megaloblastic anemia
* Pernicious anemia
* Sickle cell anemia
* Thalassemia

<font color='blue'>
content:
       
1. [Load an Check Data](#1)
   * [Correcting Column Names](#2)
   * [Changing Data Type](#3)
   * [Extracting from the Data](#4)
   * [Grouping the Data](#5)
       * [Setting the Scale Gaps of Features](#10)
2. [Plotting](#6)
3. [Creating Model and Testing](#7)
   * [Creating Train and Test Data](#8)
   * [Testing the Model](#9)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from IPython.display import Image
import os

plt.style.use("seaborn-v0_8-darkgrid")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import warnings
import warnings
# filter warnings
warnings.filterwarnings('ignore')

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

---

<a id="1"></a>
# Load an Check Data

We loaded the data.

In [None]:
data = pd.read_csv("/kaggle/input/anemia-diagnosis-dataset/CBC data_for_meandeley_csv.csv")

In [None]:
data.head(5)

In [None]:
data.info()

In [None]:
data.describe()

---

<a id="2"></a>
# Manipulate the Data

<a id="2"></a>
## Correcting Column Names
Here we use the strip method to get rid of the spaces at the beginning and end of the names in the columns. We also fix the spaces manually.

In [None]:
data = data.copy()

In [None]:
print(data.columns)

In [None]:
data.columns = data.columns.str.replace(' ', '')
print(data.columns)

---

<a id="4"></a>
## Extracting from the Data


Extract data from our data frame that is not directly related to the disease.In order to use the ranges of disease blood values, we need to use only the appropriate data, so we need to update our data.

In [None]:
filtered_data = data.dropna(ignore_index=True)
data=filtered_data.copy()

In [None]:
filtered_data.describe()

In [None]:
filtered_data.Age[filtered_data.Age<18]

The test we use only works with adults. And the number of children is not that high, so we can extract from the data.

All data size.

In [None]:
len(filtered_data)

We extract patients who are less than 18 years old.

In [None]:
data.drop(filtered_data[filtered_data['Age'] < 18].index,inplace=True)
filtered_data.drop(filtered_data[filtered_data['Age'] < 18].index,inplace=True)
filtered_data.reset_index(inplace=True)
data.reset_index(inplace=True)

Extracted data size.

In [None]:
len(filtered_data)

---

<a id="3"></a>
## Changing Data Type


In [None]:
filtered_data.info()

Here we change the data type of the data frame to float.

In [None]:
filtered_data = filtered_data.astype("float64")

In [None]:
filtered_data.info()

---

<a id="5"></a>
## Grouping the Data
Here we divide our features according to this site
https://www.compsim.com/demos/d60/Anemia.htm

In [None]:
filtered_data.columns

<a id="10"></a>
### Setting the Scale Gaps of Features.

In [None]:
filtered_data["HGB"]=[(2 if filtered_data["HGB"][item] >15 else 0 if filtered_data["HGB"][item] < 12 else 1) if filtered_data["Sex"][item] == 0.0 else (2 if filtered_data["HGB"][item] >14 else 0 if filtered_data["HGB"][item] < 11 else 1) for item in range(len(filtered_data))]

In [None]:
filtered_data["PCV"]=[(2 if filtered_data["PCV"][item] >48.6 else 0 if filtered_data["PCV"][item] < 38.3 else 1) if filtered_data["Sex"][item] == 0.0 else (2 if filtered_data["PCV"][item] >44.9 else 0 if filtered_data["PCV"][item] < 35.5 else 1) for item in range(len(filtered_data))]

In [None]:
filtered_data["MCHC"]=[2 if item >38 else 0 if item < 30 else 1 for item in filtered_data["MCHC"]]

In [None]:
filtered_data["MCH"]=[2 if item >35 else 0 if item < 25 else 1 for item in filtered_data["MCH"]]

In [None]:
filtered_data["RDW"]=[2 if item >16 else 0 if item < 12.5 else 1 for item in filtered_data["RDW"]]

In [None]:
filtered_data["PLT/mm3"]=[2 if item >450 else 0 if item < 140 else 1 for item in filtered_data["PLT/mm3"]]

In [None]:
filtered_data["RBC"]=[2 if item >5.8 else 0 if item < 4 else 1 for item in filtered_data["RBC"]]

In [None]:
filtered_data["MCV"]=[2 if item >115 else 0 if item < 75 else 1 for item in filtered_data["MCV"]]

In [None]:
filtered_data["TLC"]=[2 if item >11 else 0 if item < 3 else 1 for item in filtered_data["TLC"]]

In [None]:
filtered_data["Age"]=[3 if item >64 else 2 if item > 48 else 1 if item>32 else 0 for item in filtered_data["Age"]]

Extract data from our data frame that is not directly related to the disease.

In [None]:
filtered_data2=filtered_data.drop(["Age","Sex","S.No.","index"],axis=1,inplace=False)

In [None]:
filtered_data2.head()

We use this function to diagnose patients according to the above intervals.

In [None]:
def find_diagnose(feature):
    filtered_data2["diseased"]=[1 if filtered_data2[feature][item] == 0 or filtered_data2[feature][item] == 2 else filtered_data2["diseased"][item] for item in range(len(filtered_data2))]


In [None]:
filtered_data2["diseased"] = 0
for item in filtered_data2.columns:
    if(item!="diseased"):
        find_diagnose(item)

---

<a id="6"></a>
# Plotting

We added the newly found disease feature to our previous data to make a graph and examine our values.

In [None]:
data["diseased"]=filtered_data2["diseased"]
filtered_data["diseased"]=filtered_data2["diseased"]
data = data.astype("float64")

In [None]:
data.head()

In [None]:
corr=data.corr().drop(['S.No.', 'Age', 'Sex', 'RBC', 'PCV', 'MCV', 'MCH', 'MCHC', 'RDW', 'TLC',
       'PLT/mm3', 'HGB'],axis=1)
corr=corr.drop(["diseased",'S.No.'],axis=0)

In [None]:
corr.head(corr.size)

Effects of feature's on disease.

In [None]:
f,ax = plt.subplots(figsize=(5, 5))
sns.heatmap(corr, annot=True, linewidths=.5, fmt= '.5f',ax=ax,)
plt.show()

Effects of feature's on each other.

In [None]:
f,ax = plt.subplots(figsize=(9, 9))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax,)
plt.show()

In [None]:
def scatter_plot(names):
    plt.scatter(data[names[0]][data["diseased"]==1],data[names[1]][data["diseased"]==1],color="red",alpha=0.5,label="diseased")
    plt.scatter(data[names[0]][data["diseased"]==0],data[names[1]][data["diseased"]==0],color="blue",alpha=0.6,label="not diseased")
    plt.xlabel(names[0])
    plt.ylabel(names[1])
    plt.legend()

The graphs in which we examine in which intervals the most relevant featurettes are diseased or not. 

In [None]:
plt.subplots(figsize=(10, 10))
plt.subplot(2, 2, 1)
scatter_plot(["RBC","PCV"])
plt.subplot(2, 2, 2)
scatter_plot(["MCH","MCV"])
plt.subplot(2, 2, 3)
scatter_plot(["HGB","PCV"])
plt.subplot(2, 2, 4)
scatter_plot(["HGB","RBC"])
plt.tight_layout()
plt.show()

Disease graph separated according to age groups.

In [None]:
sns.barplot(filtered_data,x="Age",y="diseased")
plt.xticks(ticks=(0,1,2,3),labels=("18-32","32-48","48-64","64-"))
plt.xlabel("Age groups")
plt.ylabel("diseased")
plt.show()

Disease graph separated according to sex.

In [None]:
sns.barplot(data,x="Sex",y="diseased")
plt.xticks(ticks=(0,1),labels=("Male","Female"))
plt.xlabel("Sex")
plt.ylabel("diseased")
plt.show()

Percentages of anemia types by sex.

In [None]:
female_mild_anemia=filtered_data[(filtered_data.Sex==1.0) & (filtered_data.HGB==0)][filtered_data.columns[0]].count()
female_normal_anemia=filtered_data[(filtered_data.Sex==1.0) & (filtered_data.HGB==1)][filtered_data.columns[0]].count()
female_anemia=filtered_data[filtered_data.Sex==1.0]["S.No."].count()

print("when sex is female")
print("mild anemia count is :",female_mild_anemia)
print("normal anemia count is :",female_normal_anemia)
print("mild anemia count is :",female_mild_anemia/female_anemia)

In [None]:
male_mild_anemia=filtered_data[(filtered_data.Sex==0.0) & (filtered_data.HGB==0)][filtered_data.columns[0]].count()
male_normal_anemia=filtered_data[(filtered_data.Sex==0.0) & (filtered_data.HGB==1)][filtered_data.columns[0]].count()
male_anemia=filtered_data[filtered_data.Sex==0.0]["S.No."].count()

print("when sex is male")
print("mild anemia count is :",male_mild_anemia)
print("normal anemia count is :",male_normal_anemia)
print("mild anemia count is :",male_mild_anemia/male_anemia)

---

<a id="7"></a>
# Creating Model and Testing

<a id="8"></a>
## Creating Train and Test Data

In [None]:
y = filtered_data2.diseased

In [None]:
x = filtered_data2.drop(["diseased"],axis=1)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=42)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

---

<a id="9"></a>
## Testing the Model

Testing data using Random Forest

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(x_train, y_train)
y_pred = random_forest.predict(x_test)
print("%",r2_score(y_test,y_pred)*100)

We are using Confuison Matrix to plot are Random Forest predictions.

In [None]:
cm = confusion_matrix(y_test,y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['diseased','not diseased']); ax.yaxis.set_ticklabels(['diseased','not diseased']);

Testing data using KNN.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
print("%",r2_score(y_test,y_pred)*100)

We are using Confuison Matrix to plot are KNN predictions.

In [None]:
cm = confusion_matrix(y_test,y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['diseased','not diseased']); ax.yaxis.set_ticklabels(['diseased','not diseased']);