## Hepatitis C prediction analysis

**DataSet link:**
https://www.kaggle.com/fedesoriano/hepatitis-c-dataset

**DataSet Documentation:**
<br>1) Category (diagnosis) -> (values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis')
<br>2) Age (in years)
<br>3) Sex (f,m)

Attributes 4 to 13 refer to laboratory data:
<br>4) ALB
<br>5) ALP
<br>6) ALT
<br>7) AST
<br>8) BIL
<br>9) CHE
<br>10) CHOL
<br>11) CREA
<br>12) GGT
<br>13) PROT

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('HepC.csv', index_col = 0)

In [3]:
df.head()

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 615 entries, 1 to 615
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Category  615 non-null    object 
 1   Age       615 non-null    int64  
 2   Sex       615 non-null    object 
 3   ALB       614 non-null    float64
 4   ALP       597 non-null    float64
 5   ALT       614 non-null    float64
 6   AST       615 non-null    float64
 7   BIL       615 non-null    float64
 8   CHE       615 non-null    float64
 9   CHOL      605 non-null    float64
 10  CREA      615 non-null    float64
 11  GGT       615 non-null    float64
 12  PROT      614 non-null    float64
dtypes: float64(10), int64(1), object(2)
memory usage: 67.3+ KB


We can see that the only two categorical values are sex and category, we will predict the category values in our model

**Now that we have seen our dataset we can start to create hipotesis to see what the data have to tell us**
<br>The first hipotesis that we should bring up is what is the mean age of every category

In [5]:
df.groupby('Category')['Age'].mean()

Category
0=Blood Donor             47.131332
0s=suspect Blood Donor    57.571429
1=Hepatitis               38.708333
2=Fibrosis                52.333333
3=Cirrhosis               53.466667
Name: Age, dtype: float64

We can see that people with hepatitis have a lower average age, this may be because fibrosis and cirrhosis can occur due to hepatitis, so the disease had time to evolve

To better understand the Data, we can define the next hypothesis if hepatitis was more likely in men or women and what is the average age of each

In [6]:
df.loc[(df['Category'] == '1=Hepatitis') | (df['Category'] == '2=Fibrosis') 
       | (df['Category'] == '3=Cirrhosis')]['Sex'].value_counts()
#How many men or women have hepatitis or any desease that can be caused by hepatitis

m    53
f    22
Name: Sex, dtype: int64

In [7]:
df.loc[(df['Category'] == '1=Hepatitis') | (df['Category'] == '2=Fibrosis') 
       | (df['Category'] == '3=Cirrhosis')].groupby('Sex')['Age'].mean()
#The mean Age of men and women with hepatitis or any desease that can be caused by hepatitis

Sex
f    53.181818
m    46.452830
Name: Age, dtype: float64

We can see that hepatitis and diseases that can be caused by hepatitis are more likely to happen in men and that men have a lower average age compared to women, this may be because men are more likely to have hepatitis than women, you can check out this article on the topic: https://www.medicalnewstoday.com/articles/323674 