# Data Visualization with Haberman Dataset

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Title: Haberman's Survival Data


Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)

Attribute Information:

Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
Missing Attribute Values: None

# Objective:

To predict whether the patient will survive after 5 years or not based upon the patient’s age, year of treatment and the number of lymph nodes.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

hbm=pd.read_csv("../input/haberman.csv")
hbm

In [None]:
hbm.status.dtype

In [None]:
#We shall label the class attribute 'status' in a readable format
hbm.status.replace([1,2],['survived 5 years or longer','died within 5 years'],inplace=True)
hbm

In [None]:
print(hbm.shape)
print(hbm.columns)

In [None]:
hbm.describe()

In [None]:
unique_years=list(hbm.year.unique())
unique_years.sort()
print('The list of unique years in which the operation was performed :' ,unique_years)
print('Number of unique Years:', len(unique_years))
unique_nodes=list(hbm.nodes.unique())
unique_nodes.sort()
print('The unique values of nodes:',unique_nodes)
print('The number of unique values of nodes are:',len(unique_nodes))
unique_age=list(hbm.age.unique())
unique_age.sort()
print('The list of unique age to which the operation was performed :' ,unique_age)
print('Number of unique age:', len(unique_age))

In [None]:
print(hbm.status.value_counts())
print('The percentage of people survived and died respectively is:',list(i*100 for i in list(hbm.status.value_counts(normalize=True))))
sns.countplot(x='status',data=hbm)
plt.show()

# Observations:
* This data set has 306 observations with each observation characterised by 4 classes namely 'age', 'year', 'nodes', 'status'.
* There are no missing attributes in any class.
* The number of people who survived 5 years or longer is considerably more than the number of people who didn't survive(225 vs 81).
* Therefore this dataset is an Imbalanced DataSet.

# Univariate Analysis

In [None]:
plt.figure(figsize=(15,5))
sns.set(style="ticks")
sns.countplot(x='age',data=hbm,hue='status')

In [None]:
sns.set(style="ticks")
sns.catplot(x='year',kind='count',data=hbm)
plt.show()
sns.catplot(x='year',kind='count',data=hbm,col='status')
plt.show()

In [None]:
sns.set(style="ticks")
sns.catplot(x='nodes',col='status',kind='count',data=hbm,height=6)
plt.show()

# Observations
* The highest number of operations were performed on the people with age 52 and the lowest on the people with age (71,75,76,76,77,78,83).
* The highest survival rate is found in the persons with age 38.
* The highest number of operations were performed in the year 1958 while lowest in 1969.
* Highest number of people survived were from the year 1958 while the highest number of people who didn't survive were from      1965.
* Most People who survived 5 years or longer have 0 nodes.
* As the number of nodes increased,the number of people who survived 5 years or longer decreased and the number of people who died within 5 years also decreased.

In [None]:
sns.set_style('whitegrid')
fig, axes = plt.subplots(1, 3, figsize=(20, 7))
for i,j in enumerate(list(hbm.columns[:-1])):
    sns.violinplot(x='status',y=j,data=hbm,ax=axes[i])
plt.show()

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
for i,j in enumerate(list(hbm.columns[:-1])):
    sns.boxplot(x='status',y=j,data=hbm,ax=axes[i])
plt.show()

In [None]:
for i in hbm.columns[:-1]:
    sns.FacetGrid(hbm,hue='status',height=5).map(sns.distplot,i).add_legend()
plt.show()

In [None]:
for i in hbm.columns[:-1]:
    counts, bin_edges = np.histogram(hbm[i], bins=10,density = True)
    pdf = counts/(sum(counts))
    cdf = np.cumsum(pdf)
    print('PDF:',pdf) 
    print('CDF:',cdf)
    print('Bin edges:',bin_edges)
    plt.plot(bin_edges[1:],pdf)
    plt.plot(bin_edges[1:], cdf)
    plt.xlabel(i)
    plt.show()

# Observations
* From the box plots and violin plots we observe that the class attribute distribution in 'age' and 'year' is almost same whereas in 'nodes' shows some variation.
* From the PDF plot we observe that there is heavy overlap of the two classe attributes in 'age' and 'year' classes and high density is found near nodes between 0 and 10. Same can be inferred from CDF plots.

In [None]:
for i in list(hbm.columns)[:-1]:
    print(i,':')
    print("Mean:")
    print(np.mean(hbm[i]))
    print("Median:")
    print(np.median(hbm[i]))
    print("Quantiles:")
    print(np.percentile(hbm[i],np.arange(0, 100, 25)))
    print("90th Percentiles:")
    print(np.percentile(hbm[i],90))
    from statsmodels import robust
    print ("Median Absolute Deviation")
    print(robust.mad(hbm[i]))

# Bi-variate Analysis

In [None]:
sns.FacetGrid(hbm,hue='status',height=8).map(plt.scatter,'nodes','age').add_legend()
plt.show()
sns.FacetGrid(hbm,hue='status',height=8).map(plt.scatter,'nodes','year').add_legend()
plt.show()

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(hbm,hue = "status",height = 4);
plt.show()

In [None]:
sns.jointplot(x='nodes',y='year',data=hbm,height=8)
plt.show()

sns.jointplot(x='age',y='nodes',data=hbm,kind='kde',height=8)
plt.show()

# Observations:
* The pairplot of the dataset shows heavy overlap of the class attributes in almost all plots,leaving us perplexed.
* However, the plot between year and nodes looks promising to seperate the class atrributes.
* The nodes above 30 can be considered as outliers.
* The people in age group 40-60 have more nodes between 0 and 10.

# Conclusion
* The attributes 'age' and 'year' can not be used for our prediction as we can observe high overlap in       their distribution curves.
* The attribute 'nodes' is likely to be a successful one to predict. 
* More observations are needed to come to a definite conclusion regarding the 'nodes' attribute for           prediction.