#  Haberman’s Survival Data:
***
* This dataset contains three independant variables or featues namely age,year,nodes
> &emsp;    **age** - represents age of the patient
<br>  &emsp;  **year** - year in which operation happend
<br>   &emsp; **nodes** - number of nodes 

* A dependant variable or class label, status
>  &emsp;  **status** - patient survied more than 5 years or not 

## objective:
>Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are useful towards classification
<br>Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are useful in classfication
<br>To obtain confidence and clarity on data

#### importing required libraries and loading data

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
haberman = pd.read_csv('/kaggle/input/haberman/haberman.csv')  #loading data using pandas and we use haberman as identifier



In [None]:
haberman.head() #to see first 5 rows of dataset and to confirm it loaded 

In [None]:
type(haberman) #data type

In [None]:
haberman.shape  #shape of the dataset

#### Observation:
>haberman.shape ,lets us know that the dataset contains 306 rows and 4 columns

In [None]:
haberman.columns #columns in the dataset

In [None]:
haberman.info()

#### Observations:
>There are no missing values in this data set
<br>All the columns are of the integer data type

In [None]:
haberman.status.value_counts()

#### Observations:
>Out of 306 patients, 225 patients survived and 81 did not.
<br>The dataset is imbalanced.

In [None]:
haberman.status.unique()

class label status has two integer values 1 and 2 ,which may create problems

In [None]:
haberman.status=haberman.status.map({1:"yes",2:"no"}) #replace 1 with yes and 2 with no
haberman.status = haberman['status'].astype('category') #change type of status column from integer to category

In [None]:
haberman.tail()


## Bi-variate analysis
***

### pair plots

In [None]:
sns.set_style("whitegrid")
sns.pairplot(haberman, hue = 'status',height = 5)
plt.show()

#### Observations: 

>There were 3 features,therefore we get 3 unique pair plots
<br>The plot between age and year is overlapped 
<br>The plot between age and nodes is more better than the age and year plot
<br>The plot between year and nodes is comparatively more better than other plots

### scatter plot

In [None]:

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue = 'status' , height = 6)\
 .map(plt.scatter,"nodes",'age')\
 .add_legend();
plt.show()

 ## Univaraite analysis
 ***

 ### Probability Density Function(PDF)

In [None]:
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.show();


#### Observation:
>At the age range from 55 to 75 , the status of survival and death is more similar
<br>From 43 to 53, there is low survival rate
<br>From 30 to 40, there is more survival rate

In [None]:
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.show();


#### Observation:
> The years 1960 and 1965 there were more unsuccessful operations
<br>The survivial rate is cant be unpredictable because the graphs are overlapped mostly

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(haberman, hue="status", size=10) \
   .map(sns.distplot, "nodes") \
   .add_legend();
plt.show();


#### Obsevation:
>patients with no nodes or 1 node are more likely to survive
,also surivial rate quite predictable and is high upto 3 nodes 

### Cumulative Distribution Function(CDF)

In [None]:
plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    
    plt.subplot(1, 3, idx+1)
    print("********* "+feature+" *********")
    
    counts, bin_edges = np.histogram(haberman[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    
    plt.xlabel(feature)
    

#### Observation:
>we can see that almost 80% of the patients have positive lymph nodes less than 10

### Box Plots and Violin Plots

In [None]:
sns.boxplot(x='status',y='age', data=haberman)
plt.show()
sns.boxplot(x='status',y='year', data=haberman)
plt.show()
sns.boxplot(x='status',y='nodes', data=haberman)
plt.show()

In [None]:
sns.violinplot(x='status',y='age',data = haberman,height = 10)
plt.show()
sns.violinplot(x='status',y='year',data = haberman,height = 10)
plt.show()
sns.violinplot(x='status',y='nodes',data = haberman,height = 10)
plt.show()

#### Observations:
>more the nodes,lesser the survival,also there is small probability of failure
<br>there were more failure cases around the year 1965



## conclusion:

>1 The classification of surivial status is difficult based on given features because the dataset is imbalanced
<br>2 The surivial chance is inversely propotional to the number of nodes and also absence of nodes cannot always guarantee survival

