# Haberman's Cancer Survival - EDA

The dataset contains cases from a study that was conducted **between 1958 and 1970** at the University of Chicago's Billings Hospital on the **survival of patients who had undergone surgery for breast cancer**.

**FEATURES:**

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

[Missisng values: Zero]

**About Lymphatic Nodes:**

- Lymphatic system tend to cluster around armpits, neck and belly
- Cancer spreads in predictable fashion form breast to lymph glands
- Cancer in lymph nodes - Perhaps single most factor in predicting outcomes
- If cancer is detected in Lymph nodes - Surgery for axillary disection


**Domain Knowledge Note:**

Here, positive axil nodes may not tell anything about survival except for presence of cancer. More positive nodes imply more chances of cancer.

**In other words, we might be predicting survival of patient on surgery when he didn't have cancer at all.** To much profit for hospitals >_< 


**OBJECTIVE:**

To predict the survival chances of patients who had undergone breast surgery for a period of 5 years and above depending og given featues

(**Breast Surgery may possibly mean removal of lymph nodes as well as tumor**)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
os.chdir("/kaggle/input/habermans-survival-data-set/")

In [None]:
#load data

df = pd.read_csv("haberman.csv")

# Renaming cols
df.columns=['age', 'op_year', 'axil_nodes', 'survived']
# Replacing 2s with 0s in survived col
df[df==2] = 0

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df.head()

In [None]:
df.columns

In [None]:
#  Number of obsvns and features
df.shape

In [None]:
#Value Counts

print(df["survived"].value_counts())

In [None]:
df.info()

In [None]:
print((224)/305)
print((224-81)/224)
# Same can be found from mean of 'survived'

In [None]:
df.describe()

**My Observations:**

1. There is 73% more chances of surviving than dying after 5 years of surgery
2. This can Imbalanced Dataset as there are more than 75% survivors(IQR). Is it? We will see.
- There are no null values
- Operation has been performed on diverse age groups 
- Most of the operations are performed in 1962 with `std-dev` 3 years 
    - We may try to find out medical situation in country that year
- 75% of patients have atmost 4 axil nodes
    - This can be the reason behind high survival rate of 0.73!
- axil nodes are irregularly distributed. May have outliers; Or most of data is nearer to minimum value as mean is nearer to minimum value.

**External Oservations:**

- `4` Can be wrong. See graphs as well
- `5` Can be wrong. See graphs as well

**Takeaway**: Don't depend too much solely on numerical statistics or graphs. USE BOTH

# MULTIVARIATE ANALYSIS

In [None]:
# First lets do multivariate analysis which will help us in univariate analysis
# Pair plots
sns.set_style("whitegrid");
sns.pairplot(df, hue="survived", size=3);
plt.show()

**My Observation**

- Too much is hazy for now. Proceed with univariate analysis

# UNIVARIATE ANALYSIS

In [None]:
# AGE

plt.close();
sns.FacetGrid(df, hue="survived", size=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.show();


**My Observations:**

7. Cannot be used effectively as more or less the patients from same age groups have survived or did not survive breast surgery   
8. Correction to obsvn `4`: It is normal distribution with more spread. Mostly diverse? Yes but normal distr.

In [None]:
# OPERATION YEAR

plt.close();
sns.FacetGrid(df, hue="survived", size=5) \
   .map(sns.distplot, "op_year") \
   .add_legend();
plt.show();

**My Observations:**

9. Cannot be used effectively for classifocation as more or less the patients from same year groups have survived or did not survive breast surgery
    - We can see a peak in death in 1964
    - We can also see survival is decreasing with time.
    - Let us compare above point with survival 'rate' or death 'rate' as we cannot depend only on survival data or death data
    
    
    

In [None]:
# Task: To calaculate survival rate per year and plot it
# ------------------------------------------------------

# Counts of survived and dead patients combined wrt year
print(df["op_year"].value_counts())

# Creating separate dataframes for survived and dead patients
sur, ded = df[df["survived"]==1], df[df["survived"]==0]
print(sur.head())
print(sur.shape)
print(ded.head())
print(ded.shape)

In [None]:
# create empty df to store survival rate
newDF = pd.DataFrame()

# storing death counts ONLY
newDF = newDF.append(ded["op_year"].value_counts())
# storing survival counts ONLY
newDF = newDF.append(sur["op_year"].value_counts())

sur_rate = newDF.transpose()
sur_rate.head()

In [None]:
# renaming cols
sur_rate.columns=["no_of_death", "no_of_sur"]
sur_rate

In [None]:
print(sur_rate["no_of_death"].sum())
print(sur_rate["no_of_sur"].sum())
print(sur_rate["no_of_death"].max())
print(sur_rate["no_of_sur"].max())
# Let's use max for normalisation and compare it with sum normalisation

In [None]:
#sur_rate["normBySum_death"] = sur_rate["no_of_death"]/sur_rate["no_of_death"].sum()
#sur_rate["normBySum_sur"] = sur_rate["no_of_sur"]/sur_rate["no_of_sur"].sum()

# Normalizing no_of_death and no_of_sur to reduce effect of Imbalanced data
#sur_rate["normByMax_death"] = sur_rate["no_of_death"]/sur_rate["no_of_death"].max()

# Normalizing with total patients is better than nomalizing just with 'no_of_sur'. Data wont be lost. Is it?
sur_rate["sur_ratio"] = sur_rate["no_of_sur"]/sur_rate["no_of_death"]
sur_rate["death_ratio"] = sur_rate["no_of_death"]/sur_rate["no_of_sur"]
sur_rate

In [None]:
sur_rate["sur_ratio_normalized"] = sur_rate["sur_ratio"] / sur_rate["sur_ratio"].max()
sur_rate["death_ratio_normalized"] = sur_rate["death_ratio"] / sur_rate["death_ratio"].max()
sur_rate

In [None]:
# Relative survival rate

#sur_rate["sur_rate_bySum"] = sur_rate["normBySum_sur"]/sur_rate["normBySum_death"]
#sur_rate["sur_rate_byMax"] = sur_rate["normByMax_sur"]/sur_rate["normByMax_death"]

# Normalizing survival rate - Helpful for our objective - "Classification"
#sur_rate["normalized_sur_rate1"] = sur_rate["sur_rate_byMax"] / sur_rate["sur_rate_byMax"].max()
#sur_rate["normalized_sur_rate2"] = sur_rate["sur_rate_byMax"] / sur_rate["sur_rate_byMax"].max()

#sur_rate

In [None]:
plt.close();
plt.figure(figsize=(10, 5))
plt.plot(sur_rate["sur_ratio_normalized"])
plt.plot(sur_rate["death_ratio_normalized"])
plt.legend()

plt.show()

**My Observations:**
    
10. Suvival rate is having little decreasing trend. Is this effect of imbalanced data? No, it is normalized. What can be reason behind this? Not our objective.
11. Survival is least in 1965. Peolple operated on 1965 have less chances of survival compared to 1961 and 67. 1963 patients have little higher chances of survival.
13. Operation year is some what good feature for classification

**Task:**

Try to find correlation between `survived` col and `sur_ratio_normalized` after conerting `sur_ratio_normalized` to discrete values 

In [None]:
# Normallized vs Non normalized. THIS IS NOT EDA

#plt.close();
#plt.figure(figsize=(10,5))
#plt.plot(sur_rate["normalized_sur_rate1"])
#plt.plot(sur_rate["no_of_sur"] - sur_rate["no_of_death"])
#plt.legend()

#plt.show()

In [None]:
# Positive Axil Nodes: Should be interesting. But remember it is ONLY a sign of cancer

plt.close();
sns.FacetGrid(df, hue="survived", size=10) \
   .map(sns.distplot, "axil_nodes") \
   .add_legend();
plt.show();

**My Observations:**

10. People **without more positive axil nodes** when operated upon, tend to survive more. This is intutively is correct as they may not even have had cancer!
11. People with **more positive axil nodes** when operated upon, tend to `die relatively more` and survive less.
    - Can this contradiction be an effect of unbalanced dataset? No It shouldn't be.
    - This means that positive axil nodes definitly indicates presence of cancer
    - But it implies **breast surgerys are not much useful**. It only profits medical organisatons as they can perform surgeries on patients without cancer and the patient still survives. 
12. For any given number of positive axil nodes (except near to zero), **Survival is always less compared to non-survival!!!**

Note: **Whenever there is a contradiction, see for bias, outliers etc. in data**

In [None]:
plt.close();
sns.FacetGrid(df, hue="survived", size=5) \
   .map(sns.distplot, "survived") \
   .add_legend();
plt.show();

**My Observations:**

12. This **cannot** be much of an `imbalanced dataset` as non-survival is more than half of survival bar. (63% of it to be exact = relative ratio  in cell 74)
    - This answers obsvn. `11` doubt 1

In [None]:
# CDF - axil_node

survived = df[df["survived"] == 1];
died = df[df["survived"] == 0];

In [None]:
# CDF - Used to get misclassification error

plt.close();
counts, bin_edges = np.histogram(survived['axil_nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)


# virginica
counts, bin_edges = np.histogram(died['axil_nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)


plt.show();

# 😂

**My Observations:**

13. Full of misclassification error
14. Most of people have maximum ~26 positive nodes
15. One thing we can note otherwise is, (except near to zero)  **Survival is always less compared to non-survival!!!**

In [None]:
# Lets play with mean, std-dev, median, mad, iqr etc. 

In [None]:
# MEAN

print("Mean Ages:")
print("Overall average age: \t\t\t {}".format(np.mean(df["age"])))
print("Average age of patients who survived: \t {}".format(np.mean(survived["age"])))
print("Average age of patients who died\t {}".format(np.mean(died["age"])))

print("\nStd Dev:")
print("Overall std-dev: \t\t\t {}".format(np.std(df["age"])))
print("Age std-dev of patients who survived: \t {}".format(np.std(survived["age"])))
print("Age std-dev of patients who died\t {}".format(np.std(died["age"])))

- No useful information

In [None]:
print("Mean op_year:")
print("Overall average op_year: \t\t\t {}".format(np.mean(df["op_year"])))
print("Average op_year of patients who survived: \t {}".format(np.mean(survived["op_year"])))
print("Average op_year of patients who died\t\t {}".format(np.mean(died["op_year"])))

print("\nStd Dev:")
print("Overall std-dev: \t\t\t\t {}".format(np.std(df["op_year"])))
print("op_year std-dev of patients who survived: \t {}".format(np.std(survived["op_year"])))
print("op_year std-dev of patients who died\t\t {}".format(np.std(died["op_year"])))

- No useful information

In [None]:
print("Mean axil_nodes:")
print("Overall average axil_nodes: \t\t\t {}".format(np.mean(df["axil_nodes"])))
print("Average axil_nodes of patients who survived: \t {}".format(np.mean(survived["axil_nodes"])))
print("Average axil_nodes of patients who died\t\t {}".format(np.mean(died["axil_nodes"])))

print("\nStd Dev:")
print("Overall std-dev: \t\t\t\t {}".format(np.std(df["axil_nodes"])))
print("axil_nodes std-dev of patients who survived: \t {}".format(np.std(survived["axil_nodes"])))
print("axil_nodes std-dev of patients who died\t\t {}".format(np.std(died["axil_nodes"])))

**My Observations:**

16. People with more positive axil nodes died. ~7 but std-dev is more (=9)
    - Positive axil nodes can be a good indicators of cancer. i.e Death in 5 years (An unfortunately surgery won't work) 

17. There is avg of 3 axil nodes per person but spread is more (std ~ 7)

In [None]:
print("Mean survived:")
print("Percent of people who survived surgery: {}".format(np.mean(df["survived"])))

Let us see **Median and it's related values** so that our results aren't affected by outliers. We can't use it if more than half od data is corrupted with outliers.

In [None]:
print("Median Ages:")
print("Overall middle age: \t\t\t {}".format(np.median(df["age"])))
print("Middle age of patients who survived: \t {}".format(np.median(survived["age"])))
print("Middle age of patients who died\t\t {}".format(np.median(died["age"])))


print("\nCoverage: 0th, 25th, 50th, 75th Percentiles")

print(np.percentile(df["age"],np.arange(0, 100, 25)))
print(np.percentile(survived["age"],np.arange(0, 100, 25)))
print(np.percentile(died["age"], np.arange(0, 100, 25)))

print("\nInsights from 90th percentile:")
print("================================")
print("90% of patient age is {}.".format(np.percentile(df["age"],90)))
print("Which is more than mean or median\n")
print("90% of survivors age is {}.".format(np.percentile(survived["age"],90)))
print("Which is same as population 90th percentile\n")
print("90% of dead patient's age is {}.".format(np.percentile(died["age"],90)))
print("Which is same as population 90th percentile\n")



from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(df["age"]))
print(robust.mad(survived["age"]))
print(robust.mad(died["age"]))


print("\nIQR: {}".format(np.percentile(df["age"],75) - np.percentile(df["age"],25)))

 * Median same as mean. 
 * 90% of data deals with people aged below 67 years
 * Explain IQR

In [None]:
print("Median Ages:")
print("op_year: \t\t{}".format(np.median(df["op_year"])))
print("op_year survived:\t{}".format(np.median(survived["op_year"])))
print("op_year died\t\t{}".format(np.median(died["op_year"])))


print("\nCoverage: 0th, 25th, 50th, 75th Percentiles")
print(np.percentile(df["op_year"],np.arange(0, 100, 25)))
print(np.percentile(survived["op_year"],np.arange(0, 100, 25)))
print(np.percentile(died["op_year"], np.arange(0, 100, 25)))

print("\nInsights from 90th percentile:")
print("================================")
print("90% of operations were performed below {}.".format(np.percentile(df["op_year"],90)))
print("Which is slightly more than mean or median\n")
print("90% patients survived in {}.".format(np.percentile(survived["op_year"],90)))
print("Which is same as population 90th percentile\n")
print("90% patients died in {}.".format(np.percentile(died["op_year"],90)))
print("Which is same as population 90th percentile\n")



from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(df["op_year"]))
print(robust.mad(survived["op_year"]))
print(robust.mad(died["op_year"]))


print("\nIQR: {}".format(np.percentile(df["op_year"],75) - np.percentile(df["op_year"],25)))

In [None]:
print("Median axil_nodes:")
print("axil_nodes: \t\t{}".format(np.median(df["axil_nodes"])))
print("axil_nodes survived:\t{}".format(np.median(survived["axil_nodes"])))
print("axil_nodes died\t\t{}".format(np.median(died["axil_nodes"])))

print("Cannot be corrupted with outliers unless > 50% corruption. We have to see graphs")


print("\nCoverage: 0th, 25th, 50th, 75th Percentiles")
print(np.percentile(df["axil_nodes"],np.arange(0, 100, 25)))
print(np.percentile(survived["axil_nodes"],np.arange(0, 100, 25)))
print(np.percentile(died["axil_nodes"], np.arange(0, 100, 25)))

print("\nInsights from 90th percentile:")
print("================================")
print("90% of patients had max. {} positive nodes.".format(np.percentile(df["axil_nodes"],90)))
print("Which is very high compared to population mean or median\n")
print("90% of SURVIVED patients had max. {} positive nodes.".format(np.percentile(survived["axil_nodes"],90)))
print("\n90% of DEAD patients had max. {} positive nodes.".format(np.percentile(died["axil_nodes"],90)))




from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(df["axil_nodes"]))
print(robust.mad(survived["axil_nodes"]))
print(robust.mad(died["axil_nodes"]))


print("\nIQR: {}".format(np.percentile(df["axil_nodes"],75) - np.percentile(df["axil_nodes"],25)))

* Note Medians

In [None]:
# Box Plots
plt.close();
plt.figure()

plt.figure(figsize=(10, 5))
sns.boxplot(x='survived',y='op_year', data=df)

plt.figure(figsize=(10, 5))
sns.boxplot(x='survived',y='age', data=df)

plt.figure(figsize=(10, 5))
sns.boxplot(x='survived',y='axil_nodes', data=df)


plt.show()

**My observations:**

* Interesting outliers for axil nodes
* Not most, but almost all who survived more than 5 years after surgery had maximum positive axil nodes about 7 to 8
* Median axil nodes for survived patients is zero. It is a central value
* We perform EDA by removing outliers for survived patients.


# MUlTIVARIATE ANALYSIS

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(df, hue="survived", size=3);
plt.show()

* Cannot be used effectively

In [None]:
#3-D plot

import plotly.express as px
fig = px.scatter_3d(df, x='axil_nodes', y='op_year', z='age',
              color='survived', size='op_year', size_max=25,)#,opacity=0.8)
fig.show()

[see pic here](https://www.kaggle.com/l0new0lf/plot3dplotly)
> * op_year vs pairplot can be used for classification but with some misclassification errors

In [None]:

#2D Density plot, contors-plot
plt.close();
sns.jointplot(x="op_year", y="axil_nodes", data=df, kind="kde", height=30);
plt.show();

**Overall Observations:**

1. There is 73% more chances of surviving than dying after 5 years of surgery
2. This can Imbalanced Dataset as there are more than 75% survivors(IQR). Is it? We will see.
- There are no null values
- Operation has been performed on diverse age groups 
- Most of the operations are performed in 1962 with `std-dev` 3 years 
    - We may try to find out medical situation in country that year
- 75% of patients have atmost 4 axil nodes
    - This can be the reason behind high survival rate of 0.73!
- axil nodes are irregularly distributed. May have outliers; Or most of data is nearer to minimum value as mean is nearer to minimum value.

- `4` Can be wrong. See graphs as well
- `5` Can be wrong. See graphs as well

7. `age` cannot be used effectively as more or less the patients from same age groups have survived or did not survive breast surgery   
8. Correction to obsvn `4`: It is normal distribution with more spread. Mostly diverse? Yes but normal distr
9. Cannot be used effectively for classifocation as more or less the patients from same year groups have survived or did not survive breast surgery
    - We can see a peak in death in 1964
    - We can also see survival is decreasing with time.
    - Let us compare above point with survival 'rate' or death 'rate' as we cannot depend only on survival data or death data
**My Observations:**
    
- Suvival rate is having little decreasing trend. Is this effect of imbalanced data? No, it is normalized. What can be reason behind this? Not our objective.
- Survival is least in 1965. Peolple operated on 1965 have less chances of survival compared to 1961 and 67. 1963 patients have little higher chances of survival.
- Operation year is some what good feature for classification
- People **without more positive axil nodes** when operated upon, tend to survive more. This is intutively is correct as they may not even have had cancer!
- People with **more positive axil nodes** when operated upon, tend to `die relatively more` and survive less.
    - Can this contradiction be an effect of unbalanced dataset? No It shouldn't be.
    - This means that positive axil nodes definitly indicates presence of cancer
    - But it implies **breast surgerys are not much useful**. It only profits medical organisatons as they can perform surgeries on patients without cancer and the patient still survives. 
- For any given number of positive axil nodes (except near to zero), **Survival is always less compared to non-survival!!!**
- This **cannot** be much of an `imbalanced dataset` as non-survival is more than half of survival bar. (63% of it to be exact = relative ratio  in cell 74)
    - This answers obsvn. `11` doubt 1
- Most of people have maximum ~26 positive nodes
- One thing we can note otherwise is, (except near to zero)  **Survival is always less compared to non-survival
* Interesting outliers for axil nodes
* Not most, but almost all who survived more than 5 years after surgery had maximum positive axil nodes about 7 to 8
* Median axil nodes for survived patients is zero. It is a central value
* We perform EDA by removing outliers for survived patients.
* see pic here](https://www.kaggle.com/l0new0lf/plot3dplotly)
* op_year vs pairplot can be used for classification but with some misclassification errors[see pic here](https://www.kaggle.com/l0new0lf/plot3dplotly)
* See outputs from python for MEDIAN observations