# Breast Cancer Survival prediction Using Machine Learning 
**Author : Team Octopy**

# Define the problem
Survival prediction in breast cancer is an essential aspect of patient care. By analyzing various factors such as tumor size and genetic markers, healthcare providers can estimate a patient's prognosis and adjust treatment accordingly. This personalized approach improves the chances of long-term survival and helps patients make informed decisions about their healthcare journey.

Doctors, despite their expertise, can provide less precise diagnoses due to the complexity of medicine, the inherent uncertainties, and the potential for human error. This is where machine learning comes in, offering tools to augment medical decision-making, improve accuracy, and assist healthcare professionals in providing better patient care.

In our notebook, we will analyze and go through breast cancer data to develop a precise model capable of predicting a patient's survival rate

# Creating the dataframe

In [46]:
from warnings import simplefilter 
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore",  category= ConvergenceWarning)

import pandas as pd
df = pd.read_csv('METABRIC_RNA_Mutation.csv', low_memory=False)

# Dataset cleaning

In [47]:
#drop unneeded columns
df = df.drop(list(set(df.columns) - set(['patient_id','age_at_diagnosis','type_of_breast_surgery','cancer_type_detailed','cellularity','chemotherapy','pam50_+_claudin-low_subtype','cohort','er_status_measured_by_ihc','er_status','neoplasm_histologic_grade','her2_status_measured_by_snp6','her2_status','tumor_other_histologic_subtype','hormone_therapy','inferred_menopausal_state','integrative_cluster','primary_tumor_laterality','oncotree_code','lymph_nodes_examined_positive','mutation_count','nottingham_prognostic_index','overall_survival_months','overall_survival','pr_status','radio_therapy','3-gene_classifier_subtype','tumor_size','tumor_stage','death_from_cancer'])), axis=1)

In [48]:
#drop records(rows) where there are invalid data of an attribute checking them using unique() function
#drop nan values for each atribute
df = df.dropna()
#drop rows where the type is invalid
is_in_given_type = df['cancer_type_detailed'].isin(['Breast Invasive Ductal Carcinoma','Breast Mixed Ductal and Lobular Carcinoma','Breast Invasive Lobular Carcinoma','Breast Invasive Mixed Mucinous Carcinoma','Metaplastic Breast Cancer'])
df.drop(is_in_given_type[~is_in_given_type].index, inplace=True)

print(df['type_of_breast_surgery'].unique())
print(df)

['BREAST CONSERVING' 'MASTECTOMY']
      patient_id  age_at_diagnosis type_of_breast_surgery  \
1              2             43.19      BREAST CONSERVING   
4              8             76.97             MASTECTOMY   
5             10             78.77             MASTECTOMY   
8             28             86.41      BREAST CONSERVING   
9             35             84.22             MASTECTOMY   
...          ...               ...                    ...   
1618        6232             71.22             MASTECTOMY   
1619        6233             70.65      BREAST CONSERVING   
1621        6237             75.62             MASTECTOMY   
1623        6239             52.84      BREAST CONSERVING   
1664        6346             63.20      BREAST CONSERVING   

                           cancer_type_detailed cellularity  chemotherapy  \
1              Breast Invasive Ductal Carcinoma        High             0   
4     Breast Mixed Ductal and Lobular Carcinoma        High             1   
5

# Encoding Categorical data

In [49]:
#converte categorical data to numerical using labelencoder
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object for each column then transform the values
for column in ['oncotree_code','type_of_breast_surgery','cancer_type_detailed','pam50_+_claudin-low_subtype','her2_status_measured_by_snp6','er_status_measured_by_ihc','er_status','her2_status','tumor_other_histologic_subtype','inferred_menopausal_state','integrative_cluster','primary_tumor_laterality','pr_status','3-gene_classifier_subtype','death_from_cancer','cellularity']:    
    le = LabelEncoder()
    le.fit(df[column])
    df[column] = le.transform(df[column])
    
    
    
# Print the unique values for each column
for column in df.columns:
    
    unique_values = df[column].unique()
    print(f'Unique values in column {column}: {unique_values}')




Unique values in column patient_id: [   2    8   10 ... 6237 6239 6346]
Unique values in column age_at_diagnosis: [43.19 76.97 78.77 86.41 84.22 85.49 45.43 61.49 68.68 46.89 51.38 49.87
 54.23 83.89 48.59 39.84 42.55 60.07 82.73 72.1  78.73 58.95 76.89 43.46
 73.98 57.4  69.16 58.89 73.11 72.3  51.33 47.62 74.07 85.39 62.4  79.28
 53.16 77.13 70.22 67.7  63.77 62.46 62.72 71.5  57.56 37.87 53.75 44.98
 76.4  60.26 83.35 64.57 48.11 83.99 59.18 38.78 85.94 51.69 68.45 68.41
 57.79 62.55 56.45 75.18 50.48 43.37 49.5  46.   86.24 79.38 53.56 55.02
 41.41 75.63 43.39 37.24 77.72 50.45 58.21 63.35 45.5  44.73 52.55 43.63
 52.19 77.85 53.72 66.75 53.45 57.61 62.7  63.93 46.44 45.39 29.98 55.36
 68.66 51.81 55.22 51.19 80.34 46.17 47.13 82.46 81.53 47.71 60.62 82.53
 89.43 67.38 77.94 55.52 73.48 56.32 50.82 92.14 50.98 46.86 86.28 79.34
 74.76 79.97 59.75 37.3  32.61 81.88 72.26 61.32 86.26 73.01 58.16 57.87
 83.58 40.5  52.79 73.24 48.67 76.84 51.74 74.79 64.01 78.41 39.86 63.31
 34.68 72.

# Rounding the Values of Age

In [50]:
#Roundidng age values 

df['age_at_diagnosis'] = df['age_at_diagnosis'].round()
print(df['age_at_diagnosis'])


1       43.0
4       77.0
5       79.0
8       86.0
9       84.0
        ... 
1618    71.0
1619    71.0
1621    76.0
1623    53.0
1664    63.0
Name: age_at_diagnosis, Length: 1087, dtype: float64


# Removing Outliers

In [51]:


for column in df.columns:
    lower_limit = df[column].quantile(0.01)
    upper_limit = df[column].quantile(0.99)
    df[column] = df[column].apply(lambda x: lower_limit if x < lower_limit else (upper_limit if x > upper_limit else x))

for column in df.columns:
    print(df[column].unique())

print(df.shape)

[ 105.44  106.    111.   ... 6194.   6201.   6201.98]
[43. 77. 79. 86. 84. 85. 45. 61. 69. 47. 51. 50. 54. 49. 40. 60. 83. 72.
 59. 74. 57. 73. 48. 62. 53. 70. 68. 64. 63. 58. 38. 76. 65. 39. 52. 56.
 75. 46. 55. 41. 37. 78. 44. 67. 31. 80. 82. 33. 35. 34. 71. 66. 36. 42.
 81. 32.]
[0 1]
[0 3 1 2]
[0 2 1]
[0 1]
[2 3 1 0 5 6 4]
[1. 2. 3. 5.]
[1 0]
[1 0]
[3. 2. 1.]
[2. 0. 1.]
[0 1]
[0 3 1 5 4 2]
[1 0]
[1 0]
[ 4 10  8  3  1  9  7  0  2  6  5]
[1 0]
[ 0.  8.  1. 16.  5. 14.  6.  2.  3.  9. 19.  4.  7. 12. 10. 15. 13. 11.
 17. 18.]
[ 2.    4.    5.    1.    3.    8.    7.   11.    9.    6.   19.14 10.
 14.   13.   12.   16.   18.   15.   17.   19.  ]
[4.02    6.08    4.062   5.032   3.056   3.044   4.046   4.032   4.078
 3.068   5.08    4.14    2.054   6.12    3.06    4.05    4.12    5.052
 5.06    3.04    4.1     2.02572 3.028   3.16    5.044   3.026   3.036
 4.06    4.054   5.076   4.038   6.1     4.024   3.024   4.028   6.104
 5.048   4.034   3.052   4.04    6.088   4.048   6.072   4.088

# Training

In [52]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

X = df.drop('overall_survival', axis=1)
y = df['overall_survival']

#spliting the dataset to 75% for training and 25% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = eval('LogisticRegression')()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('The accuracy of LogisticRegression: ')
print(accuracy)


The accuracy of LogisticRegression: 
0.9755351681957186
