<a href="https://colab.research.google.com/github/ItsmeBlackOps/Breast-Cancer-Survival-Predication/blob/main/Breast_Cancer_Survival_Predication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Breast Cancer Survival Prediction using Python**

**Importing** **Libraries**

---
In this step, we import the required libraries for our analysis, including pandas for data manipulation, numpy for numerical operations, plotly.express for data visualization, and SVC (Support Vector Classifier) from scikit-learn for our machine learning model.





In [43]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

**Reading the Dataset**


---

Here, we load the breast cancer dataset from a CSV file named "BRCA.csv" using pandas and print the first few rows to get an overview of the data.



In [44]:
data = pd.read_csv("/content/drive/MyDrive/BRCA.csv")
print(data.head())

     Patient_ID   Age  Gender  Protein1  Protein2  Protein3  Protein4  \
0  TCGA-D8-A1XD  36.0  FEMALE  0.080353   0.42638   0.54715  0.273680   
1  TCGA-EW-A1OX  43.0  FEMALE -0.420320   0.57807   0.61447 -0.031505   
2  TCGA-A8-A079  69.0  FEMALE  0.213980   1.31140  -0.32747 -0.234260   
3  TCGA-D8-A1XR  56.0  FEMALE  0.345090  -0.21147  -0.19304  0.124270   
4  TCGA-BH-A0BF  56.0  FEMALE  0.221550   1.90680   0.52045 -0.311990   

  Tumour_Stage                      Histology ER status PR status HER2 status  \
0          III  Infiltrating Ductal Carcinoma  Positive  Positive    Negative   
1           II             Mucinous Carcinoma  Positive  Positive    Negative   
2          III  Infiltrating Ductal Carcinoma  Positive  Positive    Negative   
3           II  Infiltrating Ductal Carcinoma  Positive  Positive    Negative   
4           II  Infiltrating Ductal Carcinoma  Positive  Positive    Negative   

                  Surgery_type Date_of_Surgery Date_of_Last_Visit  \
0  Mo

**Checking for Null Values**
---
We examine the dataset for any missing values by using the isnull() function, which returns a boolean dataframe indicating whether each value is null or not. Then, we calculate the sum of null values for each column using sum().



In [45]:
print(data.isnull().sum())

Patient_ID             7
Age                    7
Gender                 7
Protein1               7
Protein2               7
Protein3               7
Protein4               7
Tumour_Stage           7
Histology              7
ER status              7
PR status              7
HER2 status            7
Surgery_type           7
Date_of_Surgery        7
Date_of_Last_Visit    24
Patient_Status        20
dtype: int64


**Dropping Rows with Null Values**
---
Since missing values can impact our analysis and modeling, we drop the rows containing null values using the dropna() function.



In [46]:
data = data.dropna()

**Exploring the Dataset**
---
To gain insights into the dataset, we print the information about the data, including column names, data types, and non-null counts. Additionally, we count the number of patients based on their gender.



In [47]:
print(data.info())
print(data.Gender.value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 317 entries, 0 to 333
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Patient_ID          317 non-null    object 
 1   Age                 317 non-null    float64
 2   Gender              317 non-null    object 
 3   Protein1            317 non-null    float64
 4   Protein2            317 non-null    float64
 5   Protein3            317 non-null    float64
 6   Protein4            317 non-null    float64
 7   Tumour_Stage        317 non-null    object 
 8   Histology           317 non-null    object 
 9   ER status           317 non-null    object 
 10  PR status           317 non-null    object 
 11  HER2 status         317 non-null    object 
 12  Surgery_type        317 non-null    object 
 13  Date_of_Surgery     317 non-null    object 
 14  Date_of_Last_Visit  317 non-null    object 
 15  Patient_Status      317 non-null    object 
dtypes: float

**Visualizing Tumor Stage**
---
We create a pie chart using Plotly Express to visualize the distribution of tumor stages among the patients. The chart provides a visual representation of the relative proportions of different tumor stages.



In [48]:
stage = data["Tumour_Stage"].value_counts()
transactions = stage.index
quantity = stage.values
figure = px.pie(data, values=quantity, names=transactions, hole=0.5, title="Tumour Stages of Patients")
figure.show()


**Visualizing Histology**
---
Similarly, we create a pie chart to visualize the distribution of histology (cellular composition) among the patients. This chart helps us understand the proportions of different histological types.



In [49]:
histology = data["Histology"].value_counts()
transactions = histology.index
quantity = histology.values
figure = px.pie(data, values=quantity, names=transactions, hole=0.5, title="Histology of Patients")
figure.show()


**Exploring ER Status, PR Status, and HER2 Status**
---
We examine the counts of ER (Estrogen Receptor), PR (Progesterone Receptor), and HER2 (Human Epidermal Growth Factor Receptor 2) status among the patients. This information provides insights into the presence or absence of these biomarkers.



In [50]:
print(data["ER status"].value_counts())
print(data["PR status"].value_counts())
print(data["HER2 status"].value_counts())

Positive    317
Name: ER status, dtype: int64
Positive    317
Name: PR status, dtype: int64
Negative    288
Positive     29
Name: HER2 status, dtype: int64


**Visualizing Type of Surgery of Patients**
---
We create a pie chart to visualize the distribution of surgery types performed on the patients. This chart helps us understand the relative frequencies of different surgical interventions.



In [51]:
surgery = data["Surgery_type"].value_counts()
transactions = surgery.index
quantity = surgery.values
figure = px.pie(data, values=quantity, names=transactions, hole=0.5, title="Type of Surgery of Patients")
figure.show()

**Transform categorical features**
---
In this step, we convert categorical features into numerical representations for our machine learning model. We map categorical values to numerical codes using the map() function for the following columns: Tumour_Stage, Histology, ER status, PR status, HER2 status, Gender, and Surgery_type.



In [52]:
data["Tumour_Stage"] = data["Tumour_Stage"].map({"I": 1, "II": 2, "III": 3})
data["Histology"] = data["Histology"].map({"Infiltrating Ductal Carcinoma": 1,
                                           "Infiltrating Lobular Carcinoma": 2, "Mucinous Carcinoma": 3})
data["ER status"] = data["ER status"].map({"Positive": 1})
data["PR status"] = data["PR status"].map({"Positive": 1})
data["HER2 status"] = data["HER2 status"].map({"Positive": 1, "Negative": 2})
data["Gender"] = data["Gender"].map({"MALE": 0, "FEMALE": 1})
data["Surgery_type"] = data["Surgery_type"].map({"Other": 1, "Modified Radical Mastectomy": 2,
                                                 "Lumpectomy": 3, "Simple Mastectomy": 4})
data.head()

Unnamed: 0,Patient_ID,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
0,TCGA-D8-A1XD,36.0,1,0.080353,0.42638,0.54715,0.27368,3,1,1,1,2,2,15-Jan-17,19-Jun-17,Alive
1,TCGA-EW-A1OX,43.0,1,-0.42032,0.57807,0.61447,-0.031505,2,3,1,1,2,3,26-Apr-17,09-Nov-18,Dead
2,TCGA-A8-A079,69.0,1,0.21398,1.3114,-0.32747,-0.23426,3,1,1,1,2,1,08-Sep-17,09-Jun-18,Alive
3,TCGA-D8-A1XR,56.0,1,0.34509,-0.21147,-0.19304,0.12427,2,1,1,1,2,2,25-Jan-17,12-Jul-17,Alive
4,TCGA-BH-A0BF,56.0,1,0.22155,1.9068,0.52045,-0.31199,2,1,1,1,2,1,06-May-17,27-Jun-19,Dead


**Splitting the Data into Training and Testing Sets**
---
We split the data into input features (x) and the target variable (y). Then, we further split the data into training and test sets using the train_test_split() function from scikit-learn. Here, the test set size is set to 20% of the data, and a random state is used for reproducibility.



In [57]:
x = np.array(data[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']])
y = np.array(data['Patient_Status'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.10, random_state=42)


**Training the SVM Model**
---
We initialize a Support Vector Classifier (SVC) model and train it on the training data using the fit() method.



In [58]:
model = SVC()
model.fit(x_train, y_train)


**Evaluating the Model**

 we evaluate the trained SVM model by computing the accuracy score on the test data using the score() method. The accuracy score indicates how well the model performs in predicting the survival status of breast cancer patients.

In [59]:
# Prediction
# features = [['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']]
features = np.array([[36.0, 1, 0.080353, 0.42638, 0.54715, 0.273680, 3, 1, 1, 1, 2, 2,]])
print(model.predict(features))

['Alive']
