<h2><b>Breast Cancer Survival Prediction</b></h2>
<h4><b>Author:</b> Data Science @ Georgia Tech</h4>
<p><b>Reference:</b> <a href="https://medium.com/coders-camp/225-machine-learning-projects-with-python-44d6ea8ace18">Medium</a></p>

<p><b>Welcome to the Breast Cancer Survival Prediction self-guided project!</b></p>
<p>In this project, we are going to create a machine learning model that predicts breast cancer survival rates based on a multitude of factors.</p>

Here is the schema for the dataset:

<ul>
    <li><code>Patient_ID</code>: ID of the patient</li>
    <li><code>Age</code>: Age of the patient</li>
    <li><code>Gender</code>: Gender of the patient</li>
    <li><code>Protein1, Protein2, Protein3, Protein4</code>: expression levels</li>
    <li><code>Tumor_Stage</code>: Breast cancer stage of the patient</li>
    <li><code>Histology</code>: Infiltrating Ductal Carcinoma, Infiltration Lobular Carcinoma, Mucinous Carcinoma</li>
    <li><code>ER status</code>: Positive/Negative</li>
    <li><code>PR status</code>: Positive/Negative</li>
    <li><code>HER2 status</code>: Positive/Negative</li>
    <li><code>Surgery_type</code>: Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy, Other</li>
    <li><code>DateofSurgery</code>: The date of surgery</li>
    <li><code>DateofLast_Visit</code>: The date of the last visit of the patient</li>
    <li><code>Patient_Status</code>: Alive/Dead</li>
</ul>

We will import the modules for you.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

Now, try to read the dataset and read the first couple of rows.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Reading the Dataset</b></font></summary>
  <pre>
    <code style="display: block;">
        # Reading data solution
        data = pd.read_csv("BRCA.csv")
        print(data.head())
    </code>
  </pre>
</details>

Now we will start data cleaning the dataset. Let's start off by finding how many missing values exist in each column.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Checking for Missing Values</b></font></summary>
  <pre>
    <code style="display: block;">
        # Missing data in each column solution
        print(data.isnull().sum())
    </code>
  </pre>
</details>

Drop the missing values from each column.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Drop Missing Values</b></font></summary>
  <pre>
    <code style="display: block;">
        # Dropping the data in each column solution
        data = data.dropna()
    </code>
  </pre>
</details>

 We will look at the data types of each column and see what they are.

 Print a summary of the dataset.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Dataset Summary</b></font></summary>
  <pre>
    <code style="display: block;">
        # Data summary solution
        data.info()
    </code>
  </pre>
</details>

Print out how many males and females there are in the dataset.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Dataset Summary</b></font></summary>
  <pre>
    <code style="display: block;">
        # Males and females solution
        print(data.Gender.value_counts())
    </code>
  </pre>
</details>

As you can see, the proportion of females is greater than males.

Now, let's look at the tumor stages of the patients.

Create a pie chart showing percentages of tumor stages in the patients.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Tumor Stages Chart</b></font></summary>
  <pre>
    <code style="display: block;">
        # Tumor stages pie chart solution
        stage = data["Tumour_Stage"].value_counts()
        transactions = stage.index
        quantity = stage.values
        figure = px.pie(data,
            values=quantity,
            names=transactions,hole = 0.5,
            title="Tumour Stages of Patients")
        figure.show()
    </code>
  </pre>
</details>

Most of the patients are in the second tumor stage.
Now let’s have a look at the histology of breast cancer patients.

<b>Histology</b> is a description of a tumour based on how abnormal the cancer cells and tissue look under a microscope and how quickly cancer can grow and spread.

In [None]:
# Write your code here. Use the "Histology" column!


<details>
  <summary>Click for solution: <font color="sky blue"><b>Our Histology Solution</b></font></summary>
  <pre>
    <code style="display: block;">
        # Histology
        histology = data["Histology"].value_counts()
        transactions = histology.index
        quantity = histology.values
        figure = px.pie(data,
                     values=quantity,
                     names=transactions,hole = 0.5,
                     title="Histology of Patients")
        figure.show()
    </code>
  </pre>
</details>

Now let's examine the HR, PR, and HER2 statuses of the patients.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Our HR, PR, and HER2 Solution</b></font></summary>
  <pre>
    <code style="display: block;">
        # ER status
        print(data["ER status"].value_counts())
        # PR status
        print(data["PR status"].value_counts())
        # HER2 status
        print(data["HER2 status"].value_counts())
    </code>
  </pre>
</details>

Let's take a look at the surgery types done to the patients.

Create a pie chart of the surgery types.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Surgery Types Chart</b></font></summary>
  <pre>
    <code style="display: block;">
        # Surgery Type
        surgery = data["Surgery_type"].value_counts()
        transactions = surgery.index
        quantity = surgery.values
        figure = px.pie(data,
             values=quantity,
             names=transactions,hole = 0.5,
             title="Type of Surgery of Patients")
        figure.show()
    </code>
  </pre>
</details>

Since the majority of the columns in the dataset are categorical, we have to transform into quantitative measures.

Transform the categorical columns into their quantitative counterparts.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Transformation Solution</b></font></summary>
  <pre>
    <code style="display: block;">
        # Categorical to quantitative transformation solution.
        data["Tumour_Stage"] = data["Tumour_Stage"].map({"I": 1, "II": 2, "III": 3})
        data["Histology"] = data["Histology"].map({"Infiltrating Ductal Carcinoma": 1,
                                                   "Infiltrating Lobular Carcinoma": 2, "Mucinous Carcinoma": 3})
        data["ER status"] = data["ER status"].map({"Positive": 1})
        data["PR status"] = data["PR status"].map({"Positive": 1})
        data["HER2 status"] = data["HER2 status"].map({"Positive": 1, "Negative": 2})
        data["Gender"] = data["Gender"].map({"MALE": 0, "FEMALE": 1})
        data["Surgery_type"] = data["Surgery_type"].map({"Other": 1, "Modified Radical Mastectomy": 2, "Lumpectomy": 3, "Simple Mastectomy": 4})
        print(data.head())
    </code>
  </pre>
</details>

Now let's transition to the machine learning steps.

Split the data into training and testing sets.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Train-Test Split</b></font></summary>
  <pre>
    <code style="display: block;">
        # @title train test splitting solution.
        x = np.array(data[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']])
        y = np.array(data[['Patient_Status']])
        xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10, random_state=42)
    </code>
  </pre>
</details>

Now, let's create and train a machine learning model.

In [None]:
# Write your code here.

<details>
  <summary>Click for solution: <font color="sky blue"><b>Model Creation + Training</b></font></summary>
  <pre>
    <code style="display: block;">
        # Our solution
        model = SVC()
        model.fit(xtrain, ytrain)
    </code>
  </pre>
</details>

Create a test case and see how well your machine learning model predicts the output.

You are welcome to print out metrics associated with the SVC model.

In [None]:
# Write your code here.


<details>
  <summary>Click for solution: <font color="sky blue"><b>Model Prediction</b></font></summary>
  <pre>
    <code style="display: block;">
        # Prediction
        # features = [['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']]
        features = np.array([[36.0, 1, 0.080353, 0.42638, 0.54715, 0.273680, 3, 1, 1, 1, 2, 2,]])
        print(model.predict(features))
    </code>
  </pre>
</details>

## **Summary**
**Congratulations on completing the Breast Cancer Survival Prediction project!**

We hope you have learned about how breast cancer is classified and the factors that can play a role in breast cancer.