# CSI 4142 - Introduction to Data Science
# Assignment 3: Predictive analysis Regression and Classification

Shacha Parker (300235525)\
Callum Frodsham and (300199446)\
Group 79

### Setup Instructions To Reproduce this Data Cleaning Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install jupyter
pip install ipykernel
pip install pandas
pip install numpy
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

<h1>Dataset 2: Breast Cancer</h1>
<h3>Decision Tree Classification</h3>

Author: Reihaneh Namdari
<br>
Purpose: The purpose of this dataset is to provide population-based cancer statistics on patients with infiltrating duct and lobular carcinoma breast cancer that were diagnosed in 2006-2010. 
<br>
Shape: This dataset is composed of 16 columns, and 4024 rows.
<br><br>
Link: <a href="https://www.kaggle.com/datasets/reihanenamdari/breast-cancer"> Breast Cancer Dataset</a>
<br>
Note: This description only includes 10/16 features, as the rest will not be used.\
The differentiate feature was removed because it is the same as the grade feature.\
The 6th Stage feature was removed because it can be derived by the n stage, t stage and grade features.

<h3>Dataset Feature List: </h3>
<ol>
    <li>Age:
    <br>
    Feature Type: Numerical - Discrete
    <br>
    Description: The age in years of the patient.
    </li>
    <br>
    <li>Race:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The race of patient.
        </li>
    <br>
    <li>Marital Status:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The marital status of the patient.
        </li>
    <br>
    <li>T Stage:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: The T stage refers to the Tumor stage of the TNM staging system that describes the extent of the cancer such as tumor size, and tumor invasion into nearby structures. T1-T3 in increasing severity. 
        </li>
    <br>
    <li>N Stage:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: N Stage refers to the Node stage of the TNM staging system that describes whether the cancer has spread to other nearby lymph nodes, with N1-N3 increasing in severity.
        </li>
    <br>
    <li>6th Stage:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: Refers to the Breast Adjusted AJCC 6th Stage variables describing the extent of the disease (EOD), and the collaborative stage (CS). 
        </li>
    <br>
    <li>Grade:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: The 4 grades, 1, 2, 3 and anaplastic Grade 4 describe the histologic grade of the cancer cells.
        </li>
    <br>
    <li>A Stage:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: Two categories, distant and regional. Distant means that the tumor has spread/metastasized to far away organs/regions from the original site. Regional implies that that the tumor has extended in areas close to the original site. 
        </li>
    <br>
    <li>Tumor Size:
    <br>
    Feature Type: Numerical - Continuous
    <br>
    Description: The size of the tumor in exact millimeters.
        </li>
    <br>
    <li>Survival Months:
    <br>
    Feature Type: Numerical - Continuous
    <br>
    Description: The length of time in months until the patients death or their last follow up.
        </li>
    <br>
    <li>Status:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The status of the patient at their last follow up, with two categories, dead or alive.
        </li>
</ol>

In [None]:
# Initial Imports:
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.neighbors import LocalOutlierFactor
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import colormaps

# Then load the dataset
dataset = pd.read_csv("Breast_Cancer.csv")
# a list of excluded features to prevent the scope of this assignment from creeping. 
excluded_columns = ['Estrogen Status', 'Progesterone Status', 'Regional Node Examined', 'Reginol Node Positive', 'differentiate', '6th Stage']
dataset = dataset.drop(columns=excluded_columns)

## Data Cleaning
The Data Cleaning step will identify whether the dataset has any incorrect or missing values.

In [None]:
# Display information about the dataframe
print(dataset.info())

# Describe the dataframe
print(dataset.describe())

# Check for missing values
missing_values = dataset.isnull().sum()
print("Missing values in each column:\n", missing_values)


As can be seen from the data above, the data contains no null values, and each feature has the correct value types.

In [None]:
# Checking for correct value range for each categorical feature:
categorical_columns = dataset.select_dtypes(include=['object']).columns
for column in categorical_columns:
    print(f"\nUnique values in column '{column}':\n", dataset[column].unique())

As can be seen from the data above, each categorical feature has the values and there are no incorrect or missing values. Therefore it can be ascertained that the dataset is "clean".

## EDA and Outlier detection


### Categorical Feature Exploration:

In [None]:
# Creating visualizations from the categorical columns:
color_list = list(colormaps)
count = 0;
for column in categorical_columns:
    plt.figure(figsize=(10, 6))
    graph = sns.countplot(data=dataset, x=column, hue=column, palette=color_list[count])
    count += 3
    for bar in graph.containers:
        graph.bar_label(bar)
    plt.title(f"{column.title()} Count Plot")
    plt.show()


#### Handling Outliers for Categorical Features:

Most of the categorical features do not contain any outliers, however the marital status feature does contain an outlying category, specfically the "separated" category. The separated category only has 45/4024 or approximately 0.011% of the values, and is also similar to another category, divorced. The distinction between separated and divorced should not matter as much, and thus both category will be combined into one. \
The "Grade", and "A Stage" features have their own categories that are much rarer than the others. Those being, anaplastic, and distant. These rare categories will not be considered outliers as they are important, and could provide meaningful insight later on in the empirical study.

## EDA of Numerical Features and Outlier Detection:

In [None]:
# get all of the numerical columns
numerical_data = dataset.select_dtypes(include=['int64'])

# create a box plot of the age to view any possible outliers
sns.boxplot(data=dataset, x='Age')
plt.title("Age Box Plot")
plt.show()

As can be seen from the Age Box plot above, there are no outliers in age.

In [None]:
sns.boxplot(data=dataset, x='Tumor Size', color='blue')
plt.title("Tumor Size Box Plot")
plt.show()

sns.boxplot(data=dataset, x='Survival Months', color='red')
plt.title("Survival Months Box Plot")
plt.show()

In [None]:
# Calculating the IQR for both survival months, and the Tumor Size Features:

# calculate IQR for the tumor size
Q1TS = np.percentile(dataset["Tumor Size"], 25)
Q3TS = np.percentile(dataset["Tumor Size"], 75)
IQRTS = Q3TS - Q1TS 

# get upper bound
upper_bound = Q3TS + 1.5 * IQRTS
outliers_ts = dataset[dataset["Tumor Size"] > upper_bound]
percentile_of_upper_bound_ts = (dataset["Tumor Size"] > upper_bound).mean()
print("Tumor Size IQR info:")
print(f"Percentile of upper bound: {percentile_of_upper_bound_ts}%")
print(f"Number of outliers: {outliers_ts.shape[0]}")

# calculate IQR for the survival months
Q1SM = np.percentile(dataset["Survival Months"], 25) 
Q3SM = np.percentile(dataset["Survival Months"], 75) 
IQRSM = Q3SM - Q1SM

lower_bound_sm = Q1SM - 1.5 * IQRSM

# percentile rank of the lower bound of sm
outliers_sm = dataset[(dataset["Survival Months"] < lower_bound_sm)]
percentile_of_lower_bound_sm = (dataset["Survival Months"] < lower_bound_sm).mean() * 100
print("Surival Months IQR info:")
print(f"Percentile of lower bound: {percentile_of_lower_bound_sm}%")
num_rows_sm = outliers_sm.shape[0]
print(f"Number of outliers: {num_rows_sm}")


As can be seen, both features contain outliers. The Survival months feature has outliers in the 0.44% percentile containing 18 values, and the tumor size box plot shows that there are 222 outliers contained in the upper 0.05% percentile. However, this was just to show that there are outliers that need to be handled. We will find the outliers we want to remove with LOF, by comparing both the Tumor Size and the Survival Months feature.

In [None]:
# first get the features we want to apply LOF outlier removal to:
tumor_and_months_df = dataset[['Tumor Size', 'Survival Months']].copy()

# get the LOF of these two features
lof_labels = LocalOutlierFactor(n_neighbors=50, contamination= 0.035).fit_predict(tumor_and_months_df)

# modify the new df we created
tumor_and_months_df["LOF_Score"] = lof_labels
tumor_and_months_df["LOF_outlier_val"] = tumor_and_months_df["LOF_Score"] == -1


sns.scatterplot(data=tumor_and_months_df, x='Tumor Size', y ='Survival Months', hue="LOF_outlier_val")
plt.legend(title="Outlier")
plt.title("Survival Months Vs. Tumor Size")
plt.show()

# removal of outliers using LOF
#tumor_and_months_df_clean = tumor_and_months_df[tumor_and_months_df["LOF_Anomaly"] == False]

This graph highlights the outliers found by LOF in orange. LOF parameters are n = 50, and the contamination percentage set to 3.5%. We will deal with these outliers by removing them, and this will be shown later in the empirical study portion of this notebook. 

#### Handling Outliers:

1. For the comparison of Tumor Size and Survival Months, the outliers found using LOF (as above) will be removed.  
2. The Marital Status feature's outlier category, separated, will be combined into the Divorced category.

We will expand on these methods of handling the outliers further into the empirical study when the different test/validation sets are created.

## Predictive Analysis: Decision Trees
I will be using catboost's decision tree classification methods, instead of scikit's due to catboost's ability to natively process categorical features without any encoding.


## Feature Engineering


First Feature: Tumor Size to Grade ratio
Second Feature: Age to Survival Months Ratio

In [None]:
# First feature: Tumor Size to Grade ratio
grade_map = {'3':3, '2':2, '1':1,' anaplastic; Grade IV':4}

# modify the dataset to add the fixed grade map
dataset["GradeNumber"] = dataset["Grade"].map(grade_map)

# then add the new feature:
dataset['SizeToGradeRatio'] = dataset['GradeNumber'] / dataset['Tumor Size']


In [None]:
# Second Feature: Age to Survival Months Ratio
dataset['AgeToSurvivalMonthsRatio'] = dataset['Age'] / dataset['Survival Months']