# CSI 4142 - Introduction to Data Science
# Assignment 3: Predictive analysis Regression and Classification

Shacha Parker (300235525)\
Callum Frodsham and (300199446)\
Group 79

### Setup Instructions To Reproduce this Data Cleaning Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install jupyter
pip install ipykernel
pip install pandas
pip install numpy
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

<h1>Dataset 2: Breast Cancer</h1>
<h3>Decision Tree Classification</h3>

Author: Reihaneh Namdari
<br>
Purpose: The purpose of this dataset is to provide population-based cancer statistics on patients with infiltrating duct and lobular carcinoma breast cancer that were diagnosed in 2006-2010. 
<br>
Shape: This dataset is composed of 16 columns, and 4024 rows.
<br><br>
Link: <a href="https://www.kaggle.com/datasets/reihanenamdari/breast-cancer"> Breast Cancer Dataset</a>
<br>
Note: This description only includes 12/16 features, as the rest will not be used. 
<h3>Dataset Feature List: </h3>
<ol>
    <li>Age:
    <br>
    Feature Type: Numerical - Discrete
    <br>
    Description: The age in years of the patient.
    </li>
    <br>
    <li>Race:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The race of patient.
        </li>
    <br>
    <li>Marital Status:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The marital status of the patient.
        </li>
    <br>
    <li>T Stage:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: The T stage refers to the Tumor stage of the TNM staging system that describes the extent of the cancer such as tumor size, and tumor invasion into nearby structures. T1-T3 in increasing severity. 
        </li>
    <br>
    <li>N Stage:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: N Stage refers to the Node stage of the TNM staging system that describes whether the cancer has spread to other nearby lymph nodes, with N1-N3 increasing in severity.
        </li>
    <br>
    <li>6th Stage:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: Refers to the Breast Adjusted AJCC 6th Stage variables describing the extent of the disease (EOD), and the collaborative stage (CS). 
        </li>
    <br>
    <li>Differentiate:
    <br>
    Feature Type: Categorial - Ordinal
    <br>
    Description: Refers to how closely the cancer cells resemble the breast cells.
        </li>
    <br>
    <li>Grade:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: The 4 grades, 1, 2, 3 and anaplastic Grade 4 describe the histologic grade of the cancer cells.
        </li>
    <br>
    <li>A Stage:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: Two categories, distant and regional. Distant means that the tumor has spread/metastasized to far away organs/regions from the original site. Regional implies that that the tumor has extended in areas close to the original site. 
        </li>
    <br>
    <li>Tumor Size:
    <br>
    Feature Type: Numerical - Continuous
    <br>
    Description: The size of the tumor in exact millimeters.
        </li>
    <br>
    <li>Survival Months:
    <br>
    Feature Type: Numerical - Continuous
    <br>
    Description: The length of time in months until the patients death or their last follow up.
        </li>
    <br>
    <li>Status:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The status of the patient at their last follow up, with two categories, dead or alive.
        </li>
</ol>

In [None]:
# Initial Imports:
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import colormaps

# Then load the dataset
dataset = pd.read_csv("Breast_Cancer.csv")

# a list of excluded features to prevent the scope of this assignment from creeping. 
excluded_columns = ['Estrogen Status', 'Progesterone Status', 'Regional Node Examined', 'Reginol Node Positive']
dataset = dataset.drop(columns=excluded_columns)

## Data Cleaning
The Data Cleaning step will identify whether the dataset has any incorrect or missing values.

In [None]:
# Display information about the dataframe
print(dataset.info())

# Describe the dataframe
print(dataset.describe())

# Check for missing values
missing_values = dataset.isnull().sum()
print("Missing values in each column:\n", missing_values)


As can be seen from the data above, the data contains no null values, and each feature has the correct value types.

In [None]:
# Checking for correct value range for each categorical feature:
categorical_columns = dataset.select_dtypes(include=['object']).columns
for column in categorical_columns:
    print(f"\nUnique values in column '{column}':\n", dataset[column].unique())

As can be seen from the data above, each categorical feature has the values and there are no incorrect or missing values. Therefore it can be ascertained that the dataset is "clean".

## EDA and Outlier detection


### Categorical Feature Exploration:

In [None]:
# Creating visualizations from the categorical columns:
color_list = list(colormaps)
count = 0;
for column in categorical_columns:
    plt.figure(figsize=(10, 6))
    graph = sns.countplot(data=dataset, x=column, hue=column, palette=color_list[count])
    print(color_list[count])
    count += 3
    for bar in graph.containers:
        graph.bar_label(bar)
    plt.title(column.title())
    plt.show()



#### Handling Outliers for Categorical Features:

### EDA of Numerical Features:

In [None]:
# get all of the numerical columns
numerical_data = dataset.select_dtypes(include=['int64'])

colors = ["blue", "orange", "pink"]
# create a histplot of all of the values.
count = 0
for column in numerical_data.columns:
    graph = sns.boxplot(data=dataset, x = column, color=colors[count])
    count += 1
    plt.title(column.title())
    plt.show()

#### Numerical Outlier Breakdown:
 