# Project Overview


This project focuses on developing a classification model to assist in the diagnosis of breast cancer using historical patient data. The model leverages key features such as patient age, menopause status, tumor size, presence of invasive nodes, breast affected, metastasis status, breast quadrant, and prior history of breast conditions. The primary objective is to accurately predict the likelihood of cancer occurrence based on these clinical and demographic variables.

The dataset used for this analysis was obtained from Kaggle, originally sourced from the University of Calabar Teaching Hospital cancer registry. While the data provides valuable insights into patient characteristics and cancer diagnosis trends, it is important to acknowledge that it reflects a specific hospital population. As a result, the findings and model performance may not generalize broadly to other populations or regions without additional validation on more diverse datasets.


## Columns Description 

| **Feature Name**     | **Description**                                                                                  |
|----------------------|--------------------------------------------------------------------------------------------------|
| **S/N**              | Unique identifier for each patient.                                                              |
| **Year**             | Year in which the diagnosis was conducted.                                                       |
| **Age**              | Age of the patient at the time of diagnosis.                                                     |
| **Menopause**        | Menopausal status at diagnosis: `0` = Postmenopausal, `1` = Premenopausal.                       |
| **Tumor Size**       | Size of the excised tumor (in centimeters).                                                      |
| **Involved Nodes**   | Number of axillary lymph nodes containing metastasis: `1` = Present, `0` = Not present.          |
| **Breast**           | Indicates spread on both sides: `1` = Cancer has spread, `0` = Has not spread.                   |
| **Metastatic**       | Indicates whether cance has spread to other organs: `1` = Yes, `0` = No.                         |
| **Breast Quadrant**  | Tumor location based on breast quadrants (e.g., Upper Outer, Lower Inner, etc.).                 |
| **History**          | Cancer history: `1` = Patient has a personal or family history, `0` = No history.                |
| **Diagnosis**        | Diagnosis outcome; used as the target variable for classification.                               |


##                                        Loading the necessary Libraries 


In [None]:

#Data Manipulation
import pandas as pd
import numpy as np

#Data Visualisation 
import matplotlib.pyplot as plt 
plt.style.use('ggplot')
import seaborn as sns
sns.set_style('darkgrid')

#Stats
import statsmodels.api as sm 
from scipy import stats 
from scipy.stats import mannwhitneyu
from scipy.stats import pearsonr


#Data Processing 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import OneHotEncoder

#Models 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

#Metrics 
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import  classification_report
