# Project Overview


This project focuses on developing a classification model to assist in the diagnosis of breast cancer using historical patient data. The model leverages key features such as patient age, menopause status, tumor size, presence of invasive nodes, breast affected, metastasis status, breast quadrant, and prior history of breast conditions. The primary objective is to accurately predict the likelihood of cancer occurrence based on these clinical and demographic variables.

The dataset used for this analysis was obtained from Kaggle, originally sourced from the University of Calabar Teaching Hospital cancer registry. While the data provides valuable insights into patient characteristics and cancer diagnosis trends, it is important to acknowledge that it reflects a specific hospital population. As a result, the findings and model performance may not generalize broadly to other populations or regions without additional validation on more diverse datasets.


## Columns Description 

| **Feature Name**     | **Description**                                                                                  |
|----------------------|--------------------------------------------------------------------------------------------------|
| **S/N**              | Unique identifier for each patient.                                                              |
| **Year**             | Year in which the diagnosis was conducted.                                                       |
| **Age**              | Age of the patient at the time of diagnosis.                                                     |
| **Menopause**        | Menopausal status at diagnosis: `0` = Postmenopausal, `1` = Premenopausal.                       |
| **Tumor Size**       | Size of the excised tumor (in centimeters).                                                      |
| **Involved Nodes**   | Number of axillary lymph nodes containing metastasis: `1` = Present, `0` = Not present.          |
| **Breast**           | Indicates spread on both sides: `1` = Cancer has spread, `0` = Has not spread.                   |
| **Metastatic**       | Indicates whether cance has spread to other organs: `1` = Yes, `0` = No.                         |
| **Breast Quadrant**  | Tumor location based on breast quadrants (e.g., Upper Outer, Lower Inner, etc.).                 |
| **History**          | Cancer history: `1` = Patient has a personal or family history, `0` = No history.                |
| **Diagnosis**        | Diagnosis outcome; used as the target variable for classification.                               |


##                                        Loading the necessary Libraries 


In [75]:

# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualisation 
import matplotlib.pyplot as plt 
plt.style.use('ggplot')
import seaborn as sns
sns.set_style('darkgrid')

# Statistical Analysis
import statsmodels.api as sm 
from scipy import stats 
from scipy.stats import mannwhitneyu
from scipy.stats import pearsonr


# Data Preprocessing 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import OneHotEncoder

# Classification Model
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Model Evaluation
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import  classification_report


## Loading the Data

In [76]:
Excel_file = r"C:\Users\sandr\OneDrive\Documents\Github_Testing\Data-Science-Portfolio\Code\breastcancer.csv"
data = pd.read_csv(Excel_file)
print(data.head())

   S/N  Year   Age  Menopause Tumor Size (cm) Inv-Nodes Breast Metastasis  \
0    1  2019  40.0          1               2         0  Right          0   
1    2  2019  39.0          1               2         0   Left          0   
2    3  2019  45.0          0               4         0   Left          0   
3    4  2019  26.0          1               3         0   Left          0   
4    5  2019  21.0          1               1         0  Right          0   

  Breast Quadrant History Diagnosis Result  
0     Upper inner       0           Benign  
1     Upper outer       0           Benign  
2     Lower outer       0           Benign  
3     Lower inner       1           Benign  
4     Upper outer       1           Benign  


## Data Exploration

In [77]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   S/N               215 non-null    int64  
 1   Year              215 non-null    object 
 2   Age               214 non-null    float64
 3   Menopause         215 non-null    int64  
 4   Tumor Size (cm)   214 non-null    object 
 5   Inv-Nodes         214 non-null    object 
 6   Breast            213 non-null    object 
 7   Metastasis        215 non-null    object 
 8   Breast Quadrant   215 non-null    object 
 9   History           215 non-null    object 
 10  Diagnosis Result  214 non-null    object 
dtypes: float64(1), int64(2), object(8)
memory usage: 18.6+ KB


In [78]:
print(data.isnull().sum())  # print number of misinig data in each column

# to print out rows with hash sign
mask = data.apply(lambda x: x.astype(str).str.contains('#')).any(axis=1)
rows_with_hash = data[mask]

print(rows_with_hash)


S/N                 0
Year                0
Age                 1
Menopause           0
Tumor Size (cm)     1
Inv-Nodes           1
Breast              2
Metastasis          0
Breast Quadrant     0
History             0
Diagnosis Result    1
dtype: int64
     S/N  Year   Age  Menopause Tumor Size (cm) Inv-Nodes Breast Metastasis  \
30    31  2019  56.0          0               9         1   Left          1   
40    41     #  34.0          1               #         #      #          #   
47    48  2019  25.0          1               5         0      #          0   
67    68  2019  40.0          1               1         0   Left          0   
143  144  2020  29.0          1               2         0      #          0   
164  165  2020  38.0          1               2         0      #          0   
166  167  2020  62.0          0               3         1      #          1   
178  179  2020  49.0          1               4         0      #          0   

    Breast Quadrant History Diagn

This shows that we have some missing data in the dataset. I also noticed that in the dataset there are rows that being populated with # which clearly shows that is a data entry error.

### Data Cleaning/ Data Imputation

Data imputation is the processing of replacing misisng values in a dataset with a substitued values. Instead of removeing or deleting rows with a misisng or incorrect entry i rather imputed the data to retain the majority of the dataset information and also not reducing the size of my dataset. For this dataset  i will be using the Single imputation Methods.Such as Replacing misiing values of catergorical and binary columns by mode and  for numerical data, if the data is skewed i will use median and if it normally distributed i will use mean. 

In [79]:
data.columns = data.columns.str.replace(" ", "") # to remove the white space in the columns
data.columns

Index(['S/N', 'Year', 'Age', 'Menopause', 'TumorSize(cm)', 'Inv-Nodes',
       'Breast', 'Metastasis', 'BreastQuadrant', 'History', 'DiagnosisResult'],
      dtype='object')

In [80]:
columns = ['Year','Age','Menopause','TumorSize(cm)','Inv-Nodes','Metastasis','History']
for col in columns:
    data[col]= pd.to_numeric(data[col], errors= 'coerce')

In [81]:
unique_column= ['Menopause','Inv-Nodes','Breast','Metastasis','BreastQuadrant','DiagnosisResult']
for col in unique_column:
    print(f"{col}: {data[col].unique()}")

Menopause: [1 0]
Inv-Nodes: [ 0.  1. nan  3.]
Breast: ['Right' 'Left' '#' nan]
Metastasis: [ 0.  1. nan]
BreastQuadrant: ['Upper inner' 'Upper outer' 'Lower outer' 'Lower inner' '#'
 'Upper outer ']
DiagnosisResult: ['Benign' 'Malignant' nan]


In [82]:
# to ensure that Inv-Nodes and Metastasis keep only 0 or 1
data['Inv-Nodes'] = data['Inv-Nodes'].apply(lambda x: x if x == 0 or x ==1 else np.nan)
data['Metastasis'] = data['Metastasis'].apply(lambda x : x if x == 0 or x == 1 else np.nan)

# to ensure Breast column has only Left or Right
data['Breast'] = data['Breast'].apply(lambda x : x if x == 'Right' or x == 'Left' else np.nan)

# to replace '#' with NaN
data['BreastQuadrant'] = data['BreastQuadrant'].apply(lambda x : np.nan if x == '#' else x)
 
 #Remove extra space
data['BreastQuadrant'] = data['BreastQuadrant'].str.strip()

In [83]:
data_null_values = data.isnull().sum().to_frame().rename(columns = {0:'count_Value'})
print(data_null_values)

                 count_Value
S/N                        0
Year                       1
Age                        1
Menopause                  0
TumorSize(cm)              2
Inv-Nodes                  3
Breast                     8
Metastasis                 1
BreastQuadrant             2
History                    2
DiagnosisResult            1


In [84]:
# To find the mode of the categorical and binary columns
columns_mode = ['Inv-Nodes','Breast','BreastQuadrant','History','DiagnosisResult','Metastasis']

for col in columns_mode:
    mode_value = data[col].mode()
    if not mode_value.empty:
        print(f"Mode of '{col}': {mode_value.iloc[0]}")
    else:
        print(f"Mode of '{col}': No mode found (column may be empty)")

Mode of 'Inv-Nodes': 0.0
Mode of 'Breast': Left
Mode of 'BreastQuadrant': Upper outer
Mode of 'History': 0.0
Mode of 'DiagnosisResult': Benign
Mode of 'Metastasis': 0.0


In [85]:
# Function for filling missing values with mode

def missing_value_mode(data,columns):
    for col in columns:
        if col in data.columns:
            data[col] = data[col].replace('#', pd.NA)
            data[col] = data[col].astype("string")
            if data[col].isnull().any():
                mode_value = data[col].mode()
                if not mode_value.empty:
                    data[col].fillna(mode_value.iloc[0], inplace = True)
    return data


In [86]:
df = data.copy()
#print(df)
df = missing_value_mode(df,columns_mode)
print(df)

     S/N    Year   Age  Menopause  TumorSize(cm) Inv-Nodes Breast Metastasis  \
0      1  2019.0  40.0          1            2.0       0.0  Right        0.0   
1      2  2019.0  39.0          1            2.0       0.0   Left        0.0   
2      3  2019.0  45.0          0            4.0       0.0   Left        0.0   
3      4  2019.0  26.0          1            3.0       0.0   Left        0.0   
4      5  2019.0  21.0          1            1.0       0.0  Right        0.0   
..   ...     ...   ...        ...            ...       ...    ...        ...   
210  211  2020.0  22.0          1            1.0       0.0   Left        0.0   
211  212  2020.0  19.0          1            1.0       0.0   Left        0.0   
212  213  2020.0  50.0          0            4.0       0.0  Right        0.0   
213  214  2020.0   NaN          0            5.0       0.0   Left        1.0   
214  215  2020.0  13.0          1            NaN       0.0   Left        0.0   

    BreastQuadrant History DiagnosisRes

In [87]:
columns_median = ['Year','Age','TumorSize(cm)']
for col in columns_median:
    if df[col].isnull().any():
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)

In [88]:
# To ensure there is no missing values

print(df.isnull().sum())  


S/N                0
Year               0
Age                0
Menopause          0
TumorSize(cm)      0
Inv-Nodes          0
Breast             0
Metastasis         0
BreastQuadrant     0
History            0
DiagnosisResult    0
dtype: int64


In [90]:
# droping the "S/N" column because it won't be needed for the analysis and model building
df.drop('S/N', axis=1)


Unnamed: 0,Year,Age,Menopause,TumorSize(cm),Inv-Nodes,Breast,Metastasis,BreastQuadrant,History,DiagnosisResult
0,2019.0,40.0,1,2.0,0.0,Right,0.0,Upper inner,0.0,Benign
1,2019.0,39.0,1,2.0,0.0,Left,0.0,Upper outer,0.0,Benign
2,2019.0,45.0,0,4.0,0.0,Left,0.0,Lower outer,0.0,Benign
3,2019.0,26.0,1,3.0,0.0,Left,0.0,Lower inner,1.0,Benign
4,2019.0,21.0,1,1.0,0.0,Right,0.0,Upper outer,1.0,Benign
...,...,...,...,...,...,...,...,...,...,...
210,2020.0,22.0,1,1.0,0.0,Left,0.0,Upper outer,1.0,Benign
211,2020.0,19.0,1,1.0,0.0,Left,0.0,Lower inner,1.0,Benign
212,2020.0,50.0,0,4.0,0.0,Right,0.0,Lower outer,1.0,Benign
213,2020.0,40.0,0,5.0,0.0,Left,1.0,Upper outer,0.0,Benign


## Data Visualisation

In [None]:
columns_to_plot = ['Age', 'TumorSize']

     S/N  Year   Age  Menopause  Tumor Size (cm) Inv-Nodes Breast Metastasis  \
0      1  2019  40.0          1              2.0         0  Right          0   
1      2  2019  39.0          1              2.0         0   Left          0   
2      3  2019  45.0          0              4.0         0   Left          0   
3      4  2019  26.0          1              3.0         0   Left          0   
4      5  2019  21.0          1              1.0         0  Right          0   
..   ...   ...   ...        ...              ...       ...    ...        ...   
210  211  2020  22.0          1              1.0         0   Left          0   
211  212  2020  19.0          1              1.0         0   Left          0   
212  213  2020  50.0          0              4.0         0  Right          0   
213  214  2020  40.0          0              5.0         0   Left          1   
214  215  2020  13.0          1              4.0         0   Left          0   

    Breast Quadrant History Diagnosis R