Problem Statement

Breast cancer is a global health concern, with a significant impact on both women and men .Early detection is crucial for effective treatment and improved survival rates.The abnormal growth of cells in breast tissue can lead to benign, pre-malignant, or malignant tumors. Common diagnostic methods include MRI, mammogram, ultrasound, and biopsy.The challenge is to develop accurate, interpretable models for global breast cancer prediction, contributing to early detection and improved treatment outcomes.

Expected Outcome

The objective is to develop a model for classifying breast cancer based on the results of a fine-needle aspiration (FNA) test. This quick and simple procedure involves extracting fluid or cells from a breast lesion or cyst using a fine needle, similar to a blood sample needle.

The model aims to classify tumors into two categories:

1: Malignant (Cancerous) - Present
0: Benign (Not Cancerous) - Absent

Data source

The dataset contains:
ID numbers
Ten real-valued features are computed for each cell nucleu
357 benign, 212 malignant

Data Preprocessing

Load Dataset

First, load the supplied CSV file using additional options in the Pandas read_csv function.

In [None]:
import pandas as pd

In [2]:
# read data
file_path = '/Users/oumaymabamoh/PycharmProjects/BreastCancerPrediction/data/raw/data_t1.csv'
df = pd.read_csv(file_path)

Displaying Data Overview

To gain an understanding of the dataset, we utilize various Pandas methods to display key information

In [3]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# Info of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [5]:
# Display the number of rows and columns in the DataFrame
df.shape

(569, 33)

Cleaning Data

Data cleaning involves standardizing column names and removing unnecessary columns.

In [6]:
df.columns = df.columns.str.strip()  # Remove leading/trailing whitespaces from column names
df.drop('Unnamed: 32', axis=1, inplace=True)  # Try dropping again

Final Data Overview

After cleaning, we reassess the dataset to ensure a consistent and cleaned state

In [7]:
# Display the number of rows and columns in the DataFrame
df.shape

(569, 32)

In [8]:
# Info of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

Checking for Missing Values

Ensure there are no missing values in the dataset.

In [9]:
# Check for missing values
df.isnull().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Exploring Class Distribution

Explore the distribution of classes in the 'diagnosis' column.

In [10]:
# Display class distribution
df.diagnosis.unique()

array(['M', 'B'], dtype=object)

In [11]:
df.duplicated().sum()

0

In [12]:
# Display class distribution
df['diagnosis'].value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

Saving Cleaned Data

Save the cleaned data to a new CSV file.

In [13]:
#save data
# Assuming clean_data is your DataFrame
df.to_csv('/Users/oumaymabamoh/PycharmProjects/BreastCancerPrediction/data/processed /data_clean.csv', index=False)