# **Data Cleaning**

In [1]:
# Modules & Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit Learn
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

### **Loading the Dataset**

In [2]:
# Path
path = '../Data/Raw/breast-cancer.csv'

# Reading csv file
df_raw = pd.read_csv(path)

# Displaying first 5 rows of dataframe
df_raw.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
# Checking dimension of the dataframe (# of rows & cols)
df_raw.shape

(569, 32)

In [4]:
# Concise summary of the dataframe
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

## **Dataset Features Overview**

The dataset contains various measurements and attributes related to breast tumors. Below is a detailed description of the columns:

### **Identifiers**
- **`id`**: A unique identifier for each tumor record.
- **`diagnosis`**: The diagnosis of the tumor & the target variable for this project:
  - `M`: Malignant (cancerous).
  - `B`: Benign (non-cancerous).

### **Mean Measurements**
These columns represent the average values calculated for each tumor's features:
- **`radius_mean`**: The average distance from the center of the tumor to its perimeter, reflecting the tumor's size.
- **`texture_mean`**: The average standard deviation of gray-scale values, measuring the smoothness or coarseness of the tumor's texture.
- **`perimeter_mean`**: The average length of the tumor's boundary perimeter.
- **`area_mean`**: The average surface area of the tumor.
- **`smoothness_mean`**: The average local variation in radius lengths, indicating how smooth the tumor's surface is.
- **`compactness_mean`**: The average compactness (calculated as perimeter² / area - 1.0), representing the roundness of the tumor.
- **`concavity_mean`**: The average severity of inward-curving portions of the tumor's boundary.
- **`concave points_mean`**: The average number of concave (inward) points on the tumor's boundary.
- **`symmetry_mean`**: The average symmetry of the tumor about its center.
- **`fractal_dimension_mean`**: The average fractal dimension, indicating the complexity of the tumor's surface.

### **Standard Error (SE) Measurements**
These columns capture the variability in the measurements:
- **`radius_se`**: Standard error of the radius.
- **`texture_se`**: Standard error of the texture.
- **`perimeter_se`**: Standard error of the perimeter.
- **`area_se`**: Standard error of the area.
- **`smoothness_se`**: Standard error of the smoothness.
- **`compactness_se`**: Standard error of the compactness.
- **`concavity_se`**: Standard error of the concavity.
- **`concave points_se`**: Standard error of the concave points.
- **`symmetry_se`**: Standard error of the symmetry.
- **`fractal_dimension_se`**: Standard error of the fractal dimension.

### **Worst (Largest) Measurements**
These columns represent the largest values observed for each feature:
- **`radius_worst`**: Largest radius observed.
- **`texture_worst`**: Largest texture value observed.
- **`perimeter_worst`**: Largest perimeter length observed.
- **`area_worst`**: Largest area observed.
- **`smoothness_worst`**: Largest smoothness value observed.
- **`compactness_worst`**: Largest compactness value observed.
- **`concavity_worst`**: Largest concavity value observed.
- **`concave points_worst`**: Largest number of concave points observed.
- **`symmetry_worst`**: Largest symmetry value observed.
- **`fractal_dimension_worst`**: Largest fractal dimension observed.

#### **Summary**
For each tumor, the dataset provides:
- **Mean measurements**: Average values for key features.
- **Standard error measurements**: Variability in the measurements.
- **Worst measurements**: Largest observed values for the features.

These measurements collectively describe the physical characteristics of the tumors and aid in distinguishing between benign and malignant cases.

### **Verifying Data Quality: Null Values, Duplicates**

In [5]:
# Checking for null values in dataframe
df_raw.isnull().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

In [6]:
# Checking for duplicate values
df_raw.duplicated().sum()

0

##### **Note:** No null or duplicate values present 

### **Preprocessing Attributes**

In [7]:
# Preprocessing Target Attribute
le = LabelEncoder()
df_raw['diagnosis'] = le.fit_transform(df_raw.diagnosis)
df_raw.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [8]:
# Dropping Unnecessary Attributes
df_raw.drop(['id'], axis=1, inplace=True)

In [9]:
# Standarizing the Attributes for Processing 
X = df_raw.drop(columns=['diagnosis'])

# Initialize the StandardScaler
scaler = StandardScaler()

# Standardizes the features (all columns except diagnosis)
X_scaled = scaler.fit_transform(X)

# Creates a new DataFrame with standardized features and original 'diagnosis' column
df_processed = pd.DataFrame(X_scaled, columns=X.columns)

# Adds 'diagnosis' column back to the DataFrame
df_processed['diagnosis'] = df_raw['diagnosis']

# Shows the first few rows of the scaled DataFrame
df_processed.head()


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015,1
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119,1
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391,1
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501,1
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971,1


In [10]:
df_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   radius_mean              569 non-null    float64
 1   texture_mean             569 non-null    float64
 2   perimeter_mean           569 non-null    float64
 3   area_mean                569 non-null    float64
 4   smoothness_mean          569 non-null    float64
 5   compactness_mean         569 non-null    float64
 6   concavity_mean           569 non-null    float64
 7   concave points_mean      569 non-null    float64
 8   symmetry_mean            569 non-null    float64
 9   fractal_dimension_mean   569 non-null    float64
 10  radius_se                569 non-null    float64
 11  texture_se               569 non-null    float64
 12  perimeter_se             569 non-null    float64
 13  area_se                  569 non-null    float64
 14  smoothness_se            5

### **Saving Processed Dataframe**

In [11]:
# Defines the file path for saving the processed data
processed_file_path = '../Data/Processed/df_processed.csv'

# Saves the df_processed dataframe to the CSV file
df_processed.to_csv(processed_file_path, index=False)

print(f"Processed data saved to: {processed_file_path}")


Processed data saved to: ../Data/Processed/df_processed.csv
