# **Data Cleaning**

In [1]:
# Modules & Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### **Loading the Dataset**

In [2]:
# Path
path = '../Data/Raw/breast-cancer.csv'

# Reading csv file
df_raw = pd.read_csv(path)
df_raw.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
# Checking dimension of the dataframe (# of rows & cols)
df_raw.shape

(569, 32)

In [4]:
# Concise summary of the dataframe
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

## **Dataset Features Overview**

The dataset contains various measurements and attributes related to breast tumors. Below is a detailed description of the columns:

### **Identifiers**
- **`id`**: A unique identifier for each tumor record.
- **`diagnosis`**: The diagnosis of the tumor:
  - `M`: Malignant (cancerous).
  - `B`: Benign (non-cancerous).

### **Mean Measurements**
These columns represent the average values calculated for each tumor's features:
- **`radius_mean`**: The average distance from the center of the tumor to its perimeter, reflecting the tumor's size.
- **`texture_mean`**: The average standard deviation of gray-scale values, measuring the smoothness or coarseness of the tumor's texture.
- **`perimeter_mean`**: The average length of the tumor's boundary perimeter.
- **`area_mean`**: The average surface area of the tumor.
- **`smoothness_mean`**: The average local variation in radius lengths, indicating how smooth the tumor's surface is.
- **`compactness_mean`**: The average compactness (calculated as perimeter² / area - 1.0), representing the roundness of the tumor.
- **`concavity_mean`**: The average severity of inward-curving portions of the tumor's boundary.
- **`concave points_mean`**: The average number of concave (inward) points on the tumor's boundary.
- **`symmetry_mean`**: The average symmetry of the tumor about its center.
- **`fractal_dimension_mean`**: The average fractal dimension, indicating the complexity of the tumor's surface.

### **Standard Error (SE) Measurements**
These columns capture the variability in the measurements:
- **`radius_se`**: Standard error of the radius.
- **`texture_se`**: Standard error of the texture.
- **`perimeter_se`**: Standard error of the perimeter.
- **`area_se`**: Standard error of the area.
- **`smoothness_se`**: Standard error of the smoothness.
- **`compactness_se`**: Standard error of the compactness.
- **`concavity_se`**: Standard error of the concavity.
- **`concave points_se`**: Standard error of the concave points.
- **`symmetry_se`**: Standard error of the symmetry.
- **`fractal_dimension_se`**: Standard error of the fractal dimension.

### **Worst (Largest) Measurements**
These columns represent the largest values observed for each feature:
- **`radius_worst`**: Largest radius observed.
- **`texture_worst`**: Largest texture value observed.
- **`perimeter_worst`**: Largest perimeter length observed.
- **`area_worst`**: Largest area observed.
- **`smoothness_worst`**: Largest smoothness value observed.
- **`compactness_worst`**: Largest compactness value observed.
- **`concavity_worst`**: Largest concavity value observed.
- **`concave points_worst`**: Largest number of concave points observed.
- **`symmetry_worst`**: Largest symmetry value observed.
- **`fractal_dimension_worst`**: Largest fractal dimension observed.

#### **Summary**
For each tumor, the dataset provides:
- **Mean measurements**: Average values for key features.
- **Standard error measurements**: Variability in the measurements.
- **Worst measurements**: Largest observed values for the features.

These measurements collectively describe the physical characteristics of the tumors and aid in distinguishing between benign and malignant cases.

### **Verifying Data Quality: Null Values, Duplicates**

### **