# Data Cleaning and Preparation

This notebook encodes the `diagnosis` column, drops unnecessary columns, and saves the cleaned dataset as `clean_data.csv`.

In [1]:
# DATA CLEANING AND PREPARATION

import pandas as pd

# 1) Load original dataset
df = pd.read_csv("data.csv")
print("Initial shape:", df.shape)

# 2) Encode target column ('M' → 1, 'B' → 0)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
print("\n Target column 'diagnosis' successfully encoded (1=Malignant, 0=Benign).")

# 3️) Drop unnecessary columns
df = df.drop(columns=['Unnamed: 32'], errors='ignore')

print("\n Dropped unnecessary columns")
print("New shape after cleaning:", df.shape)

# 4) Verify data types and check for missing values
print("\nData types after cleaning:\n", df.dtypes)
print("\nMissing values per column:\n", df.isnull().sum())

# 5️) Save the cleaned dataset
df.to_csv("clean_data.csv", index=False)
print("\nCleaned dataset saved as 'clean_data.csv'")

Initial shape: (569, 33)

 Target column 'diagnosis' successfully encoded (1=Malignant, 0=Benign).

 Dropped unnecessary columns
New shape after cleaning: (569, 32)

Data types after cleaning:
 id                           int64
diagnosis                    int64
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
t