
# Breast Cancer Data Analysis (PySpark)

**Dataset file:** `/mnt/data/data.csv`


**Dataset description:**


This notebook performs exploratory data analysis (EDA) on a breast cancer dataset using **PySpark 4.0.1** (compatible code)
and common Python libraries (`pandas`, `matplotlib`, `seaborn`). The analysis pipeline mirrors the structure from the
`govt_data_analysis_pyspark` notebook you provided, adapted for this medical dataset.

**What this notebook includes:**
- Spark session initialization (PySpark)
- Data loading and preview
- Column cleaning (normalize names)
- Null / duplicate handling
- Automatic identification and casting of numeric columns
- Summary statistics and correlations
- Visualizations (feature distributions, diagnosis-wise boxplots, correlation heatmap)

> Note: The notebook assumes the file `/mnt/data/data.csv` exists (you already uploaded it). If your environment uses a
different path, update the `csv_path` variable in the Data Loading cell.

**Goal:** Provide clear EDA (summary statistics and visual insights) to help understand patterns in features and how they
relate to the target diagnosis (e.g., malignant vs benign).


In [None]:

# Imports and Spark session initialization
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType, IntegerType
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Create Spark session (compatible with PySpark 4.0.1)
spark = SparkSession.builder \

    .appName("BreastCancerEDA") \

    .getOrCreate()

print("Spark version:", spark.version)


In [None]:

# Path to the CSV file (update if needed)
csv_path = "/mnt/data/data.csv"

# Read CSV without forcing schema (so we can mimic the govt notebook flow)
df = spark.read.csv(csv_path, header=True, inferSchema=False)

print("Number of rows:", df.count())
print("Number of columns:", len(df.columns))
print("Columns:", df.columns)

# Show sample rows
df.show(5, truncate=False)
df.printSchema()


In [None]:

# Normalize column names: replace spaces, slashes, hyphens, parentheses with underscores
def normalize_col_name(c):
    return (c.strip()
            .replace(' ', '_')
            .replace('-', '_')
            .replace('/', '_')
            .replace('(', '')
            .replace(')', '')
            .replace('%', 'pct'))

old_cols = df.columns
new_cols = [normalize_col_name(c) for c in old_cols]

for old, new in zip(old_cols, new_cols):
    if old != new:
        df = df.withColumnRenamed(old, new)

print("Renamed columns:", df.columns)


In [None]:

# Count nulls per column
null_counts = df.select([F.count(F.when(F.col(c).isNull() | (F.col(c) == ''), c)).alias(c) for c in df.columns])
print("Null / empty counts:")
null_counts.show(truncate=False)

# Drop exact duplicate rows (if any)
before = df.count()
df = df.dropDuplicates()
after = df.count()
print(f"Dropped {before-after} duplicate rows (if any).")


In [None]:

# Heuristic: detect numeric columns by trying to cast to double and checking non-null proportion.
numeric_candidates = []
for c in df.columns:
    # Try casting to double and compute how many values become NULL (i.e., non-numeric)
    cast_col = F.col(c).cast('double')
    non_null_count = df.select(F.count(F.when(cast_col.isNotNull(), c)).alias('non_null')).collect()[0]['non_null']
    total_count = df.count()
    # If at least 80% values cast to numeric, consider numeric (threshold can be adjusted)
    if total_count > 0 and non_null_count / total_count >= 0.8:
        numeric_candidates.append(c)

print("Numeric-like candidate columns:", numeric_candidates)

# Cast the detected numeric columns to double
for c in numeric_candidates:
    df = df.withColumn(c, F.col(c).cast(DoubleType()))

# Show schema after casting
df.printSchema()


In [None]:

# Summary statistics for numeric columns
numeric_cols = [c for c in df.columns if dict(df.dtypes)[c] in ('double', 'int', 'bigint', 'float')]
print("Numeric columns:", numeric_cols)

if numeric_cols:
    df.select(numeric_cols).describe().toPandas().set_index('summary').T

    # Compute medians using approxQuantile
    medians = {}
    for c in numeric_cols:
        med = df.approxQuantile(c, [0.5], 0.01)  # 1% relative error
        medians[c] = med[0] if med else None
    print("Medians (approx):")
    for k, v in medians.items():
        print(f"  {k}: {v}")
else:
    print("No numeric columns detected to summarize.")


In [None]:

# Correlation heatmap: convert numeric columns to pandas for ease of plotting
if numeric_cols:
    pdf = df.select(numeric_cols).toPandas()
    corr = pdf.corr()
    display(corr.head())

    # Plot correlation heatmap (matplotlib + seaborn)
    plt.figure(figsize=(10,8))
    sns.heatmap(corr, annot=True, fmt='.2f', square=True, linewidths=.5)
    plt.title('Correlation matrix (numeric features)')
    plt.show()
else:
    print('No numeric columns available for correlation heatmap.')


In [None]:

# Plot distributions for a selection of numeric columns (up to first 6)
if numeric_cols:
    sample_cols = numeric_cols[:6]
    pdf = df.select(sample_cols).toPandas()
    pdf.hist(bins=20, figsize=(12, 8))
    plt.suptitle('Feature distributions (sample)')
    plt.show()
else:
    print('No numeric columns to plot distributions.')


In [None]:

# Look for common target column names (diagnosis, target, class, label)
possible_targets = [c for c in df.columns if c.lower() in ('diagnosis', 'target', 'label', 'class')]
target_col = possible_targets[0] if possible_targets else None
print('Detected target column:', target_col)

if target_col and numeric_cols:
    # Show counts per class
    df.groupBy(target_col).count().show()

    # For each numeric column, produce boxplots grouped by target
    pdf = df.select([target_col] + numeric_cols).toPandas()
    melted = pdf.melt(id_vars=target_col, value_vars=numeric_cols)
    plt.figure(figsize=(12, 6))
    sns.boxplot(x='variable', y='value', hue=target_col, data=melted)
    plt.xticks(rotation=90)
    plt.title('Feature distributions by target (boxplots)')
    plt.show()
else:
    print('No target column detected or no numeric columns available for group comparisons.')


In [None]:

# Compute correlation of numeric features with target if possible (map categorical target to numeric)
if target_col and numeric_cols:
    # If target is non-numeric, create a mapping
    if dict(df.dtypes)[target_col] not in ('double', 'int', 'bigint', 'float'):
        distinct_vals = [r[0] for r in df.select(target_col).distinct().collect()]
        mapping = {v: i for i, v in enumerate(sorted(distinct_vals))}
        print('Mapping target values to integers:', mapping)
        mapping_expr = F.create_map([F.lit(x) for kv in sum([[k, v] for k, v in mapping.items()], [])])
        df_num_target = df.withColumn('_target_num', mapping_expr[F.col(target_col)])
        target_name = '_target_num'
    else:
        df_num_target = df
        target_name = target_col

    # compute Pearson correlation between each numeric feature and target
    corrs = {}
    for c in numeric_cols:
        try:
            corr_val = df_num_target.stat.corr(c, target_name)
            corrs[c] = corr_val
        except Exception as e:
            corrs[c] = None
    # Show sorted correlations by absolute value
    sorted_corrs = sorted(corrs.items(), key=lambda x: abs(x[1]) if x[1] is not None else -1, reverse=True)
    print('Feature correlations with target (descending by absolute value):')
    for k, v in sorted_corrs:
        print(f'  {k}: {v}')
else:
    print('Cannot compute feature-target correlations (missing target or numeric columns).')


In [None]:

# Optionally, save the cleaned & cast dataframe to Parquet for faster reuse
out_path = '/mnt/data/breast_cancer_cleaned.parquet'
df.write.mode('overwrite').parquet(out_path)
print('Saved cleaned dataframe to', out_path)



# Summary & Key Insights


*This section summarises the main findings from the exploratory data analysis above.*


- **Dataset file used:** `/mnt/data/data.csv`.
- **Columns inspected:** see the "Columns" output in the Data Loading cell.
- **Numeric features:** Detected and cast automatically; summary statistics and medians were computed.
- **Correlations:** The correlation heatmap above highlights pairwise relationships between numeric features.
- **Target comparisons:** If a `diagnosis` (or `target`/`class`) column was detected, boxplots and target correlations were shown to compare feature distributions between classes (e.g., malignant vs benign).

**Suggested next steps (optional):**
1. Run feature selection (e.g., remove highly correlated features).
2. Train classification models using PySpark MLlib (Logistic Regression, RandomForest) and evaluate with cross-validation.
3. Create a summarized PPT and a written report describing the EDA results and any modelling outcomes.

---

_This notebook was generated to match the structure and operations of your provided `govt_data_analysis_pyspark` notebook, adapted for a breast cancer dataset._
