# Explore ionosphere data

The goal of this notebook is to explore the ionosphere data set.
The following steps are taken:
1. Load the data
2. Explore the data
3. Generate a report on the data

### Load the data

In [3]:
from ucimlrepo import fetch_ucirepo
from loguru import logger   

In [4]:
try:
    # fetch dataset 
    ionosphere = fetch_ucirepo(id=52)
    logger.info("Dataset loaded successfully!")
except Exception as e:
    logger.error(f"Error loading dataset: {e}")
    raise e
  
# format data (as pandas dataframes) 
X = ionosphere.data.features 
y = ionosphere.data.targets 
y = y["Class"]
y = y == "g"

[32m2025-11-09 13:34:47.268[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mDataset loaded successfully![0m


### Quick data exploration

In [5]:
logger.info(f"Size of the design matrix X: {X.shape} ")
logger.info(f"Size of the labels y: {y.shape} ")


[32m2025-11-09 13:35:00.648[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mSize of the design matrix X: (351, 34) [0m
[32m2025-11-09 13:35:00.651[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mSize of the labels y: (351,) [0m


In [6]:
X.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute25,Attribute26,Attribute27,Attribute28,Attribute29,Attribute30,Attribute31,Attribute32,Attribute33,Attribute34
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.0,0.0376,...,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.3409,0.42267,-0.54487,0.18641,-0.453
1,1,0,1.0,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.0,-0.04549,...,-0.20332,-0.26569,-0.20468,-0.18401,-0.1904,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447
2,1,0,1.0,-0.03365,1.0,0.00485,1.0,-0.12062,0.88965,0.01198,...,0.57528,-0.4022,0.58984,-0.22145,0.431,-0.17365,0.60436,-0.2418,0.56045,-0.38238
3,1,0,1.0,-0.45161,1.0,1.0,0.71216,-1.0,0.0,0.0,...,1.0,0.90695,0.51613,1.0,1.0,-0.20099,0.25682,1.0,-0.32382,1.0
4,1,0,1.0,-0.02401,0.9414,0.06531,0.92106,-0.23255,0.77152,-0.16399,...,0.03286,-0.65158,0.1329,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697


In [7]:
y.head()

0     True
1    False
2     True
3    False
4     True
Name: Class, dtype: bool

In [8]:
logger.info(f"Class distribution: {y.value_counts()}")

[32m2025-11-09 13:35:20.326[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mClass distribution: Class
True     225
False    126
Name: count, dtype: int64[0m


### Create a profile report


In [9]:
from ydata_profiling import ProfileReport

profile = ProfileReport(X, title="Profiling Report")
profile.to_file("report_data.html")

  from .autonotebook import tqdm as notebook_tqdm


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)
100%|██████████| 34/34 [00:00<00:00, 4042.02it/s]00:00, 21.83it/s, Describe variable: Attribute34]
Summarize dataset: 100%|██████████| 1067/1067 [02:08<00:00,  8.30it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:10<00:00, 10.74s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.42s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  7.01it/s]


We notice from the report that the `Attribute2` column is constant and hence not useful for the classification task.

In [10]:
X = X.drop(columns=["Attribute2"])
logger.info(f"Size of the design matrix X: {X.shape} ")

[32m2025-11-09 00:34:37.440[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mSize of the design matrix X: (351, 33) [0m


### Save the processed data

In [11]:
import os
output_dir = "../data/processed"
# check if the directory exists
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
X.to_csv(os.path.join(output_dir, "X.csv"), index=False)
y.to_csv(os.path.join(output_dir, "y.csv"), index=False)
logger.info(f"Data saved to {output_dir}")


[32m2025-11-09 00:43:21.939[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1mData saved to ../data/processed[0m
