## Objective
The goal of this notebook is to handle missing values in the approved dataset by:

- Identifying missing-value patterns  
- Removing non-imputable columns  
- Applying KNN Imputation for numerical features  
- Applying appropriate imputation for categorical features  
- Producing a clean dataset


In [1]:
import pandas as pd
import numpy as np


Pandas is used for data manipulation and analysis.
NumPy is used for numerical operations and missing value handling.


In [9]:
df = pd.read_csv("../../datasets/full_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4048 entries, 0 to 4047
Columns: 112 entries, P_NAME to P_SEMI_MAJOR_AXIS_EST
dtypes: float64(94), int64(4), object(14)
memory usage: 3.5+ MB


In [10]:
missing_summary = pd.DataFrame({
    "Missing_Count": df.isnull().sum(),
    "Missing_Percentage": df.isnull().mean() * 100
}).sort_values(by="Missing_Percentage", ascending=False)

missing_summary.head(30)


Unnamed: 0,Missing_Count,Missing_Percentage
P_ATMOSPHERE,4048,100.0
P_ALT_NAMES,4048,100.0
P_DETECTION_RADIUS,4048,100.0
P_GEO_ALBEDO,4048,100.0
P_DETECTION_MASS,4048,100.0
S_MAGNETIC_FIELD,4048,100.0
S_DISC,4048,100.0
P_TEMP_MEASURED,4043,99.876482
P_GEO_ALBEDO_ERROR_MIN,4043,99.876482
P_GEO_ALBEDO_ERROR_MAX,4043,99.876482


Both numerical and categorical features contain missing values.
They are handled separately using suitable imputation techniques.


In [14]:

num_cols = df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = df.select_dtypes(include=["object"]).columns

len(num_cols), len(cat_cols)


(98, 14)

After separating the features based on data type, we observe that the dataset
contains **98 numerical columns** and **14 categorical columns**.
Since these feature types have different characteristics, they are imputed
using different strategies.


In [17]:
# find numerical columns that are completely missing
all_nan_num_cols = [col for col in num_cols if df[col].isnull().all()]

len(all_nan_num_cols), all_nan_num_cols

df.drop(columns=all_nan_num_cols, inplace=True)

num_cols = df.select_dtypes(include=["int64", "float64"]).columns
len(num_cols)

91

Some numerical columns were found to contain only missing values across all
rows. Since these columns provide no usable information, they were removed
from the dataset.

After dropping these fully-missing numerical columns, the list of numerical
features was recalculated to ensure consistency before applying imputation.


In [18]:
from sklearn.impute import KNNImputer
import pandas as pd

knn_imputer = KNNImputer(n_neighbors=5)

df[num_cols] = pd.DataFrame(
    knn_imputer.fit_transform(df[num_cols]),
    columns=num_cols,
    index=df.index
)

df[num_cols].isnull().sum().sum()


np.int64(0)

Numerical columns that were entirely missing were removed before imputation.
KNN Imputation was then applied to the remaining numerical features to preserve
relationships between similar observations.


In [19]:
from sklearn.impute import SimpleImputer

cat_imputer = SimpleImputer(strategy="most_frequent")

df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

df[cat_cols].isnull().sum().sum()


np.int64(0)

Missing values in categorical features were imputed using the most frequent
category. After imputation, no missing values remain in categorical columns.


In [20]:
df.isnull().sum().sum()

np.int64(0)

## Final Validation and Conclusion

After applying appropriate imputation strategies to both numerical and
categorical features, the dataset contains **no missing values**.

High-missing columns were removed to maintain data quality, numerical features
were imputed using KNN Imputation, and categorical features were imputed using
mode-based imputation. The dataset is now ready for further analysis or
modeling.



In [21]:
#cleaned dataset
df.to_csv("cleaned_imputed_data.csv", index=False)

## Final Observations

- Columns with extremely high missing values and fully missing numerical features
  were removed as they provided no meaningful information.
- Numerical features were imputed using KNN Imputation, which estimates missing
  values based on similar observations.
- Categorical features were imputed using the most frequent category to maintain
  data consistency.
- After applying all preprocessing steps, the dataset contains no missing values
  and is ready for further analysis or modeling.
