## Dataset Acquisition 

The land-mine detection dataset was retrieved programmatically from the UCI Machine Learning Repository using the ucimlrepo Python package. This ensures that the dataset used in analysis is consistent, reproducible, and obtained from a reliable public source.

The dataset is returned with separate feature and target components:

land_mines.data.features → input variables (Voltage, Height, Soil Type)

land_mines.data.targets → mine type labels (5 categories)

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch the dataset 
land_mines = fetch_ucirepo(id=763) 
  
# data (as pandas dataframes) 
X = land_mines.data.features 
y = land_mines.data.targets 
  
# metadata 
print(land_mines.metadata) 
  
# variable information 
print(land_mines.variables) 


In [None]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# Fetch dataset
land_mines = fetch_ucirepo(id=763)

# Combine features and targets
df = pd.concat([land_mines.data.features, land_mines.data.targets], axis=1)

# Save as CSV
df.to_csv("land_mines_dataset.csv", index=False)
print("Dataset saved as land_mines_dataset.csv.")


## EDA and Missing Data

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("land_mines_dataset.csv")

# Summary statistics
summary_stats = df.describe()
print(summary_stats)

# Save statistics
summary_stats.to_csv("eda_summary.csv")
print("EDA summary saved.")


## Missing Data and Transformation using IQR

In [None]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# Fetch dataset
land_mines = fetch_ucirepo(id=763)
X = land_mines.data.features
y = land_mines.data.targets

# Check for missing values
missing_values = X.isnull().sum()
print("Missing Values Per Column:\n", missing_values)

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("land_mines_dataset.csv")

# Check and remove duplicates
duplicates = df.duplicated().sum()
print(f"Total Duplicates: {duplicates}")

In [None]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# Fetch dataset
land_mines = fetch_ucirepo(id=763)

# Convert to DataFrame
df = land_mines.data.features
df['M'] = land_mines.data.targets  # Adding target column

# Function to detect outliers using IQR
def detect_outliers(df):
    outliers = pd.DataFrame()
    for col in df.columns:
        if df[col].dtype in ['float64', 'int64']:  # Only for numerical columns
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers_in_col = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
            outliers = pd.concat([outliers, outliers_in_col])
    
    return outliers.drop_duplicates()

# Show outliers
outlier_df = detect_outliers(df)
print("Outliers in the dataset:")
print(outlier_df)


# Data Transformation

Before training machine learning models on the land-mine dataset, a targeted data transformation step was performed to address skewness and improve feature distribution quality—particularly for the V variable, which exhibited strong right-skewed behavior.

The feature V showed a heavy right tail, meaning many small values and a few very large values. Such skewed distributions can negatively impact:

Models sensitive to variance (Logistic Regression, SVM, Neural Networks)

Distance-based measures

Gradient convergence

Feature scaling consistency

To correct this, a log1p transformation was applied:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis

# Load dataset 
df = pd.read_csv("land_mines_dataset.csv")  

# Apply log transformation only to V
df['V_log'] = np.log1p(df['V'])  # log1p(x) = log(1 + x) to avoid -inf

# Recalculate skewness and kurtosis
stats_df = pd.DataFrame({
    "Feature": ["V_log", "H", "S"],
    "Skewness": [skew(df['V_log']), skew(df['H']), skew(df['S'])],
    "Kurtosis": [kurtosis(df['V_log']), kurtosis(df['H']), kurtosis(df['S'])]
})

# Save transformed dataset
df.to_csv("transformed_land_mines.csv", index=False)

# Plot original vs log-transformed V distribution
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].hist(df['V'], bins=20, color='blue', alpha=0.7)
axes[0].set_title("Original V")

axes[1].hist(df['V_log'], bins=20, color='blue', alpha=0.7)
axes[1].set_title("Log-Transformed V")

plt.tight_layout()
plt.show()

# Print skewness and kurtosis
print(stats_df)


## Feature Engineering and Visualization

To understand the structure of the land-mine dataset and prepare it for model training, several feature engineering and visualization steps were performed. These steps help identify patterns, detect anomalies, and understand feature relationships, which guides model selection and optimization.

A correlation matrix heatmap was plotted with annotated values to visualize pairwise feature relationships.

1. Histograms were generated for all features except the target M and the auxiliary column V.

Understand the distribution of each feature

Detect skewness, outliers, and unusual value concentrations

Identify whether transformations (scaling/normalization) may be required


Identify highly correlated features (potential redundancy)

Detect multi-collinearity

Understand which features may be informative for classification

2. PCA was used to project the dataset onto two principal components.


Capture maximum variance in 2D

Understand linear separability between land-mine classes

Identify feature combinations that contribute most to variability

The PCA scatter plot helps determine how well classes can be separated using linear transformations.

3. t-SNE was used to create a non-linear 2D embedding of the high-dimensional dataset.


Visualize local clusters and neighborhood structures

Detect hidden non-linear separations between classes

Complement PCA to show complex interactions

t-SNE often reveals class clusters not visible through linear methods.

4. A consolidated boxplot was created for all numerical features.

Identify feature-wise outliers

Understand the spread (IQR) and distribution shape

Detect potential measurement errors or extreme values

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("transformed_land_mines.csv")

# Exclude "M" (target) and "V" from visualizations
X = df.drop(columns=["M", "V"])

# Histogram
X.hist(figsize=(10, 8))
plt.savefig("histograms_excluding_V.png")

# Box Plot
plt.figure(figsize=(12, 6))
sns.boxplot(data=X)
plt.xticks(rotation=90)
plt.savefig("boxplot_excluding_V.png")

# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(X.corr(), annot=True, cmap="coolwarm")
plt.savefig("correlation_heatmap_excluding_V.png")

print("Visualizations saved (excluding V).")


In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset
df = pd.read_csv("transformed_land_mines.csv")
X = df.drop(columns=["M", "V"])
y = df["M"] - 1

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
pca_df["label"] = y

# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X)
tsne_df = pd.DataFrame(X_tsne, columns=["TSNE1", "TSNE2"])
tsne_df["label"] = y

# Plotting
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.scatterplot(data=pca_df, x="PC1", y="PC2", hue="label", palette="tab10", s=60)
plt.title("PCA Projection")
plt.legend(title="Class")

plt.subplot(1, 2, 2)
sns.scatterplot(data=tsne_df, x="TSNE1", y="TSNE2", hue="label", palette="tab10", s=60)
plt.title("t-SNE Projection")
plt.legend(title="Class")

plt.tight_layout()
plt.show()
