# Rock vs Mine Prediction using Machine Learning

This project builds a **binary classification model** to predict whether an underwater object detected by sonar signals is a **Rock** or a **Mine**.

### Problem Type
- Binary Classification

### Dataset
- Sonar dataset
- 60 numerical features representing sonar signal strengths
- Target labels:
  - `R` → Rock
  - `M` → Mine

### Model Used
- Logistic Regression (with feature scaling)

### Tools & Libraries
- NumPy
- Pandas
- Scikit-learn


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score


In [2]:
sonar_data = pd.read_csv('/content/sonar data.csv', header=None)


- The dataset has **no column headers**, so `header=None` is used.
- Columns:
  - `0–59` → Sonar signal features
  - `60` → Target label (`R` or `M`)


In [3]:
# Dataset shape
sonar_data.shape

# Statistical summary
sonar_data.describe()

# Class distribution
sonar_data[60].value_counts()

Unnamed: 0_level_0,count
60,Unnamed: 1_level_1
M,111
R,97


In [4]:
X = sonar_data.drop(columns=60)
y = sonar_data[60]

- **X** contains 60 sonar signal features
- **y** contains target labels:
  - `R` → Rock
  - `M` → Mine


In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    stratify=y,
    random_state=1
)

### Why do we split the data?

In machine learning, a model must be evaluated on **unseen data** to measure how well it generalizes.
If we train and test on the same data, the model may simply memorize patterns instead of learning them.

Therefore, we split the dataset into:
- **Training set** → used to learn patterns
- **Test set** → used only for evaluation

---

### Explanation of each parameter

**X**
- Feature matrix containing 60 sonar signal values per sample.

**y**
- Target labels:
  - `R` → Rock
  - `M` → Mine

**test_size = 0.1**
- 10% of the data is reserved for testing.
- 90% is used for training.
- This is suitable for small datasets like Sonar (~208 samples).

**stratify = y**
- Ensures the same Rock/Mine ratio in both training and test sets.
- Prevents biased learning and misleading accuracy.
- Especially important for classification problems with limited data.

**random_state = 1**
- Fixes randomness to make results reproducible.
- Ensures consistent results across runs and environments.

---

### Resulting Data Shapes (Typical)

- X → (208, 60)
- X_train → (~187, 60)
- X_test → (~21, 60)

This confirms:
- Feature count remains unchanged
- Data is safely split without leakage


In [6]:
model = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(
        solver='liblinear',
        max_iter=1000,
        random_state=1
    ))
])

### What is a Pipeline?

A Pipeline is a structured sequence of data transformations followed by a machine learning model.

It ensures:
- Correct order of preprocessing and training
- No data leakage
- Clean, maintainable, and production-ready code

---

### Step 1: StandardScaler

- Sonar features have different ranges.
- Logistic Regression is sensitive to feature scale.
- StandardScaler transforms features to:
  - Mean = 0
  - Standard deviation = 1

Formula:
(x − mean) / standard deviation

This ensures all sonar frequencies contribute fairly.

---

### Step 2: Logistic Regression Classifier

- Used for binary classification problems.
- Predicts probability of an object being a Mine or Rock.
- Uses a sigmoid function to map values between 0 and 1.

**solver = 'liblinear'**
- Best suited for small datasets.
- Stable for binary classification.

**max_iter = 1000**
- Allows sufficient iterations for convergence.
- Prevents early stopping warnings.

**random_state = 1**
- Ensures reproducible weight initialization.

---

### Why Pipeline is Industry Standard

- Prevents accidental fitting on test data
- Guarantees identical preprocessing during training and prediction
- Simplifies deployment and cross-validation


In [7]:
model.fit(X_train, y_train)

### What does model.fit() do?

Training teaches the model the relationship:
Sonar signals → Rock or Mine

---

### Step-by-step process

1. **Scaler learns statistics**
   - Mean and standard deviation are computed ONLY from training data.

2. **Features are scaled**
   - Both training features are normalized using learned statistics.

3. **Logistic Regression optimization**
   - Initializes weights and bias
   - Uses cross-entropy (log loss)
   - Iteratively updates weights to minimize classification error

---

### Important Note

- The test data is NOT used during training.
- This ensures honest evaluation later.


In [8]:
y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")

Training Accuracy: 0.9198


### What does training accuracy tell us?

Training accuracy measures how well the model learned patterns from the training data.

---

### Interpretation

- High training accuracy is expected.
- It confirms the model is capable of learning.
- However, it does NOT guarantee real-world performance.

---

### Risk

A model may:
- Perform very well on training data
- Perform poorly on unseen data (overfitting)

That is why test accuracy is essential.


In [9]:
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Test Accuracy: {test_accuracy:.4f}")

Test Accuracy: 0.7619


### Why test accuracy matters most

Test accuracy measures performance on unseen data.
This reflects how the model would behave in real-world sonar detection.

---

### Interpretation

- High test accuracy → good generalization
- Similar training and test accuracy → stable model
- Large gap → overfitting or underfitting

---

### Industry Perspective

In mine detection systems:
- False negatives (missing a mine) are dangerous
- Therefore, reliable generalization is critical


In [10]:
input_data = (
    0.0307,0.0523,0.0653,0.0521,0.0611,0.0577,0.0665,0.0664,
    0.1460,0.2792,0.3877,0.4992,0.4981,0.4972,0.5607,0.7339,
    0.8230,0.9173,0.9975,0.9911,0.8240,0.6498,0.5980,0.4862,
    0.3150,0.1543,0.0989,0.0284,0.1008,0.2636,0.2694,0.2930,
    0.2925,0.3998,0.3660,0.3172,0.4609,0.4374,0.1820,0.3376,
    0.6202,0.4448,0.1863,0.1420,0.0589,0.0576,0.0672,0.0269,
    0.0245,0.0190,0.0063,0.0321,0.0189,0.0137,0.0277,0.0152,
    0.0052,0.0121,0.0124,0.0055
)

input_array = np.asarray(input_data).reshape(1, -1)
prediction = model.predict(input_array)[0]

if prediction == 'M':
    print("The object is predicted to be a MINE ")
else:
    print("The object is predicted to be a ROCK ")


The object is predicted to be a MINE 


### Key Takeaways

- Applied stratified sampling for fair training
- Used Pipeline to prevent data leakage
- Built a clean, reproducible ML workflow
- Evaluated model properly using unseen data
- Implemented real-time prediction logic

This project demonstrates strong fundamentals in:
- Machine Learning
- Data Preprocessing
- Model Evaluation
- Industry-level ML practices
