<a href="https://www.kaggle.com/code/mohammedmohsen0404/sonar-rock-vs-mine-prediction?scriptVersionId=188649914" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
**<center><h1>SONAR Rock Vs Mine Prediction</h1></center>**
<center><h3>Part of 100 Days 100 ML Projects Challenge</h3></center>

---




The SONAR Rock Vs Mine Prediction falls under **Classication Machine Learning Problem**. The project aims to develop a machine learning model capable of accurately distinguishing between metal cylinders(mines) and rocks based on SONAR return data.

# **Import Libraries and Data**
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.metrics import classification_report , f1_score

In [None]:
data = pd.read_csv('/kaggle/input/rock-vs-mine-prediction-machine-learning/Sonar dataset.csv', header = None)
df=data.copy()

# **Take a look at the data**
---

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.iloc[:,60].value_counts()

# **Exploratory Data Analysis**
---

In [None]:
plt.hist (df.drop(df.columns[60], axis = 1))
plt.show()

**Univariate Analysis**

In [None]:
numerical_data = data.select_dtypes(include='number')
numerical_data.hist(figsize=(10, 8),color = 'b')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.boxplot(numerical_data)
plt.show()

In [None]:
categorical_data = data.select_dtypes(include='object')
for column in categorical_data.columns:
    sns.countplot(data=categorical_data, x=column, palette="Set1")
    plt.title(f"Countplot of {column}")
    plt.show()

In [None]:
plt.pie(df.iloc[:,60].value_counts(), labels=['R','M'], autopct='%1.1f%%')
plt.show()

**Multivariate Analysis**

In [None]:
sns.pairplot(data.select_dtypes(include='number'))
plt.show()

In [None]:
sns.heatmap(numerical_data.corr(), annot=True, cmap='coolwarm')
plt.show()

# **Data Cleaning**
---

**Handling Duplicate Rows**

In [None]:
# Check for duplicate rows
duplicate_rows = df.duplicated()
# Count of duplicate rows
print(f"Number of duplicate rows: {duplicate_rows.sum()}")

**Handling Missing Data**

In [None]:
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)

In [None]:
total = data.isnull().sum().sum()
print('Total Null values =' ,total)

# **Data Preprocesing**
----

It's important to conduct preprocessing steps separately on train, test sets to avoid data leakage, which can lead to overly optimistic performance estimates.
so let's split the data

**Data Splitting**

In [None]:
X = df.drop(df.columns[60], axis = 1)
y = df.iloc[:,60]

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3,random_state=101)

**Encoding Categorical Variables**

In [None]:
y_train = y_train.apply(lambda x : 0 if x == 'R' else 1)
y_test = y_test.apply(lambda x : 0 if x == 'R' else 1)

**Dealing with Outliers**

In [None]:
# Boxplot
plt.figure(figsize=(20, 10))
plt.boxplot(X_train)
plt.title('Boxplot for Outlier Detection')
plt.show()

In [None]:
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1
outliers = X_train[((X_train < (Q1 - 1.5 * IQR)) | (X_train > (Q3 + 1.5 * IQR))).any(axis=1)]

print("Outliers using IQR method:")
print(outliers)

In [None]:
X_train = np.log(X_train + 1)
X_test = np.log(X_test + 1)

**Data Normalization**

In [None]:
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

# **Modeling**

In [None]:
classifiers = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('K-Nearest Neighbors', KNeighborsClassifier()),
    ('Support Vector Machine', SVC(random_state=42)),
    ('xgboost', xgb.XGBClassifier(tree_method="hist")),
]

In [None]:
for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='weighted')  # Using weighted average for multi-class classification
    print(f'{clf_name}: F1 Score = {f1:.2f}')
    print(f'{clf_name} Classification Report:\n{classification_report(y_test, y_pred)}')
    print('---------------------------------------------------')