In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Assuming your dataset is stored in a CSV file named 'your_earthquake_dataset.csv'
df = pd.read_csv('/kaggle/input/tsunami-events-dataset-1900-present/tsunamis-2023-09-11_22-13-51_0530 (2).csv')

# Extract features and target variable
features = ['Year', 'Mo', 'Dy', 'Hr', 'Mn', 'Sec', 'TsunamiNanEventNanValidity',
            'TsunamiNanCauseNanCode', 'EarthquakeNanMagnitude']
X = df[features]
y = df['TotalNanInjuriesNanDescription'].notna().astype(int)  # Binary classification based on 'TotalNanInjuriesNanDescription'

# Replace NaN values with 0 for features
X = X.fillna(0)

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Choose the number of neighbors (you can adjust this based on your analysis)
k_val = 5

# Create a KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=k_val)
knn_classifier.fit(X_train, Y_train)

# Make predictions on the test set
y_pred = knn_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(Y_test, y_pred)
print("Accuracy:", accuracy)

# Try different values of k
accuracy_vals = []
for k_val in range(1, 15):
    knn_classifier = KNeighborsClassifier(n_neighbors=k_val)
    knn_classifier.fit(X_train, Y_train)
    y_pred = knn_classifier.predict(X_test)
    accuracy = accuracy_score(Y_test, y_pred)
    accuracy_vals.append(accuracy)

# Plot accuracy vs. k value
plt.plot(range(1, 15), accuracy_vals, color='blue', marker='x', linestyle='dashed')
plt.xlabel("K Value")
plt.ylabel("Accuracy")
plt.title("KNN Accuracy vs. K Value")
plt.show()

# Visualize results
sample_size = 100
random_indices = np.random.choice(len(y_pred), sample_size, replace=False)

plt.figure(figsize=(10, 6))
for i in random_indices:
    if y_pred[i] == 0:
        plt.scatter(X_test.iloc[i]['Year'], X_test.iloc[i]['EarthquakeNanMagnitude'], color='orange')
    if y_pred[i] == 1:
        plt.scatter(X_test.iloc[i]['Year'], X_test.iloc[i]['EarthquakeNanMagnitude'], color='blue')

plt.xlabel('Year')
plt.ylabel('Earthquake Magnitude')
plt.title('KNN Results for Earthquake Dataset (Sampled)')
plt.legend(['No Injuries', 'Injuries'], loc='upper right')
plt.show()

This Python script uses the scikit-learn library to create a K-Nearest Neighbors (KNN) classifier for a tsunami events dataset. The goal is to predict whether there were injuries based on various features such as the year, month, day, hour, minute, second, tsunami event validity, tsunami cause code, and earthquake magnitude.

The script starts by importing the necessary libraries. It then loads the tsunami events dataset from a specified CSV file into a pandas DataFrame. The features and the target variable are extracted from the DataFrame. The target variable is the 'TotalNanInjuriesNanDescription' column, which is converted into a binary format: if the value is not NaN, it's represented as 1 (indicating injuries), and if it's NaN, it's represented as 0 (indicating no injuries).

Any NaN values in the features are replaced with 0 using the `fillna()` function. The dataset is then split into a training set and a test set, with 70% of the data used for training and 30% used for testing. This is done using the `train_test_split` function from scikit-learn.

A KNN classifier is created with a specified number of neighbors (k_val = 5 in this case). The classifier is trained using the training data and then used to make predictions on the test set. The accuracy of these predictions is calculated by comparing them to the actual injury statuses in the test set.

The script then tries different values of k (from 1 to 14) and stores the corresponding accuracy values. These accuracy values are plotted against the k values to visualize how the choice of k affects the model's accuracy.

Finally, the script visualizes the results of the KNN classifier on a random sample of the test set. The year and earthquake magnitude are plotted on the x and y axes, respectively, and the predicted injury status is represented by the color of the points. This provides a visual representation of how the KNN classifier has categorized the events based on the features.