<a href="https://colab.research.google.com/github/PolemoniProkshitha/ai_ml/blob/main/Logistic_Regression_Flood_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt       # matplotlib.pyplot plots data
%matplotlib inline
import seaborn as sns

Basic Exploratory Data Analysis (EDA)

Load and review data

In [None]:
train_fdata = pd.read_csv("/kaggle/input/playground-series-s4e5/train.csv")

In [None]:
test_fdata = pd.read_csv('/kaggle/input/playground-series-s4e5/test.csv')

In [None]:
train_fdata.shape # Check number of columns and rows in data frame

In [None]:
train_fdata.head() # To check first 5 rows of data set

In [None]:
# Check for null values
train_fdata.isnull().sum()

In [None]:
train_fdata.info()

In [None]:
# Summary statistics
train_fdata.describe()

In [None]:
train_fdata.corr() # It will show correlation matrix

Increase the size of the heatmap and adjust the font size of the annotations to make the correlation values clearer.

fmt='.2f': Limits the decimal places to two, reducing visual clutter.

annot_kws={"size": 10}: Adjusts the annotation font size.

plt.xticks(rotation=45, ha='right'): Rotates the x-axis labels for better readability.

In [None]:
# Correlation heatmap with larger figure size and bigger font size for annotations
plt.figure(figsize=(16,12))
sns.heatmap(train_fdata.corr(), annot=True, fmt='.2f', cmap='coolwarm', annot_kws={"size": 10})
plt.title('Correlation Heatmap - Train Data', fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.show()


## Logistic Regression Model

Prepare the Training Data
- Separate Features and Target (Training Data)

Separate your features (independent variables) from the target (FloodProbability).

In [None]:
# Separate features (X) and target (y)
X_train = train_fdata.drop(columns=['id', 'FloodProbability'])
y_train = train_fdata['FloodProbability']

Prepare the Test Data

Use the same feature columns in the test dataset, but we don’t have the target (FloodProbability) column in the test set.

In [None]:
# Prepare the test data (drop 'id' column)
X_test = test_fdata.drop(columns=['id'])

Train the Logistic Regression Model
- Now, fit the logistic regression model using the training dataset.

In [None]:
from sklearn.linear_model import LogisticRegression

The target variable y_train for logistic regression should be binary (or categorical), but it is continuous (floating-point values).
- Logistic regression is used for classification tasks, not for continuous target variables.

To fix this, we need to ensure that y_train contains only binary or categorical values (e.g., 0 or 1 for binary classification).

In this case, the target column (FloodProbability) might be a continuous probability value. Hence, we convert it into binary values (e.g., 0 for non-flood, 1 for flood) using a threshold.

Convert Continuous Target to Binary
- We can decide on a threshold for the FloodProbability to classify it as 0 or 1.
For example, you could use a threshold of 0.5 (or another suitable threshold based on your data).

In [None]:
threshold = 0.5
y_train_binary = (y_train >= threshold).astype(int)

# Convert continuous target y_train into binary (0 or 1) based on the FloodProbability threshold.

In [None]:
# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train_binary)

Predict on Test Data
- When making predictions on the test data, the output will be probabilities, and you can apply the same threshold to classify them into flood/non-flood.

In [None]:
# Predict probabilities on test data
y_test_pred_prob = model.predict_proba(X_test)[:,1]

# Classify based on the same threshold
y_test_pred = (y_test_pred_prob >= threshold).astype(int)

Evaluate the Model (On the Training Set)

Now that your logistic regression model has been trained with the binary target (FloodProbability), let's evaluate its performance using the following metrics:

- Accuracy: The ratio of correctly predicted instances.
- Confusion Matrix: A matrix to evaluate the performance of the model based on actual vs predicted values.
- ROC-AUC Score: Measures the quality of the model's classification by analyzing the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR).

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

# Predict on the training set
y_train_pred = model.predict(X_train)

# Accuracy on the training set
accuracy = accuracy_score(y_train_binary, y_train_pred)
print(f"Training Accuracy: {accuracy:.4f}")

# Confusion matrix
cm = confusion_matrix(y_train_binary, y_train_pred)
print("Confusion Matrix:")
print(cm)

# ROC-AUC score
roc_auc = roc_auc_score(y_train_binary, model.predict_proba(X_train)[:,1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

Confusion Matrix (Training Set) Visualization

- Visualizing the confusion matrix helps you understand how well the model distinguishes between floods and non-floods in the training dataset.

In [None]:
# Plot confusion matrix
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title('Confusion Matrix - Training Data')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

ROC Curve (Training Set)
- The ROC curve visualizes the trade-off between sensitivity and specificity, giving insight into the model's ability to classify flood vs non-flood cases.

In [None]:
from sklearn.metrics import roc_curve

# ROC Curve
fpr, tpr, _ = roc_curve(y_train_binary, model.predict_proba(X_train)[:,1])

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Training Data')
plt.legend(loc='lower right')
plt.show()

Evaluate Model Performance on Test Data
- Although you won’t have the ground truth labels for the test dataset, you can still use the model to predict the probabilities and classify floods in the test data.

In [None]:
# Predict probabilities on the test dataset
y_test_pred_prob = model.predict_proba(X_test)[:,1]

# Apply the same threshold (0.5 in this case)
threshold = 0.5
y_test_pred = (y_test_pred_prob >= threshold).astype(int)

# Save the predictions to a CSV file
submission = pd.DataFrame({'id': test_fdata['id'], 'FloodProbability': y_test_pred_prob, 'FloodClass': y_test_pred})
submission.to_csv('flood_predictions.csv', index=False)

------

# Retraining the model

### Using Different Threshold

In logistic regression, we use different thresholds to optimize the performance of the model based on the specific objectives and the nature of the problem.
- The default threshold for logistic regression is 0.5, meaning that if the predicted probability of the positive class (e.g., flood) is greater than or equal to 0.5, the model will classify it as positive, otherwise negative.
- However, this default threshold might not always be the most suitable, especially in certain situations like imbalanced datasets or when different evaluation metrics are more important.

In [None]:
threshold = 0.6
y_train_binary = (y_train >= threshold).astype(int)

# Convert continuous target y_train into binary (0 or 1) based on the FloodProbability threshold.

In [None]:
# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train_binary)

In [None]:
# Predict probabilities on test data
y_test_pred_prob = model.predict_proba(X_test)[:,1]

# Classify based on the same threshold
y_test_pred = (y_test_pred_prob >= threshold).astype(int)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

# Predict on the training set
y_train_pred = model.predict(X_train)

# Accuracy on the training set
accuracy = accuracy_score(y_train_binary, y_train_pred)
print(f"Training Accuracy: {accuracy:.4f}")

# Confusion matrix
cm = confusion_matrix(y_train_binary, y_train_pred)
print("Confusion Matrix:")
print(cm)

# ROC-AUC score
roc_auc = roc_auc_score(y_train_binary, model.predict_proba(X_train)[:,1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

In [None]:
# Plot confusion matrix
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Reds", cbar=False)
plt.title('Confusion Matrix - Training Data')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
from sklearn.metrics import roc_curve

# ROC Curve
fpr, tpr, _ = roc_curve(y_train_binary, model.predict_proba(X_train)[:,1])

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Training Data')
plt.legend(loc='lower right')
plt.show()

# Comparison of Model Performance


## 1. Training Accuracy:
- ##### Old Accuracy: 0.8483

- ##### New Accuracy: 0.9885

The accuracy has increased dramatically from 84.83% to 98.85%. This suggests that the new model is making fewer overall errors on the training set.

## 2. Confusion Matrix:
- ##### Old Confusion Matrix:

True Negatives (TN): 435,333

False Positives (FP): 72,745

False Negatives (FN): 96,864

True Positives (TP): 513,015


- ##### New Confusion Matrix:

True Negatives (TN): 1,075,239

False Positives (FP): 3,643

False Negatives (FN): 9,239

True Positives (TP): 29,836

#### Improvements:

True Negatives (TN) have increased significantly, which means the model is much better at correctly identifying non-flood cases.
False Positives (FP) have drastically decreased from 72,745 to 3,643, which means the model is making far fewer incorrect flood predictions.
False Negatives (FN) have decreased from 96,864 to 9,239, showing that the model is better at identifying actual flood events.

## 3. ROC-AUC Score:
- ##### Old ROC-AUC: 0.9258
- ##### New ROC-AUC: 0.9801
The ROC-AUC score has improved from 0.9258 to 0.9801, indicating a significant increase in the model's ability to distinguish between the flood and non-flood cases across all threshold values.