# OBJECTIVE
To check if LLMs can detect data leaks in our code

# Approach
Create a sample code of logistic regression on titanic dataset. Intentionally include data leak in scaling feature i.e scale on entire dataset at once. The correct approach would be to first split the data into train-test and then scale seperately. 

Following prompt was fed to both OPEN AI CHAT GPT 3.5 and Microsoft Copilot

In [None]:
"review the code for any data leaks"

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
titanic_data = pd.read_csv(url)

# Data preprocessing
titanic_data = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data = titanic_data.dropna()

# Split the data into features and target variable
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

# Scale the features. Intentionally introducing data leak.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the scaled data into training and testing sets. Correct approach would be first split then scale separately. 
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{classification_rep}")


It is important to note that the comments are only for our understanding. All comments were omitted and the prompt code did not have any comments in it. I wanted to see if AI could figure out the objective of the code itself and then identify the data leak.

## CHAT GPT Response

The provided code looks well-structured, and I don't see any explicit data leakage. However, to ensure there is no unintentional data leakage, let's go through the code:

1. **Data Loading:**
   - The code loads the data directly from an external URL. This is generally fine as long as the data is publicly available and consistent over time. However, it's worth noting that if the external data source changes or is updated, it might affect the results of your analysis.

2. **Feature Selection:**
   - The features selected for training the model (`'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived'`) seem appropriate for a basic analysis of Titanic survival. Ensure that there are no features used in the model that would not be available at the time of prediction.

3. **Data Transformation:**
   - The 'Sex' column is transformed to numeric values, which is a common practice. However, if you later plan to deploy this model and use it in a real-world scenario, ensure that the encoding is consistent with what you expect when new data is provided.

4. **Handling Missing Values:**
   - Missing values are dropped from the dataset using `titanic_data = titanic_data.dropna()`. It's essential to be aware of the impact of this decision. If there are missing values in the test set during deployment, you'll need to handle them in a way consistent with what you did during training.

5. **Scaling:**
   - Features are scaled using `StandardScaler()`, which is a good practice for many machine learning models. Just ensure that the scaler is fit on the training data and applied consistently to both training and test sets to prevent data leakage.

6. **Train-Test Split:**
   - The train-test split is performed using `train_test_split` with a specified random seed (`random_state=42`). This ensures reproducibility, which is good. However, always be cautious with the choice of the random seed, and consider whether the chosen seed might affect the results.

7. **Model Training and Evaluation:**
   - The logistic regression model is trained on the training set and evaluated on the test set. There doesn't seem to be any explicit data leakage in this process.

In summary, while the code appears well-written and doesn't contain obvious data leakage, it's always a good practice to be mindful of the data sources, transformations, and preprocessing steps to ensure that your model is trained and evaluated correctly. Consider documenting your data processing steps and testing the model with new, unseen data to ensure its generalizability.

-------------------------------------------------------------------------------------------------------------------------------------------------
### Surprising!
Its so surpising that the it explicitly mentions "ensure that the scaler is fit on the training data and applied consistently to both training and test sets to prevent data leakage." but still is unable to identify the exact same issue in our code. This should not be happening in my opinion.

So I followed up with "there is data leak try again"

---------------------------------------------------------------------------------------------------------------------------------------------------



Apologies for any confusion in my previous responses. I've reviewed the code again, and I see a potential issue with data leakage in the following line:

```python
X_scaled = scaler.fit_transform(X)
```

The `fit_transform` method is used, which means the `StandardScaler` is fit on the entire dataset `X` and then applied to transform it. This can lead to data leakage because the scaler is supposed to be fitted only on the training data to avoid introducing information from the test set into the training process.

To address this potential data leakage, you should fit the scaler on the training set and then use the same scaler to transform both the training and test sets. Here's how you can modify that part of the code:

```python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

This modification ensures that the scaler is fit only on the training data (`X_train`) and then applied to both the training and test sets (`X_test`). This way, information from the test set doesn't influence the scaling process during training, eliminating the risk of data leakage.

-------------------------------------------------------------------------------------------------------------------------------------------------
Immediately it identified the data leaks and further suggested methods to overcome the same. 

## Copilot Response

Your code appears to be well-structured and free from obvious data leaks. Let's break down the key steps:

1. **Data Preparation**:
   - You've loaded the Titanic dataset from Stanford's archive.
   - You've selected relevant features: 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', and 'Survived'.
   - You've mapped the 'Sex' column to numerical values (0 for male, 1 for female).
   - You've dropped rows with missing values (NaN) from the dataset.

2. **Feature Scaling**:
   - You've standardized the features using `StandardScaler`.

3. **Train-Test Split**:
   - You've split the data into training and testing sets.

4. **Logistic Regression Model**:
   - You've created a logistic regression model using `LogisticRegression`.
   - The model has been trained on the training data.

5. **Model Evaluation**:
   - You've made predictions on the test set using `predict`.
   - The accuracy of the model on the test data is printed.
   - The confusion matrix and classification report provide additional evaluation metrics.

Remember to continue testing your model on unseen data and consider cross-validation to assess its generalization performance. Great job! 🚀👩‍💻👨‍💻 .

-------------------------------------------------------------------------------------------------------------------------------------------------


No comments from copilot related to data leakage. So here also I followed up with " there is data leak try again"

-------------------------------------------------------------------------------------------------------------------------------------------------


I apologize for the oversight. Thank you for pointing it out. Upon reviewing the code again, I've identified the data leak. The issue lies in the **feature scaling** step. Specifically, the `StandardScaler` should be fit on the **training data only**, and then applied to both the training and test data. Here's the corrected version of your code:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
titanic_data = pd.read_csv(url)

titanic_data = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data = titanic_data.dropna()

X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

# Corrected: Fit the scaler on the training data only
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{classification_rep}")
```

This corrected version ensures that the scaling is applied consistently to both the training and test data, preventing any data leakage. Thank you for your attention, and happy coding! 🚀👩‍💻👨‍💻 .

-------------------------------------------------------------------------------------------------------------------------------------------------


#### TLDR- both LLMs failed to identify the data leaks on its own. They were able to detect data leak only after explicitly mentioning that there is a data leak. Maybe we should start using the prompt “there is an error try again” as a fool proof check. 