<a href="https://colab.research.google.com/github/ShraddhaSharma24/Machine-learning/blob/main/Privacy_Preserving_Customer_Churn_Prediction_using_Differential_Privacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project demonstrates Privacy-Preserving Machine Learning by applying Differential Privacy (DP) to a synthetic customer churn dataset. It compares the performance of a Logistic Regression model trained on clean data vs. noisy (differentially private) data.

📝 Project Overview

Objective

To generate synthetic customer churn data.
To train a baseline Logistic Regression model on clean data.
To enforce Differential Privacy by adding Laplace noise to the features.
To train a second Logistic Regression model on the noisy, privacy-preserving data.
Finally, compare the accuracy and performance of both models to observe the privacy-utility tradeoff.


In [9]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [10]:
# 1. Generate synthetic customer churn data
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

In [12]:
# Convert to DataFrame for convenience
columns = [f"feature_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=columns)
df['Churn'] = y

print("Sample Data (first 5 rows):")
print(df.head())

Sample Data (first 5 rows):
   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  \
0   0.964799  -0.066449   0.986768  -0.358079   0.997266   1.181890   
1  -0.916511  -0.566395  -1.008614   0.831617  -1.176962   1.820544   
2  -0.109484  -0.432774  -0.457649   0.793818  -0.268646  -1.836360   
3   1.750412   2.023606   1.688159   0.006800  -1.607661   0.184741   
4  -0.224726  -0.711303  -0.220778   0.117124   1.536061   0.597538   

   feature_6  feature_7  feature_8  feature_9  Churn  
0  -1.615679  -1.210161  -0.628077   1.227274      0  
1   1.752375  -0.984534   0.363896   0.209470      1  
2   1.239086  -0.246383  -1.058145  -0.297376      1  
3  -2.619427  -0.357445  -1.473127  -0.190039      0  
4   0.348645  -0.939156   0.175915   0.236224      1  


In [13]:
# 2. Split into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df['Churn'], test_size=0.2, random_state=42)


In [14]:
# 3. Train baseline Logistic Regression (no privacy noise)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

In [15]:
# 4. Evaluate baseline model
y_pred = lr.predict(X_test)
print("\nBaseline Model Performance:")
print(classification_report(y_test, y_pred))


Baseline Model Performance:
              precision    recall  f1-score   support

           0       0.79      0.84      0.82        89
           1       0.87      0.82      0.84       111

    accuracy                           0.83       200
   macro avg       0.83      0.83      0.83       200
weighted avg       0.83      0.83      0.83       200



In [16]:
# 5. Add Differential Privacy noise to training data
epsilon = 1.0  # Privacy budget (lower = more privacy, but more noise)
sensitivity = X_train.max().max() - X_train.min().min()  # range of feature values
noise = np.random.laplace(0, sensitivity/epsilon, X_train.shape)

X_train_noisy = X_train + noise

In [17]:
# 6. Train Logistic Regression on Noisy Data (privacy-preserving model)
lr_noisy = LogisticRegression(max_iter=1000)
lr_noisy.fit(X_train_noisy, y_train)


In [18]:
# 7. Evaluate Privacy-Preserving Model
y_pred_noisy = lr_noisy.predict(X_test)
print("\nPrivacy-Preserving Model Performance (with DP noise):")
print(classification_report(y_test, y_pred_noisy))


Privacy-Preserving Model Performance (with DP noise):
              precision    recall  f1-score   support

           0       0.47      1.00      0.64        89
           1       1.00      0.10      0.18       111

    accuracy                           0.50       200
   macro avg       0.74      0.55      0.41       200
weighted avg       0.76      0.50      0.39       200



In [19]:
# 8. Compare Accuracy
baseline_acc = accuracy_score(y_test, y_pred)
dp_acc = accuracy_score(y_test, y_pred_noisy)

In [20]:
print(f"\nComparison of Accuracy:")
print(f"Baseline Accuracy: {baseline_acc:.4f}")
print(f"DP Logistic Regression Accuracy: {dp_acc:.4f}")


Comparison of Accuracy:
Baseline Accuracy: 0.8300
DP Logistic Regression Accuracy: 0.5000
