In [2]:
import pandas as pd

# Load the dataset
Online_Retail = "/content/drive/MyDrive/Privacy/Online Retail.xlsx"
data = pd.read_excel(Online_Retail)

# Display the first few rows of the dataset to understand its structure
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
# Display basic information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


**Part 1: Problem Demonstration**

Identify Sensitive Attributes and Quasi-Identifiers, Based on the dataset, we can identify the following attributes:

*   Sensitive Attribute: CustomerID
*   Quasi-Identifiers: InvoiceDate, Country

**Privacy Risks Demonstration**

We'll show how customer data can be de-anonymized using quasi-identifiers and calculate the re-identification risk

In [6]:
import pandas as pd

# Convert InvoiceDate to datetime format
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

# Extract date and time components
data['InvoiceTime'] = data['InvoiceDate'].dt.time
data['InvoiceDate'] = data['InvoiceDate'].dt.date

# Select quasi-identifiers
quasi_identifiers = ['InvoiceDate', 'Country']

# Calculate uniqueness
unique_combinations = data[quasi_identifiers].drop_duplicates()
total_unique_combinations = unique_combinations.shape[0]
total_records = data.shape[0]

# Calculate re-identification risk
reidentification_risk = total_unique_combinations / total_records

total_unique_combinations, total_records, reidentification_risk

(1716, 541909, 0.003166583319339594)

**Problem Demonstration:**

We identified that the original dataset had a re-identification risk of **0.31%** due to unique combinations of quasi-identifiers (**InvoiceDate and Country)**.


**Proposed Solution:**

Apply k-Anonymity
We'll create a custom function to apply k-anonymity, ensuring that each group has at least
𝑘
k records.

In [8]:
def k_anonymity(data, quasi_identifiers, k):
    # Copy the data to avoid modifying the original dataset
    anonymized_data = data.copy()

    # Generalize quasi-identifiers
    for col in quasi_identifiers:
        # Group dates by month and country by grouping similar countries (for simplicity)
        if col == 'InvoiceDate':
            anonymized_data[col] = pd.to_datetime(anonymized_data[col]).dt.to_period('M')

    # Ensure each group has at least k records
    anonymized_data = anonymized_data.groupby(quasi_identifiers).filter(lambda x: len(x) >= k)

    return anonymized_data

# Apply k-anonymity with k=5
k = 5
k_anonymous_data = k_anonymity(data, quasi_identifiers, k)

# Display basic information about the k-anonymous dataset
k_anonymous_info = k_anonymous_data.info()

# Display the first few rows of the k-anonymous dataset
k_anonymous_head = k_anonymous_data.head()

k_anonymous_info, k_anonymous_head

<class 'pandas.core.frame.DataFrame'>
Index: 541873 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype    
---  ------       --------------   -----    
 0   InvoiceNo    541873 non-null  object   
 1   StockCode    541873 non-null  object   
 2   Description  540419 non-null  object   
 3   Quantity     541873 non-null  int64    
 4   InvoiceDate  541873 non-null  period[M]
 5   UnitPrice    541873 non-null  float64  
 6   CustomerID   406801 non-null  float64  
 7   Country      541873 non-null  object   
 8   InvoiceTime  541873 non-null  object   
dtypes: float64(2), int64(1), object(5), period[M](1)
memory usage: 41.3+ MB


(None,
   InvoiceNo StockCode                          Description  Quantity  \
 0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
 1    536365     71053                  WHITE METAL LANTERN         6   
 2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
 3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
 4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
 
   InvoiceDate  UnitPrice  CustomerID         Country InvoiceTime  
 0     2010-12       2.55     17850.0  United Kingdom    00:00:00  
 1     2010-12       3.39     17850.0  United Kingdom    00:00:00  
 2     2010-12       2.75     17850.0  United Kingdom    00:00:00  
 3     2010-12       3.39     17850.0  United Kingdom    00:00:00  
 4     2010-12       3.39     17850.0  United Kingdom    00:00:00  )

**Apply l-Diversity**

We'll ensure diversity of the sensitive attribute within each group.

In [9]:
def l_diversity(data, quasi_identifiers, sensitive_attribute, l):
    # Group by quasi-identifiers and filter groups with at least l diverse sensitive attribute values
    diverse_groups = data.groupby(quasi_identifiers).filter(
        lambda x: x[sensitive_attribute].nunique() >= l
    )
    return diverse_groups

# Apply l-diversity with l=2
l = 2
l_diverse_data = l_diversity(k_anonymous_data, quasi_identifiers, 'CustomerID', l)

# Display basic information about the l-diverse dataset
l_diverse_info = l_diverse_data.info()

# Display the first few rows of the l-diverse dataset
l_diverse_head = l_diverse_data.head()

l_diverse_info, l_diverse_head


<class 'pandas.core.frame.DataFrame'>
Index: 537846 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype    
---  ------       --------------   -----    
 0   InvoiceNo    537846 non-null  object   
 1   StockCode    537846 non-null  object   
 2   Description  536392 non-null  object   
 3   Quantity     537846 non-null  int64    
 4   InvoiceDate  537846 non-null  period[M]
 5   UnitPrice    537846 non-null  float64  
 6   CustomerID   403248 non-null  float64  
 7   Country      537846 non-null  object   
 8   InvoiceTime  537846 non-null  object   
dtypes: float64(2), int64(1), object(5), period[M](1)
memory usage: 41.0+ MB


(None,
   InvoiceNo StockCode                          Description  Quantity  \
 0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
 1    536365     71053                  WHITE METAL LANTERN         6   
 2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
 3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
 4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
 
   InvoiceDate  UnitPrice  CustomerID         Country InvoiceTime  
 0     2010-12       2.55     17850.0  United Kingdom    00:00:00  
 1     2010-12       3.39     17850.0  United Kingdom    00:00:00  
 2     2010-12       2.75     17850.0  United Kingdom    00:00:00  
 3     2010-12       3.39     17850.0  United Kingdom    00:00:00  
 4     2010-12       3.39     17850.0  United Kingdom    00:00:00  )

**Evaluation**

We'll evaluate the effectiveness of these techniques.

In [10]:
# Re-identification risk after applying k-anonymity
unique_combinations_k = k_anonymous_data[quasi_identifiers].drop_duplicates()
total_unique_combinations_k = unique_combinations_k.shape[0]
reidentification_risk_k = total_unique_combinations_k / total_records

# Diversity check after applying l-diversity
diversity_check = l_diverse_data.groupby(quasi_identifiers)['CustomerID'].nunique().min()

reidentification_risk_k, diversity_check

(0.0005480624975780067, 2)

**Evaluation Results**

**Re-identification Risk after k-Anonymity:**

The re-identification risk has been reduced to approximately 0.054%. This means that only 0.054% of the records are unique based on the generalized quasi-identifiers, significantly reducing the re-identification risk.

**Diversity Check after l-Diversity:**

The minimum diversity of the sensitive attribute (CustomerID) within each group is 2, ensuring that each equivalence class has at least 2 distinct values of the sensitive attribute, protecting against attribute disclosure.

Summary
Problem Demonstration:

We identified that the original dataset had a re-identification risk of 0.31% due to unique combinations of quasi-identifiers (InvoiceDate and Country).
Proposed Solution:

We applied k-anonymity with
𝑘
=
5
k=5 to generalize the quasi-identifiers, reducing the re-identification risk to 0.06%.
We then applied l-diversity with
𝑙
=
2
l=2 to ensure diversity within the sensitive attribute, maintaining a minimum of 2 distinct values within each equivalence class.
These privacy-preserving techniques effectively mitigated the data privacy risks while retaining the utility of the dataset for analysis. This approach demonstrates the importance and application of k-anonymity and l-diversity in ensuring responsible innovation in AI and data science.


---
