# Analysis of Video Game Sales

 This project explores the application of Secure Multi-Party Computation (SMPC) techniques in analyzing video game sales data. Our objective is to analyze and gain insights without compromising the confidentiality of the data. By implementing techniques like Paillier Encryption, Differential Privacy (DP), and Advanced Encryption Standard (AES), we aim to showcase a privacy-preserving approach to data analysis.

**Note:** To analyze the video game sales data, we are using a CSV file named `vgsales.csv`.

Ensure that the `vgsales.csv` file is located in the same directory as your Jupyter notebook or Python script. If it's in a different directory, you'll need to specify the correct path when loading the file.

Remember to execute each cell in sequence if you're using a Jupyter notebook, as some cells may depend on the execution of previous ones.


To perform cryptographic functions such as encryption and decryption in our analysis, we need to install the `pycryptodome` library. This library provides various cryptographic modules that are essential for securing data. You can install it using the following pip command:


In [None]:
 pip install pycryptodome

`pycryptodomex` is an another self-contained Python package of low-level cryptographic primitives. Install it using the pip command as follows:



In [None]:
pip install pycryptodomex

In [None]:
pip install phe

Importing useful libraries

In [7]:
from phe import paillier
from Cryptodome.Cipher import AES
from Crypto.Random import get_random_bytes
import time
import random
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from scipy.stats import laplace

## Part 1: Average sales per decade (1980-2020)

In this part, we aim to determine which geographical region (North America, Europe, or the Rest of the World) experienced the highest average video game sales per decade, spanning from 1980 to 2020. Then we plot the results for better comparison.

To ensure the privacy and security of the data, we employ **Paillier Encryption**. This homomorphic encryption technique allows us to perform calculations on encrypted data, guaranteeing that the average number of sales is securely computed while maintaining the confidentiality of individual game sales data.



In [8]:
# Reading the CSV file
df = pd.read_csv('vgsales.csv')

In [None]:
# Function modification for batch processing
def paillier_avg_batch(numbers, batch_size=100):
    public_key, private_key = paillier.generate_paillier_keypair()
    n_batches = len(numbers) // batch_size

    encrypted_sum = 0
    for i in range(n_batches):
        batch_total = sum(numbers[i*batch_size:(i+1)*batch_size])
        encrypted_sum += public_key.encrypt(batch_total)

    # Handle the remainder if the dataset size is not divisible by batch_size
    if len(numbers) % batch_size != 0:
        batch_total = sum(numbers[n_batches*batch_size:])
        encrypted_sum += public_key.encrypt(batch_total)

    decrypted_sum = private_key.decrypt(encrypted_sum)
    avg = decrypted_sum / len(numbers)
    return avg

def decade_ranges(year):
    if 1980 <= year < 1990:
        return "1980-1990"
    elif 1990 <= year < 2000:
        return "1990-2000"
    elif 2000 <= year < 2010:
        return "2000-2010"
    elif 2010 <= year <= 2020:
        return "2010-2020"
    else:
        return None


# Add a new column for the decade
df['DecadeRange'] = df['Year'].apply(decade_ranges)

# Group the data by decade
grouped_data = df[df['DecadeRange'].notna()].groupby('DecadeRange')

decade_avgs = {}
for decade_range, group in grouped_data:
    na_avg = round(paillier_avg_batch(group['NA_Sales'], batch_size=100), 2)
    eu_avg = round(paillier_avg_batch(group['EU_Sales'], batch_size=100), 2)
    rest_of_world_avg = round(paillier_avg_batch(group['Global_Sales'], batch_size=100), 2)


    decade_avgs[decade_range] = {
        'NA': na_avg,
        'EU': eu_avg,
        'Rest of the World': rest_of_world_avg
    }

for decade_range in sorted(decade_avgs.keys()):
    print(f"Decade: {decade_range}")
    print(f"  NA Average Sales: {decade_avgs[decade_range]['NA']} million")
    print(f"  EU Average Sales: {decade_avgs[decade_range]['EU']} million")
    print(f"  Rest of World Average Sales: {decade_avgs[decade_range]['Rest of the World']} million\n")

plot_data = []
for decade_range, sales in decade_avgs.items():
    for region, avg_sales in sales.items():
        plot_data.append({'DecadeRange': decade_range, 'Region': region, 'Average Sales': avg_sales})

# Convert plot_data to a DataFrame
df_plot = pd.DataFrame(plot_data)

# Create the line graph
fig = px.line(df_plot, x="DecadeRange", y="Average Sales", color="Region",
              title="Average Video Game Sales per Decade Range (in millions)",
              labels={"DecadeRange": "Decade Range", "Average Sales": "Average Sales (Millions)"})


fig.show()

# Part 2: Ranking Top 5 Favorite Games in Each Region

In this part, we rank the top 5 favorite games in each region based on their sales figures. To protect the privacy of individual sales data, we apply **Differential Privacy (DP)** techniques by adding Laplace noise to the sales figures. This ensures the confidentiality of the dataset, adhering to privacy-preserving measures.


First, we plot a bar graph showing top 5 favorite games in each region without adding any privacy measure to it.

However, it is important to note that the introduction of noise means that the exact order of rankings may vary, especially with bigger values of the privacy budget epsilon (ε). Therefore, we conducted an analysis to understand the computational implications of applying DP with a wide range of ε values. Specifically, we observed how the computation time varies when ε ranges from 1 to nearly 100,000. Our findings indicate that as ε increases, so does the computation time. This is because a larger ε corresponds to a smaller noise scale, which requires less complex calculations.

For our analysis, we have chosen an ε of 1. This provides a moderate level of noise that sufficiently anonymizes the data while preserving its utility. It also allows for faster computation, which is beneficial for processing large datasets.




In [None]:
barWidth = 0.25

list_sales = df.nlargest(5, ['NA_Sales'])
list_sales_NA = list_sales['NA_Sales'].tolist()

list_sales_1 = df.nlargest(5, ['EU_Sales'])
list_sales_EU = list_sales_1['EU_Sales'].tolist()

list_sales_2 = df.nlargest(5, ['Global_Sales'])
list_sales_Other = list_sales_2['Global_Sales'].tolist()


# Set the positions for the bars
pos1 = np.arange(len(list_sales_NA))
pos2 = [x + barWidth for x in pos1]
pos3 = [x + barWidth for x in pos2]

# Set figure size
plt.figure(figsize=(12, 8))

# Create the horizontal bars
barNA = plt.barh(pos1, list_sales_NA, height=barWidth, color='r', edgecolor='grey', label='NA')
barEU = plt.barh(pos2, list_sales_EU, height=barWidth, color='g', edgecolor='grey', label='EU')
barOther = plt.barh(pos3, list_sales_Other, height=barWidth, color='b', edgecolor='grey', label='Rest of the World')

# Label the bars with game names
for bar, name in zip(barNA, list_sales['Name']):
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, ' '+name, va='center', ha='left', color='black', fontweight='bold')
for bar, name in zip(barEU, list_sales_1['Name']):
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, ' '+name, va='center', ha='left', color='black', fontweight='bold')
for bar, name in zip(barOther, list_sales_2['Name']):
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, ' '+name, va='center', ha='left', color='black', fontweight='bold')

# Set the y-axis labels
plt.yticks([r + barWidth for r in range(len(list_sales_NA))], ['Top 1', 'Top 2', 'Top 3', 'Top 4', 'Top 5'])
plt.gca().invert_yaxis()


# Set labels and title
plt.xlabel('Sales (Millions)')
plt.title('Top 5 Favorite Games in Each Region')
plt.legend()

# Show the plot
plt.tight_layout()
plt.show()

The graph below illustrates the relationship between ε values and computation time, reinforcing our choice of ε for our DP implementation.

Note: We only considered NA sales just to show the impact of ε values on time.

In [None]:
ran_val = []
for i in range(1,100000) :
    ran_val.append(i)

x = []
y = []
list_saless = df.nlargest(5, ['NA_Sales'])
list_NAA = list_saless['NA_Sales'].tolist()

#sorted_check = []
for i in range(1,10) :
    ran_lap = random.choice(ran_val)
    x.append(ran_lap)

    list_NAA_change = []
    start = time.time()
    for j in range(len(list_NAA)) :
        list_NAA[j] = (list_NAA[j]) + (1/ran_lap)
    end = time.time()
    y.append(end-start)
list_time = []
for i in range(len(y)):
    ls = []
    ls.append(x[i])
    ls.append(y[i])
    list_time.append(ls)

x.clear()
y.clear()
list_time.sort()
for i in range(len(list_time)):
    x.append(list_time[i][0])
    y.append(list_time[i][1])

#print(x)
plt.plot(x,y)
plt.xlabel("Value of epsilon (ε)")
plt.ylabel("Computation Time (s)")
plt.title("Computation Time for Adding Laplace Noise")
plt.show()


Below is the bar graph representing the top 5 favorite games in each region after applying DP with ε = 1.

This visualization demonstrates the effect of DP on our dataset. Note that the names of the games remain constant, as our primary focus is on demonstrating the impact of DP on the sales figures, not on the actual titles.

In [None]:
# Add Laplace noise to the sales figures for differential privacy
epsilon = 1  # Setting Privacy Budget
list_sales_NA = [sale + np.random.laplace(0, epsilon) for sale in list_sales_NA]
list_sales_EU = [sale + np.random.laplace(0, epsilon) for sale in list_sales_EU]
list_sales_Other = [sale + np.random.laplace(0, epsilon) for sale in list_sales_Other]

# Set the positions for the bars
pos1 = np.arange(len(list_sales_NA))
pos2 = [x + barWidth for x in pos1]
pos3 = [x + barWidth for x in pos2]

label_position = max(max(list_sales_NA), max(list_sales_EU), max(list_sales_Other)) * 0.09  # 5% of the maximum sales

# Set figure size
plt.figure(figsize=(12, 8))

# Create the horizontal bars
plt.barh(pos1, list_sales_NA, height=barWidth, color='r', edgecolor='grey', label='NA')
plt.barh(pos2, list_sales_EU, height=barWidth, color='g', edgecolor='grey', label='EU')
plt.barh(pos3, list_sales_Other, height=barWidth, color='b', edgecolor='grey', label='Rest of the World')

for idx, name in enumerate(list_sales['Name']):
    plt.text(label_position, pos1[idx], ' '+name, va='center', ha='left', color='black', fontweight='bold')
for idx, name in enumerate(list_sales_1['Name']):
    plt.text(label_position, pos2[idx], ' '+name, va='center', ha='left', color='black', fontweight='bold')
for idx, name in enumerate(list_sales_2['Name']):
    plt.text(label_position, pos3[idx], ' '+name, va='center', ha='left', color='black', fontweight='bold')

# Set the y-axis labels in ascending order
plt.yticks([r + barWidth for r in range(len(list_sales_NA))], ['Top 1', 'Top 2', 'Top 3', 'Top 4', 'Top 5'])

# Invert the y-axis to have the top game at the top
plt.gca().invert_yaxis()

# Set labels and title
plt.xlabel('Sales (Millions)')
plt.title('Top 5 Favorite Games in Each Region with Differential Privacy (ε = 1)')
plt.legend()

# Print the noisy sales data
print("Noisy NA Sales:", list_sales_NA)
print("Noisy EU Sales:", list_sales_EU)
print("Noisy Other Sales:", list_sales_Other)

# Show the plot
plt.tight_layout()
plt.show()

# Part 3: Accuracy Analysis of DP on Game Rankings

In this part, we do accuracy analysis of DP's impact on game rankings for each region. For the sake of clear visualization and to effectively demonstrate the impact of differential privacy, we have limited our analysis to the top 500 games. This focused approach allows us to more easily observe the changes in rankings due to noise addition and avoids overwhelming the graph with too many data points, which can make patterns and insights difficult to discern.

The scatter plots below show the change in ranking for each of these top 500 games. The x-axis represents the original ranking based on actual sales, while the y-axis indicates how much each game's rank has changed after the application of noise. A greater change in rank suggests a more significant impact of the noise on the game's sales figure, highlighting the trade-off between privacy and data accuracy.  

In [None]:
# Get the top 500 games in the NA region
top_500 = df.nlargest(500, ['NA_Sales'])
original_sales = top_500['NA_Sales'].tolist()
original_names = top_500['Name'].tolist()

# Apply Laplace noise to the sales figures for differential privacy
epsilon = 1
noisy_sales = [sale + np.random.laplace(0, epsilon) for sale in original_sales]

# Sort the original and noisy sales
sorted_indices_original = sorted(range(len(original_sales)), key=original_sales.__getitem__, reverse=True)
sorted_indices_noisy = sorted(range(len(noisy_sales)), key=noisy_sales.__getitem__, reverse=True)

# Map the original and noisy indices to their ranks
original_ranks = {original_names[idx]: rank for rank, idx in enumerate(sorted_indices_original)}
noisy_ranks = {original_names[idx]: rank for rank, idx in enumerate(sorted_indices_noisy)}

# Calculate the rank differences for each game
x = []  # Original ranks
y = []  # Rank differences

for name in original_names:
    original_rank = original_ranks[name]
    noisy_rank = noisy_ranks[name]
    x.append(original_rank)
    y.append(abs(original_rank - noisy_rank))

# Plot the rank differences for the top 500 games
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6)  # Using a scatter plot for better visualization
plt.title('DP Impact on Game Ranking for Top 500 Games (NA Region)')
plt.xlabel('Original Rank')
plt.ylabel('Change in Rank')
plt.grid(True)
plt.show()


In [None]:
# Get the top 500 games in the EU region
top_500 = df.nlargest(500, ['EU_Sales'])
original_sales = top_500['EU_Sales'].tolist()
original_names = top_500['Name'].tolist()

# Apply Laplace noise to the sales figures for differential privacy
epsilon = 1
noisy_sales = [sale + np.random.laplace(0, epsilon) for sale in original_sales]

# Sort the original and noisy sales
sorted_indices_original = sorted(range(len(original_sales)), key=original_sales.__getitem__, reverse=True)
sorted_indices_noisy = sorted(range(len(noisy_sales)), key=noisy_sales.__getitem__, reverse=True)

# Map the original and noisy indices to their ranks
original_ranks = {original_names[idx]: rank for rank, idx in enumerate(sorted_indices_original)}
noisy_ranks = {original_names[idx]: rank for rank, idx in enumerate(sorted_indices_noisy)}

# Calculate the rank differences for each game
x = []  # Original ranks
y = []  # Rank differences

for name in original_names:
    original_rank = original_ranks[name]
    noisy_rank = noisy_ranks[name]
    x.append(original_rank)
    y.append(abs(original_rank - noisy_rank))

# Plot the rank differences for the top 500 games
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6)  # Using a scatter plot for better visualization
plt.title('DP Impact on Game Ranking for Top 500 Games (EU Region)')
plt.xlabel('Original Rank')
plt.ylabel('Change in Rank')
plt.grid(True)
plt.show()


In [None]:
# Get the top 500 games in the rest of the world
top_500 = df.nlargest(500, ['Global_Sales'])
original_sales = top_500['Global_Sales'].tolist()
original_names = top_500['Name'].tolist()

# Apply Laplace noise to the sales figures for differential privacy
epsilon = 1
noisy_sales = [sale + np.random.laplace(0, epsilon) for sale in original_sales]

# Sort the original and noisy sales
sorted_indices_original = sorted(range(len(original_sales)), key=original_sales.__getitem__, reverse=True)
sorted_indices_noisy = sorted(range(len(noisy_sales)), key=noisy_sales.__getitem__, reverse=True)

# Map the original and noisy indices to their ranks
original_ranks = {original_names[idx]: rank for rank, idx in enumerate(sorted_indices_original)}
noisy_ranks = {original_names[idx]: rank for rank, idx in enumerate(sorted_indices_noisy)}

# Calculate the rank differences for each game
x = []  # Original ranks
y = []  # Rank differences

for name in original_names:
    original_rank = original_ranks[name]
    noisy_rank = noisy_ranks[name]
    x.append(original_rank)
    y.append(abs(original_rank - noisy_rank))

# Plot the rank differences for the top 500 games
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6)  # Using a scatter plot for better visualization
plt.title('DP Impact on Game Ranking for Top 500 Games (Rest of the World)')
plt.xlabel('Original Rank')
plt.ylabel('Change in Rank')
plt.grid(True)
plt.show()


# Part 4: Analyzing Data with AES Encryption

In this section, we uncover some interesting insights while ensuring the confidentiality of key categorical information. To achieve this, we employ Advanced Encryption Standard (AES) encryption to protect the details of game **platforms**, **genres**, and **publishers**.

First, we encrypt the data, which is then decrypted only to reveal the final results. Post-analysis, the top platform, genre, and publisher are decrypted and displayed.

This method allows us to conduct a thorough analysis while maintaining the integrity and confidentiality of the sensitive data.


In [None]:
file = open("vgsales.csv",'r')
start = 0
list_title = []
list_of_data =[]
for line in file :
    if start == 0 :
        spl_line = line.split(',')
        for val in spl_line :
            list_title.append(val.strip())
        start += 1
    else :
        spl_line = line.split(',')
        if len(spl_line) == 11 :
            ls = []
            for val in spl_line :
                ls.append(val.strip())
            list_of_data.append(ls)


def check_val (element) :
    try:
        float(element)
    except ValueError:
        return False
    return True
def aes(data) :
    key = get_random_bytes(16)
    cipher = AES.new(key, AES.MODE_EAX)
    ciphertext, tag = cipher.encrypt_and_digest(data)
    nonce = cipher.nonce
    return ciphertext,tag,nonce,key


ciphertext,tag,nonce,key = aes(b'10')
ciphertext1,tag1,nonce1,key1 = aes(b'10')

best_platform = {}
best_publisher = {};
best_genre = {}
for index in range(len(list_of_data)) :
    if check_val(list_of_data[index][6].strip()) and len(list_of_data[i]) == 11 :
        if list_of_data[index][2] not in best_platform :
            best_platform[list_of_data[index][2].strip()] = float(list_of_data[index][10])
        else :
            best_platform[list_of_data[index][2].strip()] += float(list_of_data[index][10])
        if list_of_data[index][5] not in best_publisher :
            # if list_of_data[index][5] not in best_publisher :
            best_publisher[list_of_data[index][5].strip()] = float(list_of_data[index][10])
        else:
            best_publisher[list_of_data[index][5].strip()] += float(list_of_data[index][10])

        if list_of_data[index][4] not in best_genre :
            best_genre[list_of_data[index][4].strip()] = float(list_of_data[index][10])
        else :
            best_genre[list_of_data[index][4].strip()] += float(list_of_data[index][10])

encypt_platform = []
for key in best_platform :
    encrypt_key =  bytes(key, 'utf-8')
    ciphertext1,tag1,nonce1,key1 = aes(encrypt_key)
    tup = (ciphertext1,tag1,nonce1,key1)
    ls = []
    ls.append(tup)
    ls.append(best_platform[key])
    encypt_platform.append(ls)

max_val_platform = 0
platform_tuple =()
for i in range(len(encypt_platform)) :
    if max_val_platform <encypt_platform[i][1] :
        max_val_platform = encypt_platform[i][1]
        platform_tuple = encypt_platform[i][0]

ls_platform = []
for x in platform_tuple :
    ls_platform.append(x)
#print(ls_platform[0])
cipher = AES.new(ls_platform[3], AES.MODE_EAX, ls_platform[2])
best_platform = cipher.decrypt_and_verify(ls_platform[0], ls_platform[1])
print(f"The most popular platform is: {best_platform}")

encrypt_publisher = []
for key in best_publisher :
   # print(key)
    encrypt_key =  bytes(key, 'utf-8')
    ciphertext1,tag1,nonce1,key1 = aes(encrypt_key)
    tup = (ciphertext1,tag1,nonce1,key1)
    ls = []
    ls.append(tup)
    ls.append(best_publisher[key])
    encrypt_publisher.append(ls)

max_val_publisher = 0
publisher_tuple =()
for i in range(len(encrypt_publisher)) :
    if max_val_publisher <encrypt_publisher[i][1] :
        max_val_publisher = encrypt_publisher[i][1]
        publisher_tuple = encrypt_publisher[i][0]

ls_publisher = []
for x in publisher_tuple :
    ls_publisher.append(x)
# print(ls_publisher[0])
cipher = AES.new(ls_publisher[3], AES.MODE_EAX, ls_publisher[2])
best_publisher = cipher.decrypt_and_verify(ls_publisher[0], ls_publisher[1])
print(f"The most popular publisher is: {best_publisher}")



encypt_genre = []
for key in best_genre :
    encrypt_key =  bytes(key, 'utf-8')
    ciphertext1,tag1,nonce1,key1 = aes(encrypt_key)
    tup = (ciphertext1,tag1,nonce1,key1)
    ls = []
    ls.append(tup)
    ls.append(best_genre[key])
    encypt_genre.append(ls)

max_val_genre = 0
genre_tuple =()
for i in range(len(encypt_genre)) :
    if max_val_genre < encypt_genre[i][1] :
        max_val_genre = encypt_genre[i][1]
        genre_tuple = encypt_genre[i][0]

ls_genre = []
for x in genre_tuple :
    ls_genre.append(x)
#print(ls_genre[2])
cipher = AES.new(ls_genre[3], AES.MODE_EAX, ls_genre[2])
best_genre = cipher.decrypt_and_verify(ls_genre[0], ls_genre[1])
print(f"The most popular genre is: {best_genre}")