Team: 21

- Praanshu Patel (23110249)
- Rishank Soni (23110277)

### Part 1: A/B Testing using Ad Click Prediction


In [1]:
import pandas as pd
import numpy as np

# Loading dataset
ad_data = pd.read_csv('ad_click_dataset.csv')
print(f"Ad Click Dataset loaded successfully with shape: {ad_data.shape}")

# Displaying first few rows
print("\nFirst 5 rows of Ad Click dataset:")
display(ad_data.head())

# Dropping rows with missing values of 'ad_id' and 'click'
ad_data.dropna(subset=['ad_position', 'click'], inplace=True)
print(f"\nAfter dropping rows with missing 'ad_id' or 'click', new shape: {ad_data.shape}")

# Converting categorical columns (e.g., gender, ad_position) to category dtype
categorical_cols = ['gender']
# Mapping ad_position to numerical values
ad_data['ad_position'] = ad_data['ad_position'].map({
    'Top': 0,
    'Bottom': 1
})
for col in categorical_cols:
    print(f"\nUnique values in '{col}': {ad_data[col].unique()}")
# Converting to category dtype
for col in categorical_cols:
    ad_data[col] = ad_data[col].astype('category')

for col in categorical_cols:
    ad_data[col] = ad_data[col].cat.codes

print("\nDataset after encoding:")
display(ad_data.head())

Ad Click Dataset loaded successfully with shape: (10000, 9)

First 5 rows of Ad Click dataset:


Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0



After dropping rows with missing 'ad_id' or 'click', new shape: (8000, 9)

Unique values in 'gender': [nan 'Male' 'Non-Binary' 'Female']

Dataset after encoding:


Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,-1,Desktop,0.0,Shopping,Afternoon,1
1,3044,User3044,,1,Desktop,0.0,,,1
2,5912,User5912,41.0,2,,,Education,Night,1
5,5942,User5942,,2,,1.0,Social Media,Evening,1
6,7808,User7808,26.0,0,Desktop,0.0,,,1


In [2]:
# Split dataset based on ad_position
group_A = ad_data[ad_data['ad_position'] == 0]  # Top
group_B = ad_data[ad_data['ad_position'] == 1]  # Bottom
# Display the sizes of each group
print(f"Group A (Top position) shape: {group_A.shape}")
print(f"Group B (Bottom position) shape: {group_B.shape}")
print("\nGroup A (Top) - First 5 rows:")
display(group_A.head())
print("\nGroup B (Bottom) - First 5 rows:")
display(group_B.head())

Group A (Top position) shape: (2597, 9)
Group B (Bottom position) shape: (2817, 9)

Group A (Top) - First 5 rows:


Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,-1,Desktop,0.0,Shopping,Afternoon,1
1,3044,User3044,,1,Desktop,0.0,,,1
6,7808,User7808,26.0,0,Desktop,0.0,,,1
15,7529,User7529,,-1,,0.0,Entertainment,Afternoon,0
18,2124,User2124,,1,Desktop,0.0,,Evening,1



Group B (Bottom) - First 5 rows:


Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
5,5942,User5942,,2,,1.0,Social Media,Evening,1
8,7993,User7993,,2,Mobile,1.0,Social Media,,1
9,4509,User4509,,-1,,1.0,Education,Afternoon,1
10,2595,User2595,,-1,,1.0,,Morning,1
11,7466,User7466,47.0,-1,Mobile,1.0,,Afternoon,1


In [3]:

from statsmodels.stats.proportion import proportions_ztest

# Number of clicks (successes) in each group
clicks_A = group_A['click'].sum()
clicks_B = group_B['click'].sum()

# Number of users in each group
n_A = group_A.shape[0]
n_B = group_B.shape[0]

# Perform two-sample z-test for proportions
count = np.array([clicks_A, clicks_B])
nobs = np.array([n_A, n_B])

z_stat, p_val = proportions_ztest(count, nobs)

# Display results
print(f"Z-statistic: {z_stat:}")
print(f"P-value: {p_val:}")

if p_val < 0.05:
    print("Result: Statistically significant difference in click rates between Top and Bottom ad positions.")
else:
    print("Result: No statistically significant difference in click rates between Top and Bottom ad positions.")

Z-statistic: -4.064215410098865
P-value: 4.819430188759422e-05
Result: Statistically significant difference in click rates between Top and Bottom ad positions.


**Result:**

Statistically significant difference in click rates between Top and Bottom ad positions.

**Justification:**

p-value < 0.05 indicates a statistically significant difference in click rates.

### Part 2: Covariate Shift Detection Using Air Quality Data


In [4]:
# Loading using pandas
import pandas as pd
import numpy as np

# Loading test1.csv, test2.csv, and train.csv from Air_Quality folder
train_data = pd.read_csv('Air_Quality/train.csv')
test1_data = pd.read_csv('Air_Quality/test1.csv')
test2_data = pd.read_csv('Air_Quality/test2.csv')

print(f"Train dataset loaded successfully with shape: {train_data.shape}")
print(f"Test1 dataset loaded successfully with shape: {test1_data.shape}")
print(f"Test2 dataset loaded successfully with shape: {test2_data.shape}")

# Printing column names for each dataset
print("\nDataset columns:")
print(train_data.columns.tolist())  

Train dataset loaded successfully with shape: (3200, 18)
Test1 dataset loaded successfully with shape: (800, 18)
Test2 dataset loaded successfully with shape: (800, 18)

Dataset columns:
['Unnamed: 0', 'Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'Unnamed: 15', 'Unnamed: 16']


In [5]:
# KS test for covariate shift
from scipy.stats import ks_2samp

# Performinng KS test for test1.csv against train.csv
ks_stat_test1, p_value_test1 = ks_2samp(train_data['NO2(GT)'], test1_data['NO2(GT)']) # NO2(GT) is the target variable
print(f"\nKS test statistic for test1.csv: {ks_stat_test1:.8f}, p-value: {p_value_test1:.8f}")
if p_value_test1 < 0.05:
    print("Result: Significant difference in distributions between train.csv and test1.csv.")
else:
    print("Result: No significant difference in distributions between train.csv and test1.csv.")

# Performing KS test for test2.csv against train.csv
ks_stat_test2, p_value_test2 = ks_2samp(train_data['NO2(GT)'], test2_data['NO2(GT)']) # NO2(GT) is the target variable
print(f"\nKS test statistic for test2.csv: {ks_stat_test2:.8f}, p-value: {p_value_test2:.8f}")
if p_value_test2 < 0.05:
    print("Result: Significant difference in distributions between train.csv and test2.csv.")
else:
    print("Result: No significant difference in distributions between train.csv and test2.csv.")

# Determining which exhibits a covariance shift using Kolmogorov–Smirnov test results
if p_value_test2 < 0.05 and p_value_test1 >= 0.05:
    print("\nConclusion: test2.csv exhibits a greater covariate shift compared to test1.csv.")
elif p_value_test1 < 0.05 and p_value_test2 >= 0.05:
    print("\nConclusion: test1.csv exhibits a greater covariate shift compared to test2.csv.")
elif p_value_test1 < 0.05 and p_value_test2 < 0.05:
    if ks_stat_test1 > ks_stat_test2:
        print("\nConclusion: test1.csv exhibits a greater covariate shift compared to test2.csv.")
    else:
        print("\nConclusion: test2.csv exhibits a greater covariate shift compared to test1.csv.")
else:
    print("\nConclusion: Neither test1.csv nor test2.csv exhibits a significant covariate shift.")



KS test statistic for test1.csv: 0.01906250, p-value: 0.97219406
Result: No significant difference in distributions between train.csv and test1.csv.

KS test statistic for test2.csv: 0.40750000, p-value: 0.00000000
Result: Significant difference in distributions between train.csv and test2.csv.

Conclusion: test2.csv exhibits a greater covariate shift compared to test1.csv.


**Result:**

test2.csv exhibits a greater covariate shift compared to test1.csv.

**Justification:**

- The Kolmogorov-Smirnov test statistic for test2.csv is higher than that for test1.csv, indicating a greater difference in the distribution of features between train and test datasets.
- The p-value for test2.csv is lower than that for test1.csv, suggesting a significant difference in the distributions of features between train and test datasets.