<a href="https://colab.research.google.com/github/Paras-Shirvale/A-B-Testing/blob/main/A_B_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 10

## Part 1: A/B Testing using Ad Click Prediction

In [1]:
from google.colab import drive
import pandas as pd
import os

### 1. Load the dataset into a pandas DataFrame.

In [2]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
ad_click_data = pd.read_csv("/content/ad_click_dataset.csv")
print(f"Shape: {ad_click_data.shape}")

Shape: (10000, 9)


In [5]:
ad_click_data

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0
...,...,...,...,...,...,...,...,...,...
9995,8510,User8510,,,Mobile,Top,Education,,0
9996,7843,User7843,,Female,Desktop,Bottom,Entertainment,,0
9997,3914,User3914,,Male,Mobile,Side,,Morning,0
9998,7924,User7924,,,Desktop,,Shopping,Morning,1


### 2. Perform necessary data cleaning and preprocessing: [10 points]

#### a. Handle missing values

In [6]:
ad_click_data.dropna(subset=['ad_position', 'click'], inplace=True)
print(f"Shape after dropping the rows with missing values: {ad_click_data.shape}")

Shape after dropping the rows with missing values: (8000, 9)


In [7]:
ad_click_data

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
5,5942,User5942,,Non-Binary,,Bottom,Social Media,Evening,1
6,7808,User7808,26.0,Female,Desktop,Top,,,1
...,...,...,...,...,...,...,...,...,...
9992,5818,User5818,,,Tablet,Top,Social Media,Night,1
9995,8510,User8510,,,Mobile,Top,Education,,0
9996,7843,User7843,,Female,Desktop,Bottom,Entertainment,,0
9997,3914,User3914,,Male,Mobile,Side,,Morning,0


#### b. Convert categorical columns  (e.g., gender, ad_position)

In [8]:
# Step 2: Find unique values in 'gender' and 'ad_position'
unique_genders = ad_click_data['gender'].unique()
unique_ad_positions = ad_click_data['ad_position'].unique()

print(f"Unique genders: {unique_genders}")
print(f"Unique ad_positions: {unique_ad_positions}")

Unique genders: [nan 'Male' 'Non-Binary' 'Female']
Unique ad_positions: ['Top' 'Side' 'Bottom']


In [9]:
# Step 3: Map 'gender' and 'ad_position' columns, handling NaN values
gender_mapping = {'Male': 1, 'Female': 0, 'Non-Binary': -1}
ad_click_data['gender'] = ad_click_data['gender'].map(gender_mapping).fillna(-2).astype('category') # Encode NaN as -2

ad_position_mapping = {'Top': 0, 'Bottom': 1, 'Side': -1}
ad_click_data['ad_position'] = ad_click_data['ad_position'].map(ad_position_mapping).fillna(-2).astype('category') # Encode NaN as -2

# Make sure 'clicked' column is binary (0 or 1)
ad_click_data['click'] = ad_click_data['click'].astype(int)

In [10]:
display(ad_click_data)

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,-2.0,Desktop,0,Shopping,Afternoon,1
1,3044,User3044,,1.0,Desktop,0,,,1
2,5912,User5912,41.0,-1.0,,-1,Education,Night,1
5,5942,User5942,,-1.0,,1,Social Media,Evening,1
6,7808,User7808,26.0,0.0,Desktop,0,,,1
...,...,...,...,...,...,...,...,...,...
9992,5818,User5818,,-2.0,Tablet,0,Social Media,Night,1
9995,8510,User8510,,-2.0,Mobile,0,Education,,0
9996,7843,User7843,,0.0,Desktop,1,Entertainment,,0
9997,3914,User3914,,1.0,Mobile,-1,,Morning,0


### 3. Split the dataset into two groups: [10 points]
    a. Group A: Users with ad_position = 0 (Top)
    b. Group B: Users with ad_position = 1  (Bottom)

In [11]:
# Split the dataset into two groups
group_A = ad_click_data[ad_click_data['ad_position'] == 0]
group_B = ad_click_data[ad_click_data['ad_position'] == 1]

In [12]:
# Print the first few rows of each group to verify
print("Group A (ad_position = 0): Top")
display(group_A.head())

print("\nGroup B (ad_position = 1): Bottom")
display(group_B.head())

Group A (ad_position = 0): Top


Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,-2.0,Desktop,0,Shopping,Afternoon,1
1,3044,User3044,,1.0,Desktop,0,,,1
6,7808,User7808,26.0,0.0,Desktop,0,,,1
15,7529,User7529,,-2.0,,0,Entertainment,Afternoon,0
18,2124,User2124,,1.0,Desktop,0,,Evening,1



Group B (ad_position = 1): Bottom


Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
5,5942,User5942,,-1.0,,1,Social Media,Evening,1
8,7993,User7993,,-1.0,Mobile,1,Social Media,,1
9,4509,User4509,,-2.0,,1,Education,Afternoon,1
10,2595,User2595,,-2.0,,1,,Morning,1
11,7466,User7466,47.0,-2.0,Mobile,1,,Afternoon,1


### 4. Use the statsmodel’s proportions_ztest function to perform an independent two-sample z-test between Group A and Group B.

In [13]:
from statsmodels.stats.proportion import proportions_ztest

In [15]:
# Calculate the number of clicks and total observations for each group
successes = [group_A['click'].sum(), group_B['click'].sum()]
lengths = [len(group_A), len(group_B)]

# Perform the two-sample z-test
z_score, p_value = proportions_ztest(successes, lengths)

### 5. Print the following:
    a. The z-score [10 points]
    b. The p-value [10 points]

In [16]:
# Print the results
print(f"Z-Score: {z_score}")
print(f"P-value: {p_value}")

Z-Score: -4.064215410098865
P-value: 4.819430188759425e-05


### 6. Interpret the result: Is there a statistically significant difference in click-through rates between the two groups? Justify your answer. [10 points]

To determine whether there is a statistically significant difference in click-through rates between the two groups, we analyze the **Z-score** and **p-value** obtained from the hypothesis test.

- **Z-Score**: -4.0642  
- **P-Value**: 0.0000482 (approximately)

A commonly used significance level is **α = 0.05**. If the p-value is less than this threshold, we reject the null hypothesis.

In this case:

- The **p-value (0.0000482)** is **much smaller** than 0.05.
- Therefore, we **reject the null hypothesis**, which assumed that there is no difference in click-through rates between the two groups.

**Conclusion**:  
There is a **statistically significant difference** in click-through rates between the two groups. The test provides strong evidence that the observed difference is not due to random chance.

## Part 2: Covariate Shift Detection Using Air Quality Data

### 1. You are provided with 3 datasets via this Google Drive link:
    a. train.csv
    b. test1.csv
    c. test2.csv

### 2. Load all three datasets using pandas. [10 points]

In [17]:
# Load the datasets
train_df = pd.read_csv('/content/train.csv')
test1_df = pd.read_csv('/content/test1.csv')
test2_df = pd.read_csv('/content/test2.csv')

In [18]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,1849,26/05/2004,19.00.00,-200,1130.0,-200.0,227,1368.0,-200.0,933.0,-200.0,1709.0,1269.0,267,195,6754,,
1,2533,24/06/2004,07.00.00,12,1030.0,-200.0,69,851.0,102.0,824.0,68.0,1700.0,983.0,219,570,14742,,
2,3047,15/07/2004,17.00.00,32,1164.0,-200.0,203,1306.0,259.0,648.0,198.0,1886.0,1218.0,355,191,10888,,
3,805,13/04/2004,07.00.00,39,1496.0,524.0,191,1272.0,328.0,667.0,130.0,2011.0,1399.0,110,642,8398,,
4,2962,12/07/2004,04.00.00,-200,780.0,-200.0,18,568.0,24.0,1200.0,34.0,1331.0,501.0,199,513,11803,,


In [19]:
test1_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,3123,18/07/2004,21.00.00,12,1067.0,-200.0,90,938.0,102.0,825.0,99.0,1520.0,912.0,297,248,10160,,
1,877,16/04/2004,07.00.00,45,1657.0,523.0,232,1384.0,352.0,579.0,109.0,2176.0,1600.0,128,710,10428,,
2,3457,01/08/2004,19.00.00,14,1037.0,-200.0,80,900.0,75.0,817.0,95.0,1584.0,619.0,331,327,16200,,
3,1494,12/05/2004,00.00.00,17,1122.0,-200.0,87,926.0,105.0,805.0,88.0,1619.0,1174.0,169,588,11250,,
4,713,09/04/2004,11.00.00,26,-200.0,262.0,-2000,-200.0,219.0,-200.0,121.0,-200.0,-200.0,-200,-200,-200,,


In [20]:
test2_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,8500,27/02/2005,22.00.00,10,875.0,-200.0,21,594.0,128.0,1079.0,105.0,793.0,451.0,45,480,4085,,
1,8501,27/02/2005,23.00.00,13,943.0,-200.0,39,703.0,169.0,950.0,119.0,870.0,581.0,43,486,4069,,
2,8502,28/02/2005,00.00.00,16,947.0,-200.0,38,697.0,215.0,913.0,150.0,878.0,698.0,40,500,4115,,
3,8503,28/02/2005,01.00.00,10,865.0,-200.0,18,566.0,111.0,1119.0,94.0,797.0,423.0,40,529,4338,,
4,8504,28/02/2005,02.00.00,6,823.0,-200.0,10,503.0,60.0,1268.0,56.0,755.0,332.0,40,510,4200,,


In [21]:
# Dropping Negative Values
train_df = train_df[train_df['NO2(GT)'] >= 0]
test1_df = test1_df[test1_df['NO2(GT)'] >= 0]
test2_df = test2_df[test2_df['NO2(GT)'] >= 0]

### 3. For each test dataset (test1.csv and test2.csv), compare it with train.csv using the Kolmogorov–Smirnov test (scipy.stats.ks_2samp).

Perform the KS test on the NO2(GT) column to identify whether there are any distributional differences. [20 points]

In [22]:
from scipy.stats import ks_2samp

# Kolmogorov-Smirnov test for Test1 vs Train
ks_stat_test1, p_value_test1 = ks_2samp(train_df['NO2(GT)'], test1_df['NO2(GT)'])

# Kolmogorov-Smirnov test for Test2 vs Train
ks_stat_test2, p_value_test2 = ks_2samp(train_df['NO2(GT)'], test2_df['NO2(GT)'])

### 4. Report the KS statistic and p-value for each feature. [10 points]

In [31]:
# Report the results
print("=== Kolmogorov–Smirnov Test Results ===")
print("\n")
print(f"Test1 vs Train")
print(f"KS Statistic: {ks_stat_test1} ≈ {ks_stat_test1:.5f}")
print(f"P-Value: {p_value_test1} ≈ {p_value_test1:.5f}")
print("\n")
print(f"Test2 vs Train")
print(f"KS Statistic: {ks_stat_test2:} ≈ {ks_stat_test2:.5f}")
print(f"P-Value: {p_value_test2} ≈ {p_value_test2:.5f}")

=== Kolmogorov–Smirnov Test Results ===


Test1 vs Train
KS Statistic: 0.017062220028073977 ≈ 0.01706
P-Value: 0.9971378232852736 ≈ 0.99714


Test2 vs Train
KS Statistic: 0.3688536442438679 ≈ 0.36885
P-Value: 2.53172387531317e-74 ≈ 0.00000


### 5. Determine which of the two test datasets (test1.csv or test2.csv) exhibits a covariate shift relative to the training dataset (train.csv). Use the results of the Kolmogorov–Smirnov test to support your answer. [10 points]

In [32]:
# Determine which dataset exhibits covariate shift
print("\n=== Covariate Shift Check ===")
if p_value_test1 < 0.05:
    print("Test1 shows a significant distributional change compared to the training set.")
else:
    print("Test1 is statistically similar to the training set (no covariate shift detected).")

if p_value_test2 < 0.05:
    print("Test2 shows a significant distributional change compared to the training set.")
else:
    print("Test2 is statistically similar to the training set (no covariate shift detected).")



=== Covariate Shift Check ===
Test1 is statistically similar to the training set (no covariate shift detected).
Test2 shows a significant distributional change compared to the training set.


In [33]:
print("\n=== Final Assessment ===")
if p_value_test1 < 0.05 and p_value_test2 >= 0.05:
    print("Conclusion: Test1 dataset exhibits covariate shift; Test2 does not.")
elif p_value_test2 < 0.05 and p_value_test1 >= 0.05:
    print("Conclusion: Test2 dataset exhibits covariate shift; Test1 does not.")
elif p_value_test1 < 0.05 and p_value_test2 < 0.05:
    if ks_stat_test1 > ks_stat_test2:
        print("Conclusion: Both test datasets show shift, but Test1 differs more from the training set.")
    elif ks_stat_test2 > ks_stat_test1:
        print("Conclusion: Both test datasets show shift, but Test2 differs more from the training set.")
    else:
        print("Conclusion: Both test datasets show shift to the same extent.")
else:
    print("Conclusion: Neither Test1 nor Test2 shows significant covariate shift.")


=== Final Assessment ===
Conclusion: Test2 dataset exhibits covariate shift; Test1 does not.


### Inference: Kolmogorov–Smirnov Test Results

- For **Test1** (p-value = 0.99714), the distribution is **very similar** to the training set, indicating **minimal difference**.
  
- For **Test2** (p-value = 0.00000), the distribution differs significantly from the training set, suggesting **a substantial covariate shift**.

### Conclusion:
- **Test1** shows **little to no covariate shift**.
- **Test2** exhibits **a significant covariate shift**.
