# STTAI - Lab Assignment 10

Name | Roll Number
---|---
Romit Mohane | 23110279
Rudra Pratap Singh | 23110281

Using real-world data, this assignment will introduces us to key concepts in A/B testing and Covariate Shift Detection. We performed hypothesis testing using the scipy library and identified distributional shifts in datasets using classification-based techniques.

## Part 1: A/B Testing using Ad Click Prediction


In [71]:
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import ks_2samp

### 1. Load the ad_click_dataset.csv into a pandas dataframe


In [45]:
#!/bin/bash
!curl -L -o ad-click-prediction-dataset.zip https://www.kaggle.com/api/v1/datasets/download/marius2303/ad-click-prediction-dataset
!unzip ad-click-prediction-dataset.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 83158  100 83158    0     0  56283      0  0:00:01  0:00:01 --:--:--  398k
Archive:  ad-click-prediction-dataset.zip
replace ad_click_dataset.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ad_click_dataset.csv    


In [46]:
df = pd.read_csv('ad_click_dataset.csv')

df

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0
...,...,...,...,...,...,...,...,...,...
9995,8510,User8510,,,Mobile,Top,Education,,0
9996,7843,User7843,,Female,Desktop,Bottom,Entertainment,,0
9997,3914,User3914,,Male,Mobile,Side,,Morning,0
9998,7924,User7924,,,Desktop,,Shopping,Morning,1


### 2. Perform necessary data cleaning and preprocessing: [10 points]

  #### a. Handle missing values


In [47]:
# Print unique values for each column
for col in df.columns:
    print(f"Unique values in column '{col}':")
    print(df[col].unique())
    print()

Unique values in column 'id':
[ 670 3044 5912 ... 7843 3914 3056]

Unique values in column 'full_name':
['User670' 'User3044' 'User5912' ... 'User7843' 'User3914' 'User3056']

Unique values in column 'age':
[22. nan 41. 34. 39. 26. 40. 47. 19. 56. 24. 52. 42. 36. 43. 62. 45. 37.
 31. 58. 59. 48. 38. 49. 30. 46. 54. 44. 27. 57. 28. 51. 25. 61. 32. 64.
 23. 55. 21. 20. 35. 53. 33. 29. 63. 50. 18. 60.]

Unique values in column 'gender':
[nan 'Male' 'Non-Binary' 'Female']

Unique values in column 'device_type':
['Desktop' nan 'Mobile' 'Tablet']

Unique values in column 'ad_position':
['Top' 'Side' nan 'Bottom']

Unique values in column 'browsing_history':
['Shopping' nan 'Education' 'Entertainment' 'Social Media' 'News']

Unique values in column 'time_of_day':
['Afternoon' nan 'Night' 'Evening' 'Morning']

Unique values in column 'click':
[1 0]



In [48]:
# In the dataframe we check for unique values
df.isnull().any()

Unnamed: 0,0
id,False
full_name,False
age,True
gender,True
device_type,True
ad_position,True
browsing_history,True
time_of_day,True
click,False


In [49]:
# 1) DROP rows with any nulls
df_nullhandled_dropped = df.dropna()

# 2) IMPUTE:
df_nullhandled_imputated = df.copy()

# median imputation for numbers
median_age = df_nullhandled_imputated['age'].median()
df_nullhandled_imputated['age'] = df_nullhandled_imputated['age'].fillna(median_age)

# mode imputation for categorical
for col in ['gender', 'device_type', 'ad_position', 'browsing_history', 'time_of_day']:
    mode_val = df_nullhandled_imputated[col].mode()[0]
    df_nullhandled_imputated[col] = df_nullhandled_imputated[col].fillna(mode_val)

# Quick check:
print("Dropped – any nulls left?")
print(df_nullhandled_dropped.isnull().any(), "\n")

print("Imputed – any nulls left?")
print(df_nullhandled_imputated.isnull().any())


Dropped – any nulls left?
id                  False
full_name           False
age                 False
gender              False
device_type         False
ad_position         False
browsing_history    False
time_of_day         False
click               False
dtype: bool 

Imputed – any nulls left?
id                  False
full_name           False
age                 False
gender              False
device_type         False
ad_position         False
browsing_history    False
time_of_day         False
click               False
dtype: bool


In [50]:
df_nullhandled_dropped

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
17,188,User188,56.0,Female,Tablet,Bottom,News,Morning,1
25,4890,User4890,43.0,Male,Tablet,Bottom,Education,Afternoon,1
33,4985,User4985,37.0,Male,Mobile,Top,News,Evening,0
52,9888,User9888,49.0,Male,Mobile,Top,News,Morning,1
102,8201,User8201,59.0,Female,Desktop,Bottom,Social Media,Morning,0
...,...,...,...,...,...,...,...,...,...
9951,7268,User7268,28.0,Female,Desktop,Bottom,News,Evening,1
9952,5912,User5912,41.0,Non-Binary,Mobile,Side,Education,Night,1
9960,9638,User9638,64.0,Non-Binary,Desktop,Top,Entertainment,Morning,0
9986,5574,User5574,52.0,Female,Desktop,Bottom,Shopping,Afternoon,1


In [51]:
# Print unique values for each column

for col in df_nullhandled_imputated.columns:
    print(f"Unique values in column '{col}':")
    print(df[col].unique())
    print()

Unique values in column 'id':
[ 670 3044 5912 ... 7843 3914 3056]

Unique values in column 'full_name':
['User670' 'User3044' 'User5912' ... 'User7843' 'User3914' 'User3056']

Unique values in column 'age':
[22. nan 41. 34. 39. 26. 40. 47. 19. 56. 24. 52. 42. 36. 43. 62. 45. 37.
 31. 58. 59. 48. 38. 49. 30. 46. 54. 44. 27. 57. 28. 51. 25. 61. 32. 64.
 23. 55. 21. 20. 35. 53. 33. 29. 63. 50. 18. 60.]

Unique values in column 'gender':
[nan 'Male' 'Non-Binary' 'Female']

Unique values in column 'device_type':
['Desktop' nan 'Mobile' 'Tablet']

Unique values in column 'ad_position':
['Top' 'Side' nan 'Bottom']

Unique values in column 'browsing_history':
['Shopping' nan 'Education' 'Entertainment' 'Social Media' 'News']

Unique values in column 'time_of_day':
['Afternoon' nan 'Night' 'Evening' 'Morning']

Unique values in column 'click':
[1 0]



In [52]:
df_nullhandled_imputated

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,Female,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,39.5,Male,Desktop,Top,Entertainment,Morning,1
2,5912,User5912,41.0,Non-Binary,Desktop,Side,Education,Night,1
3,5418,User5418,34.0,Male,Desktop,Bottom,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,Desktop,Bottom,Social Media,Morning,0
...,...,...,...,...,...,...,...,...,...
9995,8510,User8510,39.5,Female,Mobile,Top,Education,Morning,0
9996,7843,User7843,39.5,Female,Desktop,Bottom,Entertainment,Morning,0
9997,3914,User3914,39.5,Male,Mobile,Side,Entertainment,Morning,0
9998,7924,User7924,39.5,Female,Desktop,Bottom,Shopping,Morning,1


#### b. Convert categorical columns  (e.g., gender, ad_position)


In [77]:
# Work on the dropped values DF
df_current = df_nullhandled_dropped.copy()
# Keep only the two positions of interest
df_current = df_current[df_current['ad_position'].isin(['Top','Bottom'])]

# Map to 0/1
df_current['ad_position_flag'] = df_current['ad_position'].map({'Top': 0, 'Bottom': 1})


In [78]:
df_current # df with dropped rows where null values were present

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click,ad_position_flag
17,188,User188,56.0,Female,Tablet,Bottom,News,Morning,1,1
25,4890,User4890,43.0,Male,Tablet,Bottom,Education,Afternoon,1,1
33,4985,User4985,37.0,Male,Mobile,Top,News,Evening,0,0
52,9888,User9888,49.0,Male,Mobile,Top,News,Morning,1,0
102,8201,User8201,59.0,Female,Desktop,Bottom,Social Media,Morning,0,1
...,...,...,...,...,...,...,...,...,...,...
9928,7790,User7790,43.0,Non-Binary,Mobile,Top,Social Media,Morning,0,0
9951,7268,User7268,28.0,Female,Desktop,Bottom,News,Evening,1,1
9960,9638,User9638,64.0,Non-Binary,Desktop,Top,Entertainment,Morning,0,0
9986,5574,User5574,52.0,Female,Desktop,Bottom,Shopping,Afternoon,1,1


###  3. Split the dataset into two groups: [10 points]
a. Group A: Users with ad_position = 0 (Top)


b. Group B: Users with ad_position = 1  (Bottom)

In [79]:
# Group A: Top (flag=0)
groupA = df_current[df_current['ad_position_flag']==0]

# Group B: Bottom (flag=1)
groupB = df_current[df_current['ad_position_flag']==1]

In [80]:
groupA

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click,ad_position_flag
33,4985,User4985,37.0,Male,Mobile,Top,News,Evening,0,0
52,9888,User9888,49.0,Male,Mobile,Top,News,Morning,1,0
158,3007,User3007,42.0,Male,Desktop,Top,Shopping,Night,0,0
204,8530,User8530,52.0,Female,Mobile,Top,Social Media,Afternoon,1,0
231,4625,User4625,33.0,Non-Binary,Mobile,Top,News,Morning,0,0
...,...,...,...,...,...,...,...,...,...,...
9888,5055,User5055,45.0,Male,Desktop,Top,Education,Morning,1,0
9915,3335,User3335,24.0,Male,Mobile,Top,Entertainment,Night,1,0
9928,7790,User7790,43.0,Non-Binary,Mobile,Top,Social Media,Morning,0,0
9960,9638,User9638,64.0,Non-Binary,Desktop,Top,Entertainment,Morning,0,0


In [81]:
groupB

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click,ad_position_flag
17,188,User188,56.0,Female,Tablet,Bottom,News,Morning,1,1
25,4890,User4890,43.0,Male,Tablet,Bottom,Education,Afternoon,1,1
102,8201,User8201,59.0,Female,Desktop,Bottom,Social Media,Morning,0,1
154,118,User118,43.0,Female,Tablet,Bottom,Social Media,Night,0,1
170,3062,User3062,34.0,Male,Desktop,Bottom,Entertainment,Evening,1,1
...,...,...,...,...,...,...,...,...,...,...
9866,6989,User6989,28.0,Male,Mobile,Bottom,Shopping,Afternoon,1,1
9904,7267,User7267,20.0,Male,Desktop,Bottom,Shopping,Night,0,1
9925,5574,User5574,52.0,Female,Desktop,Bottom,Shopping,Afternoon,1,1
9951,7268,User7268,28.0,Female,Desktop,Bottom,News,Evening,1,1


### 4. Use the statsmodels proportions_ztest function to perform an independent two-sample z-test between Group A and Group B.


In [82]:
# number of clicks in each group
clicks = [ groupA['click'].sum(),
           groupB['click'].sum() ]

# number of users in each group
nobs   = [ len(groupA),
           len(groupB) ]
print(clicks[0]/nobs[0], clicks[1]/nobs[1], "\n")


z_score, p_value = proportions_ztest(count=clicks, nobs=nobs)
print("z‑score:", z_score)
print("p‑value:", p_value)

if p_value < 0.05:
    print("Reject null hypothesis: the two CTRs are not equal.")
else:
    print("Fail to reject null hypothesis: the two CTRs are equal.")


0.6327272727272727 0.6784452296819788 

z‑score: -1.1365075404030447
p‑value: 0.2557442115851094
Fail to reject null hypothesis: the two CTRs are equal.


In our A/B test on 10,000 users (with missing values dropped), we compared click-through rates (CTRs) for ads shown at the Top vs. Bottom positions. Using a two-sample z-test, we obtained `z = -1.137` and `p = 0.256` (> 0.05), so we accept H₀ that the two CTRs are equal. The negative z-score indicates Bottom-positioned ads achieved a lower CTR than Top-positioned ads.

**But, this difference is statistically insignificant.**

---
## Part 2: Covariate Shift Detection Using Air Quality Data

In [83]:
!unzip air-quality-dataset.zip

Archive:  air-quality-dataset.zip
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train.csv               
replace test2.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: test2.csv               
replace test1.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: test1.csv               


### 1. You are provided with 3 datasets via this Google Drive link:
train.csv  
test1.csv  
test2.csv
### 2. Load all three datasets using pandas.

In [84]:
import pandas as pd

air_df_train = pd.read_csv('train.csv')
air_df_test1 = pd.read_csv('test1.csv')
air_df_test2 = pd.read_csv('test2.csv')

air_df_train.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,1849,26/05/2004,19.00.00,-200,1130.0,-200.0,227,1368.0,-200.0,933.0,-200.0,1709.0,1269.0,267,195,6754,,
1,2533,24/06/2004,07.00.00,12,1030.0,-200.0,69,851.0,102.0,824.0,68.0,1700.0,983.0,219,570,14742,,
2,3047,15/07/2004,17.00.00,32,1164.0,-200.0,203,1306.0,259.0,648.0,198.0,1886.0,1218.0,355,191,10888,,
3,805,13/04/2004,07.00.00,39,1496.0,524.0,191,1272.0,328.0,667.0,130.0,2011.0,1399.0,110,642,8398,,
4,2962,12/07/2004,04.00.00,-200,780.0,-200.0,18,568.0,24.0,1200.0,34.0,1331.0,501.0,199,513,11803,,


In [85]:
air_df_train.columns

Index(['Unnamed: 0', 'Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)',
       'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)',
       'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'Unnamed: 15',
       'Unnamed: 16'],
      dtype='object')

### 3. For each test dataset (test1.csv and test2.csv), compare it with train.csv using the Kolmogorov–Smirnov test (scipy.stats.ks_2samp). Perform the KS test on the NO2(GT) column to identify whether there are any distributional differences.

### 4. Report the KS statistic and p-value for each feature.

In [86]:
print("Mean for train set:", air_df_train['NO2(GT)'].mean())

# Perform KS test for test1.csv
ks_statistic_test1, p_value_test1 = ks_2samp(air_df_train['NO2(GT)'], air_df_test1['NO2(GT)'])
print("\nMean for test set 1:", air_df_test1['NO2(GT)'].mean())

print(f"KS Test for test1.csv:")
print(f"KS Statistic: {ks_statistic_test1}")
print(f"P-value: {p_value_test1}")

# Perform KS test for test2.csv
ks_statistic_test2, p_value_test2 = ks_2samp(air_df_train['NO2(GT)'], air_df_test2['NO2(GT)'])
print("\nMean for test set 2:", air_df_test2['NO2(GT)'].mean())

print(f"KS Test for test2.csv:")
print(f"KS Statistic: {ks_statistic_test2}")
print(f"P-value: {p_value_test2}")

if p_value_test1 < 0.05:
    print(f"\nReject the null hypothesis for test1.csv")
else:
    print(f"\nFail to reject the null hypothesis for test1.csv")

if p_value_test2 < 0.05:
    print(f"Reject the null hypothesis for test2.csv")
else:
    print(f"Fail to reject the null hypothesis for test2.csv")

Mean for train set: 45.605625

Mean for test set 1: 42.62125
KS Test for test1.csv:
KS Statistic: 0.0190625
P-value: 0.9721940612395358

Mean for test set 2: 129.6825
KS Test for test2.csv:
KS Statistic: 0.4075
P-value: 7.2019977111245e-96

Fail to reject the null hypothesis for test1.csv
Reject the null hypothesis for test2.csv


Therefore, there is a distributional difference in the values of `NO2(GT)` between the test sets.

The 2nd test set exhibits a covariate shift relative to the training set, since:
- The p-value for test set 1 and train set is `0.9722`
- The p-value for test set 2 and train set is around `0`

This rejects the Null Hypothesis for Test2 and shows strong covariate shift in `test2` dataset with respect to the `train` set.