### Load as pandas df

In [11]:
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest

In [2]:
df_adclick=pd.read_csv("ad_click_dataset.csv")
df_adclick

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0
...,...,...,...,...,...,...,...,...,...
9995,8510,User8510,,,Mobile,Top,Education,,0
9996,7843,User7843,,Female,Desktop,Bottom,Entertainment,,0
9997,3914,User3914,,Male,Mobile,Side,,Morning,0
9998,7924,User7924,,,Desktop,,Shopping,Morning,1


In [3]:
print(df_adclick.isnull().sum()) 

id                     0
full_name              0
age                 4766
gender              4693
device_type         2000
ad_position         2000
browsing_history    4782
time_of_day         2000
click                  0
dtype: int64


Here we need to handle these NaN values 
### Cleaning and pre-processing

In [4]:
df_adclick['age'] = df_adclick['age'].fillna(df_adclick['age'].median()) #we fill missing values with median

categorical_cols = ['gender', 'device_type', 'ad_position', 'browsing_history', 'time_of_day']
#mode is most occuring, so we fill categorical with mode. 
for col in categorical_cols:
    df_adclick[col] = df_adclick[col].fillna(df_adclick[col].mode()[0])#.mode() returns pd series, hence [0] is the first

df_adclick = df_adclick.drop(['id', 'full_name'], axis=1)
#these two columns do not contribute to our A/B testing, hence we can drop them 

df_adclick = pd.get_dummies(df_adclick, columns=categorical_cols, drop_first=True)#one hot encoding of categorical columns 

In [5]:
df_adclick.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   age                             10000 non-null  float64
 1   click                           10000 non-null  int64  
 2   gender_Male                     10000 non-null  bool   
 3   gender_Non-Binary               10000 non-null  bool   
 4   device_type_Mobile              10000 non-null  bool   
 5   device_type_Tablet              10000 non-null  bool   
 6   ad_position_Side                10000 non-null  bool   
 7   ad_position_Top                 10000 non-null  bool   
 8   browsing_history_Entertainment  10000 non-null  bool   
 9   browsing_history_News           10000 non-null  bool   
 10  browsing_history_Shopping       10000 non-null  bool   
 11  browsing_history_Social Media   10000 non-null  bool   
 12  time_of_day_Evening             1

In [6]:
df_adclick.head()

Unnamed: 0,age,click,gender_Male,gender_Non-Binary,device_type_Mobile,device_type_Tablet,ad_position_Side,ad_position_Top,browsing_history_Entertainment,browsing_history_News,browsing_history_Shopping,browsing_history_Social Media,time_of_day_Evening,time_of_day_Morning,time_of_day_Night
0,22.0,1,False,False,False,False,False,True,False,False,True,False,False,False,False
1,39.5,1,True,False,False,False,False,True,True,False,False,False,False,True,False
2,41.0,1,False,True,False,False,True,False,False,False,False,False,False,False,True
3,34.0,1,True,False,False,False,False,False,True,False,False,False,True,False,False
4,39.0,0,False,True,False,False,False,False,False,False,False,True,False,True,False


In [7]:
df_adclick.columns

Index(['age', 'click', 'gender_Male', 'gender_Non-Binary',
       'device_type_Mobile', 'device_type_Tablet', 'ad_position_Side',
       'ad_position_Top', 'browsing_history_Entertainment',
       'browsing_history_News', 'browsing_history_Shopping',
       'browsing_history_Social Media', 'time_of_day_Evening',
       'time_of_day_Morning', 'time_of_day_Night'],
      dtype='object')

### Splitting Dataset 

In [8]:
group_A = df_adclick[df_adclick['ad_position_Top'] == 1]  
group_B = df_adclick[df_adclick['ad_position_Side'] == 1]  

success_A = group_A['click'].sum()  
trials_A = group_A.shape[0]  

success_B = group_B['click'].sum()  
trials_B = group_B.shape[0]  

successes = [success_A, success_B]
trials = [trials_A, trials_B]

### Using statsmodels for z-test and printing 

In [None]:
successes = [success_A, success_B]
trials = [trials_A, trials_B]

z_score, p_value = proportions_ztest(successes, trials)

print(f"Z-score: {z_score}")
print(f"P-value: {p_value}")

Z-score: 0.2603976662571061
P-value: 0.794557042938064


### Explaination 

#### 1. Z-score: 0.26
The Z-score is very close to 0, which indicates that the difference between the two proportions (clickthrough rates) is small in terms of standard deviations.

A Z-score close to 0 means there is little to no difference between the click-through rates for the two groups (Top ad position vs. Side ad position).
In the context of hypothesis testing:

A Z-score that is close to 0 suggests that any observed difference between the two groups is not large enough to suggest a significant effect.

#### 2. P-value: 0.795
The P-value is 0.795, which is much greater than 0.05.

In hypothesis testing, if the P-value is greater than 0.05, we fail to reject the null hypothesis.

This means that the observed difference in click-through rates between the two groups is likely due to random chance, and there is no statistically significant difference between the two groups.

#### Conclusion:

Z-score: 0.26 suggests that the difference in click-through rates is small.

P-value: 0.795 indicates no statistically significant difference between the two groups (Top and Side ad positions).

#### What Does This Mean for our A/B Test?

Since the P-value is greater than 0.05, we fail to reject the null hypothesis.

Therefore, based on this analysis, there is no evidence to suggest that the position of the ad (Top vs. Side) has a statistically significant effect on the click-through rate.

In other words, the Top ad position and the Side ad position have similar click-through rates in our data, and the observed difference could easily have occurred by random chance.

These results are **statistically insignificant**.

### Extra math and logic for self understanding

The Z-score (also called the standard score) is a statistical measure that describes how many standard deviations a data point or sample statistic is from the mean of a distribution. It is used to standardize values so that they can be compared across different distributions.

Formula for the Z-score:
The Z-score is calculated using the following formula:
$$
𝑍 = \frac{𝑋 − 𝜇} {𝜎}
$$​

### Z-test for Two Proportions

The formula for the Z-test for two proportions is:

$$
Z = \frac{(p_1 - p_2)}{\sqrt{P(1 - P)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
$$

Where:
- $p_1$ and $p_2$ are the sample proportions of the two groups.
- $P$ is the pooled proportion, calculated as:

$$
P = \frac{x_1 + x_2}{n_1 + n_2}
$$

- $n_1$ and $n_2$ are the sample sizes of the two groups.
- $x_1$ and $x_2$ are the number of successes in each group.

