# A/B Testing for ShoeFly.com

Our favorite online shoe store, ShoeFly.com is performing an A/B Test. They have two different versions of an ad, which they have placed in emails, as well as in banner ads on Facebook, Twitter, and Google. They want to know how the two ads are performing on each of the different platforms on each day of the week. Help them analyze the data using aggregate measures


In [2]:
import pandas as pd

### 1. Examine the first few rows of ad_clicks.

In [6]:
ad_clicks = pd.read_csv('ad_clicks.csv')

print(ad_clicks.head(5))

                                user_id utm_source           day  \
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google  6 - Saturday   
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook    7 - Sunday   
2  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter   2 - Tuesday   
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439     google   2 - Tuesday   
4  012137e6-7ae7-4649-af68-205b4702169c   facebook    7 - Sunday   

  ad_click_timestamp experimental_group  
0               7:18                  A  
1                NaN                  B  
2                NaN                  A  
3                NaN                  B  
4                NaN                  B  


### 2. Your manager wants to know which ad platform is getting you the most views.

### How many views (i.e., rows of the table) came from each utm_source?

In [7]:
utm_source_count = ad_clicks.groupby('utm_source').user_id.count().reset_index()
utm_source_count.columns = ["utm_source", "count"]

print(utm_source_count)

  utm_source  count
0      email    255
1   facebook    504
2     google    680
3    twitter    215


### 3. If the column ad_click_timestamp is not null, then someone actually clicked on the ad that was displayed.

### Create a new column called is_click, which is True if ad_click_timestamp is not null and False otherwise.

In [9]:
ad_clicks["is_click"] = ad_clicks.ad_click_timestamp.apply(lambda x: isinstance(x, str))

print(ad_clicks.head(5))

                                user_id utm_source           day  \
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google  6 - Saturday   
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook    7 - Sunday   
2  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter   2 - Tuesday   
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439     google   2 - Tuesday   
4  012137e6-7ae7-4649-af68-205b4702169c   facebook    7 - Sunday   

  ad_click_timestamp experimental_group  is_click  
0               7:18                  A      True  
1                NaN                  B     False  
2                NaN                  A     False  
3                NaN                  B     False  
4                NaN                  B     False  


### 4. We want to know the percent of people who clicked on ads from each utm_source.

### Start by grouping by utm_source and is_click and counting the number of user_id‘s in each of those groups. Save your answer to the variable clicks_by_source.

In [11]:
clicks_by_source = ad_clicks.groupby(['utm_source', 'is_click']).user_id.count().reset_index()

print(clicks_by_source.head(5))

  utm_source  is_click  user_id
0      email     False      175
1      email      True       80
2   facebook     False      324
3   facebook      True      180
4     google     False      441


### 5. Now let’s pivot the data so that the columns are is_click (either True or False), the index is utm_source, and the values are user_id.

### Save your results to the variable clicks_pivot.

In [12]:
clicks_pivot = clicks_by_source.pivot(
  columns='is_click',
  index='utm_source',
  values='user_id'
)
print(clicks_pivot.head(5))

is_click    False  True
utm_source             
email         175    80
facebook      324   180
google        441   239
twitter       149    66


### 6. Create a new column in clicks_pivot called percent_clicked which is equal to the percent of users who clicked on the ad from each utm_source.

### Was there a difference in click rates for each source?

In [13]:
clicks_pivot["percent_clicked"] = clicks_pivot.apply(lambda row: round(row[True] / (row[False] + row[True]), 2), axis=1) 

print(clicks_pivot.head(5))

is_click    False  True  percent_clicked
utm_source                              
email         175    80             0.31
facebook      324   180             0.36
google        441   239             0.35
twitter       149    66             0.31


### 7. The column experimental_group tells us whether the user was shown Ad A or Ad B.

### Were approximately the same number of people shown both ads?

In [14]:
a_b_testing = ad_clicks.groupby('experimental_group').user_id.count().reset_index()

print(a_b_testing)

  experimental_group  user_id
0                  A      827
1                  B      827


### 8. Using the column is_click that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B.

In [15]:
a_b_testing_advanced = ad_clicks.groupby(["experimental_group", "is_click"]).user_id.count().reset_index()

a_b_testing_advanced_pivot = a_b_testing_advanced.pivot(
  columns="experimental_group",
  index="is_click",
  values="user_id"
).reset_index()

print(a_b_testing_advanced_pivot)

experimental_group  is_click    A    B
0                      False  517  572
1                       True  310  255


### 9. The Product Manager for the A/B test thinks that the clicks might have changed by day of the week.

### Start by creating two DataFrames: a_clicks and b_clicks, which contain only the results for A group and B group, respectively.

In [19]:
b_clicks = ad_clicks[ad_clicks.experimental_group == 'B']

a_clicks = ad_clicks[ad_clicks.experimental_group == 'A']

print(b_clicks)

print(a_clicks)

                                   user_id utm_source            day  \
1     009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook     7 - Sunday   
3     011adc64-0f44-4fd9-a0bb-f1506d2ad439     google    2 - Tuesday   
4     012137e6-7ae7-4649-af68-205b4702169c   facebook     7 - Sunday   
9     01a210c3-fde0-4e6f-8efd-4f0e38730ae6      email    2 - Tuesday   
10    01adb2e7-f711-4ae4-a7c6-29f48457eea1     google  3 - Wednesday   
...                                    ...        ...            ...   
1645  fd2a5852-f0ef-4162-84a6-107a42dc46b5    twitter  3 - Wednesday   
1648  fe6cfa5a-cc63-4770-8d56-c13ac8cf5bef     google  3 - Wednesday   
1650  fed3db6d-8c92-40e3-a4fb-1fb9d7337eb1   facebook     5 - Friday   
1651  ff3a22ff-521c-478c-87ca-7dc7b8f34372    twitter  3 - Wednesday   
1653  ffdfe7ec-0c74-4623-8d90-d95d80f1ba34   facebook   6 - Saturday   

     ad_click_timestamp experimental_group  is_click  
1                   NaN                  B     False  
3                   NaN  

### 10. For each group (a_clicks and b_clicks), calculate the percent of users who clicked on the ad by day.

In [23]:
b_clicks_groupe = b_clicks.groupby(['is_click', 'day']).user_id.count().reset_index()

a_clicks_groupe = a_clicks.groupby(['is_click', 'day']).user_id.count().reset_index()

b_clicks_groupe_pivot = b_clicks_groupe.pivot(
  columns="is_click",
  index="day",
  values="user_id"
).reset_index()

a_clicks_groupe_pivot = a_clicks_groupe.pivot(
  columns="is_click",
  index="day",
  values="user_id"
).reset_index()

a_clicks_groupe_pivot["percentage"] = a_clicks_groupe_pivot.apply(lambda row: round(row[True] / (row[False] + row[True]), 2), axis = 1)
b_clicks_groupe_pivot["percentage"] = b_clicks_groupe_pivot.apply(lambda row: round(row[True] / (row[False] + row[True]), 2), axis = 1)


print(a_clicks_groupe_pivot)
print(b_clicks_groupe_pivot)

is_click            day  False  True  percentage
0            1 - Monday     70    43        0.38
1           2 - Tuesday     76    43        0.36
2         3 - Wednesday     86    38        0.31
3          4 - Thursday     69    47        0.41
4            5 - Friday     77    51        0.40
5          6 - Saturday     73    45        0.38
6            7 - Sunday     66    43        0.39
is_click            day  False  True  percentage
0            1 - Monday     81    32        0.28
1           2 - Tuesday     74    45        0.38
2         3 - Wednesday     89    35        0.28
3          4 - Thursday     87    29        0.25
4            5 - Friday     90    38        0.30
5          6 - Saturday     76    42        0.36
6            7 - Sunday     75    34        0.31


### 11. Compare the results for A and B. What happened over the course of the week?

#### Do you recommend that your company use Ad A or Ad B?

In [24]:
print("The company should use the A ad wich is better every days of the week")

The company should use the A ad wich is better every days of the week
