# A/B Testing for ShoeFly.com

## This a project for the data science career path in Codecademy.

#### For this particular project, a dataset and instructions were provided as guidance to know what information was needed for this A/B testing, but I did all the coding myself. 

Our favorite online shoe store, ShoeFly.com is performing an A/B Test. They have two different versions of an ad, which they have placed in emails, as well as in banner ads on Facebook, Twitter, and Google. They want to know how the two ads are performing on each of the different platforms on each day of the week. Help them analyze the data using aggregate measures.

### Analyzing Ad Sources

#### 1. Examine the first few rows of ad_clicks.

In [8]:
#import codecademylib
import pandas as pd

ad_clicks = pd.read_csv('ad_clicks.csv')
print(ad_clicks.head())

                                user_id utm_source           day  \
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google  6 - Saturday   
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook    7 - Sunday   
2  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter   2 - Tuesday   
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439     google   2 - Tuesday   
4  012137e6-7ae7-4649-af68-205b4702169c   facebook    7 - Sunday   

  ad_click_timestamp experimental_group  
0               7:18                  A  
1                NaN                  B  
2                NaN                  A  
3                NaN                  B  
4                NaN                  B  


#### 2. Your manager wants to know which ad platform is getting you the most views. How many views (i.e., rows of the table) came from each utm_source?


In [9]:
most_views = ad_clicks.groupby('utm_source').user_id.count().reset_index()
print(most_views)

  utm_source  user_id
0      email      255
1   facebook      504
2     google      680
3    twitter      215


#### 3. If the column ad_click_timestamp is not null, then someone actually clicked on the ad that was displayed. Create a new column called is_click, which is True if ad_click_timestamp is not null and False otherwise.


In [10]:
ad_clicks['is_click'] = ~ad_clicks.ad_click_timestamp.isnull()
print(ad_clicks)

                                   user_id utm_source            day  \
0     008b7c6c-7272-471e-b90e-930d548bd8d7     google   6 - Saturday   
1     009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook     7 - Sunday   
2     00f5d532-ed58-4570-b6d2-768df5f41aed    twitter    2 - Tuesday   
3     011adc64-0f44-4fd9-a0bb-f1506d2ad439     google    2 - Tuesday   
4     012137e6-7ae7-4649-af68-205b4702169c   facebook     7 - Sunday   
...                                    ...        ...            ...   
1649  fe8b5236-78f6-4192-9da6-a76bba67cfe6    twitter     7 - Sunday   
1650  fed3db6d-8c92-40e3-a4fb-1fb9d7337eb1   facebook     5 - Friday   
1651  ff3a22ff-521c-478c-87ca-7dc7b8f34372    twitter  3 - Wednesday   
1652  ff3af0d6-b092-4c4d-9f2e-2bdd8f7c0732     google     1 - Monday   
1653  ffdfe7ec-0c74-4623-8d90-d95d80f1ba34   facebook   6 - Saturday   

     ad_click_timestamp experimental_group  is_click  
0                  7:18                  A      True  
1                   NaN  

#### 4. We want to know the percent of people who clicked on ads from each utm_source. Start by grouping by utm_source and is_click and counting the number of user_id‘s in each of those groups. Save your answer to the variable clicks_by_source.


In [11]:
clicks_by_source = ad_clicks.groupby(['utm_source','is_click']).user_id.count().reset_index()
print(clicks_by_source)

  utm_source  is_click  user_id
0      email     False      175
1      email      True       80
2   facebook     False      324
3   facebook      True      180
4     google     False      441
5     google      True      239
6    twitter     False      149
7    twitter      True       66


#### 5. Now let’s pivot the data so that the columns are is_click (either True or False), the index is utm_source, and the values are user_id. Save your results to the variable clicks_pivot.


In [12]:
clicks_pivot = clicks_by_source.pivot(columns = 'is_click', index = 'utm_source', values = 'user_id').reset_index()
print(clicks_pivot)

is_click utm_source  False  True
0             email    175    80
1          facebook    324   180
2            google    441   239
3           twitter    149    66


#### 6. Create a new column in clicks_pivot called percent_clicked which is equal to the percent of users who clicked on the ad from each utm_source. Was there a difference in click rates for each source?


In [13]:
clicks_pivot['percent_cliked'] = clicks_pivot.apply(lambda row: (100 * row[True])/(row[True] + row[False]), axis = 1)
print(clicks_pivot)

is_click utm_source  False  True  percent_cliked
0             email    175    80       31.372549
1          facebook    324   180       35.714286
2            google    441   239       35.147059
3           twitter    149    66       30.697674


## Analyzing an A/B Test
#### 7. The column experimental_group tells us whether the user was shown Ad A or Ad B. Were approximately the same number of people shown both adds?


In [14]:
print(ad_clicks.groupby('experimental_group').user_id.count())


experimental_group
A    827
B    827
Name: user_id, dtype: int64


#### 8.  Using the column is_click that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B.

In [15]:
abtest = ad_clicks.groupby(['experimental_group','is_click']).user_id.count().reset_index()
abtest_pivot = abtest.pivot(columns = 'is_click', index = 'experimental_group', values = 'user_id').reset_index()
print(abtest)
print(abtest_pivot)

  experimental_group  is_click  user_id
0                  A     False      517
1                  A      True      310
2                  B     False      572
3                  B      True      255
is_click experimental_group  False  True
0                         A    517   310
1                         B    572   255


#### 9. The Product Manager for the A/B test thinks that the clicks might have changed by day of the week. Start by creating two DataFrames: a_clicks and b_clicks, which contain only the results for A group and B group, respectively.


In [16]:
a_clicks = ad_clicks[ad_clicks.experimental_group == 'A'].reset_index(drop = True)
b_clicks = ad_clicks[ad_clicks.experimental_group == 'B'].reset_index(drop = True)
print(a_clicks, b_clicks)

                                  user_id utm_source            day  \
0    008b7c6c-7272-471e-b90e-930d548bd8d7     google   6 - Saturday   
1    00f5d532-ed58-4570-b6d2-768df5f41aed    twitter    2 - Tuesday   
2    013b0072-7b72-40e7-b698-98b4d0c9967f   facebook     1 - Monday   
3    0153d85b-7660-4c39-92eb-1e1acd023280     google   4 - Thursday   
4    01555297-d6e6-49ae-aeba-1b196fdbb09f     google  3 - Wednesday   
..                                    ...        ...            ...   
822  fceb13ea-fd8c-446a-a61f-f977d404330a    twitter   6 - Saturday   
823  fd7d06ea-38b5-4ed9-acc9-777047db8c56     google   4 - Thursday   
824  fe570a20-448f-40ed-930b-8482b8a7c231   facebook     1 - Monday   
825  fe8b5236-78f6-4192-9da6-a76bba67cfe6    twitter     7 - Sunday   
826  ff3af0d6-b092-4c4d-9f2e-2bdd8f7c0732     google     1 - Monday   

    ad_click_timestamp experimental_group  is_click  
0                 7:18                  A      True  
1                  NaN                 

#### 10. For each group (a_clicks and b_clicks), calculate the percent of users who clicked on the ad by day.

In [17]:
a_clicks_by_day = a_clicks.groupby(['day','is_click']).user_id.count().reset_index()
a_clicks_by_day_pivot = a_clicks_by_day.pivot(columns = 'is_click', index = 'day', values = 'user_id')
a_clicks_by_day_pivot['percentage_true_over_total_users'] = (a_clicks_by_day_pivot[True] * 100  / ad_clicks.user_id.count()).round(2)
a_clicks_by_day_pivot['percentage_true_by_day'] = (a_clicks_by_day_pivot[True] * 100  / (a_clicks_by_day_pivot[True] + a_clicks_by_day_pivot[False])).round(2)

b_clicks_by_day = b_clicks.groupby(['day','is_click']).user_id.count().reset_index()
b_clicks_by_day_pivot = b_clicks_by_day.pivot(columns = 'is_click', index = 'day', values = 'user_id')
b_clicks_by_day_pivot['percentage_true_over_total_users'] = (b_clicks_by_day_pivot[True] * 100  / ad_clicks.user_id.count()).round(2)
b_clicks_by_day_pivot['percentage_true_by_day'] = (b_clicks_by_day_pivot[True] * 100  / (b_clicks_by_day_pivot[True] + b_clicks_by_day_pivot[False])).round(2)

print(a_clicks_by_day_pivot, b_clicks_by_day_pivot)

is_click       False  True  percentage_true_over_total_users  \
day                                                            
1 - Monday        70    43                              2.60   
2 - Tuesday       76    43                              2.60   
3 - Wednesday     86    38                              2.30   
4 - Thursday      69    47                              2.84   
5 - Friday        77    51                              3.08   
6 - Saturday      73    45                              2.72   
7 - Sunday        66    43                              2.60   

is_click       percentage_true_by_day  
day                                    
1 - Monday                      38.05  
2 - Tuesday                     36.13  
3 - Wednesday                   30.65  
4 - Thursday                    40.52  
5 - Friday                      39.84  
6 - Saturday                    38.14  
7 - Sunday                      39.45   is_click       False  True  percentage_true_over_total_users  \

#### 11. Compare the results for A and B. What happened over the course of the week? Do you recommend that your company use Ad A or Ad B?


In [18]:
compareAB = a_clicks_by_day_pivot.filter(['percentage_true_by_day'], axis = 1)
compareAB.rename(columns = {'percentage_true_by_day': 'percentage_true_by_day_A'}, inplace=True)
compareAB['percentage_true_by_day_B'] = b_clicks_by_day_pivot['percentage_true_by_day']
print(compareAB)

is_click       percentage_true_by_day_A  percentage_true_by_day_B
day                                                              
1 - Monday                        38.05                     28.32
2 - Tuesday                       36.13                     37.82
3 - Wednesday                     30.65                     28.23
4 - Thursday                      40.52                     25.00
5 - Friday                        39.84                     29.69
6 - Saturday                      38.14                     35.59
7 - Sunday                        39.45                     31.19


Seems like ad A provides better results in general; the percentages of clicks are higher than for ad B for every day except for Tuesday.
