# Aggregates in Pandas

In [1]:
import pandas as pd

#### 1.

Examine the first few rows of `ad_clicks`.

In [2]:
df = pd.read_csv("ad_clicks.csv")
df.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B


#### 2.

Your manager wants to know which ad platform is getting you the most views.

How many views (i.e., rows of the table) came from each `utm_source`?

In [3]:
df.groupby("utm_source").user_id.count().reset_index()

Unnamed: 0,utm_source,user_id
0,email,255
1,facebook,504
2,google,680
3,twitter,215


#### 3.


If the column `ad_click_timestamp` is not null, then someone actually clicked on the ad that was displayed.

Create a new column called `is_click`, which is True if `ad_click_timestamp` is not null and `False` otherwise.

In [None]:
df["is_click"] = ~df.ad_click_timestamp.isnull()

# '~' -> from True to False and otherwise 

df.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group,is_click
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A,True
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B,False
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A,False
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B,False
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B,False


#### 4.

We want to know the percent of people who clicked on ads from each `utm_source`.

Start by grouping by `utm_source` and `is_click` and counting the number of `user_id`‘s in each of those groups. Save your answer to the variable `clicks_by_source`.

In [10]:
clicks_by_source = df.groupby(["utm_source", "is_click"]).user_id.count().reset_index()
clicks_by_source

Unnamed: 0,utm_source,is_click,user_id
0,email,False,175
1,email,True,80
2,facebook,False,324
3,facebook,True,180
4,google,False,441
5,google,True,239
6,twitter,False,149
7,twitter,True,66


#### 5.

Now let’s pivot the data so that the columns are `is_click` (either `True` or `False`), the index is `utm_source`, and the values are `user_id`.

Save your results to the variable `clicks_pivot`.

In [14]:
clicks_pivot = clicks_by_source.pivot(columns="is_click", index="utm_source", values="user_id")
clicks_pivot

is_click,False,True
utm_source,Unnamed: 1_level_1,Unnamed: 2_level_1
email,175,80
facebook,324,180
google,441,239
twitter,149,66


#### 6.

Create a new column in `clicks_pivot` called `percent_clicked` which is equal to the percent of users who clicked on the ad from each `utm_source`.

Was there a difference in click rates for each source?

In [20]:
clicks_pivot["percent_clicked"] = clicks_pivot.apply(lambda x: (x[True]*100)/(x[True]+x[False]), axis=1)
clicks_pivot

is_click,False,True,percent_clicked
utm_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
email,175,80,31.372549
facebook,324,180,35.714286
google,441,239,35.147059
twitter,149,66,30.697674


#### 7.

The column `experimental_group` tells us whether the user was shown Ad A or Ad B.

Were approximately the same number of people shown both ads?

In [22]:
df.groupby("experimental_group").user_id.count().reset_index()

Unnamed: 0,experimental_group,user_id
0,A,827
1,B,827


#### 8.

Using the column `is_click` that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B.

In [24]:
df.groupby(["experimental_group", "is_click"]).user_id.count().reset_index()

Unnamed: 0,experimental_group,is_click,user_id
0,A,False,517
1,A,True,310
2,B,False,572
3,B,True,255


#### 9.

The Product Manager for the A/B test thinks that the clicks might have changed by day of the week.

Start by creating two DataFrames: `a_clicks` and `b_clicks`, which contain only the results for `A` group and `B` group, respectively.

In [28]:
a_clicks = df[df.experimental_group == 'A']
b_clicks = df[df.experimental_group == 'B']

In [29]:
a_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group,is_click
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A,True
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A,False
5,013b0072-7b72-40e7-b698-98b4d0c9967f,facebook,1 - Monday,,A,False
6,0153d85b-7660-4c39-92eb-1e1acd023280,google,4 - Thursday,,A,False
7,01555297-d6e6-49ae-aeba-1b196fdbb09f,google,3 - Wednesday,,A,False


In [30]:
b_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group,is_click
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B,False
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B,False
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B,False
9,01a210c3-fde0-4e6f-8efd-4f0e38730ae6,email,2 - Tuesday,15:21,B,True
10,01adb2e7-f711-4ae4-a7c6-29f48457eea1,google,3 - Wednesday,,B,False


#### 10.

For each group (`a_clicks` and `b_clicks`), calculate the percent of users who clicked on the ad by `day`.

In [45]:
def sol_10(df):
    df = df.groupby(["is_click", "day"]).user_id.count().reset_index()
    df_pivot = df.pivot(columns="is_click", index="day", values="user_id")
    df_pivot["percent_clicked"] = df_pivot.apply(lambda x: (x[True]*100)/(x[True]+x[False]), axis=1)
    return df_pivot

In [46]:
sol_10(a_clicks)

is_click,False,True,percent_clicked
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 - Monday,70,43,38.053097
2 - Tuesday,76,43,36.134454
3 - Wednesday,86,38,30.645161
4 - Thursday,69,47,40.517241
5 - Friday,77,51,39.84375
6 - Saturday,73,45,38.135593
7 - Sunday,66,43,39.449541


In [47]:
sol_10(b_clicks)

is_click,False,True,percent_clicked
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 - Monday,81,32,28.318584
2 - Tuesday,74,45,37.815126
3 - Wednesday,89,35,28.225806
4 - Thursday,87,29,25.0
5 - Friday,90,38,29.6875
6 - Saturday,76,42,35.59322
7 - Sunday,75,34,31.192661


#### 11.

Compare the results for `A` and `B`. What happened over the course of the week?

Do you recommend that your company use Ad A or Ad B?

<details><summary style="display:list-item; font-size:16px; color:white;">Solution</summary>

I would recommend using Ad A, because overall this Ad had a higher click rate than Ad B.
At the beginning of the week, less people clicked on Ad A than on Ad B compared to the middle of the week. But at the end of the week, Ad A had a higher click rate than Ad B.