## Aggregate Functions in Pandas

In [None]:
An aggregate statistic is ----> a way of creating a single number that describes a group of numbers. 

i.e. Aggregate functions summarize many data points (i.e., a column of a dataframe) into a smaller set of values.

Common aggregate statistics include -----> mean, median, and standard deviation.

#### General Syntax for calculation

df.column_name.command()

Command	Description
===================
mean     :	Average of all values in column
std      :	Standard deviation
median   :	Median
max      :	Maximum value in column
min      :	Minimum value in column
count    :	Number of values in column
nunique  :	Number of unique values in column
unique   :	List of unique values in column



## Calculating Column Statistics

### Example 1

In [None]:
Example: Datframe customers contains the names and ages of all of your customers.

print(customers.age)
>> [23, 25, 31, 35, 35, 46, 62]

print(customers.age.median())
>> 35

### Example 2

In [None]:
The DataFrame shipments contains address information for all shipments that you’ve sent out in the past year. 

You want to know how many different states you have shipped to (and how many shipments went to the same state).

print(shipments.state)
>> ['CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ']
print(shipments.state.nunique())
>> 3

### Example 3

In [None]:
The DataFrame inventory contains a list of types of t-shirts that your company makes. 

You want a list of the colors that your shirts come in.

print(inventory.color)
>> ['blue', 'blue', 'blue', 'blue', 'blue', 'green', 'green', 'orange', 'orange', 'orange']
    
print(inventory.color.unique())
>> ['blue', 'green', 'orange']

## Revisit our orders from ShoeFly.com.

In [1]:
import pandas as pd

In [2]:
orders_df = pd.read_csv(r'D:\GIT_Repositories\pandas\orders.csv')

In [4]:
orders_df.head()

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,price
0,41874,Kyle,Peck,KylePeck71@gmail.com,ballet flats,faux-leather,black,385.0
1,31349,Elizabeth,Velazquez,EVelazquez1971@gmail.com,boots,fabric,brown,388.0
2,43416,Keith,Saunders,KS4047@gmail.com,sandals,leather,navy,346.0
3,56054,Ryan,Sweeney,RyanSweeney14@outlook.com,sandals,fabric,brown,344.0
4,77402,Donna,Blankenship,DB3807@gmail.com,stilettos,fabric,brown,289.0


#### Task: 

Our finance department wants to know the price of the most expensive pair of shoes purchased.

In [5]:
most_expensive = orders_df.price.max()

In [8]:
print(most_expensive)

493.0


fetch details of expensive product --- like who ordered and what material and color etc


In [15]:
orders_df[orders_df['price'] == orders_df.price.max()]

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,price
64,13553,Aaron,Hanson,AH3867@gmail.com,clogs,faux-leather,navy,493.0


### Task:

Our fashion department wants to know how many different colors of shoes we are selling. Save your answer to the variable num_colors.

In [12]:
num_colors = orders_df['shoe_color'].nunique()

In [13]:
num_colors

5

what are those unique colors?

In [18]:
print( orders_df['shoe_color'].unique() )

['black' 'brown' 'navy' 'white' 'red']


## Calculating Aggregate Functions I

When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) 
over certain subsets of the data.

### General Syntax:

df.groupby('column1').column2.measurement()

where:
------
column1     : column that we want to group by ('student' in our example)
column2     : column that we want to perform a measurement on (grade in our example)
measurement : measurement function we want to apply (mean in our example)


Example

In [None]:
Suppose we have a grade book with columns student, assignment_name, and grade. 

The first few lines look like this:

student	          assignment_name	grade
Amy	              Assignment 1      75
Amy	              Assignment 2      35
Bob	              Assignment 1      99
Bob	              Assignment 2      35

### Task:

get an "average grade" for each student across all assignments.

In [20]:
grade_book_df = pd.DataFrame([
    ['Amy', 'Assignment 1', 75],
    ['Amy', 'Assignment 2', 35],
    ['Bob', 'Assignment 1', 99],
    ['Bob', 'Assignment 2', 35],
], columns=['student', 'assignment_name', 'grade']
)

In [21]:
grade_book_df

Unnamed: 0,student,assignment_name,grade
0,Amy,Assignment 1,75
1,Amy,Assignment 2,35
2,Bob,Assignment 1,99
3,Bob,Assignment 2,35


Since we need AVG for EACH Student.... need to group on student name

In [26]:
grade_book_df.groupby('student').grade.mean()

student
Amy    55.0
Bob    67.0
Name: grade, dtype: float64

## Excercise

## Task

In [None]:
In the previous exercise, our finance department wanted to know the most expensive shoe that we sold.

Now, they want to know the most expensive shoe for each shoe_type (i.e., the most expensive boot, the most expensive ballet flat, etc.).

Save your answer to the variable pricey_shoes.

In [27]:
orders_df.head()

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,price
0,41874,Kyle,Peck,KylePeck71@gmail.com,ballet flats,faux-leather,black,385.0
1,31349,Elizabeth,Velazquez,EVelazquez1971@gmail.com,boots,fabric,brown,388.0
2,43416,Keith,Saunders,KS4047@gmail.com,sandals,leather,navy,346.0
3,56054,Ryan,Sweeney,RyanSweeney14@outlook.com,sandals,fabric,brown,344.0
4,77402,Donna,Blankenship,DB3807@gmail.com,stilettos,fabric,brown,289.0


In [32]:
pricey_shoes = orders_df.groupby('shoe_type').price.max()

In [33]:
pricey_shoes

shoe_type
ballet flats    481.0
boots           478.0
clogs           493.0
sandals         456.0
stilettos       487.0
wedges          461.0
Name: price, dtype: float64

In [34]:
print(type(pricey_shoes))

<class 'pandas.core.series.Series'>


### Task: list expensive shoe_types begining with most expensive

In [35]:
pricey_shoes = orders_df.groupby('shoe_type').price.max().sort_values(ascending = False)

In [36]:
pricey_shoes

shoe_type
clogs           493.0
stilettos       487.0
ballet flats    481.0
boots           478.0
wedges          461.0
sandals         456.0
Name: price, dtype: float64

In [37]:
print(type(pricey_shoes))

<class 'pandas.core.series.Series'>


## Calculating Aggregate Functions II

After using groupby, we often need to clean our resulting data.

the groupby function creates a new Series, not a DataFrame.

For our ShoeFly.com example, the indices of the Series were different values of shoe_type, and the name property was price.

Usually, we’d prefer that those indices were actually a column.

In order to get that, we can use reset_index(). This will transform our Series into a DataFrame and move the indices into their own column.

### Generally, you’ll always see a groupby statement followed by reset_index:

In [None]:
df.groupby('column1').column2.measurement()
    .reset_index()

Example

In [None]:
suppose we have a DataFrame teas containing data on types of tea:

id	tea	               category	  caffeine	price
0	earl grey	       black	  38	    3
1	english breakfast  black	  41	    3
2	irish breakfast    black	  37	    2.5
3	jasmine	           green	  23	    4.5
4	matcha	           green	  48	    5
5	camomile	       herbal	  0	        3

In [49]:
teas_df = pd.DataFrame([
    [0, 'earl grey', 'black', 38, 3],
    [1, 'english breakfast', 'black', 41, 3],
    [2, 'irish breakfast', 'black', 37, 2.5],
    [3, 'jasmine', 'green', 23, 4.5],
    [4, 'matcha', 'green', 48, 5],
    [5, 'camomile', 'herbal', 0, 3],
], 
  columns=['id', 'tea', 'category', 'caffeine', 'price']
)

In [50]:
teas_df

Unnamed: 0,id,tea,category,caffeine,price
0,0,earl grey,black,38,3.0
1,1,english breakfast,black,41,3.0
2,2,irish breakfast,black,37,2.5
3,3,jasmine,green,23,4.5
4,4,matcha,green,48,5.0
5,5,camomile,herbal,0,3.0


### Task:

find the number of each category of tea

In [51]:
teas_df.groupby(['category']).id.count()

category
black     3
green     2
herbal    1
Name: id, dtype: int64

In [52]:
teas_category_counts = teas_df.groupby(['category']).id.count().reset_index()

In [53]:
teas_category_counts

Unnamed: 0,category,id
0,black,3
1,green,2
2,herbal,1


In [None]:
The new column contains the counts of each category of tea sold. We have 3 black teas, 4 green teas, and so on. 

However, this column is called id because we used the id column of teas to calculate the counts. 

We actually want to call this column counts

Rename the column

In [57]:
teas_category_counts = teas_category_counts.rename(columns = {'id' : 'counts'})

In [58]:
teas_category_counts

Unnamed: 0,category,counts
0,black,3
1,green,2
2,herbal,1


### Task

to know the most expensive shoe for each shoe_type

end with reset_index, which will change pricey_shoes into a DataFrame.

In [60]:
pricey_shoes = orders_df.groupby('shoe_type').price.max() \
               .reset_index()

In [61]:
pricey_shoes

Unnamed: 0,shoe_type,price
0,ballet flats,481.0
1,boots,478.0
2,clogs,493.0
3,sandals,456.0
4,stilettos,487.0
5,wedges,461.0


### reset_index() ---- changes output to a dataframe

In [62]:
print(type(pricey_shoes))

<class 'pandas.core.frame.DataFrame'>


## Calculating Aggregate Functions III

Sometimes, the operation that you want to perform is more complicated than mean or count.

In such cases, use the .apply() method and lambda functions, just like we did for individual column operations.

*** Note *** : that the input to our lambda function will always be a list of values.

#### Example

In [None]:
Suppose we have a DataFrame of employee information called df that has the following columns:

id: the employee’s id number
name: the employee’s name
wage: the employee’s hourly wage
category: the type of work that the employee does

Our data might look something like this:

id	    name	        wage	category
10131	Sarah Carney	39	    product
14189	Heather Carey	17	    design
15004	Gary Mercado	33	    marketing
11204	Cora Copaz	    27	    design


### Task:

In [None]:
Percentile: used in statistics to give you -

    a number that describes the value that a given percent of the values are lower than.


In [None]:
If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) 
for each category

In [65]:
employee_df = pd.DataFrame([
    [10131, 'Sarah Carney', 39, 'product'],
    [14189, 'Heather Carey', 17, 'design'],
    [15004, 'Gary Mercado', 33, 'marketing'],
    [11204, 'Cora Copaz', 27, 'design']
], 
  columns=['id', 'name', 'wage', 'category']
)

In [66]:
employee_df

Unnamed: 0,id,name,wage,category
0,10131,Sarah Carney,39,product
1,14189,Heather Carey,17,design
2,15004,Gary Mercado,33,marketing
3,11204,Cora Copaz,27,design


In [72]:
# np.percentile can calculate any percentile over an array of values
import numpy as np

high_earners = employee_df.groupby('category').wage   \
               .apply(lambda x: np.percentile(x, 75)) \
               .reset_index()

In [73]:
high_earners

Unnamed: 0,category,wage
0,design,24.5
1,marketing,33.0
2,product,39.0


From Above -

75% of design employess salaries are < 24.5 i.e. only 25% of design employees are >= 24.5

## Task

In [None]:
Once more, we’ll return to the data from ShoeFly.com. Our Marketing team says that it’s important to have some affordably priced shoes 
available for every color of shoe that we sell.

Let’s calculate the 25th percentile for shoe price for each shoe_color to help Marketing decide if we have enough cheap shoes on sale. 
Save the data to the variable cheap_shoes.

Note: Be sure to use reset_index() at the end of your query so that cheap_shoes is a DataFrame.

In [90]:
orders_df = pd.read_csv(r'D:\GIT_Repositories\pandas\orders.csv')

In [91]:
orders_df

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,price
0,41874,Kyle,Peck,KylePeck71@gmail.com,ballet flats,faux-leather,black,385
1,54885,Carol,Mclaughlin,CM3415@gmail.com,ballet flats,faux-leather,brown,440
2,35853,Jacob,Juarez,JJuarez1977@outlook.com,ballet flats,leather,red,331
3,35916,Michael,Christensen,Michael.Christensen@gmail.com,ballet flats,faux-leather,red,270
4,39587,Dennis,Vega,Dennis.Vega@gmail.com,ballet flats,faux-leather,brown,91
...,...,...,...,...,...,...,...,...
94,86546,Lisa,Spence,LSpence1998@gmail.com,wedges,faux-leather,black,115
95,19127,Philip,Dillard,Philip.Dillard@gmail.com,wedges,faux-leather,red,411
96,55075,Ashley,Rogers,Ashley.Rogers@hotmail.com,wedges,faux-leather,brown,269
97,35529,Carol,Reilly,CarolReilly18@gmail.com,wedges,fabric,brown,429


In [92]:
cheap_shoes = orders_df.groupby('shoe_color').price \
         .apply(lambda x: np.percentile(x,25)) \
         .reset_index()

In [93]:
cheap_shoes

Unnamed: 0,shoe_color,price
0,black,222.25
1,brown,193.5
2,navy,205.5
3,red,250.0
4,white,196.0


Decoding the above output:

25% of black shoes available at price < 222.25
25% of brown shoes available at price < 193.50
25% of navy shoes available at price < 205.50
25% of red shoes available at price < 250.00
25% of white shoes available at price < 196.00

In [95]:
# Counting NaN values in the 'price' column

print(orders_df['price'].isna().sum())

0


## Calculating Aggregate Functions IV - group by more than one column

by passing a list of column names into the groupby method.

#### Example

Imagine that we run a chain of stores and have data about the number of sales at different locations on different days:

Location	    Date	    Day of Week	Total Sales
West Village	February 1	W	        400
West Village	February 2	Th	        450
Chelsea	        February 1	W	        375
Chelsea	        February 2	Th	        390

In [None]:
We suspect that sales are different at different locations on different days of the week. 

In order to test this hypothesis, we could calculate the average sales for each store on each day of the week across multiple months.

In [97]:
store_chain_df = pd.DataFrame([
    ['West Village', 'February 1', 'W', 400],
    ['West Village', 'February 2', 'Th', 450],
    ['Chelsea', 'February 1', 'W', 375],
    ['Chelsea', 'February 2', 'Th', 390],
], columns=['Location', 'Date', 'Day of Week', 'Total Sales']
)

In [98]:
store_chain_df

Unnamed: 0,Location,Date,Day of Week,Total Sales
0,West Village,February 1,W,400
1,West Village,February 2,Th,450
2,Chelsea,February 1,W,375
3,Chelsea,February 2,Th,390


### calculate the average sales for each store on each day of the week across multiple months

In [105]:
av_sales1 = store_chain_df.groupby(['Location', 'Day of Week'])['Total Sales'].mean()

In [106]:
av_sales1

Location      Day of Week
Chelsea       Th             390.0
              W              375.0
West Village  Th             450.0
              W              400.0
Name: Total Sales, dtype: float64

In [107]:
print(type(av_sales1))

<class 'pandas.core.series.Series'>


In [108]:
av_sales2 = store_chain_df.groupby(['Location', 'Day of Week'])['Total Sales']  \
        .mean()   \
        .reset_index()

In [109]:
av_sales2

Unnamed: 0,Location,Day of Week,Total Sales
0,Chelsea,Th,390.0
1,Chelsea,W,375.0
2,West Village,Th,450.0
3,West Village,W,400.0


In [110]:
print(type(av_sales2))

<class 'pandas.core.frame.DataFrame'>


## Excercise - Task

In [None]:
At ShoeFly.com, our Purchasing team thinks that certain shoe_type/shoe_color combinations are particularly popular this year 
(for example, blue ballet flats are all the rage in Paris).

Create a DataFrame with the total number of shoes of each shoe_type/shoe_color combination purchased. 
Save it to the variable shoe_counts.

You should be able to do this using groupby and count().

Note: When we’re using count(), it doesn’t really matter which column we perform the calculation on. 
You should use id in this example, but we would get the same answer if we used shoe_type or last_name.

Remember to use reset_index() at the end of your code!

#### total number of shoes of each shoe_type/shoe_color combination purchased

In [111]:
orders_df.head()

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,price
0,41874,Kyle,Peck,KylePeck71@gmail.com,ballet flats,faux-leather,black,385
1,54885,Carol,Mclaughlin,CM3415@gmail.com,ballet flats,faux-leather,brown,440
2,35853,Jacob,Juarez,JJuarez1977@outlook.com,ballet flats,leather,red,331
3,35916,Michael,Christensen,Michael.Christensen@gmail.com,ballet flats,faux-leather,red,270
4,39587,Dennis,Vega,Dennis.Vega@gmail.com,ballet flats,faux-leather,brown,91


In [112]:
orders_df.groupby(['shoe_type','shoe_color']).id.count()

shoe_type     shoe_color
ballet flats  black         2
              brown         5
              red           3
              white         5
boots         black         3
              brown         5
              navy          6
              red           2
              white         3
clogs         black         4
              brown         6
              navy          1
              red           4
              white         1
sandals       black         1
              brown         4
              navy          5
              red           3
              white         4
stilettos     black         5
              brown         3
              navy          2
              red           2
              white         2
wedges        black         3
              brown         4
              navy          4
              red           5
              white         2
Name: id, dtype: int64

In [113]:
shoe_counts = orders_df.groupby(['shoe_type','shoe_color']).id.count().reset_index()

In [114]:
shoe_counts

Unnamed: 0,shoe_type,shoe_color,id
0,ballet flats,black,2
1,ballet flats,brown,5
2,ballet flats,red,3
3,ballet flats,white,5
4,boots,black,3
5,boots,brown,5
6,boots,navy,6
7,boots,red,2
8,boots,white,3
9,clogs,black,4


### Note

In [None]:
When we’re using count(), it doesn’t really matter which column we perform the calculation on. 

You should use id in this example, but we would get the same answer if we used shoe_type or last_name.

## Pivot Tables

When we perform a groupby across multiple columns, we often want to change how our data is stored.

In [None]:
For instance, recall the example where we are running a chain of stores and have data about the number of sales at different locations on 
different days:

Location	    Date	    Day of Week	Total Sales
West Village	February 1	W	        400
West Village	February 2	Th	        450
Chelsea	        February 1	W	        375
Chelsea	        February 2	Th	        390

In [None]:
We suspected that there might be different sales on different days of the week at different stores, so we performed a `groupby` across 
two different columns (`Location` and `Day of Week`). 

This gave us results that looked like this: av_sales2

In [115]:
av_sales2

Unnamed: 0,Location,Day of Week,Total Sales
0,Chelsea,Th,390.0
1,Chelsea,W,375.0
2,West Village,Th,450.0
3,West Village,W,400.0


In order to test our hypothesis, it would be more useful if the table was formatted like this:

Location	   W	 Th
--------      ---   ---
Chelsea	      375   390
West Village  400   450

##### Reorganizing a table in this way is called pivoting. The new table is called a pivot table.

### In Pandas, the command for pivot is:


df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')

For our specific example, we would write the command like this:

In [120]:
# Non-Pivot

Unpivoted = store_chain_df.groupby(['Location', 'Day of Week'])['Total Sales']  \
        .mean()   \
        .reset_index()

In [121]:
Unpivoted

Unnamed: 0,Location,Day of Week,Total Sales
0,Chelsea,Th,390.0
1,Chelsea,W,375.0
2,West Village,Th,450.0
3,West Village,W,400.0


In [125]:
# STEP1 :  First use the groupby statement

Unpivoted = store_chain_df.groupby(['Location', 'Day of Week'])['Total Sales']  \
        .mean()   \
        .reset_index()

# STEP2 : Now, Pivot the table

pivoted = Unpivoted.pivot(
    columns = ['Day of Week'],
    index   = ['Location'],
    values  = ['Total Sales'] 
)

In [126]:
pivoted

Unnamed: 0_level_0,Total Sales,Total Sales
Day of Week,Th,W
Location,Unnamed: 1_level_2,Unnamed: 2_level_2
Chelsea,390.0,375.0
West Village,450.0,400.0


In [127]:
print(type(pivoted))

<class 'pandas.core.frame.DataFrame'>


### Note

Just like with groupby, the output of a pivot command is a new DataFrame, but the indexing tends to be “weird”, 
so we usually follow up with .reset_index().

### Excercise

In [None]:
In the previous example, you created a DataFrame with the total number of shoes of each shoe_type/shoe_color combination purchased 
for ShoeFly.com.

The purchasing manager complains that this DataFrame is confusing.

Make it easier for her to compare purchases of different shoe colors of the same shoe type by creating a pivot table. 
Save your results to the variable shoe_counts_pivot.

Your table should look like this:

shoe_type	  black	 brown	navy  red  white
ballet flats	…	 …	    …	  …	   …
sandals	        …	 …	    …	  …	   …
stilettos	    … 	 …	    …	  …	   …
wedges	        …	 …	    …	  …	   …

In [128]:
shoe_counts

Unnamed: 0,shoe_type,shoe_color,id
0,ballet flats,black,2
1,ballet flats,brown,5
2,ballet flats,red,3
3,ballet flats,white,5
4,boots,black,3
5,boots,brown,5
6,boots,navy,6
7,boots,red,2
8,boots,white,3
9,clogs,black,4


In [132]:
shoe_counts_pivot = shoe_counts.pivot(
    columns = ['shoe_color'],
    index   = ['shoe_type'],
    values  = ['id']
).reset_index()

In [133]:
shoe_counts_pivot

Unnamed: 0_level_0,shoe_type,id,id,id,id,id
shoe_color,Unnamed: 1_level_1,black,brown,navy,red,white
0,ballet flats,2.0,5.0,,3.0,5.0
1,boots,3.0,5.0,6.0,2.0,3.0
2,clogs,4.0,6.0,1.0,4.0,1.0
3,sandals,1.0,4.0,5.0,3.0,4.0
4,stilettos,5.0,3.0,2.0,2.0,2.0
5,wedges,3.0,4.0,4.0,5.0,2.0


## Final Excercise

Let’s examine some more data from ShoeFly.com. This time, we’ll be looking at data about user visits to the website

In [138]:
user_visits_df = pd.read_csv(r'D:\GIT_Repositories\pandas\page_visits.csv')

In [139]:
user_visits_df

Unnamed: 0,id,first_name,last_name,email,month,utm_source
0,10043,Louis,Koch,LouisKoch43@gmail.com,03-Mar,yahoo
1,10150,Bruce,Webb,BruceWebb44@outlook.com,03-Mar,twitter
2,10155,Nicholas,Hoffman,Nicholas.Hoffman@gmail.com,02-Feb,google
3,10178,William,Key,William.Key@outlook.com,03-Mar,yahoo
4,10208,Karen,Bass,KB4971@gmail.com,02-Feb,google
...,...,...,...,...,...,...
2995,99850,Gerald,Mccarthy,GM3575@hotmail.com,03-Mar,facebook
2996,99914,Denise,Frost,DF1650@outlook.com,01-Jan,google
2997,99929,Noah,Ferguson,NF6909@hotmail.com,02-Feb,facebook
2998,99968,Grace,Vaughan,GVaughan1973@outlook.com,03-Mar,email


### Task

In [None]:
The column utm_source contains information about how users got to ShoeFly’s homepage. 

For instance, if utm_source = Facebook, then the user came to ShoeFly by clicking on an ad on Facebook.com.

Use a groupby statement to calculate how many visits came from each of the different sources. Save your answer to the variable click_source.

Remember to use reset_index()!

In [140]:
user_visits_df.head()

Unnamed: 0,id,first_name,last_name,email,month,utm_source
0,10043,Louis,Koch,LouisKoch43@gmail.com,03-Mar,yahoo
1,10150,Bruce,Webb,BruceWebb44@outlook.com,03-Mar,twitter
2,10155,Nicholas,Hoffman,Nicholas.Hoffman@gmail.com,02-Feb,google
3,10178,William,Key,William.Key@outlook.com,03-Mar,yahoo
4,10208,Karen,Bass,KB4971@gmail.com,02-Feb,google


In [142]:
user_visits_df.groupby('utm_source').id.count().reset_index()

Unnamed: 0,utm_source,id
0,email,462
1,facebook,823
2,google,543
3,twitter,415
4,yahoo,757


## Task 2

In [None]:
Our Marketing department thinks that the traffic to our site has been changing over the past few months. 

Use groupby to calculate the number of visits to our site from each utm_source for each month. 

Save your answer to the variable click_source_by_month.

In [143]:
click_source_by_month = user_visits_df.groupby(['utm_source', 'month']).id.count().reset_index()

In [144]:
click_source_by_month

Unnamed: 0,utm_source,month,id
0,email,01-Jan,43
1,email,02-Feb,147
2,email,03-Mar,272
3,facebook,01-Jan,404
4,facebook,02-Feb,263
5,facebook,03-Mar,156
6,google,01-Jan,127
7,google,02-Feb,196
8,google,03-Mar,220
9,twitter,01-Jan,164


## Task 3

In [None]:
The head of Marketing is complaining that this table is hard to read. 

Use pivot to create a pivot table where the rows are utm_source and the columns are month. 

Save your results to the variable click_source_by_month_pivot.

It should look something like this:

utm_source	1 - January	2 - February	3 - March
email	    …	        …	            …
facebook	…	        …	            …
google	    …	        …	            …
twitter	    …	        …	            …
yahoo	    …	        …	            …

In [145]:
click_source_by_month_pivot = click_source_by_month.pivot(
    columns = ['month'],
    index = ['utm_source'],
    values = ['id']
).reset_index()

In [146]:
click_source_by_month_pivot

Unnamed: 0_level_0,utm_source,id,id,id
month,Unnamed: 1_level_1,01-Jan,02-Feb,03-Mar
0,email,43,147,272
1,facebook,404,263,156
2,google,127,196,220
3,twitter,164,154,97
4,yahoo,262,240,255
