# Summary Statistics

These are numbers that tell you about your datasets 

In [2]:
import pandas as pd
import numpy as np


pitches = pd.read_csv("/data/workspace_files/stocks_pitches.csv")
pitches.head()

Unnamed: 0,Name,Associate,Ticker,Pitch Date,Pitch Price,Curr,Today Price,Price %,Total Divs,Currency ?,Total %
0,ABB Ltd,Matt Smith,ABB,06/03/2018,22.8,CHF,21.86,-0.04,1.623,1.02,0.03
1,Alibaba Group Holding Ltd,Anna Erofeeva,BABA,05/12/2017,177.62,USD,186.78,0.05,0.0,1.04,0.05
2,Apple Inc,Darren Wyy,AAPL,16/01/2018,176.3,USD,261.78,0.48,5.86,1.06,0.55
3,Arista Networks Inc,Xu Jiawei,ANET,14/11/2017,225.5,USD,194.49,-0.14,0.0,1.02,-0.14
4,BHP Billiton Ltd,Matt Smith,BHP,05/12/2017,42.1,USD,50.84,0.21,7.06,1.04,0.39


In [3]:
pitches["Price %"].mean()

-0.0364

In [4]:
pitches["Price %"].min()

-0.88

In [5]:
pitches["Price %"].max()

0.83

In [6]:
pitches["Price %"].var()

0.17796566666666666

In [7]:
pitches["Price %"].std()

0.4218597713300791

Also .sum(), .quantile(), .mode(). Note .min(), .max() works on dates too. 

We can use .agg() to make custom calcs. It can be used on multiple columns too 

In [8]:
def pct30(column): 
    return column.quantile(0.3)

pitches["Price %"].agg(pct30)

-0.16400000000000003

.agg() can take in multiple calcs too 

In [9]:
def pct40(column): 
    return column.quantile(0.4)

pitches[["Price %", "Total %"]].agg([pct30,pct40])

Unnamed: 0,Price %,Total %
pct30,-0.164,-0.134
pct40,-0.128,-0.11


Pandas has methods for taking in cumulative sums too .cumsum(), .cummax(), .cummin(), .cumprod(). These will return the whole column. Be wary of nan values that can impact these 

In [10]:
pitches["Price %"].cumsum()

# Counting in Pandas

Counting is used to summarise numerical data. .value_counts() is used we can add sort=True to sort what we get returned, or normalize=True to get the % of total

In [11]:
pitches["Curr"].value_counts()

In [12]:
pitches["Curr"].value_counts(normalize=True)

We can use drop_duplicates to remove duplicate values

In [13]:
pitches.drop_duplicates(subset="Associate ")

Unnamed: 0,Name,Associate,Ticker,Pitch Date,Pitch Price,Curr,Today Price,Price %,Total Divs,Currency ?,Total %
0,ABB Ltd,Matt Smith,ABB,06/03/2018,22.8,CHF,21.86,-0.04,1.623,1.02,0.03
1,Alibaba Group Holding Ltd,Anna Erofeeva,BABA,05/12/2017,177.62,USD,186.78,0.05,0.0,1.04,0.05
2,Apple Inc,Darren Wyy,AAPL,16/01/2018,176.3,USD,261.78,0.48,5.86,1.06,0.55
3,Arista Networks Inc,Xu Jiawei,ANET,14/11/2017,225.5,USD,194.49,-0.14,0.0,1.02,-0.14
5,BlackRock Inc,Marcus McLaney,BLK,27/02/2018,563.97,USD,485.0,-0.14,21.92,1.07,-0.11
6,Domino's Pizza Inc,Dev Singh,DPZ,12/12/2017,180.96,USD,285.76,0.58,4.61,1.03,0.62
7,Estee Lauder Companies Inc,Kelvin Fang,EL,20/02/2018,139.65,USD,193.18,0.38,2.86,1.08,0.44
8,Foot Locker Inc,Arjun Kandola,FL,06/03/2018,42.1,USD,40.25,-0.04,2.52,1.07,0.02
11,Intertek Group plc,Joshua Zeng,ITRK.L,23/01/2018,5178.0,GBX,5504.0,0.06,181.1,1.0,0.1
13,Marriott International Inc,Jerry Kim,MAR,20/02/2018,140.44,USD,136.25,-0.03,3.41,1.08,-0.01


# Grouped Summary Statistics

Summary statistics can be used to compare differences in groups i.e. we can already do this individually but we can do this more easily by using .groupby(). Note the difference brackets we use here

We can combine this with .agg() to get more powerful combos of calcs

Note for the mean we need to use numpy when doing calcs in .agg()

In [14]:
print("USD mean:", pitches[pitches["Curr"] == "USD"]["Total %"].mean())
print("EUR mean:", pitches[pitches["Curr"] == "EUR"]["Total %"].mean())
print("GBX mean:", pitches[pitches["Curr"] == "GBX"]["Total %"].mean())

USD mean: 0.005263157894736841
EUR mean: -0.08999999999999998
GBX mean: -0.0049999999999999975


In [15]:
pitches.groupby("Curr")["Total %"].mean()

In [16]:
pitches.groupby("Curr")["Total %"].agg([min, max, sum, np.mean])

Unnamed: 0_level_0,min,max,sum,mean
Curr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CHF,0.03,0.03,0.03,0.03
EUR,-0.41,0.31,-0.27,-0.09
GBX,-0.11,0.1,-0.01,-0.005
USD,-0.94,0.87,0.1,0.005263


In [17]:
pitches.groupby(["Associate ", "Curr"])["Total %"].agg([min, max, np.mean])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,mean
Associate,Curr,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adam Barbarowicz,EUR,0.31,0.31,0.31
Amos Jaupi,EUR,-0.41,-0.41,-0.41
Anna Erofeeva,USD,-0.94,0.05,-0.445
Arjun Kandola,USD,-0.55,0.02,-0.265
Charlotte Pinder,EUR,-0.17,-0.17,-0.17
Charlotte Pinder,USD,-0.41,-0.41,-0.41
Darren Wyy,USD,0.27,0.55,0.41
Dev Singh,USD,-0.87,0.62,0.1
Edmund Xia,GBX,-0.11,-0.11,-0.11
Jerry Kim,USD,-0.01,-0.01,-0.01


# Pivot Tables

These are another way to calculate values, similar to pivot tables in excel. By default it will take the mean value for each group. We can change that by passing aggfunc= to the call. 

We can group on two variables by passing columns= to the call 

In [18]:
pitches.pivot_table(values="Total %", index="Curr")

Unnamed: 0_level_0,Total %
Curr,Unnamed: 1_level_1
CHF,0.03
EUR,-0.09
GBX,-0.005
USD,0.005263


In [19]:
pitches.pivot_table(values="Total %", index="Curr", aggfunc=np.max)

Unnamed: 0_level_0,Total %
Curr,Unnamed: 1_level_1
CHF,0.03
EUR,0.31
GBX,0.1
USD,0.87


In [20]:
pitches.pivot_table(values="Total %", index="Curr", aggfunc=[np.max, np.min])

Unnamed: 0_level_0,amax,amin
Unnamed: 0_level_1,Total %,Total %
Curr,Unnamed: 1_level_2,Unnamed: 2_level_2
CHF,0.03,0.03
EUR,0.31,-0.41
GBX,0.1,-0.11
USD,0.87,-0.94


In [21]:
pitches.pivot_table(values="Total %", index="Curr", columns="Associate ", aggfunc=[np.max, np.min], fill_value=0, margins=True)

Unnamed: 0_level_0,amax,amax,amax,amax,amax,amax,amax,amax,amax,amax,...,amin,amin,amin,amin,amin,amin,amin,amin,amin,amin
Associate,Adam Barbarowicz,Amos Jaupi,Anna Erofeeva,Arjun Kandola,Charlotte Pinder,Darren Wyy,Dev Singh,Edmund Xia,Jerry Kim,Joshua Zeng,...,Darren Wyy,Dev Singh,Edmund Xia,Jerry Kim,Joshua Zeng,Kelvin Fang,Marcus McLaney,Matt Smith,Xu Jiawei,All
Curr,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
CHF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.03
EUR,0.31,-0.41,0.0,0.0,-0.17,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.41
GBX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.11,0.0,0.1,...,0.0,0.0,-0.11,0.0,0.1,0.0,0.0,0.0,0.0,-0.11
USD,0.0,0.0,0.05,0.02,-0.41,0.55,0.62,0.0,-0.01,-0.53,...,0.27,-0.87,0.0,-0.01,-0.53,0.44,-0.11,-0.11,-0.14,-0.94
All,0.31,-0.41,0.05,0.02,-0.17,0.55,0.62,-0.11,-0.01,0.1,...,0.27,-0.87,-0.11,-0.01,-0.53,0.44,-0.11,-0.11,-0.14,-0.94
