## Aggregating Statistics

#### EDA = Exploratpry Data Analysis

##### Main Types Of EDA:
##### 1) Summary Statistics
###### - Group By and Pivot Tables
##### 2) Data Visualization
###### - Numerical/Categorical Distrobutions of Feature, Seaborn Library, Relationships of Features in Numerical/Categorical Data

In [2]:
import pandas as pd

In [3]:
air_quality =  pd.read_pickle('air_quality.pkl')

In [4]:
## Having worked on this dataset previously we can see there is no missing data as shown by the NON-Null Count having the same number down the column
## We can also see there is a mix of Categorical Data (Catrgory/Bool) and Nummerical Data (float64, Int34)
air_quality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95685 entries, 0 to 95684
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype          
---  ------                 --------------  -----          
 0   date_time              95685 non-null  datetime64[ns] 
 1   PM2.5                  95685 non-null  float64        
 2   PM10                   95685 non-null  float64        
 3   SO2                    95685 non-null  float64        
 4   NO2                    95685 non-null  float64        
 5   CO                     95685 non-null  float64        
 6   O3                     95685 non-null  float64        
 7   TEMP                   95685 non-null  float64        
 8   PRES                   95685 non-null  float64        
 9   DEWP                   95685 non-null  float64        
 10  RAIN                   95685 non-null  float64        
 11  wd                     95685 non-null  object         
 12  WSPM                   95685 non-null  float64

### Summary Statistics For A Series

## Numeric

In [7]:
## This can be done on one column of a dataframe
## We can call one column which is also a series
air_quality['TEMP']

0        -0.5
1        -0.7
2        -2.4
3        -2.5
4        -1.4
         ... 
95680    15.4
95681    14.9
95682    10.8
95683    10.5
95684     8.6
Name: TEMP, Length: 95685, dtype: float64

In [8]:
## We can then apply methods to that series
## The count() method is used here. Returning the number of non missing data
## Notice it matches the above info() method used on the whole dataframe
air_quality['TEMP'].count()

95685

In [9]:
## The mean() method is used here.
air_quality['TEMP'].mean()

13.72944615261222

In [10]:
## The std() method is used here.
## std =  Standard Deviation of the data.
## std is the measure of variation of a set of values.
## basically measures out how spread out the data is from its mean
## So the higher the std the more spread out the data is from its mean.
air_quality['TEMP'].std()

11.320713457662487

In [11]:
## The min() method is used here.
air_quality['TEMP'].min()

-16.8

In [12]:
## The max() method is used here.
air_quality['TEMP'].max()

41.6

In [13]:
## The quantile(0.25) method is used here. To get the 25% quantile of the data
## This is better known as the 1st Quartile
## So this means that 25% of the data is below 3.5 degrees
air_quality['TEMP'].quantile(0.25)

3.5

In [14]:
## The median() method is used here.
air_quality['TEMP'].median()

14.6

In [15]:
## All of these may look familiar because they can all be seen at once using the .describe() method
air_quality['TEMP'].describe()

count    95685.000000
mean        13.729446
std         11.320713
min        -16.800000
25%          3.500000
50%         14.600000
75%         23.400000
max         41.600000
Name: TEMP, dtype: float64

In [16]:
## The sum() method is used here.
air_quality['RAIN'].sum()

6171.299999999999

## Categorical

In [18]:
## The mode() method is used here.
## Unhealthy is the most often category used in this column
air_quality['PM2.5_category'].mode()

0    Unhealthy
Name: PM2.5_category, dtype: category
Categories (6, object): ['Good' < 'Moderate' < 'Unhealthy for sensitive groups' < 'Unhealthy' < 'Very unhealthy' < 'Hazardous']

In [19]:
## The nunique() method is used here.
## Gives us the number of unique categories in this columns
air_quality['PM2.5_category'].nunique()

6

In [20]:
## The describe methos give us a different set of outputs for categorical data compared to numerical data
## most are self explanitory except freq. This shows the frequency of the most common category.
## So Unhealthy appeared 34257 times!
air_quality['PM2.5_category'].describe()

count         95685
unique            6
top       Unhealthy
freq          34257
Name: PM2.5_category, dtype: object

## Summarize Multiple Columns At Once

In [51]:
air_quality.count()

date_time                95685
PM2.5                    95685
PM10                     95685
SO2                      95685
NO2                      95685
CO                       95685
O3                       95685
TEMP                     95685
PRES                     95685
DEWP                     95685
RAIN                     95685
wd                       95685
WSPM                     95685
station                  95685
year                     95685
month                    95685
day                      95685
hour                     95685
quarter                  95685
day_of_week_num          95685
day_of_week_name         95685
time_until_2022          95685
time_until_2022_days     95685
time_until_2022_weeks    95685
prior_2016_ind           95685
PM2.5_category           95685
TEMP_category            95685
dtype: int64

In [57]:
## By default the .mean() method can be applyed to all columns in a dataframe
## It will skip all of the non-numeric columns but send an error/warning
## use the mean(numeric_only=True) to avoid the error and get the mean of all numeric columns in the dataframe
air_quality.mean(numeric_only=True)

PM2.5                      83.477884
PM10                      111.899959
SO2                        15.369771
NO2                        54.178310
CO                       1313.024142
O3                         56.786295
TEMP                       13.729446
PRES                     1011.397848
DEWP                        2.473027
RAIN                        0.064496
WSPM                        1.692044
year                     2014.737106
month                       6.503841
day                        15.648336
hour                       11.519862
quarter                     2.505732
day_of_week_num             3.021707
time_until_2022_days     2470.962593
time_until_2022_weeks     352.994656
prior_2016_ind              0.693369
dtype: float64

In [61]:
## Doesn't make sense to get the mean of all the columns usually so do this to get the mean of specific columns
air_quality[['PM2.5', 'TEMP']].mean()

PM2.5    83.477884
TEMP     13.729446
dtype: float64

In [63]:
air_quality[['PM2.5', 'TEMP']].min()

PM2.5     2.0
TEMP    -16.8
dtype: float64

In [65]:
air_quality[['PM2.5', 'TEMP']].max()

PM2.5    821.0
TEMP      41.6
dtype: float64

In [71]:
## The .describe() method will include only the numeric columns by default
## Use .T or Transpose to make describe easier to read due to so many columns
air_quality.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
date_time,95685.0,2015-03-28 00:53:51.929769472,2013-03-01 00:00:00,2014-04-19 13:00:00,2015-04-08 10:00:00,2016-03-17 11:00:00,2017-02-28 23:00:00,
PM2.5,95685.0,83.477884,2.0,23.0,59.0,116.0,821.0,82.678134
PM10,95685.0,111.899959,2.0,41.0,90.0,154.0,994.0,94.775897
SO2,95685.0,15.369771,1.0,3.0,7.0,20.0,500.0,20.381452
NO2,95685.0,54.17831,2.0,27.0,48.0,74.0,271.0,34.2446
CO,95685.0,1313.024142,100.0,600.0,900.0,1600.0,10000.0,1187.100019
O3,95685.0,56.786295,0.4284,9.0,43.0,82.0,1071.0,57.899402
TEMP,95685.0,13.729446,-16.8,3.5,14.6,23.4,41.6,11.320713
PRES,95685.0,1011.397848,984.0,1003.0,1011.0,1019.6,1042.0,10.28721
DEWP,95685.0,2.473027,-35.3,-8.7,3.0,15.0,28.8,13.717762


In [75]:
## To show the summary stats of the categorical columns use this:

air_quality.describe(include=['object','category','bool'])

Unnamed: 0,wd,station,day_of_week_name,prior_2016_ind,PM2.5_category,TEMP_category
count,95685,95685,95685,95685,95685,95685
unique,16,3,7,2,6,5
top,NE,Tiantan,Sunday,True,Unhealthy,Hot
freq,9447,32843,13931,66345,34257,19189


In [77]:
## We can apply methods to categorical data in a similar manor
## Notice how hour has 2 listed most often hours. Lets investigate why
air_quality[['PM2.5_category','TEMP_category','hour']].mode()

Unnamed: 0,PM2.5_category,TEMP_category,hour
0,Unhealthy,Hot,21
1,,,23


In [83]:
## This function will list the top 5 categories with the most freqent counts in the hour column
## Notice how 23 and 21 have the same amount. 
## This is why they have 2 listed in the mode method listed earlier
air_quality['hour'].value_counts(ascending=False).head(5)

hour
23    4062
21    4062
0     4052
22    4043
1     4037
Name: count, dtype: int64

# Combinations Of Statistics

## .agg() Method


In [89]:
## The agg() method is used to aggregate the data
## The particular aggregation will be determined by what is included in the () of agg()
## In the () you will define the function being used. See below examples:

## Below "mean" is input as a string into the .agg() method. There is a function that searches both pandas and default python for "mean" methods to run on this particular dataframe
## Since we are only using 1 function in the .agg() method then this is basically the same as just using .mean()
## But we can use more!

air_quality[['PM2.5','TEMP']].agg("mean")

PM2.5    83.477884
TEMP     13.729446
dtype: float64

In [91]:
## Here we can see the mean() method is the same as above

air_quality[['PM2.5','TEMP']].mean()

PM2.5    83.477884
TEMP     13.729446
dtype: float64

In [95]:
## Now lets try multiple functions into the .agg() method.

air_quality[['PM2.5','TEMP']].agg(["min", "max","mean"])

Unnamed: 0,min,max,mean
PM2.5,2.0,821.0,83.477884
TEMP,-16.8,41.6,13.729446


In [103]:
## The above was tried on 2 numeric columns. Lets try on a numeric and categorical columns
## Does not work well
## Lets see if we can fix this

air_quality[['PM2.5','PM2.5_category']].agg(["min", "max","mean","nunique"])

In [118]:
## Using a distonary {} we can specify the functions per column
## See this works now! But only works as defined by the number of columns as they are the keys to the dictonary
## See next cell what happens when running more than 1 function per key 

air_quality[['PM2.5','PM2.5_category']].agg({'PM2.5': "min",'PM2.5_category' :"nunique"})

PM2.5             2.0
PM2.5_category    6.0
dtype: float64

In [116]:
## Notice how the mean is what is defined for PM2.5 rather than listing all 3 min,max, and mean.
## This is due to key "PM2.5" in the dictonary being overwritten each time its called
## BUT there is a way around this using LISTS!

air_quality[['PM2.5','PM2.5_category']].agg({'PM2.5': "min",'PM2.5': "max",'PM2.5': "mean",'PM2.5_category' :"nunique"})

PM2.5             83.477884
PM2.5_category     6.000000
dtype: float64

In [120]:
## Notice how the list of functions for the value of the dictonary is used to call multiple functions for PM2.5

air_quality[['PM2.5','PM2.5_category']].agg({'PM2.5': ["min","max","mean"], 'PM2.5_category' :"nunique"})

Unnamed: 0,PM2.5,PM2.5_category
min,2.0,
max,821.0,
mean,83.477884,
nunique,,6.0


# Defining Our Own Functions For Statistics

In [124]:
## So we want to find the range of one of the columns but there is no method or predefined function for range so we need to make our own.
## Range = Max - Min
## In our function we made we can pass a series into the function and use methods within the function to get our range.

def max_minus_min(series):
    return series.max()-series.min()

In [126]:
## We can run the function by passing the column of air_quality['TEMP'] as a series into the function
max_minus_min(air_quality['TEMP'])

58.400000000000006

In [134]:
## Now lets try this with the .agg() method
## Not that the max_minus_min is not in "" because its a function defined by us and not by default or by pandas

air_quality[['PM2.5','TEMP']].agg(["min", "max",max_minus_min])

Unnamed: 0,PM2.5,TEMP
min,2.0,-16.8
max,821.0,41.6
max_minus_min,819.0,58.4


# Summarize Data By It's ROWS

## Less common than by columns as we've been doing but still helpful to know

In [138]:
air_quality[["PM2.5","PM10"]]

Unnamed: 0,PM2.5,PM10
0,9.0,9.0
1,4.0,4.0
2,4.0,4.0
3,5.0,5.0
4,3.0,6.0
...,...,...
95680,9.0,9.0
95681,10.0,29.0
95682,18.0,32.0
95683,15.0,42.0


In [142]:
## By default this returns the minnimum of the columns
## But how to we make this give us the min across the rows
air_quality[["PM2.5","PM10"]].min()

PM2.5    2.0
PM10     2.0
dtype: float64

In [146]:
## By using axis = 1, which we know by pressing shift + tab on the min() method, we can see this allows us to apply the min() method onto the rows of the data
## So for each row it gives us the mins of the rows
## Row 4 is a good example showing above there is a 3.0 and 6.0 and the min() method below choose 3.0
air_quality[["PM2.5","PM10"]].min(axis = 1)

0         9.0
1         4.0
2         4.0
3         5.0
4         3.0
         ... 
95680     9.0
95681    10.0
95682    18.0
95683    15.0
95684    15.0
Length: 95685, dtype: float64

In [148]:
## Now lets say we want the average across the rows
air_quality[["PM2.5","PM10"]].mean(axis = 1)

0         9.0
1         4.0
2         4.0
3         5.0
4         4.5
         ... 
95680     9.0
95681    19.5
95682    25.0
95683    28.5
95684    32.5
Length: 95685, dtype: float64

In [150]:
## Now lets say we want the sum across the rows
air_quality[["PM2.5","PM10"]].sum(axis = 1)

0        18.0
1         8.0
2         8.0
3        10.0
4         9.0
         ... 
95680    18.0
95681    39.0
95682    50.0
95683    57.0
95684    65.0
Length: 95685, dtype: float64