<a href="https://colab.research.google.com/github/JunoJames-JJ/AI-ML-Learning/blob/main/Lab_1_Aggregation_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### OBJECTIVE

> Library Introductions

> Apply Data Load processes

> Apply Data Transformation methods

> Apply Data Aggregation techniques

> Apply Descriptive Statistics Calculations  

#### Part 1: Library Introduction & data transformations

Seaborn Library: A python data visualization library that is build on matplotlib library.   
It is most polular library for drawing informative, attractive statistical graphs  
It is also useful for easy statistical calculations  

In the example below we will use seaborn's own inbuilt dataset and read it using load_dataset() function


In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

To visualize the first few rows of the dataset use head()

In [None]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [None]:
# To visualize full dataset
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


Data Description:
> method — reflects how the planet was detected. There is a large class imbalance here, that you will observe when we start performing group by operations in our upcoming steps

> orbital period — time it takes to complete one cycle of rotation around its center. Note: some are so far away from a star therefore it is unclear where the orbit begins, therefore you may see NaN  

> mass — relative weight of the exoplanet  

> distance — light-years from Earth - a distance that it takes light a year to travel. Now that you have some sort of understanding of the data set, let’s take a look at what we can do with it  


**Calculate** standard descriptive statistics, and also drop the null values while calculating.

In [None]:
# Sanity check: Check for missing values, all numeric types before performing descriptive statistics
# As we see below there are missing values (Non-Null column where count is <1035)
planets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          1035 non-null   object 
 1   number          1035 non-null   int64  
 2   orbital_period  992 non-null    float64
 3   mass            513 non-null    float64
 4   distance        808 non-null    float64
 5   year            1035 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB


In [None]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [None]:
# What if we drop rows with missing values in any columns
planets.dropna().shape

(498, 6)

SIMPLE Aggregation	Description  
count():	Total number of items  
first()/last()	First and last item  
mean(), median():	Mean and median  
min(), max()	Minimum and maximum value of a column  
std(), var()	Standard deviation and variance  
mad()	Mean absolute deviation  
prod()	Product of all items  
sum()	Sum of all items  

###### What we saw above is the simple transformation and aggregation. To go deeper into the data, however, simple aggregates and transformations are often not enough.  

###### The next level of data summarization is the groupby operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

#### Conditional Aggregation:  
Simple aggregations can give you basic understanding your dataset, but often we would prefer to aggregate conditionally on some label or index  
This can be implemented in the so-called groupby operation.   
The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.
    

#### The GroupBy object
The GroupBy object is a very flexible abstraction.   
Simply treat it as if it's a collection of DataFrames, does tricks under the hood.   

Let's explore this using the Planets data.

The most important operations made available by a GroupBy are aggregate, filter, transform, and apply.    
As we discussed in the lecture a groupby function can be viewed as a combination of "Aggregate, Filter, Transform, Apply" but lets apply this function to understand how it simplifies the data at the same time provides some basic descriptive statistical values

>> ### Column indexing  
    The GroupBy object supports column indexing in the same way as the DataFrame, and returns a modified GroupBy object. For example:

In [None]:
planets.groupby('method')['orbital_period'].median()

Unnamed: 0_level_0,orbital_period
method,Unnamed: 1_level_1
Astrometry,631.18
Eclipse Timing Variations,4343.5
Imaging,27500.0
Microlensing,3300.0
Orbital Brightness Modulation,0.342887
Pulsar Timing,66.5419
Pulsation Timing Variations,1170.0
Radial Velocity,360.2
Transit,5.714932
Transit Timing Variations,57.011


Here we've selected a particular Series group from the original DataFrame group by reference to its column name. As with the GroupBy object, no computation is done until we call some aggregate on the object

## The GroupBy object also supports direct iteration over the groups, returning each group as a Series or DataFrame:  

The example below illustrates grouping by the planets method and the dimensions of that particular method (group)

In [None]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


#### Dispatch methods

Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() method of DataFrames to perform a set of aggregations that describe each aggreation group in the data using unstack()

In [None]:
planets.groupby('method')['year'].describe().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,method,Unnamed: 2_level_1
count,Astrometry,2.0
count,Eclipse Timing Variations,9.0
count,Imaging,38.0
count,Microlensing,23.0
count,Orbital Brightness Modulation,3.0
...,...,...
max,Pulsar Timing,2011.0
max,Pulsation Timing Variations,2007.0
max,Radial Velocity,2014.0
max,Transit,2014.0


#### Aggregate, filter, transform, apply

Grouping example:
From the same dataset and with few extra lines of code we can calculate   
> the count of discovered planets by method and by decade: hence applying all the four techniques we learned above

In [None]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


#### Reflection Task: (5 points)   

Research more Conditional Aggregation methods that can be used for basic transformations and list <b> 2 methods </b>
> and apply the <b>2 methods </b> you listed using the planets (one used in this sheet) dataset

1.

In [None]:
# first year the planet was discovered
planets.groupby('method')['year'].first()

Unnamed: 0_level_0,year
method,Unnamed: 1_level_1
Astrometry,2013
Eclipse Timing Variations,2009
Imaging,2005
Microlensing,2008
Orbital Brightness Modulation,2011
Pulsar Timing,1992
Pulsation Timing Variations,2007
Radial Velocity,2006
Transit,2008
Transit Timing Variations,2011


2.

In [None]:
# min() earliest year a planet was discovered
planets.groupby('method')['year'].min()

Unnamed: 0_level_0,year
method,Unnamed: 1_level_1
Astrometry,2010
Eclipse Timing Variations,2008
Imaging,2004
Microlensing,2004
Orbital Brightness Modulation,2011
Pulsar Timing,1992
Pulsation Timing Variations,2007
Radial Velocity,1989
Transit,2002
Transit Timing Variations,2011


3.

In [None]:
# Standard deviation of orbital period
planets.groupby('method')['year'].std()


Unnamed: 0_level_0,year
method,Unnamed: 1_level_1
Astrometry,2.12132
Eclipse Timing Variations,1.414214
Imaging,2.781901
Microlensing,2.859697
Orbital Brightness Modulation,1.154701
Pulsar Timing,8.38451
Pulsation Timing Variations,
Radial Velocity,4.249052
Transit,2.077867
Transit Timing Variations,1.290994


4.

In [None]:
# short the result of the planet discovered.
planets.groupby('method')['year'].min().sort_values()

Unnamed: 0_level_0,year
method,Unnamed: 1_level_1
Radial Velocity,1989
Pulsar Timing,1992
Transit,2002
Microlensing,2004
Imaging,2004
Pulsation Timing Variations,2007
Eclipse Timing Variations,2008
Astrometry,2010
Orbital Brightness Modulation,2011
Transit Timing Variations,2011
