# Groupby Tutorial Notebook üìÄü§ì

## What Is This Notebook About? üìò

GroupBy and categorical metadata is the latest addition to the Antigranular environment. ü•≥ In this notebook, we'll show you a specialised implementation of the groupby operation, tailored for use within Antigranular.

The focus here is on a restricted version of groupby, which is a critical tool in data manipulation and analysis, particularly in the realm of large datasets. üåê This restricted version is designed to offer more controlled and secure data handling, aligning with the needs of Antigranular computing. üîë

The groupby operation, as we will explore, allows for the aggregation of data based on specified criteria, enabling complex data analysis tasks to be performed with ease and efficiency. However, in Antigranular environments, where data sensitivity and privacy are paramount, a standard groupby may not suffice due to its broad functionality. Hence, our restricted version offers a more focused approach. Let's see what it's all about. ‚úÖ

### Getting Started: Setting Up the Environment

To get started, install the Antigranular package from pip and login. ü¶æ

In [None]:
!pip install antigranular



In [None]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Sandbox Competition")

Output: Dataset "Medical Treatment" loaded to the kernel as [92mmedical_treatment[0m
Key Name                       Value Type     
---------------------------------------------
train_x                        PrivateDataFrame
train_y                        PrivateDataFrame
test_x                         DataFrame      

Connected to Antigranular server session id: a3322684-8ed5-4835-9866-df420376393b, the session will time out if idle for 25 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
üöÄ Everything's set up and ready to roll!


The output of the above cell tells you everything about the data loaded. The name of the data is highlighted in green.

As the Dataset is loaded as a dictionary, we get a table with key value pairs, telling us the key name, along with the value type of the data. üëÄ

### Creating a Dataset üìä

For this tutorial, we will be using a custom made dataset, which contains salary information of a particular individual.

The dataset contains the following columns: üëáüèΩ

* name: Contains randomly generated string of 10 characters.
* gender: Gender of the individual. Can be 'M' for male or 'F' for female.
* education: Education level of the individual. One out of 4 categories.
* age: Age of the individual. Integer between 16 and 60.
* salary: Salary of the individual. Integer between 20000 and 700000.

In [None]:
import pandas as pd
import numpy as np

'''
Creating a randomised df for analysis
containing the following columns:
  gender: {'M', 'F'}
  education: {'10th', '12th', 'graduate', 'post-graduate'}
  age: (16, 60)
  salary: (20000, 700000)
'''
import functools
name = []

for i in range(0, 10000):
  val = np.random.randint(97, 97+26, 10)
  val = [chr(x) for x in val]

  name.append(functools.reduce(lambda x,y : x+y, val))

gender = []
for i in range(0, 10000):
  gender.append('M' if np.random.randint(0, 2) == 0 else 'F')

education = []

for i in range(0, 10000):
  var = np.random.randint(0, 4)
  if var == 0:
    education.append("10th")
  elif var == 1:
    education.append("12th")
  elif var == 2:
    education.append("graduate")
  else:
    education.append("post-graduate")

df = pd.DataFrame({'name': name, 'gender': gender, 'education': education,
                   'age': np.random.randint(16, 61, 10000), 'salary': np.random.randint(20000, 700000, 10000)})

In [None]:
df.head()

Unnamed: 0,name,gender,education,age,salary
0,xgdxxiikyf,M,12th,18,171152
1,ruokvgduvc,F,graduate,31,71884
2,qciklhrsri,F,post-graduate,42,262369
3,wtgjefupbk,F,10th,57,643175
4,nzunqlhcji,F,12th,28,448692


### Importing the Data ‚û°Ô∏è

Now, we use `session.private_import` to import this dataset within the Antigranular environment. üí•

In [None]:
session.private_import(data = df, name = "df")

dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel



To create a PrivateDataFrame, we follow the following rules: ü§ó
* üåê Metadata will be given for only numerical data. Statistics will only be calculated on the columns for which metadata is given. Non-numerical data does not have metadata, because statistics cannot be done on these columns.
* üîç Categorical Metadata can be given for any column, but the datatype for the categories must be conformant. This is done to maintain uniformity within a particular column, as one column must contain elements of a single datatype.
* ‚õîÔ∏è The set of columns in metadata should not have any intersection with the set of columns in categorical_metadata.
* üåü String columns like name, address etc should not be provided in metadata or categorical metadata.

In [None]:
%%ag
import op_pandas
pdf = op_pandas.PrivateDataFrame(df, metadata = {"age": (16, 60), "salary": (20000, 700000)}, categorical_metadata = {"gender": ['M', 'F'], "education": ["10th", "12th", "graduate", "post-graduate"]})

### Getting to Know the Dataset üî¨

Some important information about the dataset is printed below. üìÑ It is important to know what the dataset looks like before proceeding with doing any kind of analysis.

In [None]:
%%ag
ag_print("columns: ", pdf.columns)
ag_print("metadata: ", pdf.metadata)
ag_print("categorical_metadata: ",pdf.categorical_metadata)
ag_print("dtypes:\n ", pdf.dtypes)

columns:  ['name', 'gender', 'education', 'age', 'salary']
metadata:  {'age': (16, 60), 'salary': (20000, 700000)}
categorical_metadata:  {'gender': ['M', 'F'], 'education': ['10th', '12th', 'graduate', 'post-graduate']}
dtypes:
  name         object
gender       object
education    object
age           int64
salary        int64
dtype: object



Here, we can see that `age` and `salary` are numerical columns while `gender` and `education` are categorical columns. üñºÔ∏è `name` is not present in any metadata, as it is neither categorical nor numerical.

Now, we use `PrivateDataFrame.describe()` function to get the lowdown of the basic information about the numerical columns.

In [None]:
%%ag
ag_print(pdf.describe(eps = 0.1))

               age         salary
count  9786.000000    9786.000000
mean     37.124402  402257.614219
std      11.472617  152898.866737
min      16.000000   44878.336264
25%      29.504922  168327.575987
50%      33.312865  369077.605656
75%      42.519508  500675.973945
max      57.875210  677394.650052



### GroupBy Usage üï∫üèæ

[Groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) is a common pandas function used to group large amounts of data and compute operations on these groups. üêº


Due to privacy limitations, the [Groupby](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#groupby) function in Antigranular is a restricted version of the original groupby from pandas, yet its application remains consistent.üìç

The primary distinction between the two lies in the `by` argument. This argument is essential for defining the groups in the groupby operation. It accepts several types of inputs:

* üìâ **Column**: This allows grouping based on a specific column. The column used for this purpose needs to be categorical and should be listed in the categorical_metadata.
* üîê **Boolean Series / PrivateSeries**: This involves a series composed exclusively of boolean values. If the series comprises non-boolean values, it will automatically be converted to boolean prior to the groupby operation, without generating any exceptions or warnings.
* üßÆ **List**: This can be a combination of the above two options, allowing for more flexible grouping criteria.

Let's say for the first example, we want to group the data on basis of their education and find the mean salary for each of the group. üí∞ This is how we can do that:

In [None]:
%%ag
grouped_pdf = pdf.groupby("education")

ag_print(grouped_pdf["salary"].mean(eps = 1))

                           0
10th           353870.272720
12th           356065.234855
graduate       355715.180512
post-graduate  351349.504497



Here, `eps = 1` will be used to get the output. As there are 4 groups, epsilon will be divided equally and hence `eps = 0.25` will be used in each group to get the result. üëçüèΩ

However, grouping can't be done on basis of `age` column, as it is not a categorical column.

In [None]:
%%ag
grouped_pdf = pdf.groupby("age")

[0;31mValueError[0m: age is not a categorical column.Group by only categorical columns are allowed.


However, we can split the `age` column in two categories using boolean PrivateSeries in `by` argument. If we want to analyse the `mean` of salaries of people with `age > 30` and `<=30`, we can do that as follows: ü§ì

In [None]:
%%ag

grouped_pdf = pdf.groupby(pdf['age'] > 30)

ag_print(grouped_pdf["salary"].mean(eps=1))

               0
0  356326.694380
1  355419.558985



Here, index 0 means all the indices where `pdf['age'] <= 30` (or the boolean PrivateSeries is `False`), and index 1 means all the indices where `pdf['age'] > 30` (or the boolean PrivateSeries is `True`).

üí° After grouping, following functions can be applied to the `grouped_pdf`:

* [sum](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#sum)
* [mean](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#mean)
* [std](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#standard-deviation)
* [var](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#variance)
* [count](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#count)
* [quantile](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#quantile)
* [median](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#median)
* [percentile](https://docs.antigranular.com/private-python/api/op_pandas/private_dataframe#percentile)

The usage will be similar to the function call for PrivateDataFrame.

For example, finding `25th percentile` of the `salary` grouped by `education` can be done as follows:

In [None]:
%%ag

ag_print(pdf.groupby("education")["salary"].quantile(eps=1, q=0.25))

                           0
10th           188497.706648
12th           193790.777589
graduate       193125.740756
post-graduate  184543.475519



You can also send in a `list` as a grouper and group over multiple columns and conditions. ‚úîÔ∏è All of these conditions will be shown as a multi-index in the output dataframe.

In [None]:
%%ag

ag_print(pdf.groupby(["education", "gender"]).mean(eps=1))

                       age         salary
10th          M  37.990412  365835.485731
              F  38.069294  350376.102060
12th          M  39.139811  362511.882352
              F  37.098640  353427.048131
graduate      M  37.778909  365651.118838
              F  37.897245  359373.224765
post-graduate M  37.944639  358544.388604
              F  39.009578  347755.868706



In [None]:
%%ag

ag_print(pdf.groupby(["education", pdf['age']>30]).mean(eps=1))

                       age         salary
10th          0  22.160349  358415.957601
              1  45.337913  360898.064176
12th          0  23.224790  361659.177613
              1  45.656241  357461.129681
graduate      0  23.726825  358778.026076
              1  45.430969  364589.124464
post-graduate 0  22.939682  366483.436018
              1  45.819218  353105.066373



This query is structured to accomplish the following:

- üöÄ ***Grouping Data***: The dataset is divided into groups based on two key attributes:

  - `education`: This categorises individuals according to their educational qualifications.
  - `age > 30`: It further segments these groups into two age-based categories ‚Äì those above 30 years of age and those 30 or younger.

- üßÆ ***Calculating Averages***: For each of these groups, we compute the differentially private average values of numerical columns like `age` and `salary`.

This methodology allows us to extract meaningful insights about average characteristics across different educational backgrounds and age brackets, while ensuring the privacy and integrity of the data. üòé

### Conclusion üåü

In this notebook, we've explored a specialised groupby operation tailored for Antigranular environments, emphasising data privacy and precision. Through examples and queries, we demonstrated how to analyse sensitive data effectively, uncovering insights while maintaining data integrity. üîê

This guide should serve as a foundation for your future data analysis projects, particularly in scenarios requiring careful data management. Remember, the techniques discussed here are crucial for responsible and secure data handling. üìä Happy analysing!üí•

Now that we're all done, we use this line to close our work session neatly. It's like turning off the lights when you leave a room ‚Äì it‚Äôs a good habit to wrap things up properly!

In [None]:
session.terminate_session()

{'status': 'ok'}