# Group By (Split-Apply-Combine) Practice

This notebook demonstrates the Split-Apply-Combine strategy using pandas' `groupby()` method and the `describe()` function on the provided dataset.

## What is Split-Apply-Combine?
- **Split:** Split the data into groups based on some criteria (e.g., a column value).
- **Apply:** Apply a function to each group independently (e.g., aggregation, transformation).
- **Combine:** Combine the results into a data structure.

This is a powerful pattern for data analysis and summarization.

In [58]:
import pandas as pd
import os

# Load the dataset using the absolute Windows path
df = pd.read_csv(r'C:\Jupyter Notebook\Group By (Split Apply Combine)\groupby_practice.csv')

# Use a constant filename for the head output
outfile = 'groupby_head.csv'
if not os.path.exists(outfile):
    df.head().to_csv(outfile, index=False)
    print(f"First few rows saved to {outfile}")
else:
    print(f"File {outfile} already exists. Not overwriting.")
df.head()

File groupby_head.csv already exists. Not overwriting.


Unnamed: 0,Region,Store,Date,ProductCategory,UnitsSold,UnitPrice,Revenue
0,North,Store_12,2020-06-28,Home & Kitchen,6,341.7,2050.2
1,West,Store_10,2020-07-05,Sports,6,262.05,1572.3
2,South,Store_4,2020-10-21,Toys,10,359.36,3593.6
3,North,Store_15,2020-06-12,Sports,19,352.82,6703.58
4,West,Store_1,2021-01-29,Clothing,19,493.96,9385.24


## The groupby() Method in Pandas

The `groupby()` method in pandas is used to split the data into groups based on the values in one or more columns. Internally, it creates a `DataFrameGroupBy` object, which is an intermediate representation that holds information about the groups but does not compute anything until an aggregation or transformation is applied.

In [59]:
# Group by the 'Region' column (you can also try 'ProductCategory' or others)
grouped = df.groupby('Region')

# Show the available groups
print("Groups:", grouped.groups.keys())

# Show descriptive statistics for each group and save to a file only if it does not already exist (no timestamp)
import os
outfile = 'groupby_describe.csv'
if not os.path.exists(outfile):
    result = grouped.describe()
    result.to_csv(outfile)
    print(f"Descriptive statistics saved to {outfile}")
else:
    print(f"File {outfile} already exists. Not overwriting.")
result = grouped.describe()
result

Groups: dict_keys(['East', 'North', 'South', 'West'])
File groupby_describe.csv already exists. Not overwriting.


Unnamed: 0_level_0,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitPrice,UnitPrice,UnitPrice,UnitPrice,UnitPrice,Revenue,Revenue,Revenue,Revenue,Revenue,Revenue,Revenue,Revenue
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
East,2408.0,9.928156,5.47573,1.0,5.0,10.0,15.0,19.0,2408.0,253.584344,...,377.3825,499.76,2408.0,2516.980395,2120.458596,9.19,736.585,1887.6,3868.9475,9491.83
North,2507.0,10.050259,5.55814,1.0,5.0,10.0,15.0,19.0,2507.0,253.054615,...,375.95,499.74,2507.0,2576.847623,2189.760682,6.62,738.28,1973.07,3980.27,9477.58
South,2562.0,9.839188,5.457147,1.0,5.0,10.0,15.0,19.0,2562.0,252.916862,...,375.735,499.04,2562.0,2494.255921,2122.599579,11.61,771.0075,1875.3,3843.57,9421.53
West,2523.0,10.100674,5.499421,1.0,5.0,10.0,15.0,19.0,2523.0,250.325192,...,375.43,499.98,2523.0,2532.694023,2161.272705,7.35,739.31,1928.32,3863.36,9484.99


## Using describe() with groupby (Split-Apply-Combine)

The `describe()` function can be used on a groupby object to get summary statistics for each group. This is an example of the split-apply-combine strategy: the data is split into groups, `describe()` is applied to each group, and the results are combined into a single DataFrame.

In [60]:
# Get descriptive statistics for each group and save to a file only if it does not already exist (no timestamp)
import os
outfile = 'groupby_describe.csv'
if not os.path.exists(outfile):
    result = grouped.describe()
    result.to_csv(outfile)
    print(f"Descriptive statistics saved to {outfile}")
else:
    print(f"File {outfile} already exists. Not overwriting.")
result = grouped.describe()
result

File groupby_describe.csv already exists. Not overwriting.


Unnamed: 0_level_0,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitsSold,UnitPrice,UnitPrice,UnitPrice,UnitPrice,UnitPrice,Revenue,Revenue,Revenue,Revenue,Revenue,Revenue,Revenue,Revenue
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
East,2408.0,9.928156,5.47573,1.0,5.0,10.0,15.0,19.0,2408.0,253.584344,...,377.3825,499.76,2408.0,2516.980395,2120.458596,9.19,736.585,1887.6,3868.9475,9491.83
North,2507.0,10.050259,5.55814,1.0,5.0,10.0,15.0,19.0,2507.0,253.054615,...,375.95,499.74,2507.0,2576.847623,2189.760682,6.62,738.28,1973.07,3980.27,9477.58
South,2562.0,9.839188,5.457147,1.0,5.0,10.0,15.0,19.0,2562.0,252.916862,...,375.735,499.04,2562.0,2494.255921,2122.599579,11.61,771.0075,1875.3,3843.57,9421.53
West,2523.0,10.100674,5.499421,1.0,5.0,10.0,15.0,19.0,2523.0,250.325192,...,375.43,499.98,2523.0,2532.694023,2161.272705,7.35,739.31,1928.32,3863.36,9484.99


In [61]:
# After grouping, create a separate file for each group only if it does not already exist, using a constant filename, print the filenames, and show a preview of each file
import os
from IPython.display import display

# Group by 'Region'
groups = df.groupby('Region')

created_files = []

# Create a directory for group files if it doesn't exist
output_dir = 'groupby_region_outputs'
os.makedirs(output_dir, exist_ok=True)

# Save each group to a separate file and display a preview
for group_name, group_df in groups:
    filename = f'{output_dir}/region_{group_name}.csv'
    if not os.path.exists(filename):
        group_df.to_csv(filename, index=False)
        created_files.append(filename)
        print(f"Created file: {filename}")
        print(f"Preview of {filename}:")
        display(group_df.head())
    else:
        print(f"File {filename} already exists. Not overwriting.")

File groupby_region_outputs/region_East.csv already exists. Not overwriting.
File groupby_region_outputs/region_North.csv already exists. Not overwriting.
File groupby_region_outputs/region_South.csv already exists. Not overwriting.
File groupby_region_outputs/region_West.csv already exists. Not overwriting.
