# When do we Optimize Column Stores ?

## <span style="font-size: 14px;">We have three scenarios that may class a column store as requiring maintenance.</span>  

- The Avg Row size of a segment is below the optimal (1024\*1024=1,048,576). We call this density and it can be expressed as %. Eg 10% **fragmentation** means each segment is only 90% full.
- The table contains a lot of soft **deleted** rows
- The table contains a lot of **inserted** rows that have not yet been compressed by the [tuple mover](https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index). It can take over 60 million rows before this is engaged (eg 1 segment per distribution).

.

### Demo - Fragmentation and Density

In [2]:
/* Demo - Examining Column store Density on inital Insert for 100 milion row */
IF OBJECT_ID('[dbo].[FactFinance100m]') is not null 
	DROP TABLE [dbo].[FactFinance100m]
GO
CREATE TABLE [dbo].[FactFinance100m] WITH (
	DISTRIBUTION = ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX 
) AS
SELECT TOP 130000000 * FROM [dbo].[FactFinance1b]



In [3]:

/* This view shows the fragmentation_density is perfect, with some open row stores for overspill. Not ununsual as this table is only 2 complete segments per distribution */ 
SELECT * FROM dbo.vColumnstoreStats WHERE table_name='FactFinance100m'



execution_date,database_name,schema_name,table_name,partition_number,partition_scheme,object_id,index_name,row_count,deleted_row_count,row_group_count,compressed_row_count,compressed_rowgroup_count,open_rowgroup_count,open_row_count,compressed_row_max,compressed_row_avg,fragmentation_density,fragmentation_deletes,fragmentation_open
2022-07-08 11:45:52.233,AdventureWorksDW,dbo,FactFinance100m,,,114867526,ClusteredIndex_6177510f7ca142d4a29cb235500b018b,130000000,0,180,125829120,120,60,4170880,1048576,1048576,0.0,0.0,6.63


In [5]:
/* Using the DMV sys.[dm_pdw_nodes_db_column_store_row_group_physical_stats] we can track the size of each segment and WHY it was closed 
    The view "[dbo].[vCS_rg_physical_stats]" from https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-memory-optimizations-for-columnstore-compression 
    Helps a lot in understanding why segments arent full.

    BULK INSERT, over partitioning or over enthusiastic REORG are the.

    In this case we see only OPEN and NO_TRIM

*/
select * From [dbo].[vCS_rg_physical_stats] WHERE logical_table_name='FactFinance100m'

logical_table_name,row_group_id,partition_number,state,state_desc,total_rows,trim_reason_desc,physical_name,created_time
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_1,2022-07-08 11:45:51.030
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_2,2022-07-08 11:45:46.053
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_3,2022-07-08 11:45:50.983
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_4,2022-07-08 11:45:50.780
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_5,2022-07-08 11:45:46.017
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_6,2022-07-08 11:45:46.083
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_7,2022-07-08 11:45:51.023
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_8,2022-07-08 11:45:46.047
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_9,2022-07-08 11:45:46.057
FactFinance100m,2,1,1,OPEN,69736,,Table_e60660723e60487b97aa571c0cf2e7d1_10,2022-07-08 11:45:45.697


In [6]:
/*
    Lets create same table, but just with 20 million rows (not complete segment, then insert in batches to make 130 million)

    Here we can see the RowGroup closed prematurely by BULkOAD. If BULK operation IS > about 200k records a close may happen on inital insert.

    This results in 14% fragmentation and very small RowGroupo
*/

IF OBJECT_ID('[dbo].[FactFinance100m]') is not null 
	DROP TABLE [dbo].[FactFinance100m]
GO
CREATE TABLE [dbo].[FactFinance100m] WITH (
	DISTRIBUTION = ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX 
) AS
SELECT TOP 20000000 * FROM [dbo].[FactFinance1b]
GO
SELECT * FROM dbo.vColumnstoreStats WHERE table_name='FactFinance100m'
select * From [dbo].[vCS_rg_physical_stats] WHERE logical_table_name='FactFinance100m'


execution_date,database_name,schema_name,table_name,partition_number,partition_scheme,object_id,index_name,row_count,deleted_row_count,row_group_count,compressed_row_count,compressed_rowgroup_count,open_rowgroup_count,open_row_count,compressed_row_max,compressed_row_avg,fragmentation_density,fragmentation_deletes,fragmentation_open
2022-07-08 11:46:37.263,AdventureWorksDW,dbo,FactFinance100m,,,130867583,ClusteredIndex_766ace22025346dba8b1db2efab6ee27,20000000,0,60,20000000,60,0,0,333884,333333,0.17,0.0,0.0


logical_table_name,row_group_id,partition_number,state,state_desc,total_rows,trim_reason_desc,physical_name,created_time
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_1,2022-07-08 11:46:35.730
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_2,2022-07-08 11:46:35.780
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_3,2022-07-08 11:46:36.693
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_4,2022-07-08 11:46:35.727
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_5,2022-07-08 11:46:35.770
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_6,2022-07-08 11:46:36.633
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_7,2022-07-08 11:46:35.740
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_8,2022-07-08 11:46:35.743
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_9,2022-07-08 11:46:35.767
FactFinance100m,0,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_10,2022-07-08 11:46:35.790


In [7]:
/*
    Lets Insert another 100 million to make similar as inital test case (120 million)

    We can see that we have 360 RowGroups instead of 180 and Fragmentation is only 17%, with only 33k per Row Group instead of 1024l
*/

INSERT INTO FactFinance100m ([AccountKey], [ScenarioKey], [DepartmentGroupKey], [DateKey], [OrganizationKey], [Amount], [Date], [LineageKey])
SELECT TOP 20000000 [AccountKey], [ScenarioKey], [DepartmentGroupKey], [DateKey], [OrganizationKey], [Amount], [Date], [LineageKey] FROM [dbo].[FactFinance1b]
GO 5


In [8]:
/* Now lets Look at the Fragmnentation. 68% Fragmented. Eg only 32% full */

SELECT * FROM dbo.vColumnstoreStats WHERE table_name='FactFinance100m'
select * From [dbo].[vCS_rg_physical_stats] WHERE logical_table_name='FactFinance100m'

execution_date,database_name,schema_name,table_name,partition_number,partition_scheme,object_id,index_name,row_count,deleted_row_count,row_group_count,compressed_row_count,compressed_rowgroup_count,open_rowgroup_count,open_row_count,compressed_row_max,compressed_row_avg,fragmentation_density,fragmentation_deletes,fragmentation_open
2022-07-08 11:50:04.637,AdventureWorksDW,dbo,FactFinance100m,,,130867583,ClusteredIndex_766ace22025346dba8b1db2efab6ee27,120000000,0,360,120000000,360,0,0,333884,333333,68.21,0.0,0.0


logical_table_name,row_group_id,partition_number,state,state_desc,total_rows,trim_reason_desc,physical_name,created_time
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_1,2022-07-08 11:50:02.537
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_2,2022-07-08 11:50:03.283
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_3,2022-07-08 11:50:02.487
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_4,2022-07-08 11:50:04.167
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_5,2022-07-08 11:50:04.200
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_6,2022-07-08 11:50:02.567
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_7,2022-07-08 11:50:02.573
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_8,2022-07-08 11:50:04.197
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_9,2022-07-08 11:50:02.550
FactFinance100m,5,1,3,COMPRESSED,333324,BULKLOAD,Table_3b03618df7c6458cb362bc5edb705bfb_10,2022-07-08 11:50:02.560


In [9]:
/* We could fix this with a REORG 
    ALTER INDEX ClusteredIndex_1fba0db5c48b40288124497ec2198389 ON [dbo].[FactFinance100m] REORGANIZE

    But, lets introduce the ColumnStoreOptimize which can locate all Column Stores with low density and REORG them
*/
exec   [dbo].[ColumnstoreOptimize]  @Tables='FactFinance100m'
,@DensityThreshold=25 /* Default=25 */
,@OpenThreshold=null
,@DeleteThreshold=null
,@TimeLimit =null
, @Execute='N'


In [10]:
/*  Now we can check commandlog and also imprves Stats
    From 180 row groups to just 60 (one per distibution)
 */
SELECT TOP 1 * FROM dbo.CommandLog ORDER BY StartTime DESC
SELECT * From vColumnstoreStats where table_name ='factFinance100m'


ID,DatabaseName,SchemaName,ObjectName,ObjectType,IndexName,IndexType,StatisticsName,PartitionNumber,ExtendedInfo,Command,CommandType,StartTime,EndTime,ErrorNumber,ErrorMessage
4,AdventureWorksDW,dbo,FactFinance100m_nostats,U,,,ALL,,<ExtendedInfo><StatsRowCount>67844111</StatsRowCount><ActualRowCount>67403926</ActualRowCount><UpdateLevel>259622</UpdateLevel></ExtendedInfo>,UPDATE STATISTICS [dbo].[FactFinance100m_nostats] WITH FULLSCAN,UPDATE STATISTICS,2022-07-07 20:37:52.130,2022-07-07 20:37:55.077,,


execution_date,database_name,schema_name,table_name,partition_number,partition_scheme,object_id,index_name,row_count,deleted_row_count,row_group_count,compressed_row_count,compressed_rowgroup_count,open_rowgroup_count,open_row_count,compressed_row_max,compressed_row_avg,fragmentation_density,fragmentation_deletes,fragmentation_open
2022-07-08 11:50:12.400,AdventureWorksDW,dbo,FactFinance100m,,,130867583,ClusteredIndex_766ace22025346dba8b1db2efab6ee27,120000000,0,360,120000000,360,0,0,333884,333333,68.21,0.0,0.0
