# How to work with B-tree indexes
In this recipe we will explore how indexes behave:
1. We'll create a heap and we'll test a query on it
2. We'll create nonclusted index
3. We'll create clustered index
4. We'll create filtered index

Let's create a table called SalesOrders and let's fill it with data

## Preparation of the data

In [None]:
set nocount on
go

-- too big transaction is comming in a moment
alter database Demo modify file (name = 'DemoLog', size = 4096 MB)
alter database Demo set recovery simple
go

use Demo
go

-- this table is subject for optimization
drop table if exists SalesOrders
create table SalesOrders
(
    Id int not null identity constraint pk_SalesOrders primary key nonclustered
    , OrderDate date not null
    , OrderNumber nvarchar(10) not null
    , CustomerId int not null
    , Subtotal dec(10, 3) not null
    , OrderStatus tinyint not null 
)
GO

-- this table serves as a multiplicator and will be dropped
drop table if exists Numbers 
create table Numbers (Id int not null identity)
go
insert Numbers default values 
go 100

-- dates for last three years
drop table if exists #dates
;with cte as
(
select cast('20180101' as date) as TheDate
union all
select dateadd(dd, 1, cte.TheDate) from cte where cte.TheDate < '20210331'
)
select * 
into #dates
from cte
option (maxrecursion 2000)

declare @date date
	, @OrderNo tinyint
	, @i tinyint = 1

declare crs cursor 
for
select TheDate from #dates
open crs
fetch crs into @date
while @@FETCH_STATUS = 0
 begin
	set @OrderNo = ceiling(rand() * 10)
	set @i = 1
	while @i <= @OrderNo
	 begin
		insert SalesOrders (OrderDate, OrderNumber, CustomerId, Subtotal, OrderStatus)
		select @date, concat('OD', year(@date), @i), Numbers.Id, rand() * 1000, cast((rand() * 100) as int) % 5
		from Numbers
		set @i += 1
	 end
	fetch crs into @date
 end
close crs
deallocate crs
go

drop table if exists Numbers
go

-- let's create more records
insert SalesOrders (OrderDate, OrderNumber, CustomerId, Subtotal, OrderStatus)
select top 4000000 a.OrderDate, a.OrderNumber, a.CustomerId, a.Subtotal, a.OrderStatus
from SalesOrders a
	cross apply SalesOrders b

The SalesOrders table contains approximatelly 6.5 millon of rows. The table is heap. Let's execute three queries:

- query seeking by primary key
    
- query seeking some data
    
- aggregate query
    

## Querying heap
We'll observe number of reads, CPU times and DOP.

In [None]:
use Demo
go

drop index ix_SalesOrders_OrderDate on dbo.SalesOrders
drop index ix_SalesOrders_CustomerId on dbo.SalesOrders

In [None]:
use Demo
go

set statistics IO, time ON
set statistics profile on
go
select * from SalesOrders where Id = 234567
go
select * from SalesOrders where OrderDate between '2020-04-01' and '2020-04-30'
go

SELECT
    CustomerId
    , sum(Subtotal)
from SalesOrders
group by CustomerId
order by CustomerId
go

With an exception of first query, whole table is read every time when it's queried. Now, let's create indexes on OrderDate column (support for date filter) and on CustomerId column (support for aggregation).
  
## Creating indexes

In [None]:
use Demo
go

create index ix_SalesOrders_OrderDate on dbo.SalesOrders (OrderDate)
create index ix_SalesOrders_CustomerId on dbo.SalesOrders (CustomerId)

In [None]:
use Demo
go

set statistics IO, time ON
set statistics profile on
go

select * from SalesOrders where OrderDate between '2020-04-01' and '2020-04-30'
go

SELECT
    CustomerId
    , sum(Subtotal)
from SalesOrders
group by CustomerId
order by CustomerId
go

Let's comment results. Basically, no change is there. 

  

The query seeking for range of values uses \* in SELECT clause. The number of records (even if it looks like small portion of data, actually 16400/6.5mio of records) is big enough and SQL Server decided to use table scan instead of index seek followed with 16400 RID lookups (while the lookup is an iteration).

  

Second query does not have all columns covered in the newly created index. While there's no predicate, SQL Server still needs to scan all data from the table. 

  

As a conclusion we can say, that index works only in conjuction with the design of the query. Let's correct problems here. We will make three actions:

1\. We will reduce the number of records from first query

2\. We will replace the \* with a list of columns

3\. We will redesign index for second query

In [None]:
use Demo
GO

select * from SalesOrders where OrderDate between '2020-04-01' and '2020-04-01'
go

select OrderDate from SalesOrders
GO

create index ix_SalesOrders_CustomerId on dbo.SalesOrders (CustomerId, Subtotal) with DROP_EXISTING
go

SELECT
    CustomerId
    , sum(Subtotal)
from SalesOrders
group by CustomerId
order by CustomerId
go

## Creating filtered index
Let's assume that the OrderStatus column contains values from 0 to 4 (from new order to finished order). Most of records will have the value of 4 (finished order). So we would like create index on the OrderStatus column. The distribution of data in this column is really skewed. So we are going to create filtered index.

In [None]:
use Demo
go
-- data prep.
update SalesOrders set OrderStatus = 4 where OrderDate < '2021-03-20'

select
    OrderStatus
    , count(*)
from SalesOrders
group by OrderStatus

In [None]:
use Demo
go

create index ix_SalesOrders_OrderStatus on SalesOrders (OrderStatus)
where OrderStatus < 4

In [None]:
use Demo
go
select * from SalesOrders where OrderStatus = 3

## Conclusion

1. When index does not contain all columns used in a query, SQL Server decides if it's cheaper to scan all data, or to invoke Nested Loops followed by RID Lookup or Key Lookup
2. When index contains all columns needed, it's used for Index Scan operator. SQL Server does not access base table.
3. Filtered indexes are great for skewed distribution of data