# AtliQ Products Analysis

## Contents

1. [Introduction](#introduction)
2. [Data loading and preprocessing](#data-loading-and-preprocessing)
    - [Import libraries](#libraries)
    - [Load/Download database](#loaddownload-database)
    - [Helper functions](#helper-functions)
    - [Basic tables info](#basic-tables-info)
    - [Missing values](#missing-values)
    - [Duplicates](#duplicates)
    - [Other](#other)
3. [Analysis]
    1. [Finding the bestsellers]
    2. [Popularity across time and markets]
    3. [Variant sales]
    4. [Division sales]
    5. [Product margin]
    6. [Price vs cost]
4. [Conclusion]


## Introduction

Our team has been commissioned by AtliQ Hardware to conduct a thorough analysis of their product portfolio and sales data.

As a prominent computer hardware producer in India, AtliQ is keen on enhancing their understanding of product performance. This analysis aims to identify top-selling products, uncover trends, and develop strategies to optimize sales and market share.

The primary objective of this research is to analyze AtliQ Hardware's product portfolio and sales data to better understand product performance and identify strategies for optimizing sales. We aim to answer these key questions:
- Which items are the bestsellers?
- How has popularity changed over time/across markets?
- Are there some variants that contribute a disproportionate amount to the product sales?
- Are some channels responsible for a large portion of a division’s sales?
- What are the products with the best/worst margin?
- Is gross price keeping up with manufacturing costs?

Through this analysis, our goal is to provide AtliQ Hardware with actionable insights and recommendations to help drive business growth.

[Back to Contents](#contents)

## Data loading and Preprocessing

### Libraries
These are the libraries that we are going to use for this project:

In [1]:
import pandas as pd
import sqlite3
import os
import requests
import shutil

[Back to Contents](#contents)

### Load/Download database

We have access to an SQLite database with data on products, clients and sales. 

First let's check that it exists, and if doesn't, we'll download it.

In [2]:
# Local path to the Database
db_directory_path = 'Data'
db_file_path = os.path.join(db_directory_path, 'atliq_db.sqlite3')


In [3]:
# Check if directory exists. If it doesn't, create it
if not os.path.exists(db_directory_path):
    os.makedirs(db_directory_path)


In [4]:
# Check if file exists. If it doesn't, download it
if not os.path.exists(db_file_path):
    print('Database not found. Downloading the file...')

    db_url = 'https://practicum-content.s3.us-west-1.amazonaws.com/data-eng/databases/atliq_db.sqlite3'
    
    response = requests.get(db_url)
    with open(db_file_path, 'wb') as f:
        f.write(response.content)
    
    print('Database downloaded successfully!')
else:
    print('Database found.')


Database found.


We have our database. But we don't want to directly change the raw data, and we will be working directly with the database as much as possible. So we'll make a copy and modify that instead.

In [5]:
# Check if the copy exists
work_db_path = os.path.join(db_directory_path, 'atliq_db_processed.sqlite3')

work_db_found = False
if os.path.exists(work_db_path):
    work_db_found = True
    print('Previous copy found.')
else:
    shutil.copyfile(db_file_path, work_db_path)
    print('Database duplicated.')


Previous copy found.


We can now connect to our working copy and start processing it. If we found that the copy already exists, we can assume that it is already processed, and we can skip those steps.

In [6]:
# Connect to the DB
con = sqlite3.connect(work_db_path)

Let's check that we have access to the tables that we are supposed to.

In [7]:
# Check all tables
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(*cursor.fetchall(), sep='\n')

('dim_customer',)
('dim_product',)
('fact_pre_discount',)
('fact_manufacturing_cost',)
('fact_gross_price',)
('fact_sales_monthly',)


[Back to contents](#contents)

### Helper functions

Lets check for missing values. We can't load the whole tables into pandas, so we'll have to rely only on SQL queries.

Lets build a function to help us, similar to pandas info().

We want to display the following:
- Table name
- Row count
- Column info, including name, type, null count, and Primary Key status.

In [8]:
# Count rows in the table
def row_count(table: str):
    query = """
    SELECT COUNT(*)
    from fact_sales_monthly
    """

    cursor.execute(query)
    return cursor.fetchone()[0]

In [9]:
# Find Null values in a column
def count_nulls_in_column(column: str, table: str):
    query = f"""
    SELECT COUNT(*)
    FROM {table}
    WHERE {column} IS NULL
    """

    cursor.execute(query)
    return cursor.fetchone()[0]

In [10]:
# Get all the column names from a table
def get_column_names(table: str):
    query = f"""
    PRAGMA table_info({table}) 
    """

    cursor.execute(query)
    result = cursor.fetchall()
    name_pos_in_row = 1

    column_names = []
    for row in result:
        column_names.append(row[name_pos_in_row])

    return column_names

In [11]:
# Check missing values in all columns of the table
def check_nulls(table: str):
    column_names = get_column_names(table)
    null_counts = []
    for column in column_names:
        null_counts.append((column, count_nulls_in_column(column, table)))

    return null_counts

In [12]:
def table_schema(table: str):
    query = f"""
    PRAGMA table_info({table})
    """

    return pd.read_sql_query(query, con)[['name', 'type', 'pk']]


In [13]:
def get_table_info(table: str):
    print(f'Table: {table}')
    print(f'Rows: {row_count(table)}')
    null_count = check_nulls(table)
    schema = table_schema(table)
    schema['nulls'] = [t[1] for t in null_count]
    display(schema)

To find duplicates, we'll compare the row count of a regular `SELECT` vs a `SELECT DISTINCT`, using relevant columns for each table.

In [14]:
def count_duplicates(table: str, list_cols: list, verbose=True):

    columns = ', '.join(list_cols)

    query = f"""
    SELECT COUNT(*)
    FROM {table}
    """
    cursor.execute(query)
    sel = cursor.fetchone()[0]

    query = f"""
    SELECT DISTINCT COUNT(*)
    FROM (
        SELECT DISTINCT {columns}
        FROM {table}
    )
    """
    cursor.execute(query)
    dis = cursor.fetchone()[0]
    duplicate_count = sel - dis

    if verbose:
        print(f'Table: {table}')
        print(f'Duplicates found: {duplicate_count}')
    return(duplicate_count)


[Back to Contents](#contents)

### Basic tables info

With our function ready, lets check the info on our tables.

In [15]:
get_table_info('dim_customer')

Table: dim_customer
Rows: 67250


Unnamed: 0,name,type,pk,nulls
0,customer_code,INTEGER,0,0
1,customer,TEXT,0,0
2,platform,TEXT,0,0
3,channel,TEXT,0,0
4,market,TEXT,0,0
5,sub_zone,TEXT,0,0
6,region,TEXT,0,0


In [16]:
get_table_info('dim_product')

Table: dim_product
Rows: 67250


Unnamed: 0,name,type,pk,nulls
0,product_code,TEXT,0,0
1,division,TEXT,0,0
2,segment,TEXT,0,0
3,category,TEXT,0,0
4,product,TEXT,0,0
5,variant,TEXT,0,0


In [17]:
get_table_info('fact_pre_discount')

Table: fact_pre_discount
Rows: 67250


Unnamed: 0,name,type,pk,nulls
0,customer_code,INTEGER,0,0
1,fiscal_year,INTEGER,0,0
2,pre_invoice_discount_pct,float,0,0


In [18]:
get_table_info('fact_manufacturing_cost')

Table: fact_manufacturing_cost
Rows: 67250


Unnamed: 0,name,type,pk,nulls
0,product_code,TEXT,0,0
1,cost_year,INTEGER,0,0
2,manufacturing_cost,float,0,0


In [19]:
get_table_info('fact_gross_price')

Table: fact_gross_price
Rows: 67250


Unnamed: 0,name,type,pk,nulls
0,product_code,TEXT,0,0
1,fiscal_year,INTEGER,0,0
2,gross_price,float,0,0


In [20]:
get_table_info('fact_sales_monthly')

Table: fact_sales_monthly
Rows: 67250


Unnamed: 0,name,type,pk,nulls
0,date,TEXT,0,0
1,product_code,TEXT,0,0
2,customer_code,INTEGER,0,0
3,sold_quantity,INTEGER,0,0
4,fiscal_year,INTEGER,0,0


There is only one missing value in the whole database. 

Curiously, none of the tables have primary keys declared. 

[Back to contents](#contents)

### Missing values

There is only one row with missing values in the whole Database. Let's print that row.

In [21]:
query='''
SELECT *
FROM fact_sales_monthly
WHERE fiscal_year IS NULL
'''

cursor.execute(query)
print(*cursor.fetchall(), sep='\n')




It's for product `A0` during `June 2019`. It could mean that this product didn't get any sales that month. Let's see some more info about it.

In [22]:
# Look for other sales of this product
query='''
SELECT *
FROM fact_sales_monthly
WHERE product_code = "A0"
'''

cursor.execute(query)
print(*cursor.fetchall(), sep='\n')




There are no other record of sales of this product.

In [23]:
# What product is this
query='''
SELECT *
FROM dim_product
WHERE product_code = "A0"
'''

cursor.execute(query)
print(*cursor.fetchall(), sep='\n')




There is no register of this product in `dim_product`. It doesn't exist. We can delete this row.

In [24]:
# What product is this
query='''
DELETE 
FROM fact_sales_monthly
WHERE product_code = "A0"
'''

cursor.execute(query)
print(*cursor.fetchall(), sep='\n')

con.commit()




[Back to Contents](#contents)

### Duplicates

The tables don't have Primary keys set up. That means it's possible that some of them have duplicated values in critical columns.

We'll be examining various column combinations in each table. Unless we encounter a non-zero value, we'll proceed to the next one.

In [25]:
# No duplicate customer codes
table = 'dim_customer'
columns = ['customer_code']
count_duplicates(table, columns);

Table: dim_customer
Duplicates found: 0


In [26]:
# Customer codes are unique to the ['customer', 'platform', 'channel', 'region'] combination
table = 'dim_customer'
columns = ['customer', 'platform', 'channel', 'region']
count_duplicates(table, columns);

Table: dim_customer
Duplicates found: 101


We got duplicates here. It's possible that the batabase notes are mistaken and `customer_code` is given by market, instead of by region.

In [27]:
# Check if adding 'market' to the previous query makes the results unique
table = 'dim_customer'
columns = ['customer', 'platform', 'channel', 'region', 'sub_zone', 'market']
count_duplicates(table, columns);

Table: dim_customer
Duplicates found: 0


It seems to be the case. We can move on.

In [28]:
# No duplicate product_code
table = 'dim_product'
columns = ['product_code']
count_duplicates(table, columns);

Table: dim_product
Duplicates found: 0


In [29]:
# No duplicate variants for the same product
table = 'dim_product'
columns = ['product', 'variant']
count_duplicates(table, columns);

Table: dim_product
Duplicates found: 0


In [30]:
# No duplicate category for the same product variant
table = 'dim_product'
columns = ['category', 'product', 'variant']
count_duplicates(table, columns);

Table: dim_product
Duplicates found: 0


In [31]:
# No duplicate segment for the same category product variant
table = 'dim_product'
columns = ['segment', 'category', 'product', 'variant']
count_duplicates(table, columns);

Table: dim_product
Duplicates found: 0


In [32]:
# No duplicate division for the same category product variant
table = 'dim_product'
columns = ['division', 'segment', 'category', 'product', 'variant']
count_duplicates(table, columns);

Table: dim_product
Duplicates found: 0


In [33]:
# The combination ['product', 'category', 'variant', 'segment', 'division'] should be unique
table = 'dim_product'
columns = ['product', 'category', 'variant', 'segment', 'division']
count_duplicates(table, columns);

Table: dim_product
Duplicates found: 0


In [34]:
# Each customer_code should have only one discount per fiscal_year
table = 'fact_pre_discount'
columns = ['customer_code', 'fiscal_year']
count_duplicates(table, columns);

Table: fact_pre_discount
Duplicates found: 0


In [35]:
# Each product_code should have only one cost per year
table = 'fact_manufacturing_cost'
columns = ['product_code', 'cost_year']
count_duplicates(table, columns);

Table: fact_manufacturing_cost
Duplicates found: 0


In [36]:
# Each product_code should have only one price per year
table = 'fact_gross_price'
columns = ['product_code', 'fiscal_year']
count_duplicates(table, columns);

Table: fact_gross_price
Duplicates found: 0


In [37]:
# The sales data should be aggrergated by customer, product and date.
table = 'fact_sales_monthly'
columns = ['date', 'product_code', 'customer_code']
count_duplicates(table, columns);

Table: fact_sales_monthly
Duplicates found: 0


We only found duplicates in `dim_customer`, and they are easily explainable. There is nothing to fix here.

[Back to Contents](#contents)

### Other

We should check the consistency of the data that we have, and that it obeys the rules stated in the Database documentation.

Let's see if the time period is the same in all tables.

In [38]:
# Period for fact_pre_discount data
query = """
SELECT MIN(fiscal_year), MAX(fiscal_year)
FROM fact_pre_discount
"""

cursor.execute(query)
cursor.fetchone()

(2018, 2022)

In [39]:
# Period for fact_manufacturing_cost data
query = """
SELECT MIN(cost_year), MAX(cost_year)
FROM fact_manufacturing_cost
"""

cursor.execute(query)
cursor.fetchone()

(2018, 2022)

In [40]:
# Period for fact_gross_price data
query = """
SELECT MIN(fiscal_year), MAX(fiscal_year)
FROM fact_gross_price
"""

cursor.execute(query)
cursor.fetchone()

(2018, 2022)

In [41]:
# Period for fact_sales_monthly data
query = """
SELECT MIN(fiscal_year), MAX(fiscal_year)
FROM fact_sales_monthly
"""

cursor.execute(query)
cursor.fetchone()

(2018, 2022)

All data is from fiscal years `2018` to `2022`

In [42]:
# dim_customer.platform should have 2 values
query = """
SELECT DISTINCT platform
FROM dim_customer
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

('Brick & Mortar',)
('E-Commerce',)


In [43]:
# dim_customer.channel should have 3 values
query = """
SELECT DISTINCT channel
FROM dim_customer
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

('Direct',)
('Distributor',)
('Retailer',)


In [54]:
# Each dim_customer.customer_code should have 1 platform
query = """
SELECT customer_code, count(platform) as platform_count
FROM dim_customer
GROUP BY customer_code
ORDER BY platform_count DESC
"""
cursor.execute(query)
print (f'Max platform count: {cursor.fetchone()[1]}')

Max platform count: 1


In [59]:
# Each dim_customer.market should have 1 sub-zone
query = """
SELECT market, count(sub_zone) as sub_zone_count
FROM (
    SELECT DISTINCT market, sub_zone
    FROM dim_customer
)
GROUP BY market
ORDER BY sub_zone_count DESC
"""
cursor.execute(query)
print (f'Max sub zone count: {cursor.fetchone()[1]}')

Max sub zone count: 1


In [60]:
# Each dim_customer.market should have 1 region
query = """
SELECT market, count(region) as region_count
FROM (
    SELECT DISTINCT market, region
    FROM dim_customer
)
GROUP BY market
ORDER BY region_count DESC
"""
cursor.execute(query)
print (f'Max region count: {cursor.fetchone()[1]}')

Max region count: 1


In [61]:
# dim_product.division should have 3 values
query = """
SELECT DISTINCT division
FROM dim_product
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

('P & A',)
('PC',)
('N & S',)


In [62]:
# dim_product.segment should have 6 values
query = """
SELECT DISTINCT segment
FROM dim_product
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

('Peripherals',)
('Accessories',)
('Notebook',)
('Desktop',)
('Storage',)
('Networking',)


In [63]:
# fact_pre_discount.pre_invoice_discount_pct should always be less than 1
query = """
SELECT max(pre_invoice_discount_pct)
FROM fact_pre_discount
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

(0.3099,)


In [64]:
# fact_sales_monthly.date consistently assigns the same months to the right fiscal_year
query= """
SELECT fiscal_year, MIN(date), MAX(date)
FROM fact_sales_monthly
GROUP BY fiscal_year
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

(2018, '2017-09-01', '2018-08-01')
(2019, '2018-09-01', '2019-08-01')
(2020, '2019-09-01', '2020-08-01')
(2021, '2020-09-01', '2021-08-01')
(2022, '2021-09-01', '2021-12-01')


There is only 3 months worth of data for fiscal year 2022, as opposed to the full 12 months for the rest of the fiscal years.

In [65]:
# fact_sales_monthly.sold_quantity should always be positive
query= """
SELECT MIN(sold_quantity)
FROM fact_sales_monthly
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

(0,)


In [66]:
query= """
SELECT *
FROM fact_sales_monthly
WHERE sold_quantity = 0
"""
cursor.execute(query)
print(*cursor.fetchall(), sep="\n")

('2017-09-01', 'A0118150101', 70012042, 0, 2018)
('2017-09-01', 'A0118150101', 70012043, 0, 2018)
('2017-09-01', 'A0118150101', 90012033, 0, 2018)
('2017-09-01', 'A0118150101', 90012034, 0, 2018)
('2017-09-01', 'A0118150101', 90012035, 0, 2018)
('2017-09-01', 'A0118150101', 90012037, 0, 2018)
('2017-09-01', 'A0118150101', 90012038, 0, 2018)
('2017-09-01', 'A0118150101', 90012039, 0, 2018)
('2017-09-01', 'A0118150101', 90012041, 0, 2018)
('2017-09-01', 'A0118150102', 70012042, 0, 2018)
('2017-09-01', 'A0118150102', 70012043, 0, 2018)
('2017-09-01', 'A0118150102', 90012033, 0, 2018)
('2017-09-01', 'A0118150102', 90012034, 0, 2018)
('2017-09-01', 'A0118150102', 90012035, 0, 2018)
('2017-09-01', 'A0118150102', 90012037, 0, 2018)
('2017-09-01', 'A0118150102', 90012038, 0, 2018)
('2017-09-01', 'A0118150102', 90012039, 0, 2018)
('2017-09-01', 'A0118150102', 90012041, 0, 2018)
('2017-09-01', 'A0118150103', 70012042, 0, 2018)
('2017-09-01', 'A0118150103', 70012043, 0, 2018)
('2017-09-01', 'A011