Fetch_AnalyEng_Exercise

FIRST: Review Existing Unstructured Data and Diagram a New Structured Relational Data Model

Make sure you have the following libraries:

pip install matplotlib

pip install networkx

To generate the ER diagram:

python diagram.py

You can see the actual diagram:

SECOND: Write queries that directly answer predetermined questions from a business stakeholder

What are the top 5 brands by receipts scanned for most recent month?

    WITH RecentMonth AS (
        SELECT 
                b.name AS brand_name,
                COUNT(r.receipt_id) AS receipt_count
        FROM 
                Receipts r
        JOIN 
                Receipt_Items ri ON r.receipt_id = ri.receipt_id
        JOIN 
                Brands b ON ri.brand_id = b.brand_id
        WHERE 
                r.date_scanned >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01')
                AND r.date_scanned < DATE_FORMAT(NOW(), '%Y-%m-01')
        GROUP BY 
                b.name
        ORDER BY 
                receipt_count DESC
        LIMIT 5
        );

How does the ranking of the top 5 brands by receipts scanned for the recent month compare to the ranking for the previous month?

    -- Get the top 5 brands for the most recent month
    WITH RecentMonth AS (
    SELECT 
            b.name AS brand_name,
            COUNT(r.receipt_id) AS receipt_count
    FROM 
            Receipts r
    JOIN 
            Receipt_Items ri ON r.receipt_id = ri.receipt_id
    JOIN 
            Brands b ON ri.brand_id = b.brand_id
    WHERE 
            r.date_scanned >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01')
            AND r.date_scanned < DATE_FORMAT(NOW(), '%Y-%m-01')
    GROUP BY 
            b.name
    ORDER BY 
            receipt_count DESC
    LIMIT 5
    ),
    -- Get the top 5 brands for the previous month
    PreviousMonth AS (
    SELECT 
            b.name AS brand_name,
            COUNT(r.receipt_id) AS receipt_count
    FROM 
            Receipts r
    JOIN 
            Receipt_Items ri ON r.receipt_id = ri.receipt_id
    JOIN 
            Brands b ON ri.brand_id = b.brand_id
    WHERE 
            r.date_scanned >= DATE_FORMAT(NOW() - INTERVAL 2 MONTH, '%Y-%m-01')
            AND r.date_scanned < DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01')
    GROUP BY 
            b.name
    ORDER BY 
            receipt_count DESC
    LIMIT 5
    )
    -- Compare the top brands between the two months
    SELECT 
    rm.brand_name AS recent_month_brand,
    rm.receipt_count AS recent_month_count,
    pm.brand_name AS previous_month_brand,
    pm.receipt_count AS previous_month_count
    FROM 
    RecentMonth rm
    LEFT JOIN 
    PreviousMonth pm ON rm.brand_name = pm.brand_name
    UNION
    SELECT 
    pm.brand_name AS recent_month_brand,
    NULL AS recent_month_count,
    pm.brand_name AS previous_month_brand,
    pm.receipt_count AS previous_month_count
    FROM 
    PreviousMonth pm
    LEFT JOIN 
    RecentMonth rm ON pm.brand_name = rm.brand_name
    WHERE 
    rm.brand_name IS NULL;

Which brand has the most transactions among users who were created within the past 6 months?

   -- Get the brand with the most transactions among users created within the past 6 months
    SELECT 
    b.name AS brand_name,
    COUNT(ri.receipt_item_id) AS transaction_count
    FROM 
    Users u
    JOIN 
    Receipts r ON u.user_id = r.user_id
    JOIN 
    Receipt_Items ri ON r.receipt_id = ri.receipt_id
    JOIN 
    Brands b ON ri.brand_id = b.brand_id
    WHERE 
    u.created_date >= NOW() - INTERVAL 6 MONTH
    GROUP BY 
    b.name
    ORDER BY 
    transaction_count DESC
    LIMIT 1;

THIRD: Evaluate Data Quality Issues in the Data Provided

python Data_Quality.py

You can checkout the errors found in data_quality_issues.csv.

Made SQL Scripts for all json files.

python json_to_sql.py

The Generated SQL Scripts (MySQL)

Creating Tables.sql
brands_insert.sql
receipts_items_insert.sql
receipts_insert.sql
users_insert.sql

When I was trying to do this I was encountering multiple issues but the major issues were Missing Values and Duplicate entries so I tried to go around it as much as possible.

I also assumed that date format was given as a timestamp in milliseconds so I converted it into YYYY-MM-DD HH:MM format.

Thats when I realized I can make a detailed csv file about Missing Values and Duplicate entries error in the json files therefore I made the python script and generated a csv file.

FOURTH: Communicate with Stakeholders

Hi Team,

After reviewing the provided data and developing a new structured relational data model, I wanted to share some insights and raise a few questions.

Data Model and Insights:

We can now efficiently determine the top 5 brands by the number of receipts scanned for the most recent month.
We can also compare the ranking of these top 5 brands between the recent month and the previous month.
We can also get the brand with the most transactions among users created within the past 6 months

Data Quality Issues Identified:

Missing Values: We found records with missing values in some fields.
Duplicate Entries: There are duplicate records in the receipts and users data.

I am attaching the issue details in the data_quality_issues.csv.

Questions:

Can you provide clarity on how to handle missing values? Should we exclude such records or use a placeholder?
For duplicate entries, should we consider the most recent record or aggregate data from duplicates?

Further Information Needed:

Data Validation Rules: Understanding the rules for valid data entries will help in cleaning and maintaining the dataset.
Business Priorities: What are the key metrics or business goals we should focus on while optimizing the data model?

Performance and Scaling Concerns:

With growing data, we anticipate the need for optimized indexing and partitioning strategies to ensure query performance.
Ensuring data consistency and integrity during concurrent transactions will be critical.

Looking forward to your feedback and guidance on the next steps.

Best regards,

Nav Sanya Anand

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
Creating Tables.sql		Creating Tables.sql
Data_Quality.py		Data_Quality.py
ER_Diagram.png		ER_Diagram.png
README.md		README.md
brands.json		brands.json
brands_insert.sql		brands_insert.sql
data_quality_issues.csv		data_quality_issues.csv
diagram.py		diagram.py
fetch-hiring.s3.amazonaws.com_analytics-engineer_ineeddata-data-modeling_data-modeling.pdf		fetch-hiring.s3.amazonaws.com_analytics-engineer_ineeddata-data-modeling_data-modeling.pdf
json_to_sql.py		json_to_sql.py
receipt_items_insert.sql		receipt_items_insert.sql
receipts.json		receipts.json
receipts_insert.sql		receipts_insert.sql
users.json		users.json
users_insert.sql		users_insert.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fetch_AnalyEng_Exercise

FIRST: Review Existing Unstructured Data and Diagram a New Structured Relational Data Model

SECOND: Write queries that directly answer predetermined questions from a business stakeholder

THIRD: Evaluate Data Quality Issues in the Data Provided

Made SQL Scripts for all json files.

FOURTH: Communicate with Stakeholders

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fetch_AnalyEng_Exercise

FIRST: Review Existing Unstructured Data and Diagram a New Structured Relational Data Model

SECOND: Write queries that directly answer predetermined questions from a business stakeholder

THIRD: Evaluate Data Quality Issues in the Data Provided

Made SQL Scripts for all json files.

FOURTH: Communicate with Stakeholders

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages