🍊Case Study #8: Fresh Segments

Please view the detailed case study info here.

📚Table of Contents

Business Task
Questions And Solutions
Data Exploration and Cleansing
Interest Analysis
Segment Analysis
Index Analysis

Business Task

Fresh Segments is a digital marketing agency that helps other businesses analyze trends in online ad click behavior for their unique customer base.

The task is to analyze the aggregated metrices and provide high levels insights for the given dataset.

There are 2 datasets in this case study to answer the business questions as follows:

Questions and Solutions

Data Exploration and Cleansing

1. Update the `interest_metrics` table by modifying the month_year column to be a date data type with the start of the month.

-- Drop month_year column
ALTER TABLE interest_metrics 
DROP COLUMN month_year;

-- Add new month_year column
ALTER TABLE interest_metrics
ADD COLUMN month_year DATE; 

-- Add data to the month_year column
UPDATE interest_metrics 
SET month_year = CAST(CONCAT(_year, "-", _month, "-01") AS DATE);

-- To check the data type of each column
DESC interest_metrics;

Answer:

2. What is the count of records in the `interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the null values appearing first?

SELECT
	month_year, 
	COUNT(*) AS count
FROM interest_metrics
GROUP BY month_year
ORDER BY month_year;

Answer:

3. What do you think we should do with these null values in the `interest_metrics`?

-- To find the percentage of null values in this dataset
SELECT 
    100 * COUNT(*) / (SELECT 
            COUNT(*)
        FROM
            interest_metrics) AS null_percent
FROM
    interest_metrics
WHERE
    interest_id IS NULL;

Answer:

Since the data is meaningless without interest_id, _month, and _year info and the null values percentage is only around 8.35%, I suggest dropping these null values.

-- To delete null values
DELETE FROM interest_metrics 
WHERE interest_id IS NULL;

-- To check if all null values are deleted
SELECT 
    100 * COUNT(*) / (SELECT 
            COUNT(*)
        FROM
            interest_metrics) AS null_percent
FROM
    interest_metrics
WHERE
    interest_id IS NULL;

4. How many `interest_id` values exist in the `interest_metrics` table but not in the `interest_map` table? What about the other way around?

-- Outer Join two tables 
WITH combine_cte AS (
	SELECT * FROM interest_metrics mt LEFT JOIN interest_map mp ON mt.interest_id = mp.id 
    UNION
    SELECT * FROM interest_metrics mt RIGHT JOIN interest_map mp ON mt.interest_id = mp.id
)

SELECT 
	COUNT(DISTINCT interest_id) AS metrics_id_count, 
    COUNT(DISTINCT id) AS maps_id_count, 
    SUM(IF(interest_id IS NULL, 1, 0)) AS not_in_metrics, 
    SUM(IF(id IS NULL, 1, 0)) AS not_in_maps
FROM 
	combine_cte;

Answer:

5. Summarize the `id` values in the `interest_map` by its total record count in this table.

SELECT 
    id, interest_name, COUNT(*) AS record_count
FROM
    interest_map mp
        JOIN
    interest_metrics mt ON mt.interest_id = mp.id
GROUP BY mp.id
ORDER BY record_count DESC;

Answer:

Only showing the partial result here. There are 1,000 results from this query.

6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where `interest_id = 21246` in your joined output and include all columns from `interest_metrics` and all columns from `interest_map` except the `id` column.

The good old INNER JOIN should be used to connect two tables by matching interest_id & id. This JOIN method would allow us to have the data records that are available in both tables.

SELECT 
    mt.*,
    mp.interest_name,
    mp.interest_summary,
    mp.created_at,
    mp.last_modified
FROM
    interest_metrics mt
        JOIN
    interest_map mp ON mt.interest_id = mp.id
WHERE
    mt.interest_id = 21246
        AND mt._month IS NOT NULL;

Answer:

7. Are there any records in your joined table where the `month_year` value is before the `created_at` value from the `interest_map` table? Do you think these values are valid and why?

-- To check the data counts where the month_year is before the created_at value
SELECT
	COUNT(*) AS count
FROM 
	interest_metrics mt
JOIN interest_map mp ON mt.interest_id = mp.id 
WHERE mt.month_year < mp.created_at;

Answer:

Even though there are 188 records where the month_year value is before the created_at value, we also have to keep in mind how the month_year column is created. The month_year column was updated to have the data values at the start of each month. So, I would check again whether there are any values of month_year before created_at and not in the same month and same year as the created_at, and there are none in such condition as seen below.

SELECT 
    COUNT(*) AS ctn
FROM
    interest_metrics mt
        JOIN
    interest_map mp ON mt.interest_id = mp.id
WHERE
    mt.month_year < mp.created_at
        AND MONTH(mt.month_year) != MONTH(mp.created_at)
        AND YEAR(mt.month_year) != YEAR(mp.created_at);

Answer:

Interest Analysis

1. Which interests have been present in all `month_year` dates in our dataset?

-- To check the unique number of interests and month_year
SELECT 
    COUNT(DISTINCT interest_id) AS interest_count,
    COUNT(DISTINCT month_year) AS month_year_count
FROM
    interest_metrics;

Answer:

--To check the interests count that present in all month_year 
WITH count_cte AS (
	SELECT 
		interest_id, 
        COUNT(DISTINCT month_year) AS total_months 
	FROM 
		interest_metrics 
	WHERE month_year IS NOT NULL
	GROUP BY 
		interest_id
)

SELECT 
	total_months,
	COUNT(DISTINCT interest_id) AS interest_count
FROM 
	count_cte 
WHERE total_months = 14;

Answer:

2. Using this same `total_months` measure - calculate the cumulative percentage of all records starting at 14 months - which `total_months` value passes the 90% cumulative percentage value?

WITH month_count_cte AS (
	SELECT
		interest_id, 
        COUNT(DISTINCT month_year) AS total_months
	FROM 
		interest_metrics
	WHERE month_year IS NOT NULL
    GROUP BY interest_id
), 

interest_count AS (
	SELECT
		total_months, 
        COUNT(DISTINCT interest_id) AS interest_count 
	FROM 
		month_count_cte
	GROUP BY total_months
), 

percent_cte AS (
	SELECT *, 
		ROUND(100* SUM(interest_count) OVER(ORDER BY total_months DESC)/(SELECT COUNT(DISTINCT interest_id) FROM interest_metrics), 2) AS cumulative_percent
	FROM 
		interest_count
) 

SELECT 
	*
FROM 
	percent_cte 
WHERE cumulative_percent > 90;

Answer:

3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing?

WITH month_cte AS (
	SELECT 
		COUNT(DISTINCT month_year) AS total_months, 
        interest_id
	FROM 
		interest_metrics 
	WHERE month_year IS NOT NULL 
    GROUP BY interest_id
    HAVING total_months <= 6
)

SELECT 
	COUNT(interest_id) AS interest_count 
FROM 
	month_cte;

Answer:

4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed `interest` example for your arguments - think about what it means to have less months present from a segment perspective.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🍊Case Study #8: Fresh Segments

📚Table of Contents

Business Task

Questions and Solutions

Data Exploration and Cleansing

1. Update the `interest_metrics` table by modifying the month_year column to be a date data type with the start of the month.

2. What is the count of records in the `interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the null values appearing first?

3. What do you think we should do with these null values in the `interest_metrics`?

4. How many `interest_id` values exist in the `interest_metrics` table but not in the `interest_map` table? What about the other way around?

5. Summarize the `id` values in the `interest_map` by its total record count in this table.

6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where `interest_id = 21246` in your joined output and include all columns from `interest_metrics` and all columns from `interest_map` except the `id` column.

7. Are there any records in your joined table where the `month_year` value is before the `created_at` value from the `interest_map` table? Do you think these values are valid and why?

Interest Analysis

1. Which interests have been present in all `month_year` dates in our dataset?

2. Using this same `total_months` measure - calculate the cumulative percentage of all records starting at 14 months - which `total_months` value passes the 90% cumulative percentage value?

3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing?

4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed `interest` example for your arguments - think about what it means to have less months present from a segment perspective.

Files

README.md

Latest commit

History

README.md

File metadata and controls

🍊Case Study #8: Fresh Segments

📚Table of Contents

Business Task

Questions and Solutions

Data Exploration and Cleansing

1. Update the interest_metrics table by modifying the month_year column to be a date data type with the start of the month.

2. What is the count of records in the interest_metrics for each month_year value sorted in chronological order (earliest to latest) with the null values appearing first?

3. What do you think we should do with these null values in the interest_metrics?

4. How many interest_id values exist in the interest_metrics table but not in the interest_map table? What about the other way around?

5. Summarize the id values in the interest_map by its total record count in this table.

6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where interest_id = 21246 in your joined output and include all columns from interest_metrics and all columns from interest_map except the id column.

7. Are there any records in your joined table where the month_year value is before the created_at value from the interest_map table? Do you think these values are valid and why?

Interest Analysis

1. Which interests have been present in all month_year dates in our dataset?

2. Using this same total_months measure - calculate the cumulative percentage of all records starting at 14 months - which total_months value passes the 90% cumulative percentage value?

3. If we were to remove all interest_id values which are lower than the total_months value we found in the previous question - how many total data points would we be removing?

4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed interest example for your arguments - think about what it means to have less months present from a segment perspective.

1. Update the `interest_metrics` table by modifying the month_year column to be a date data type with the start of the month.

2. What is the count of records in the `interest_metrics` for each `month_year` value sorted in chronological order (earliest to latest) with the null values appearing first?

3. What do you think we should do with these null values in the `interest_metrics`?

4. How many `interest_id` values exist in the `interest_metrics` table but not in the `interest_map` table? What about the other way around?

5. Summarize the `id` values in the `interest_map` by its total record count in this table.

6. What sort of table join should we perform for our analysis and why? Check your logic by checking the rows where `interest_id = 21246` in your joined output and include all columns from `interest_metrics` and all columns from `interest_map` except the `id` column.

7. Are there any records in your joined table where the `month_year` value is before the `created_at` value from the `interest_map` table? Do you think these values are valid and why?

1. Which interests have been present in all `month_year` dates in our dataset?

2. Using this same `total_months` measure - calculate the cumulative percentage of all records starting at 14 months - which `total_months` value passes the 90% cumulative percentage value?

3. If we were to remove all `interest_id` values which are lower than the `total_months` value we found in the previous question - how many total data points would we be removing?

4. Does this decision make sense to remove these data points from a business perspective? Use an example where there are all 14 months present to a removed `interest` example for your arguments - think about what it means to have less months present from a segment perspective.