# Data Analyst Intern – Sportradar Challenge - Task 1
Date: 4 July 2024 <br>
Author: Kristina Chuang

## Instructions
- In programmatic advertising, ads (commonly referred to as impressions) are shown on websites with the help of DSPs (Demand Side Platforms). 
- These platforms want to make sure that they are showing the right impressions to the right users to generate as many conversions as possible. 
- If a user clicks on the ad and lands on the advertiser’s website, we store this event in a table called clicks (schema shown below). 
- If the user successfully places a deposit after landing on the advertiser’s website, we store this event in a table called conversions (schema shown below). 
- For each conversion there exists at least one impression, but not all impressions have a conversion or a click.
<br>
You are given 3 tables with the following schemas, defined by column names and types:
### impressions
1. <b>impression_id</b>: string
2. url_address: string
3. user_id: string
4. <b>request_country: string</b> ("Austria")
5. tracking_type: string [this is the tracking type (fingerprinted or cookie-based)]
6. dynamic_display: boolean [this is whether the impression was served through Dynamic Display]
7. dynamic_display_variables: string [content served in the impression, i.e soccer vs baseball]
8. request_browser_name: string
9. timestamp: date 
<br>
### clicks
1. <b>impression_id</b>: int
2. user_id: <i>int</i>
3. timestamp: string
<br>
### conversions
1. conversion_id: string
2. user_id: string
3. dval: integer [this is the deposit value of the conversion]
4. curr: string [this is the currency of the deposit value]
5. timestamp: date


# Q1: What is the CTR (%) for impressions served in "Austria"?


CTR % (Click-Through-Rate) is how many people clicked on an ad (clicks) divided by the number of people who saw it (impressions), in this case exclusively in the country of Austria and expressed as a percentage.

The formula to calculate CTR is (clicks / impressions) x 100.

$$ CTR(AT) = \frac{ "total-clicks-in-Austria" }{ "total-impression-in-Austria" } * 100$$

- create a mock database (ad_data.db) to test the SQL queries in notebook mock_ads.ipynb

### Load SQL extension to run SQL commands in code cells

In [14]:
#!pip install ipython-sql
%load_ext sql

In [17]:
# connect to mock database

%sql sqlite:///ad_data.db

In [19]:
# enable foreign keys (SQLite specific)
%sql PRAGMA foreign_keys = ON

 * sqlite:///ad_data.db
Done.


[]

## Computing CTR for Austria
- Use Common Table Expressions to calculate total impressions and clicks in Austria
- "total_impressions" CTE counts the rows in table impressions that have value 'Austria' in field request_country AS impression_count
- "total_clicks" CTE counts the rows in clicks that have the foreign key in impressions table and request_country is 'Austria' AS click_count
- Finally CTR is calculated by dividing click_count by impression_count (casting to float to ensure that decimals appear) and multiply by 100 to get the percentage.

In [58]:
%%sql

WITH total_impressions AS (
    SELECT COUNT(*) AS impression_count
    FROM impressions
    WHERE request_country = 'Austria'
),
total_clicks AS (
    SELECT COUNT(*) AS click_count
    FROM clicks
    JOIN impressions ON impressions.impression_id = clicks.impression_id
    WHERE impressions.request_country = 'Austria'
)

SELECT 
    (CAST(click_count AS FLOAT) / CAST(impression_count AS FLOAT)) * 100 AS CTR_Percentage
FROM 
    total_impressions, total_clicks


 * sqlite:///ad_data.db
Done.


CTR_Percentage
100.0


## Q2: For each converted user, find out how many impressions they were served. Specifically capturing the timestamp for the first & last impression.
- If there is a record in the conversion table, then the same user_id must be in the impressions table.
- First c1, show (SELECT) all the user_id present in conversions
- 
For this we need to match the user_id from the conversions table 

In [68]:
%%sql
SELECT 
    conversions.user_id,
    COUNT(impressions.impression_id) AS impressions_count,
    MIN(impressions.timestamp) AS first_impression_timestamp,
    MAX(impressions.timestamp) AS last_impression_timestamp
FROM 
    conversions
JOIN 
    impressions ON conversions.user_id = impressions.user_id
GROUP BY 
    conversions.user_id;

 * sqlite:///ad_data.db
Done.


user_id,impressions_count,first_impression_timestamp,last_impression_timestamp
user1,4,2022-11-21,2022-11-23
user2,1,2022-11-22,2022-11-22
user3,1,2022-11-24,2022-11-24
user4,1,2022-11-25,2022-11-25
user5,1,2022-11-26,2022-11-26
user6,1,2022-11-27,2022-11-27
user7,1,2022-11-28,2022-11-28
user8,1,2022-11-29,2022-11-29
user9,1,2022-11-30,2022-11-30


In [48]:
%%sql
SELECT * FROM clicks

 * sqlite:///ad_data.db
Done.


impression_id,user_id,timestamp
imp1,user1,2022-11-21
imp2,user2,2022-11-22
imp3,user1,2022-11-23
imp4,user3,2022-11-24
imp5,user4,2022-11-25
imp6,user5,2022-11-26
imp7,user6,2022-11-27
imp8,user7,2022-11-28
imp9,user8,2022-11-29
imp10,user9,2022-11-30


In [50]:
%%sql
SELECT * FROM impressions

 * sqlite:///ad_data.db
Done.


impression_id,url_address,user_id,request_country,tracking_type,dynamic_display,dynamic_display_variables,request_browser_name,timestamp
imp1,http://example.com,user1,Austria,cookie-based,1,soccer,Chrome,2022-11-21
imp2,http://example2.com,user2,Austria,fingerprinted,0,baseball,Firefox,2022-11-22
imp3,http://example3.com,user1,Austria,cookie-based,1,soccer,Safari,2022-11-23
imp4,http://example4.com,user3,Austria,fingerprinted,0,soccer,Edge,2022-11-24
imp5,http://example5.com,user4,Germany,cookie-based,1,basketball,Chrome,2022-11-25
imp6,http://example6.com,user5,Germany,fingerprinted,0,soccer,Firefox,2022-11-26
imp7,http://example7.com,user6,France,cookie-based,1,tennis,Safari,2022-11-27
imp8,http://example8.com,user7,France,fingerprinted,0,soccer,Edge,2022-11-28
imp9,http://example9.com,user8,Italy,cookie-based,1,volleyball,Chrome,2022-11-29
imp10,http://example10.com,user9,Italy,fingerprinted,0,soccer,Firefox,2022-11-30


In [64]:
%%sql
SELECT * FROM conversions

 * sqlite:///ad_data.db
Done.


conversion_id,user_id,dval,curr,timestamp
conv1,user1,100,EUR,2022-11-21
conv2,user2,150,EUR,2022-11-22
conv3,user1,200,EUR,2022-11-23
conv4,user3,250,EUR,2022-11-24
conv5,user4,300,EUR,2022-11-25
conv6,user5,350,EUR,2022-11-26
conv7,user6,400,EUR,2022-11-27
conv8,user7,450,EUR,2022-11-28
conv9,user8,500,EUR,2022-11-29
conv10,user9,550,EUR,2022-11-30


In [15]:
import pandas as pd
df = pd.read_csv('world_cup_data.csv')
df.head()


Unnamed: 0,date_partition,hour_of_day_utc,country_code,os_name,platform_type,imps,viewable_imps,clicks,reg_fin,ftd,deposit,spend_usd
0,2022-12-02,4,BR,Android,website,263424.0,177356.0,964.0,27.0,20.0,197.0,178.62
1,2022-12-02,14,BR,Windows,website,206976.0,160251.0,509.0,14.0,12.0,128.0,122.47
2,2022-12-02,10,DE,Android,website,5790.0,4501.0,73.0,0.0,0.0,12.0,13.41
3,2022-12-02,9,IN,Android,website,376320.0,188159.0,1437.0,1.0,2.0,10.0,220.12
4,2022-12-02,1,KE,Android,website,34737.0,14405.0,36.0,1.0,0.0,11.0,24.42


In [17]:
df.shape


(659849, 12)