# TASK 2: World Cup 2022
With Euros underway, it is a repetitive task for the team to produce insights around the
importance of sporting events in Ads. Using the attached data, feel free to generate any insights
you think is relevant around what we have seen at the World Cup in 2022 & how this could be
interesting, below are some hints to help you get started
- Q1: Which countries have the largest reduction in Cost per Acquistion (CPA) for different
conversions during the World Cup compared to other date ranges available?
- Q2: What insights could you present which clients might find interesting & encourage them to
spend more on the next major sporting event?
- Q3: Euro 2024 is currently underway; how could you adapt your analysis for this competition
instead? Are there any insights or questions that would be more or less relevant?

# Notes to self
- I am not sure if I should put this csv file in a sqlite database and query with sql.
- It seems to me that the data analysis process is much easier done directly in pandas.

## Columns meanings
- Data set has 659.849 rows and 12 columns:
1. date_partition: date of data recording
2. hour_of_day_utc: Hour when data was recorded (UTC) Coordinated Universal Time. Range 0-23.
3. country_code: origin country of data in ISO standard. Range 2 letter code
4. os_name: operating system of the device used. Android, iOS, Windows.
5. platform_type: type of platform . website, mobile app.
6. imps: Number of ad impressions. How many times the ad was fetched and shown independent of user interaction.
7. viewable_imps: Number of viewable ad impressions. Ads actually viewed according to 50% in 1second standard.
8. clicks: Number of ad clicks. (interaction measure)
9. reg_fin: finalized registrations. Number of users who registered after clicking.
10. ftd: first-time deposits. Number of users who made their first deposit after clicking. (Users need to deposit funds to use the service)
11. deposit: Total amount deposited by users. Cummulative value per UTC hour (probably).
12. spend_usd: Total ad spend in US dollars. Cost of displaying the ads.

Therefore, this data set contain data points for online marketing campigns metrics and can be use to calculate further digital marketing KPIs (key performance indicators)
- CPA: cost per acquisition
- Click-Through Rate (CTR)
- Cost-Per-Click (CPC)
- Conversion Rate (CVR)
- Return on Investment (ROI)
- Return on Ad Spend (ROAS)
- Customer Acquisition Cost (CAC)
- Customer Lifetime Value (CLV)
- Marketing Qualified Lead (MQL)



## Definitions
- <b>Viewable impressions</b>: is a standard measure of ad viewability defined by the International Advertising Bureau (IAB) to be an ad, which appears at least 50% on screen for more than one second. They are the metric that advertisers use to <b>quantify the percentage of ads that are actually viewed</b> by real people.


# Data Inspection 

In [2]:
import pandas as pd
df = pd.read_csv('data/world_cup_data.csv')
df.head()


Unnamed: 0,date_partition,hour_of_day_utc,country_code,os_name,platform_type,imps,viewable_imps,clicks,reg_fin,ftd,deposit,spend_usd
0,2022-12-02,4,BR,Android,website,263424.0,177356.0,964.0,27.0,20.0,197.0,178.62
1,2022-12-02,14,BR,Windows,website,206976.0,160251.0,509.0,14.0,12.0,128.0,122.47
2,2022-12-02,10,DE,Android,website,5790.0,4501.0,73.0,0.0,0.0,12.0,13.41
3,2022-12-02,9,IN,Android,website,376320.0,188159.0,1437.0,1.0,2.0,10.0,220.12
4,2022-12-02,1,KE,Android,website,34737.0,14405.0,36.0,1.0,0.0,11.0,24.42


In [5]:
# 659.849 rows and 12 columns
df.shape

(659849, 12)

In [7]:
# summary of dataframe, check for missing values and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 659849 entries, 0 to 659848
Data columns (total 12 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   date_partition   659849 non-null  object 
 1   hour_of_day_utc  659849 non-null  int64  
 2   country_code     659466 non-null  object 
 3   os_name          658339 non-null  object 
 4   platform_type    659849 non-null  object 
 5   imps             659849 non-null  float64
 6   viewable_imps    659849 non-null  float64
 7   clicks           659849 non-null  float64
 8   reg_fin          659849 non-null  float64
 9   ftd              659849 non-null  float64
 10  deposit          659849 non-null  float64
 11  spend_usd        659849 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 60.4+ MB


- No missing values apparent

In [9]:
# summary statistics of numerical columns
df.describe()

Unnamed: 0,hour_of_day_utc,imps,viewable_imps,clicks,reg_fin,ftd,deposit,spend_usd
count,659849.0,659849.0,659849.0,659849.0,659849.0,659849.0,659849.0,659849.0
mean,11.726145,19456.9,12322.35,41.446393,1.236513,0.506079,8.604884,16.637455
std,6.821883,68584.29,46599.76,183.442549,10.460184,7.901057,56.606414,65.575744
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.0,735.0,0.0,0.0,0.0,0.0,0.0,0.39
50%,12.0,2792.0,1649.0,0.0,0.0,0.0,0.0,2.15
75%,18.0,11339.0,6934.0,21.0,0.0,0.0,2.0,10.21
max,23.0,1507310.0,1269530.0,6592.0,1868.0,2213.0,2453.0,4149.73


- high variability, all numerical columns have a min 0 and low mean value.
- median  of clicks, registrations, ftd, deposit also 0.
    - implies that most records for impressions were innefective for engament.
    - investigate how many hours were recorded per date_partition
    


In [10]:
# Inspect the categorical variables, how many unique values
df.select_dtypes('object').nunique()

date_partition    123
country_code      155
os_name            10
platform_type       2
dtype: int64

- records for 123 unique days (convert to datetime datatype)
- 155 countries 
- 10 operating systems (which ones?)
- 2 platform types

In [14]:
# convert date_partition to datetime
df['date_partition'] = pd.to_datetime(df['date_partition'])

# Get the start date (earliest date)
start_date = df['date_partition'].min()

# Get the end date (latest date)
end_date = df['date_partition'].max()

date_difference = (end_date - start_date).days

# Display the difference in days
print(f"Difference between start and end date: {date_difference} days")

# Display the start and end dates
print(f"Start date: {start_date}")
print(f"End date: {end_date}")

Difference between start and end date: 122 days
Start date: 2022-10-01 00:00:00
End date: 2023-01-31 00:00:00


In [16]:
# double check if there are any days missing in the range

# Generate a complete date range from start_date to end_date
complete_date_range = pd.date_range(start=start_date, end=end_date)

# Identify the missing dates by comparing the complete date range with the dates in the DataFrame
present_dates = pd.to_datetime(df['date_partition'].unique())
missing_dates = complete_date_range.difference(present_dates)

# Display the missing dates
print("Missing dates:")
print(missing_dates)

Missing dates:
DatetimeIndex([], dtype='datetime64[ns]', freq='D')


In [18]:
# Get the start date (earliest date) and end date (latest date)
start_date = df['date_partition'].min()
end_date = df['date_partition'].max()

# Generate a complete date range from start_date to end_date
complete_date_range = pd.date_range(start=start_date, end=end_date)

# Extract unique dates from the DataFrame and sort them
unique_dates = pd.Series(df['date_partition'].unique()).sort_values()

# Convert the unique dates to a list
ordered_unique_dates = unique_dates.tolist()

# Display the ordered list of unique dates
print("Ordered list of unique dates between the min and max dates:")
print(ordered_unique_dates)

Ordered list of unique dates between the min and max dates:
[Timestamp('2022-10-01 00:00:00'), Timestamp('2022-10-02 00:00:00'), Timestamp('2022-10-03 00:00:00'), Timestamp('2022-10-04 00:00:00'), Timestamp('2022-10-05 00:00:00'), Timestamp('2022-10-06 00:00:00'), Timestamp('2022-10-07 00:00:00'), Timestamp('2022-10-08 00:00:00'), Timestamp('2022-10-09 00:00:00'), Timestamp('2022-10-10 00:00:00'), Timestamp('2022-10-11 00:00:00'), Timestamp('2022-10-12 00:00:00'), Timestamp('2022-10-13 00:00:00'), Timestamp('2022-10-14 00:00:00'), Timestamp('2022-10-15 00:00:00'), Timestamp('2022-10-16 00:00:00'), Timestamp('2022-10-17 00:00:00'), Timestamp('2022-10-18 00:00:00'), Timestamp('2022-10-19 00:00:00'), Timestamp('2022-10-20 00:00:00'), Timestamp('2022-10-21 00:00:00'), Timestamp('2022-10-22 00:00:00'), Timestamp('2022-10-23 00:00:00'), Timestamp('2022-10-24 00:00:00'), Timestamp('2022-10-25 00:00:00'), Timestamp('2022-10-26 00:00:00'), Timestamp('2022-10-27 00:00:00'), Timestamp('2022-10-28