#### ID 2020

```Which company had the biggest month call decline from March to April 2020? Return the company_id and calls difference for the company with the highest decline.```

In [None]:
%%sql
WITH total_call_in_march AS (SELECT company_id,
                                    SUM(CASE WHEN call_id IS NOT NULL THEN 1 ELSE 0 END) AS total_calls
                             FROM rc_calls
                                       JOIN rc_users USING (user_id)
                             WHERE date BETWEEN '2020-03-01 00:00:00' AND '2020-03-31 23:59:59'
                             GROUP BY company_id),
     total_call_in_april AS (SELECT company_id,
                                    SUM(CASE WHEN call_id IS NOT NULL THEN 1 ELSE 0 END) AS total_calls
                             FROM rc_calls
                                      JOIN rc_users USING (user_id)
                             WHERE date BETWEEN '2020-04-01 00:00:00' AND '2020-04-30 23:59:59'
                             GROUP BY company_id),
     variance_calculation AS (SELECT m.company_id,
                                     m.total_calls         AS march_total_calls,
                                     a.total_calls                 AS april_total_calls,
                                     a.total_calls - m.total_calls AS variance
                              FROM total_call_in_march AS m
                                       JOIN total_call_in_april AS a
                                            ON m.company_id = a.company_id)
SELECT company_id, march_total_calls, april_total_calls, variance
FROM variance_calculation
WHERE variance = (SELECT MIN(variance)
                  FROM variance_calculation)
ORDER BY variance;

In [None]:
total_call_in_march = pd.merge(rc_calls, rc_users, on='user_id', how='inner').query(
    'date >= "2020-03-01 00:00:00" & date <= "2020-03-31 23:59:59"').groupby('company_id', as_index=False).agg(
    total_calls=('call_id', 'count'))

total_call_in_april = pd.merge(rc_calls, rc_users, on='user_id', how='inner').query(
    'date >= "2020-04-01 00:00:00" & date <= "2020-04-30 23:59:59"').groupby('company_id', as_index=False).agg(
    total_calls=('call_id', 'count'))
variance_calculation = pd.merge(total_call_in_march, total_call_in_april, on='company_id', suffixes=('_m', '_a'))
variance_calculation['variance'] = variance_calculation['total_calls_a'] - variance_calculation['total_calls_m']
variance_calculation.nsmallest(1, 'variance')[['company_id', 'variance']]

#### ID 2021

```Redfin helps clients to find agents. Each client will have a unique request_id and each request_id has several calls. For each request_id, the first call is an “initial call” and all the following calls are “update calls”.  What's the average call duration for all initial calls?```

In [None]:
%%sql
WITH cte AS (SELECT request_id,
                    call_duration,
                    RANK() OVER (PARTITION BY request_id ORDER BY created_on) AS rnk
             FROM redfin_call_tracking)
SELECT AVG(call_duration)
FROM cte
WHERE rnk = 1

In [None]:
df = redfin_call_tracking
df['rnk'] = df.groupby('request_id')['created_on'].rank(method='first', ascending=True)
df.query('rnk == 1')['call_duration'].mean()

#### ID 2022

```Redfin helps clients to find agents. Each client will have a unique request_id and each request_id has several calls. For each request_id, the first call is an “initial call” and all the following calls are “update calls”.  What's the average call duration for all update calls?```

In [None]:
%%sql
SELECT AVG(call_duration)
FROM (SELECT call_duration,
             DENSE_RANK() OVER (PARTITION BY request_id ORDER BY created_on) AS rnk
      FROM redfin_call_tracking) t1
WHERE rnk > 1

In [None]:
df = redfin_call_tracking
df['rnk'] = df.sort_values('created_on').groupby('request_id')['created_on'].rank(method='dense')
df.query('rnk > 1')['call_duration'].mean()

#### ID 2023

```Redfin helps clients to find agents. Each client will have a unique request_id and each request_id has several calls. For each request_id, the first call is an “initial call” and all the following calls are “update calls”.  How many customers have called 3 or more times between 3 PM and 6 PM (initial and update calls combined)?```

In [None]:
%%sql
WITH total_calls AS (SELECT request_id, COUNT(call_duration) AS cnt
             FROM redfin_call_tracking
             WHERE EXTRACT(HOUR FROM created_on) BETWEEN 15 AND 18
             GROUP BY request_id
             HAVING COUNT(call_duration) >= 3)
SELECT COUNT(request_id)
FROM total_calls

In [None]:
df = redfin_call_tracking
df[(df['created_on'].dt.hour >= 15) & (df['created_on'].dt.hour <= 18)].groupby('request_id', as_index=False).agg(
    total_cnt=('call_duration', 'count')).query('total_cnt >= 3')['request_id'].count()

#### ID 2025

```Write a query that returns a number of users who are exclusive to only one client. Output the client_id and number of exclusive users.```

In [None]:
%%sql
WITH distinct_users AS (SELECT user_id, COUNT(DISTINCT client_id)
                        FROM fact_events
                        GROUP BY user_id
                        HAVING COUNT(DISTINCT client_id) = 1)

SELECT client_id, COUNT(DISTINCT fe.user_id)
FROM fact_events fe
         JOIN distinct_users du ON fe.user_id = du.user_id
GROUP BY client_id

In [None]:
df = fact_events
grouped_users = df.groupby('user_id', as_index=False).agg(cnt=('client_id', 'nunique')).query('cnt == 1')[
    'user_id'].to_list()
df.query('user_id.isin(@grouped_users)').groupby('client_id', as_index=False).agg(cnt_users=('user_id', 'nunique'))

#### ID 2026

```Write a query that returns a list of the bottom 2 companies by mobile usage. Company is defined in the customer_id column. Mobile usage is defined as the number of events registered on a client_id == 'mobile'. Order the result by the number of events ascending. In the case where there are multiple companies tied for the bottom ranks (rank 1 or 2), return all the companies. Output the customer_id and number of events.```

In [None]:
%%sql
WITH ranked_mobile_events AS (SELECT customer_id,
                    COUNT(1)                              AS events,
                    DENSE_RANK() OVER (ORDER BY COUNT(1)) AS rnk
             FROM fact_events
             WHERE client_id = 'mobile'
             GROUP BY customer_id)
SELECT customer_id, events
FROM ranked_mobile_events
WHERE rnk = 1

In [None]:
df = fact_events
df.query('client_id == "mobile"').groupby('customer_id', as_index=False).agg(events=('client_id', 'count')).nsmallest(2,
                                                                                                                      'events',
                                                                                                                      keep='all')

#### ID 2027

```Write a query that returns the company (customer id column) with highest number of users that use desktop only.```

In [None]:
%%sql
SELECT customer_id
FROM (SELECT customer_id,
             RANK() OVER (
                 ORDER BY COUNT(DISTINCT user_id) DESC) AS rnk
      FROM fact_events
      WHERE user_id IN (SELECT user_id
                        FROM fact_events
                        GROUP BY user_id
                        HAVING COUNT(DISTINCT client_id) = 1)
        AND client_id = 'desktop'
      GROUP BY customer_id) t1
WHERE rnk = 1

In [None]:
df = fact_events
result = df.groupby('user_id', as_index=False).agg(cnt=('client_id', 'nunique')).query('cnt == 1')
result['rnk'] = result.sort_values('cnt', ascending=False)['cnt'].rank(method='dense')
df[df['user_id'].isin(result.query('rnk == 1')['user_id'])]['customer_id'].unique()

#### ID 2028 

```Calculate the share of new and existing users for each month in the table. Output the month, share of new users, and share of existing users as a ratio. New users are defined as users who started using services in the current month (there is no usage history in previous months). Existing users are users who used services in current month, but they also used services in any previous month. Assume that the dates are all from the year 2020. HINT: Users are contained in user_id column```

In [None]:
%%sql
WITH started_month AS (SELECT time_id,
                    user_id,
                    EXTRACT(MONTH FROM time_id)                                      AS current_month,
                    EXTRACT(MONTH FROM MIN(time_id)
                                       OVER (PARTITION BY user_id ORDER BY time_id)) AS start_month
             FROM fact_events)
SELECT EXTRACT(MONTH FROM time_id) AS month,
       COUNT(DISTINCT user_id) FILTER (WHERE current_month = start_month) * 1.0 /
       COUNT(DISTINCT user_id)     AS first_users_cnt,
       COUNT(DISTINCT user_id) FILTER (WHERE current_month != start_month) * 1.0 /
       COUNT(DISTINCT user_id)     AS
                                      existing_users_cnt
FROM started_month
GROUP BY month
ORDER BY month;

In [None]:
df = fact_events
df['current_month'] = df['time_id'].dt.month
df['start_month'] = df.groupby('user_id')['time_id'].transform('min').dt.month
df = df[['user_id', 'current_month', 'start_month']].drop_duplicates()
df['is_new_user'] = df.apply(lambda x: 1 if x['current_month'] == x['start_month'] else 0, axis = 1)
df['is_old_user'] = df.apply(lambda x: 1 if x['current_month'] != x['start_month'] else 0, axis = 1)

df.groupby('current_month', as_index=False).agg(share_of_new_users=('is_new_user', 'mean'), share_of_old_users=('is_old_user', 'mean'))

#### ID 2031

```Get list of signups which have a transaction start date earlier than 10 months ago from March 2021. For all of those users get the average transaction value and group it by the billing cycle. Your output should include the billing cycle, signup_id of the user, and average transaction amount. Sort your results by billing cycle in reverse alphabetical order and signup_id in ascending order.```

In [None]:
%%sql
SELECT p.billing_cycle, t.signup_id, AVG(amt) AS avg_amt
FROM transactions t
         JOIN signups s ON t.signup_id = s.signup_id
         JOIN plans p ON p.id = s.plan_id
WHERE t.transaction_start_date < '2021-03-01'::TIMESTAMP - INTERVAL '10 month'
GROUP BY p.billing_cycle, t.signup_id
ORDER BY p.billing_cycle DESC, t.signup_id ASC

In [None]:
df = pd.merge(pd.merge(transactions, signups, how='inner', on='signup_id'), plans, how='inner', left_on='plan_id',
              right_on='id')
df[(df['transaction_start_date'].dt.date < pd.to_datetime('2021-03-01') - relativedelta(months=10))].groupby(
    ['billing_cycle', 'signup_id'], as_index=False).agg(amt=('amt', 'mean')).sort_values(['amt', 'signup_id'],
                                                                                         ascending=[False, True])

#### ID 2032

```Write a query that returns a table containing the number of signups for each weekday and for each billing cycle frequency. The day of the week standard we expect is from Sunday as 0 to Saturday as 6. Output the weekday number (e.g., 1, 2, 3) as rows in your table and the billing cycle frequency (e.g., annual, monthly, quarterly) as columns. If there are NULLs in the output replace them with zeroes.```

In [None]:
%%sql
SELECT EXTRACT(DOW FROM signup_start_date)                 AS weekday,
       COUNT(*) FILTER (WHERE billing_cycle = 'annual')    AS annual,
       COUNT(*) FILTER (WHERE billing_cycle = 'monthly')   AS monthly,
       COUNT(*) FILTER (WHERE billing_cycle = 'quarterly') AS quarterly
FROM signups s
         JOIN plans p ON p.id = s.plan_id
GROUP BY weekday
ORDER BY weekday

In [None]:
df = pd.merge(signups, plans, how='inner', left_on='plan_id', right_on='id')
df['weekday'] = df['signup_start_date'].dt.weekday
df.pivot_table(index='weekday', columns='billing_cycle', values='signup_id', aggfunc='nunique').reset_index().fillna(0)

#### ID 2033

```Find the most profitable location. Write a query that calculates the average signup duration and average transaction amount for each location, and then compare these two measures together by taking the ratio of the average transaction amount and average duration for each location. Your output should include the location, average duration, average transaction amount, and ratio. Sort your results from highest ratio to lowest.```

In [None]:
%%sql
SELECT location,
       AVG(DISTINCT (signup_stop_date - signup_start_date))                  AS mean_duration,
       AVG(amt)                                                              AS mean_revenue,
       AVG(amt) * 1.0 / AVG(DISTINCT (signup_stop_date - signup_start_date)) AS ratiom
FROM transactions t
         JOIN signups s ON t.signup_id = s.signup_id
GROUP BY location
ORDER BY AVG(amt) * 1.0 / AVG(DISTINCT (signup_stop_date - signup_start_date)) DESC

In [None]:
df = pd.merge(transactions, signups, how='inner', on='signup_id')
df['diff_signup_date'] = (df['signup_stop_date'] - df['signup_start_date']).dt.days
duration_df = df[['location', 'diff_signup_date']].drop_duplicates().groupby('location', as_index=False).agg(mean_duration=('diff_signup_date', 'mean'))
revenue_df = df.groupby('location', as_index=False).agg(mean_revenue=('amt', 'mean'))
result = pd.merge(duration_df, revenue_df, how='inner', on='location')
result['ratio'] = result['mean_revenue'] / result['mean_duration']
result.sort_values('ratio', ascending=False)

#### ID 2034

```You have been asked to calculate the average earnings per order segmented by a combination of weekday (all 7 days) and hour using the column customer_placed_order_datetime. You have also been told that the column order_total represents the gross order total for each order. Therefore, you'll need to calculate the net order total. The gross order total is the total of the order before adding the tip and deducting the discount and refund. Note: In your output, the day of the week should be represented in text format (i.e., Monday). Also, round earnings to 2 decimals```

In [None]:
%%sql
SELECT TO_CHAR(customer_placed_order_datetime, 'FMDay') AS weekday, 
       EXTRACT(HOUR FROM customer_placed_order_datetime) AS hour, 
      ROUND(AVG((order_total - discount_amount + tip_amount - refunded_amount)::numeric), 2) AS avg_earnings
FROM doordash_delivery
GROUP BY TO_CHAR(customer_placed_order_datetime, 'FMDay'), EXTRACT(HOUR FROM customer_placed_order_datetime);

In [None]:
df = doordash_delivery
df['order_value'] = df['order_total'] - df['discount_amount'] - df['refunded_amount'] + df['tip_amount']
df['hour'] = df['customer_placed_order_datetime'].dt.hour
df['weekday'] = df['customer_placed_order_datetime'].dt.strftime('%A').str.strip()
result_df =df.groupby(['weekday', 'hour'], as_index=False).agg(avg_earnings=('order_value', 'mean'))
result_df['avg_earnings'] = result_df['avg_earnings'].round(2)

#### ID 2035

```The company you work for has asked you to look into the average order value per hour during rush hours in the San Jose area. Rush hour is from 15H - 17H inclusive. You have also been told that the column order_total represents the gross order total for each order. Therefore, you'll need to calculate the net order total. The gross order total is the total of the order before adding the tip and deducting the discount and refund. Use the column customer_placed_order_datetime for your calculations.```

In [None]:
%%sql
SELECT EXTRACT(HOUR FROM customer_placed_order_datetime)                 AS order_hour,
       AVG(order_total - discount_amount + tip_amount - refunded_amount) AS avg_earnings
FROM delivery_details
WHERE delivery_region = 'San Jose'
  AND EXTRACT(HOUR FROM customer_placed_order_datetime) BETWEEN 15 AND 17
GROUP BY order_hour

In [None]:
df = delivery_details
df['order_hour'] = df['customer_placed_order_datetime'].dt.hour
df['order_value'] = df['order_total'] - df['refunded_amount'] - df['discount_amount'] + df['tip_amount']
df.query('delivery_region == "San Jose" & order_hour.between(15,17)').groupby('order_hour', as_index=False).agg(final_order_value=('order_value', 'mean'))

#### ID 2037

```You have been asked to investigate whether there is a correlation between the average total order value and the average time in minutes between placing an order and having it delivered per restaurant. You have also been told that the column order_total represents the gross order total for each order. Therefore, you'll need to calculate the net order total. The gross order total is the total of the order before adding the tip and deducting the discount and refund.```

In [None]:
%%sql
WITH sq AS (SELECT restaurant_id,
                   AVG(order_total + tip_amount - discount_amount -
                       refunded_amount) AS avg_gross_order,
                   AVG(EXTRACT(EPOCH
                               FROM (delivered_to_consumer_datetime -
                                     customer_placed_order_datetime)) /
                       60)              AS avg_diff_time
            FROM delivery_details
            GROUP BY restaurant_id)
SELECT CORR(avg_diff_time, avg_gross_order)
FROM sq

In [None]:
df = delivery_details
df['gross_order'] = df['order_total'] - df['discount_amount'] - df['refunded_amount'] + df['tip_amount']

df['diff_time'] = (df['delivered_to_consumer_datetime'] - df['customer_placed_order_datetime']) / pd.Timedelta(minutes=1)

result_df = df.groupby('restaurant_id', as_index=False).agg(avg_gross_order=('gross_order', 'mean'), avg_diff_time=('diff_time', 'mean'))

result_df['avg_gross_order'].corr(result_df['avg_diff_time'])

#### ID 2039 

```Find the number of unique transactions and total sales for each of the product categories in 2017. Output the product categories, number of transactions, and total sales in descending order. The sales column represents the total cost the customer paid for the product so no additional calculations need to be done on the column. Only include product categories that have products sold.```

In [None]:
%%sql
SELECT product_category, COUNT(DISTINCT transaction_id), SUM(sales) AS sales
FROM wfm_transactions AS wt
         JOIN wfm_products AS wp ON wt.product_id = wp.product_id
WHERE EXTRACT(YEAR FROM transaction_date) = 2017
GROUP BY product_category
ORDER BY sales DESC

In [None]:
df = pd.merge(wfm_transactions, wfm_products, on='product_id', how='inner')
df['year'] = df['transaction_date'].dt.year
df.query('year == 2017').groupby('product_category', as_index=False).agg(cnt=('transaction_id', 'nunique'),
                                                                         sum=('sales', 'sum'))