________
## Final Data Processing

Want to understand the full experimental design? Here's the process, we apologize in advance.
________

#### **Part 1: Definitions**

This notebook walks through the final data processing for identifying "valid observations" that can be used in the final regression analysis. A "valid observation" is defined as a *collection of posts* from a single profile that matches the required pattern for either a control observation or a treatment observation. The following definitions wil be used throughout the notebook and are the basis for this causal experimental design.

- **Viral Post**: A single post whose (square root transformed) engagement exceeds 3 standard deviations from the profile mean engagement.

- **Baseline**: A series of `w`consecutive posts where the (square root transformed) engagement for every post falls within 2 standard deviations of the mean. A baseline is expected to represent a level of consistent engagement for the profile's posts. 

- **Window (`w`)**: Required number of consecutive posts to define a valid "baseline". The window is varied between 5 and 40 posts to perform a sensitivity analysis on the consistency of the results. 

- **Treatment Observation**: A treatment observation is represented as the combination of a single (or many) viral posts surrounded by a baseline before and after the instance(s) of virality. Data from each baseline before and after the viral post will be submitted as individual post data to the regression analysis. For example, if the window `n` is equal to 20 posts, a "valid treatment observation" with a single viral post between baselines would submit 41 samples of data to the regression. However, in total, all 41 samples would be evaluating the effect of a single observation of virality. Hence, we call it a single treatment observation.

- **Control Observation**: If we define a treatment observation to be virality samdwiched between two baselines, we define a the pattern for control observations to be no viral posts sandwiched between two baselines. In other words, two consecutive baselines stitched together for a total baseline length of `2*n`.

To minimize the chances of observing seasonality effects for a particular profile (say a brand known for summer-time clothing that only experiences virality during the summer and only experiences control observations during winter), treatment observations and control observations are taken from pairs of companies matched based on followers, engagement levels, and other profile-level evaluation metrics. By splitting companies into treatment and control, we also prevent potential data leakage from residual effects of treatment data found in control (or vice versa). Company matching is explained in another notebook. At this point we are provided a list of 10 matched control and treatment profiles from which we'll search for valid control and treatment observations.
_______


#### **Part 2: Extracting Control and Treatment Observations**

The objective of this notebook is to search each of the treatment profile engagement data and find valid instances of "baseline-viral-baseline" patterns for each of the respective window sizes `w`. For this explanation we'll use window_size of `20 posts`. The high level process and associated functions are described below.


>1. Sort Data by 'Date' and calculate appropriate standard deviation thresholds.
>    
>    Functions:
>    - sort_data()
>    - calculate_thresholds()

*Treatment Process*:

>2. Find and store indices of viral posts.
>    
>    Functions:
>    - find_virality()

>3. Find valid pairs of baselines (pre- and post- baseline with a gap in between). Overlapping data is not permitted - each instance of virality must have a unique pre and post baseline. Therefore, for treatment, the first and last identified baseline for a particular profile can be `w=20` posts in length, but any baseline in the middle must be twice as long to adequately represent a *post-baseline* for the previous virality and a *pre-baseline* for the subsequent virality. To maintain consistency between measuring baselines for treatment and control, the baselines are limited to window size `w=20` posts. Therefore, if a profile starts the year with 100 consecutive posts at baseline levels before a viral post, the pre-baseline data will include only the last `w=20` posts before virality. This enables the "control observation" (of length `2*w`) to have the same size pre-baseline and post-baseline as the treatment observations.
>
>    Functions:
>    - find_treatment_baselines()

>4. For each pair of pre- and post- baselines, evaluate whether there is (are) any viral post(s) between the pairs of pre- and post- baselines. If there is at least one viral post, store the indices for pre-baseline, post-baseline, and number of viral posts as a single treatment observation.
>
>    Functions:
>    - find_valid_treatment_obs()

>5. Using the indices stored for the starts and ends of the treatment observation baselines, extract these rows of data (individual posts) from the main dataset and tag them with binary indicator for before or after virality `Time_frame = 0/1` along with the variable number of viral posts associated with the baselines `Treatment = V`. By using the number of viral posts between baselines, we quantify the treatment intensity and derive a *per-viral-post* treatment effect from the eventual interaction term coefficient between `Time_Frame` and `Treatment`.
>
>    Functions:
>    - extract_observations()

The process is repeated for identifying the control observations, but instead of searching for isolated baselines with unique strings of `w=20` posts, we search for consecutive baselines of `2w=40` posts, then split the singular baseline into a simulated *pre-baseline* and *post-baseline* both of length `w=20` posts. These control observations will control for time-based natural fluctuations in engagement and represent "a world without virality" for the match paired treatment company. Similar to before, the time frame is set to `Time_Frame = 0/1` and the Treatment is set to `Treatment = 0`.

At this point we have two dataframes: Treatment DataFrame and Control DataFrame - observations extracted from their respective treatment/control matched company pairs.

______
#### **Part 3: Matching... Again?**

Because we chose to match companies together into control and treatment groups, we end up with different quantities of treatment and control observations. Not only is the number different, but the distribution of valid observations across matched company pairs is different for control companies and treatment companies. In order to fix this discrepancy, we need to match again, but this time match treatment observations directly to control observations within the matched company pairs. 

Descriptively, this means we can only use the minimum number of observations found between the control and treatment company of each matched pair. By doing this, we ensure the same number of total control and treatment observations for the regression and we ensure the control group and treatment group have the same distribution of company type/size representation. 

Unfortunately this method reduces the number of valid observations even farther than what we've already pruned. But we believe this step was necessary to ensure balanced representation between the treatment and control groups.

________
#### **Part 4: Merging**

The final step in this notebook is to merge the final dataset with additional contrl variables that will be used in the regression. 

- Month - Month that it was posted
- Post type - Instagram TV, Reel, Photo
- Number of Posts - Company based metric to define the posting frequency for the company
- Number of Followers - Company based metric to define the size of the profile

When all the data is merged, a single CSV file is output with all treatment and control observations. Parts 1-4 are iterated for window_sizes between 5 posts and 40 posts.
_____

## Functions

In [2]:
def sort_data(df):
    
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.sort_values(by='Date')
    df.reset_index(inplace=True)
    df.drop(columns='index', inplace=True)
    return df

In [3]:
def find_treatment_baselines(df, base_col, w):

    #Copy and shift the baseline column down one space
    shifted = df[base_col].shift(1)

    # Find all baseline starts
    baseline_starts = (df[base_col] - shifted)==1

    # Find all baseline ends
    baseline_ends = (df[base_col] - shifted)==-1

    # Adjust start based on window size
    starts = [ind - (w-1) for ind in baseline_starts[baseline_starts].index]

    # Adjust end to exclude the post that broke the baseline
    ends = [ind - 1 for ind in baseline_ends[baseline_ends].index]

    # If the last index is still part of a baseline, end the baseline at the last index.
    last_ind = len(df[base_col]) - 1
    if len(starts) > len(ends):
        ends.append(last_ind)

    # Create list of tuples for baseline starts and ends
    baselines = [(starts[i], ends[i]) for i in range(len(starts))]

    # Prevent overlapping pre-baselines and post-baselines
    # If there are less than 2 baselines, we cannot establish a pre-viral and post-viral baseline
    if len(baselines) < 2:
        return []
    
    else:
        first_baseline = [baselines[0]]
        last_baseline = [baselines[-1]]
        b2b_baselines = [t for i, t in enumerate(baselines) if 0 < i < len(baselines)-1 and t[1] - t[0] >= 2*w]
    
    # Final list of acceptable baselines
    baselines = first_baseline + b2b_baselines + last_baseline

    # Modify baselines to be capped at the window size length
    final_baselines = []

    for i in range(len(baselines)-1):

        _, pre_base_end = baselines[i]
        pre_base_st = pre_base_end - (w-1)
        pre_base = (pre_base_st, pre_base_end)

        post_base_st, _ = baselines[i+1]
        post_base_end = post_base_st + (w-1)
        post_base = (post_base_st, post_base_end)

        final_baselines.append((pre_base, post_base))

    return final_baselines


In [4]:
def find_virality(df, viral_col):
    # viral_col = 'log_over_3SD'
    viral_cond = (df[viral_col] == 1)
    viral_posts = [ind for ind in viral_cond[viral_cond].index]
    return viral_posts

In [5]:
def find_valid_treatment_obs(df, window_sizes):
    # Iterate through the different consecutive posts required for baseline
    results = {}
    for w in window_sizes:
        results[w] = {}
        baseline_column = f'baseline_{w}'
        baseline_eval_column = 'sqrt_under_2SD'
        viral_column = 'sqrt_over_3SD'

        # Create binary indicator column for baseline
        df[baseline_column] = df[baseline_eval_column].rolling(window=w, min_periods=w).sum() == w

        # Find indices for viral posts
        viral_posts = find_virality(df, viral_column)

        # Return tuples for start and end of baselines all baselines for this company
        baselines = find_treatment_baselines(df, baseline_column, w)

        # Set up variables for number of (valid) observations for this company (baseline-viral-baseline combos)
        results[w]["valid_obs"] = []
        results[w]["num_valid_obs"] = 0

        # Calculate number of viral posts between baselines
        valid_obs_temp = []
        for b in baselines:
            (start_lower, end_lower), (start_upper, end_upper) = b
            viral_count = sum([1 for num in viral_posts if end_lower < num < start_upper])
            obs = ((start_lower, end_lower), (start_upper, end_upper), viral_count)

            # if viral post is found between baselines, observation is valid and can be used in regression
            if viral_count > 0:
                valid_obs_temp.append(obs)
        
        # Save valid observations associated with current window size
        results[w]["valid_obs"] = valid_obs_temp
        results[w]["num_valid_obs"] = len(valid_obs_temp)
        
    return results


In [6]:
def extract_observations(valid_obs, df, treatment):
    
    new_observations = pd.DataFrame(columns=list(df.columns))
    
    for obs in valid_obs:

        (lower_base_st, lower_base_end), (upper_base_st, upper_base_end), vposts = obs
        lower_base = df.loc[lower_base_st:lower_base_end].copy()
        upper_base = df.loc[upper_base_st:upper_base_end].copy()

        # Assign NPV and TF
        if treatment:
            lower_base['Time_Frame'] = 0
            lower_base['Treatment'] = vposts
            upper_base['Time_Frame'] = 1
            upper_base['Treatment'] = vposts
        else:
            lower_base['Time_Frame'] = 0
            lower_base['Treatment'] = 0
            upper_base['Time_Frame'] = 1
            upper_base['Treatment'] = 0

        # Concatenate baselines to final dataframes 
        new_observations = pd.concat([new_observations, lower_base, upper_base])

    return new_observations

In [7]:
def calculate_thresholds(df):
    # Square Root Engagement
    df['sqrt_engagement'] = np.sqrt(df['Engagement'])

    # Z-Score Transformation
    df['sqrt_engagement_zscore'] = (df['sqrt_engagement'] - df['sqrt_engagement'].mean()) / df['sqrt_engagement'].std()
    #df['z_score_mean'] = df['sqrt_engagement_zscore'].mean()
    #df['z_score_std'] = df['sqrt_engagement_zscore'].std()

    # Engagment Z-Score
    df['engagement_zscore'] = (df['Engagement'] - df['Engagement'].mean()) / df['Engagement'].std()

    # Mean and standard deviation for engagement across entire year
    df['sqrt_mean'] = df['engagement_zscore'].mean()
    df['sqrt_stdev'] = df['engagement_zscore'].std()

    # 2SD and 3SD thresholds
    df['3sd_sqrt'] = df['sqrt_mean'] + df['sqrt_stdev']*3
    df['2sd_sqrt'] = df['sqrt_mean'] + df['sqrt_stdev']*2

    # Binary indicator for post within 2SD (baseline)
    df['sqrt_under_2SD'] = df['engagement_zscore'] <= df['2sd_sqrt']

    # Binary indicator for post over 3SD (viral)
    df['sqrt_over_3SD'] = df['engagement_zscore'] >= df['3sd_sqrt']

    return df

In [8]:
def find_control_baselines(df, base_col, w):

    #Copy and shift the baseline column down one space
    shifted = df[base_col].shift(1)

    # Find all baseline starts
    baseline_starts = (df[base_col] - shifted)==1

    # Find all baseline ends
    baseline_ends = (df[base_col] - shifted)==-1

    # Adjust start based on window size
    starts = [ind - (w-1) for ind in baseline_starts[baseline_starts].index]

    # Adjust end to exclude the post that broke the baseline
    ends = [ind - 1 for ind in baseline_ends[baseline_ends].index]

    # If the last index is still part of a baseline, end the baseline at the last index.
    last_ind = len(df[base_col]) - 1
    if len(starts) > len(ends):
        ends.append(last_ind)

    # Create list of tuples for baseline starts and ends
    baselines = [(starts[i], ends[i]) for i in range(len(starts))]
    final_baselines = []
    for b in baselines:
        start_ind, end_ind = b
        divisions = (end_ind - start_ind + 1)//(2*w)
        for _ in range(divisions):
            pre_base_start = start_ind
            pre_base_end = start_ind + w - 1
            pre_base = (pre_base_start, pre_base_end)
            post_base_start = pre_base_end + 1
            post_base_end = post_base_start + w - 1
            post_base = (post_base_start, post_base_end)

            final_baselines.append((pre_base, post_base))
            start_ind = post_base_end+1

    return final_baselines


In [9]:
def find_valid_control_obs(df, window_sizes):
    # Iterate through the different consecutive posts required for baseline
    results = {}
    for w in window_sizes:
        results[w] = {}
        baseline_column = f'baseline_{w}'
        baseline_eval_column = 'sqrt_under_2SD'
        viral_column = 'sqrt_over_3SD'

        # Create binary indicator column for baseline
        df[baseline_column] = df[baseline_eval_column].rolling(window=w, min_periods=w).sum() == w

        # Find indices for viral posts
        viral_posts = find_virality(df, viral_column)

        # Return tuples for start and end of baselines all baselines for this company
        baselines = find_control_baselines(df, baseline_column, w)

        # Set up variables for number of (valid) observations for this company (baseline-viral-baseline combos)
        results[w]["valid_obs"] = []
        results[w]["num_valid_obs"] = 0

        # Calculate number of viral posts between baselines
        valid_obs_temp = []
        for b in baselines:
            (start_lower, end_lower), (start_upper, end_upper) = b
            viral_count = sum([1 for num in viral_posts if end_lower < num < start_upper])
            obs = ((start_lower, end_lower), (start_upper, end_upper), viral_count)

            # all baselines returned from the find_control_baselines function are valid
            valid_obs_temp.append(obs)
        
        # Save valid observations associated with current window size
        results[w]["valid_obs"] = valid_obs_temp
        results[w]["num_valid_obs"] = len(valid_obs_temp)
        
    return results


_____
## Data Pipeline

In [10]:
import pandas as pd
import numpy as np
import warnings

# Set pandas dataframe max columns/rows
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [11]:
# Set up final treatment and control dataframes for different window sizes

treatment3df = pd.DataFrame()
treatment5df = pd.DataFrame()
treatment10df = pd.DataFrame()
treatment20df = pd.DataFrame()
treatment30df = pd.DataFrame()
treatment40df = pd.DataFrame()
treatment50df = pd.DataFrame()
treatment_dfs = {
    3: treatment3df,
    5: treatment5df,
    10: treatment10df,
    20: treatment20df,
    30: treatment30df,
    40: treatment40df,
    50: treatment50df
}

control3df = pd.DataFrame()
control5df = pd.DataFrame()
control10df = pd.DataFrame()
control20df = pd.DataFrame()
control30df = pd.DataFrame()
control40df = pd.DataFrame()
control50df = pd.DataFrame()
control_dfs = {
    3: control3df,
    5: control5df,
    10: control10df,
    20: control20df,
    30: control30df,
    40: control40df,
    50: control50df
}


# Set up treatment and control companies
treatment_companies = ['adidasoriginals', 'allbirds', 'asos', 'everlane', 'ganni', 'goat', 'lunya', 'pomelofashion', 'prettylittlething', 'fashionphile']
control_companies = ['vinted', 'urbanic', 'stockx', 'lenskart', 'mejuri', 'veja', 'rixo', 'printful', 'skims', 'primark']

# Set up windows sizes
window_sizes = [5,10,20,30,40]
window_sizes_str = [str(w) for w in window_sizes]

# Set up dataframes to track posts used in the regression per company (PPC)
summarization_metrics = ['total_posts', 'total_companies_contributed','contribution_mean', 'contribution_stdev']
treatment_PPC = pd.DataFrame(columns = treatment_companies + summarization_metrics, index = window_sizes_str)
control_PPC = pd.DataFrame(columns = control_companies + summarization_metrics, index = window_sizes_str)


# Set up dataframe to describe company contributions
matched_companies = [f'{treatment_companies[i]} : {control_companies[i]}' for i in range(len(treatment_companies))]
matched_contribution_ratio = pd.DataFrame(columns = matched_companies, index = window_sizes_str)


**Extract Treatment Data**

In [12]:
# Read all data csv
file_path = "data\SocialInsider_Instagram_posts.csv"
main_df = pd.read_csv(file_path)

for company in treatment_companies:
    df = main_df[main_df['Profile'] == company].copy()
    df = sort_data(df)
    df = calculate_thresholds(df) # 2/3 standard deviation levels
    valid_observations = find_valid_treatment_obs(df, window_sizes) # baseline-viral-baseline

    for w in window_sizes:

        treatment_dfs[w]
        valid_obs = valid_observations[w]['valid_obs']
        new_observations = extract_observations(valid_obs=valid_obs, df=df, treatment=True)
        treatment_PPC.loc[str(w), company] = len(new_observations)
        treatment_dfs[w] = pd.concat([treatment_dfs[w], new_observations], ignore_index=True)

# Create a dataframe for number of posts that can be used for each combination company and window size
treatment_subset_df = treatment_PPC.iloc[:, 0:10]
treatment_PPC['total_companies_contributed'] = treatment_subset_df.apply(lambda row: row[row != 0].count(), axis=1)
treatment_PPC['total_posts'] = treatment_subset_df.apply(lambda row: row[row != 0].sum(), axis=1)
treatment_PPC['contribution_mean'] = treatment_subset_df.mean(axis=1)
treatment_PPC['contribution_stdev'] = treatment_subset_df.std(axis=1)


In [13]:
treatment_PPC

Unnamed: 0,adidasoriginals,allbirds,asos,everlane,ganni,goat,lunya,pomelofashion,prettylittlething,fashionphile,total_posts,total_companies_contributed,contribution_mean,contribution_stdev
5,10,30,150,10,60,10,60,80,110,10,530,10,53.0,48.773855
10,20,60,220,20,40,20,60,160,180,20,800,10,80.0,76.594169
20,40,120,240,40,40,40,40,80,200,40,880,10,88.0,74.951836
30,60,120,180,60,60,60,60,60,240,60,960,10,96.0,64.498062
40,80,160,160,80,80,80,80,0,320,80,1120,9,112.0,85.997416


**Extract Control Data**

In [14]:
# Read all data csv
file_path = "data\SocialInsider_Instagram_posts.csv"
main_df = pd.read_csv(file_path)

for company in control_companies:
    df = main_df[main_df['Profile'] == company].copy()
    df = sort_data(df)
    df = calculate_thresholds(df)
    valid_observations = find_valid_control_obs(df, window_sizes)
    
    for w in window_sizes:

        valid_obs = valid_observations[w]['valid_obs']
        new_observations = extract_observations(valid_obs=valid_obs, df=df, treatment=False)
        control_PPC.loc[str(w), company] = len(new_observations)
        control_dfs[w] = pd.concat([control_dfs[w], new_observations], ignore_index=True)
    
control_subset_df = control_PPC.iloc[:, 0:10]
control_PPC['total_companies_contributed'] = control_subset_df.apply(lambda row: row[row != 0].count(), axis=1)
control_PPC['total_posts'] = control_subset_df.apply(lambda row: row[row != 0].sum(), axis=1)
control_PPC['contribution_mean'] = control_subset_df.mean(axis=1)
control_PPC['contribution_stdev'] = control_subset_df.std(axis=1)

In [15]:
control_PPC

Unnamed: 0,vinted,urbanic,stockx,lenskart,mejuri,veja,rixo,printful,skims,primark,total_posts,total_companies_contributed,contribution_mean,contribution_stdev
5,110,40,220,810,370,260,740,200,1230,890,4870,10,487.0,400.501075
10,80,20,180,780,340,160,560,200,1120,780,4220,10,422.0,368.836007
20,40,0,40,720,320,40,160,160,920,600,3000,9,300.0,330.521473
30,60,0,60,720,120,0,180,180,720,360,2400,8,240.0,274.226184
40,0,0,0,640,160,0,0,160,640,320,1920,5,192.0,259.092433


___________
**Minimum Usable Posts based on observation matched pairs**

In [16]:
df1 = control_subset_df
df2 = treatment_subset_df
min_df = pd.DataFrame(np.minimum(df1.values, df2.values), columns=matched_companies, index=df1.index)
min_copy = min_df.copy()
min_df['total_companies_contributed'] = min_copy.apply(lambda row: row[row != 0].count(), axis=1)
min_df['total_posts'] = min_copy.apply(lambda row: row[row != 0].sum(), axis=1)

min_df

Unnamed: 0,adidasoriginals : vinted,allbirds : urbanic,asos : stockx,everlane : lenskart,ganni : mejuri,goat : veja,lunya : rixo,pomelofashion : printful,prettylittlething : skims,fashionphile : primark,total_companies_contributed,total_posts
5,10,30,150,10,60,10,60,80,110,10,10,530
10,20,20,180,20,40,20,60,160,180,20,10,720
20,40,0,40,40,40,40,40,80,200,40,9,560
30,60,0,60,60,60,0,60,60,240,60,8,660
40,0,0,0,80,80,0,0,0,320,80,4,560


________
**How many viral posts will be evaluated for each combination company/window_size?**

In [18]:
windows_series = pd.Series(window_sizes)
total_viral_df = min_copy.div(window_sizes, axis=0)
total_viral_df = total_viral_df/2
total_viral_df['total_viral_posts'] = total_viral_df.apply(lambda row: row[row != 0].sum(), axis=1)
total_viral_df

Unnamed: 0,adidasoriginals : vinted,allbirds : urbanic,asos : stockx,everlane : lenskart,ganni : mejuri,goat : veja,lunya : rixo,pomelofashion : printful,prettylittlething : skims,fashionphile : primark,total_viral_posts
5,1.0,3.0,15.0,1.0,6.0,1.0,6.0,8.0,11.0,1.0,53.0
10,1.0,1.0,9.0,1.0,2.0,1.0,3.0,8.0,9.0,1.0,36.0
20,1.0,0.0,1.0,1.0,1.0,1.0,1.0,2.0,5.0,1.0,14.0
30,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,4.0,1.0,11.0
40,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,4.0,1.0,7.0


Convert to treatment and control dictionary. Can reuse code because there will be the same number of observations for both.

In [19]:
control_num_baselines = total_viral_df.iloc[:,0:10].copy()
control_num_baselines.columns = df1.columns
control_num_baselines.index = df1.index
ctrl_dict = control_num_baselines.to_dict()
ctrl_dict

{'vinted': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 1.0, '40': 0.0},
 'urbanic': {'5': 3.0, '10': 1.0, '20': 0.0, '30': 0.0, '40': 0.0},
 'stockx': {'5': 15.0, '10': 9.0, '20': 1.0, '30': 1.0, '40': 0.0},
 'lenskart': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 1.0, '40': 1.0},
 'mejuri': {'5': 6.0, '10': 2.0, '20': 1.0, '30': 1.0, '40': 1.0},
 'veja': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 0.0, '40': 0.0},
 'rixo': {'5': 6.0, '10': 3.0, '20': 1.0, '30': 1.0, '40': 0.0},
 'printful': {'5': 8.0, '10': 8.0, '20': 2.0, '30': 1.0, '40': 0.0},
 'skims': {'5': 11.0, '10': 9.0, '20': 5.0, '30': 4.0, '40': 4.0},
 'primark': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 1.0, '40': 1.0}}

In [20]:
treat_num_baselines = total_viral_df.iloc[:,0:10].copy()
treat_num_baselines.columns = df2.columns
treat_num_baselines.index = df2.index
treat_num_baselines.to_dict()
treat_dict = treat_num_baselines.to_dict()
treat_dict

{'adidasoriginals': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 1.0, '40': 0.0},
 'allbirds': {'5': 3.0, '10': 1.0, '20': 0.0, '30': 0.0, '40': 0.0},
 'asos': {'5': 15.0, '10': 9.0, '20': 1.0, '30': 1.0, '40': 0.0},
 'everlane': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 1.0, '40': 1.0},
 'ganni': {'5': 6.0, '10': 2.0, '20': 1.0, '30': 1.0, '40': 1.0},
 'goat': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 0.0, '40': 0.0},
 'lunya': {'5': 6.0, '10': 3.0, '20': 1.0, '30': 1.0, '40': 0.0},
 'pomelofashion': {'5': 8.0, '10': 8.0, '20': 2.0, '30': 1.0, '40': 0.0},
 'prettylittlething': {'5': 11.0, '10': 9.0, '20': 5.0, '30': 4.0, '40': 4.0},
 'fashionphile': {'5': 1.0, '10': 1.0, '20': 1.0, '30': 1.0, '40': 1.0}}

________
**Selected matched observations for control and treatment**

Re-run previous code, but limit output for valid observations to the maximum number allowable posts identified in the dictionaries above.

In [28]:
# Set up final treatment and control dataframes for observations that will be used in the regression
# Different dataframe for each of the different window sizes

treatment5df = pd.DataFrame()
treatment10df = pd.DataFrame()
treatment20df = pd.DataFrame()
treatment30df = pd.DataFrame()
treatment40df = pd.DataFrame()
treatment_dfs = {
    5: treatment5df,
    10: treatment10df,
    20: treatment20df,
    30: treatment30df,
    40: treatment40df
}

control5df = pd.DataFrame()
control10df = pd.DataFrame()
control20df = pd.DataFrame()
control30df = pd.DataFrame()
control40df = pd.DataFrame()
control_dfs = {
    5: control5df,
    10: control10df,
    20: control20df,
    30: control30df,
    40: control40df
}


# Set up treatment and control companies
treatment_companies = ['adidasoriginals', 'allbirds', 'asos', 'everlane', 'ganni', 'goat', 'lunya', 'pomelofashion', 'prettylittlething', 'fashionphile']
control_companies = ['vinted', 'urbanic', 'stockx', 'lenskart', 'mejuri', 'veja', 'rixo', 'printful', 'skims', 'primark']

# Set up windows sizes
window_sizes = [5,10,20,30,40]
window_sizes_str = [str(w) for w in window_sizes]

# Set up dataframes to track posts used in the regression per company (PPC)
summarization_metrics = ['total_posts', 'total_companies_contributed','contribution_mean', 'contribution_stdev']
treatment_PPC = pd.DataFrame(columns = treatment_companies + summarization_metrics, index = window_sizes_str)
control_PPC = pd.DataFrame(columns = control_companies + summarization_metrics, index = window_sizes_str)


# Set up dataframe to describe company contributions
matched_companies = [f'{treatment_companies[i]} : {control_companies[i]}' for i in range(len(treatment_companies))]
matched_contribution_ratio = pd.DataFrame(columns = matched_companies, index = window_sizes_str)


**Extract Control Data**

In [29]:
import random
random.seed(42)

# Read all data csv
file_path = "data\SocialInsider_Instagram_posts.csv"
main_df = pd.read_csv(file_path)

number = 0

for company in control_companies:
    df = main_df[main_df['Profile'] == company].copy()
    df = sort_data(df)
    df = calculate_thresholds(df)
    valid_observations = find_valid_control_obs(df, window_sizes)
    
    for w in window_sizes:

        valid_obs = valid_observations[w]['valid_obs']
        str_w = str(w)
        n = int(ctrl_dict[company][str_w])
        valid_obs = random.sample(valid_obs, n)
        new_observations = extract_observations(valid_obs=valid_obs, df=df, treatment=False)
        control_PPC.loc[str(w), company] = len(new_observations)
        control_dfs[w] = pd.concat([control_dfs[w], new_observations], ignore_index=True)
    
control_subset_df = control_PPC.iloc[:, 0:10]
control_PPC['total_companies_contributed'] = control_subset_df.apply(lambda row: row[row != 0].count(), axis=1)
control_PPC['total_posts'] = control_subset_df.apply(lambda row: row[row != 0].sum(), axis=1)
control_PPC['contribution_mean'] = control_subset_df.mean(axis=1)
control_PPC['contribution_stdev'] = control_subset_df.std(axis=1)

**Extract Treatment Data**

In [30]:
# Read all data csv
file_path = "data\SocialInsider_Instagram_posts.csv"
main_df = pd.read_csv(file_path)

for company in treatment_companies:
    df = main_df[main_df['Profile'] == company].copy()
    df = sort_data(df)
    df = calculate_thresholds(df)
    valid_observations = find_valid_treatment_obs(df, window_sizes)

    for w in window_sizes:

        valid_obs = valid_observations[w]['valid_obs']
        str_w = str(w)
        n = int(treat_dict[company][str_w])
        valid_obs = random.sample(valid_obs, n)
        new_observations = extract_observations(valid_obs=valid_obs, df=df, treatment=True)
        treatment_PPC.loc[str(w), company] = len(new_observations)
        treatment_dfs[w] = pd.concat([treatment_dfs[w], new_observations], ignore_index=True)

treatment_subset_df = treatment_PPC.iloc[:, 0:10]
treatment_PPC['total_companies_contributed'] = treatment_subset_df.apply(lambda row: row[row != 0].count(), axis=1)
treatment_PPC['total_posts'] = treatment_subset_df.apply(lambda row: row[row != 0].sum(), axis=1)
treatment_PPC['contribution_mean'] = treatment_subset_df.mean(axis=1)
treatment_PPC['contribution_stdev'] = treatment_subset_df.std(axis=1)


**Compare Treatment PPC and Control PPC**
These dataframes should be the same for each matched company now. Matched companies are paired by column index.

In [31]:
control_PPC

Unnamed: 0,vinted,urbanic,stockx,lenskart,mejuri,veja,rixo,printful,skims,primark,total_posts,total_companies_contributed,contribution_mean,contribution_stdev
5,10,30,150,10,60,10,60,80,110,10,530,10,53.0,48.773855
10,20,20,180,20,40,20,60,160,180,20,720,10,72.0,71.30529
20,40,0,40,40,40,40,40,80,200,40,560,9,56.0,53.995885
30,60,0,60,60,60,0,60,60,240,60,660,8,66.0,66.030296
40,0,0,0,80,80,0,0,0,320,80,560,4,56.0,100.133245


In [32]:
treatment_PPC

Unnamed: 0,adidasoriginals,allbirds,asos,everlane,ganni,goat,lunya,pomelofashion,prettylittlething,fashionphile,total_posts,total_companies_contributed,contribution_mean,contribution_stdev
5,10,30,150,10,60,10,60,80,110,10,530,10,53.0,48.773855
10,20,20,180,20,40,20,60,160,180,20,720,10,72.0,71.30529
20,40,0,40,40,40,40,40,80,200,40,560,9,56.0,53.995885
30,60,0,60,60,60,0,60,60,240,60,660,8,66.0,66.030296
40,0,0,0,80,80,0,0,0,320,80,560,4,56.0,100.133245


________
#### **Merging**

Concatenate treatment observations with control observations

In [33]:
merged5df = pd.concat([control_dfs[5], treatment_dfs[5]], ignore_index=True)
merged10df = pd.concat([control_dfs[10], treatment_dfs[10]], ignore_index=True)
merged20df = pd.concat([control_dfs[20], treatment_dfs[20]], ignore_index=True)
merged30df = pd.concat([control_dfs[30], treatment_dfs[30]], ignore_index=True)
merged40df = pd.concat([control_dfs[40], treatment_dfs[40]], ignore_index=True)

Sanity Check: Verify value_counts data for `time_frame = 0` is mirror image of value_counts data for `time_frame = 1`

In [34]:
# Verify equal number of time_frame = 0 and time_frame = 1
combination_counts = merged10df.groupby(['Time_Frame', 'Treatment']).size().reset_index(name='Count')

print("Unique Combination Counts:")
print(combination_counts)


Unique Combination Counts:
    Time_Frame  Treatment  Count
0          0.0        0.0    360
1          0.0        1.0    190
2          0.0        2.0     70
3          0.0        3.0     60
4          0.0        4.0     10
5          0.0        5.0     10
6          0.0       10.0     10
7          0.0       20.0     10
8          1.0        0.0    360
9          1.0        1.0    190
10         1.0        2.0     70
11         1.0        3.0     60
12         1.0        4.0     10
13         1.0        5.0     10
14         1.0       10.0     10
15         1.0       20.0     10


Merge Followers Data

In [35]:
# read in follower count data and merge with dataframes
followers = pd.read_csv("Follower-count_Pivot - Sheet1.csv",)
followers['Followers'] = followers['Followers'].astype(int)
followers['Profile'] = followers['Profile'].str.lower()

In [36]:
followers5df = pd.merge(merged5df, followers, on='Profile', how='left')
followers10df = pd.merge(merged10df, followers, on='Profile', how='left')
followers20df = pd.merge(merged20df, followers, on='Profile', how='left')
followers30df = pd.merge(merged30df, followers, on='Profile', how='left')
followers40df = pd.merge(merged40df, followers, on='Profile', how='left')

Merge Number of Posts Data

In [37]:
# Read all data csv
file_path = "data\SocialInsider_Instagram_posts.csv"
main_df = pd.read_csv(file_path)
num_posts = dict(main_df['Profile'].value_counts())
posts_df = pd.DataFrame(list(num_posts.items()), columns=['Profile','num_posts'])

In [38]:
final5df = pd.merge(followers5df, posts_df, on='Profile', how='left')
final10df = pd.merge(followers10df, posts_df, on='Profile', how='left')
final20df = pd.merge(followers20df, posts_df, on='Profile', how='left')
final30df = pd.merge(followers30df, posts_df, on='Profile', how='left')
final40df = pd.merge(followers40df, posts_df, on='Profile', how='left')

Create Month Column

In [39]:
final5df['Month'] = final5df['Date'].dt.month
final10df['Month'] = final10df['Date'].dt.month
final20df['Month'] = final20df['Date'].dt.month
final30df['Month'] = final30df['Date'].dt.month
final40df['Month'] = final40df['Date'].dt.month

________
#### Save CSVs


In [None]:
final_dfs = [final5df, final10df, final20df, final30df, final40df]
window_sizes = ["5","10","20","30","40"]

for i in range(len(final_dfs)):
    subset_df = final_dfs[i][['Profile','Type', 'engagement_zscore', 'Time_Frame', 'Treatment', 'Followers', 'Month', 'num_posts']]
    file_name = "window" + window_sizes[i] + "_regressiondata.csv"
    subset_df.to_csv(file_name, index=False)

Sanity Check: Read one of the CSVs

In [40]:
pd.read_csv('window40_regressiondata.csv')

Unnamed: 0,Profile,Type,sqrt_engagement_zscore,Time_Frame,Treatment,Followers,Month,num_posts
0,lenskart,CAROUSEL_ALBUM,0.660412,0.0,0.0,1045041,5,857
1,lenskart,CAROUSEL_ALBUM,0.937294,0.0,0.0,1045041,5,857
2,lenskart,IMAGE,0.287251,0.0,0.0,1045041,5,857
3,lenskart,reel,-0.680130,0.0,0.0,1045041,5,857
4,lenskart,reel,-0.651569,0.0,0.0,1045041,5,857
...,...,...,...,...,...,...,...,...
955,fashionphile,CAROUSEL_ALBUM,-0.377025,1.0,1.0,517955,4,417
956,fashionphile,CAROUSEL_ALBUM,-0.262065,1.0,1.0,517955,4,417
957,fashionphile,reel,-0.520655,1.0,1.0,517955,4,417
958,fashionphile,reel,-0.610770,1.0,1.0,517955,4,417
