<img src="./images/instagram_header.png" align="left" style="margin-bottom: 20px"/>

<h2> Web Appendix - Hiding Like Counts on Instagram </h2>

<p style="clear: both;">This online appendix complements the master thesis "Goodbye Likes, Hello Mental Health: How Hiding Like Counts Affects User Behavior & Self-Esteem":</p> 

<p><i>Likes are widely available on social network services and are known to influence people’s self-image. An emerging literature has started to look at potential detrimental effects of social media use among teenagers. We study how Instagram users’ posting frequency, variety, like behavior, and relative self-esteem are affected by an intervention in which like counts were hidden in selected treatment countries. Using a unique panel data set of individual users’ Instagram posts across multiple years, we find evidence that users posted more frequently and more varied than in the months prior to the intervention. On the other hand, the number of likes decreases as people are no longer influenced by others’ evaluations, especially among users with a small following. Further, in an experiment we show that the number of likes people see on others’ posts affects their relative self-esteem, and that users are more likely to self-disclose once they rate themselves more positively. These results are critical to understanding the dynamics on visual-based social media in order to foster a healthy online environment.</i></p>

<p>In this notebook, we perform the following steps (run this .ipynb-file locally for clickable anchors): </p>

A. [Instagram Influencer Seed](#instagram-influencer-seed)  
B. [Instagram Consumers Selection](#instagram-consumers-selection)  
C. [Collect & Preprocess Instagram Data](#preprocess-instagram-data)  
D. [Computer Vision](#computer-vision)  
E. [Cosine Similarity & Image Similarity](#cosine-similarity)  
F. [Outlier Screening](#outlier-screening)  
G. [Propensity Score Matching](#propensity-score-matching)  
H. [Differences in Differences](#differences-in-differences)  
I. [Randomized Experiment](#experiment)  

In [1]:
import pandas as pd, pickle, random, datetime, json, requests, sys, os
from sqlalchemy import create_engine
from PIL import Image
from io import BytesIO
from sklearn.metrics.pairwise import cosine_similarity 

# define path to local SQL database
engine = create_engine('postgresql+psycopg2://postgres:thesis@localhost:5433/thesis')

# support R in Jupyter Notebook
%load_ext rpy2.ipython

In [None]:
%%R
set.seed(123) # for reproducibility
library(RPostgreSQL)
library(Matching)
library(dplyr)
library(plm)
library(lme4)
library(lmerTest)
library(nlme)
library(ez)
library(ggplot2)
library(plotrix)
library(reshape)
library(lsr)
library(pscl)
library(psych)
library(mfx)
library(erer)

<a id="instagram-influencer-seed"></a>
### A. Instagram Influencer Seed

<a href="https://hypeauditor.com/top-instagram/">HypeAuditor</a> lists the top influencers by  category by country. The ranking is updated daily and takes into account quality audience and authentic engagement to control for bots and inactive accounts. For each listing the number of followers from a given country and the total number of followers are indicated. First, we scraped all listings for each category and country combination in February 2020. The number of listings for each combination varies depending on the volume of influencers in the domain. In 45 cases (0.3%) the ``followers_from_country``  column was missing because the data was unavailable on HypeAuditor. These records have been excluded from our analysis. Second, we divide both metrics by one another to derive the *purity*. That is, the percentage of influencers' followers from a given country. Third, we exclude followers whose purity is below 50%. As a result, we end up with a list of influencers (N = 5391) whose main following base is located in a single country (41% of all top influencers). Fourth, we sort the influencers by purity and pick the top 20 influencers for each country in our dataset.

<img src="./images/hypeauditor.png" align="left" />

In [3]:
def convert_follower_counts(df, column):
    '''convert string follower counts (e.g., 3K) into numeric (e.g., 3000)'''
    for counter in range(len(df)): 
        if "M" in df.loc[counter, column]:
            df.loc[counter, column] = float(df.loc[counter, column].replace("M", "")) * 1000000        
        elif "K" in df.loc[counter, column]:
            df.loc[counter, column] = float(df.loc[counter, column].replace("K", "")) * 1000
    return df


def extract_country(df): 
    '''extract country from URL (e.g., https://hypeauditor.com/top-instagram-beer-wine-spirits-brazil/ --> brazil)'''
    countries = ["australia", "brazil", "canada", "china", "france", "germany", "hong-kong", "india", "indonesia", \
                 "italy", "malaysia", "mexico", "russia", "saudi-arabia", "slovakia", "spain", "switzerland", "ukraine", \
                 "united-arab-emirates", "united-kingdom", "united-states"]
    for counter in range(len(df)): 
        for country in countries: 
            if country in df.loc[counter, 'url']:
                df.loc[counter, 'country'] = country
    return df


def calculate_purity(df):
    '''determine the percentage of influencers' followers from a given country'''
    for counter in range(len(df)):
        df.loc[counter, 'percentage_country'] = float(df.loc[counter, 'followers_from_country']) / float(df.loc[counter,'total_follower'])
    return df


# import data      
df = pd.read_csv("data/hypeauditor.csv", keep_default_na=False)

# add country column to data
df = extract_country(df)

# convert follower counts to numeric
df = convert_follower_counts(df, 'followers_from_country')
df = convert_follower_counts(df, 'total_follower')

# exclude records for which either the followers from country or total follower count is missing
df = df[(df['followers_from_country'] != "") & (df['total_follower'] != "")].reset_index(drop=True)

# add purity measure
df = calculate_purity(df)

# exclude influencers whose purity is below 50%; mean purity increases from 37.1% to 70.1%
df = df.loc[df.percentage_country > .5]

# select top influencers from each country whose purity is highest (mean purity: 82.8%)
top_20 = df.groupby('country')['percentage_country'].nlargest(20)

# obtain the usernames belonging to these influencers
indices = [top_20.index[counter][1] for counter in range(len(top_20.index))] 
selected_df = df.loc[indices, ['username', 'country', 'percentage_country']]

# export influencers' username, country of main following, and purity to local database
# --- selected_df.to_sql("purity", engine, index=None, if_exists='replace')

<a id="instagram-consumers-selection"> </a>
### B. Instagram Consumers Selection

For each selected influencer we collect a list of their followers using [Phantombuster](https://phantombuster.com/automations/instagram/7085/instagram-profile-scraper) (figure below), draw a random sample of  followers, and validate their country of origin in order to construct our dataset of consumers. That is, public Instagram users who i) followed an influencer account from our list of top 20 influencers by country we identified in step 1, ii) posted at least 50 pictures and/or videos (of which at least 5 before and after the intervention), iii) were followed by up to 5000 users, and iv) were not used for business purposes. We derived the latter from the Instagram account type (business or personal), whether the account was owned by an individual or an organisation, and whether posts are commercially affiliated (i.e., they promote products or services). Our sample includes personal accounts owned by individual users who do not engage in commercial activities on Instagram. 

<img src="./images/phantombuster.png" width="650px"/>


<p style="clear: both;">Furthermore, we validate the user's country of origin as follows. First, the language used in the bio and post captions should correspond with the main language in the country of origin. Second, language use in post comments should also be in line with the main language in the country of origin. Third, location tags should primarily refer to places in the country of origin (though a vacation photo taken elsewhere may sporadically occur). We repeat this process until we have gathered 40 accounts by country:</p>


| Country | Type | #Accounts |
| ------ | ------ |------ |
| Australia | Treatment | 40 |
| Canada | Treatment | 40 |
| France | Control | 40 |
| Germany | Control | 40 |
| Italy | Treatment | 40 |
| Netherlands | Control | 40 |
| Spain | Control | 40 |
| United Kingdom | Control | 40 |


<a id="preprocess-instagram-data"></a>
### C. Collect & Preprocess Instagram Data
In step 1 and 2 we created a list of Instagram usernames of which we collected historical post data using [Instagram Scraper](https://github.com/arc298/instagram-scraper). This is a command-line application written in Python to obtain user information, social relationship information, and photo information. Since the photo and video files attached to each post take up a significant amount of memory, we only store the links to the online media files. More specifically, we run the command below to collect the post information of usernames in `FILE_NAME.txt` (i.e., text file that contains all usernames in our sample). 

<img src="./images/instagram_scraper.png" alt="Instagram Scraper Github" align="left" width="600px">

`instagram-scraper -f FILE_NAME.txt --media-types none --media-metadata --profile-metadata -T {username}_{urlname}`


The scraping process yields a separate JSON-file for each account which requires further preprocessing for follow-up analysis. For each user we extracted post and user level data and stored it into a dataframe which we then pushed to a local database. 

In [4]:
def open_seed(file_path):
    '''open usernames and store in array'''
    with open(file_path) as f:
        names = [line.split() for line in f]
    return [name for name_list in names for name in name_list]

def pickle_files(pickle_name, df):
    '''store output of data frame as pickle'''
    with open(pickle_name, 'wb') as f:
        pickle.dump(df, f)

def parse_json_files(usernames, path):
    '''parse the json file for each username in the text file and store preprocessed records in a data frame'''
    
    # declare dataframes for post data (df), user profile data (profile), and arrays for user accounts which were missing (i.e., deleted after constructing seeds) or private (i.e., user data could not be scraped)
    df = pd.DataFrame()
    profile = pd.DataFrame()
    missing = []
    private = []
    
    for username in usernames: 
        # for each username load raw json file in memory
        try: 
            with open("./data/" + path + "/" + username + "/" + username + ".json") as f:
                d = json.load(f)
        except: 
            missing.append(username)

        # test if user profile is publicly available
        try:        
            # post level data
            shortcode = [d['GraphImages'][counter]['shortcode'] for counter in range(len(d['GraphImages']))]
            description = [d['GraphImages'][counter]['edge_media_to_caption']['edges'][0]['node']['text'] if len(d['GraphImages'][counter]['edge_media_to_caption']['edges']) > 0 else "NA" for counter in range(len(d['GraphImages']))]
            total_likes = [d['GraphImages'][counter]['edge_media_preview_like']['count'] for counter in range(len(d['GraphImages']))]
            total_comments = [d['GraphImages'][counter]['edge_media_to_comment']['count'] for counter in range(len(d['GraphImages']))]
            hashtags = [d['GraphImages'][counter]['tags'] if type(d['GraphImages'][counter].get('tags')) == list else "NA" for counter in range(len(d['GraphImages']))]
            content_type = [d['GraphImages'][counter]['__typename'] for counter in range(len(d['GraphImages']))]
            timestamp = [d['GraphImages'][counter]['taken_at_timestamp'] for counter in range(len(d['GraphImages']))]
            video_views = [d['GraphImages'][counter]['video_view_count'] if type(d['GraphImages'][counter].get('video_view_count')) == int else 0 for counter in range(len(d['GraphImages']))]
            media1 = [d['GraphImages'][counter]['urls'][0] for counter in range(len(d['GraphImages']))]

            df_temp = pd.DataFrame({
                           "username": username,
                           "shortcode": shortcode, 
                           "description": description, 
                           "total_likes": total_likes,
                           "total_comments": total_comments,
                           "hashtags": hashtags,
                           "content_type": content_type,
                           "timestamp": timestamp, 
                           "video_views": video_views,
                           "media1": media1 # for image caroussels we focused on the first photo/video
                          })

            df = pd.concat([df_temp, df]).reset_index(drop=True)

            # user level profile data
            followers_count = [d['GraphProfileInfo']['info']['followers_count']]
            following_count = [d['GraphProfileInfo']['info']['following_count']]
            posts_count = [d['GraphProfileInfo']['info']['posts_count']]
            biography = [d['GraphProfileInfo']['info']['biography']]
            full_name = [d['GraphProfileInfo']['info']['full_name']]

            profile_temp = pd.DataFrame({
                "username": username,
                "followers_count": followers_count,  
                "following_count": following_count, 
                "posts_count": posts_count,
                "biography": biography, 
                "full_name": full_name 
            })

            profile = pd.concat([profile_temp, profile]).reset_index(drop=True)

        except: 
            private.append(username)

    # convert epoch time to regular timestamp
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

    # add regular date (without time)
    df['date'] = df['timestamp'].dt.date

    return df, missing, private, profile

# list with usernames, JSON path, user type
paths = [["./data/json_consumer/consumers_new.txt", "json_consumer", "consumers"]]

# for all usernames process JSON files, push to SQL database, and store a pickle copy
for path in paths: 
    temp_output = parse_json_files(open_seed(path[0]), path[1])
    temp_df = temp_output[0]
    temp_missing = temp_output[1]
    temp_private = temp_output[2]
    temp_profile = temp_output[3]
    
    temp_df.to_sql(path[2], engine, index=None, if_exists='replace')
    temp_profile.to_sql(path[2] + "_profile", engine, index=None, if_exists='replace')
    pickle_files('pickles/' + path[2] + '.pickle', temp_df)

<a id="computer-vision"></a>
### D. Computer Vision
We use Azure Cognitive Services Computer Vision Application Programming Interface ([API](https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/)) to analyze image content. For every image, the API returns a vector of tags and confidence scores (figure below). First, we make an API request and pickle all output data for further analysis. Second, we compute image similarity within and between-subjects using image tags data. Note that Instagram image URLs are valid for a limited amount of time. The code sample below, therefore, only runs for recently scraped data. Expired URLs are printed in the console.

<img src="./images/vision_api_example.png" align="left" alt="Computer Vision API Example (tags)">

In [5]:
if 'COMPUTER_VISION_SUBSCRIPTION_KEY' in os.environ:
    subscription_key = os.environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
else:
    print("\nSet the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.\n**Restart your shell or IDE for changes to take effect.**")
    sys.exit()

if 'COMPUTER_VISION_ENDPOINT' in os.environ:
    endpoint = os.environ['COMPUTER_VISION_ENDPOINT']
    
analyze_url = endpoint + "vision/v2.1/analyze"


def process_image(temp_df, categories=True):
    '''obtain categories or tags image data from all images in dataframe using Azure Cognitive Services'''
    df = pd.DataFrame(columns=['uri', 'timestamp', 'category', 'score'])
    
    for counter in range(len(temp_df)):
        image_url = temp_df.loc[counter, 'media1']
        time_stamp = temp_df.loc[counter, 'timestamp']
    
        headers = {'Ocp-Apim-Subscription-Key': subscription_key}
        data = {'url': image_url}
        params = {'visualFeatures': 'Categories'} if categories else {'visualFeatures': 'Tags'}
        
        try: 
            response = requests.post(analyze_url, headers=headers,
                                 params=params, json=data)
            output = response.json()

            if categories: 
                for category in output['categories']: 
                    df = df.append(
                        dict(
                            uri = image_url,
                            timestamp = time_stamp,
                            category = category['name'],
                            score = category['score'],
                        ), ignore_index = True)

            else: 
                for tag in output['tags']: 
                    df = df.append(
                            dict(
                                uri = image_url,
                                timestamp = time_stamp,
                                category = tag['name'],
                                score = tag['confidence'],
                            ), ignore_index=True)           
                    
        except: 
            #image url expired
            print(image_url)             
            
    return df

# to save time and computing resources we only collect image tags data among users selected after matching
# in step 7 we describe the propensity score matching procedure
connection = engine.connect()
consumers_psm_query = connection.execute("SELECT * FROM consumers_psm").fetchall()
consumers_selected = pd.DataFrame(consumers_psm_query)[0]

# consumers' post level data
temp_df = pickle.load(open("pickles/consumers.pickle", "rb"))

for consumer in consumers_selected:
    if not os.path.isfile('./pickles/image_output/Azure_Tags/' + consumer + '.pickle'):
        temp = temp_df.loc[(temp_df.username == consumer) & (temp_df.content_type != "GraphVideo")].reset_index(drop=True) 
        image_tags = process_image(temp, False)
        pickle_files('pickles/image_output/Azure_Tags/' + consumer + '.pickle', image_tags)

<a id='cosine-similarity'> </a>
### E. Cosine Similarity & Image Similarity

#### E.1 Cosine Similarity
To illustrate how the cosine similarity scores are derived, we go over a fictitious example for the within-subject design. Let's assume a user posted two pictures of which we want to compute the cosine similarity: 

<img src="./images/cosine_similarity.jpg" align="left" alt="Cosine Similarity Example"/>

As follows from the figure the computer vision algorithm API returned three tags for both pictures. The first picture contains a group of people watching the sunset together, and the second picture also shows a group of people standing in a forest. To account for uncertainty each of these tags is associated with a confidence score, which we can write down in matrix notation as follows: 

In [6]:
pictures = pd.DataFrame([[0.67, 0.93, 0.89, 0.00], [0.74, 0.92, 0.00, 0.88]], columns=['people_group', 'outdoor', 'sunset', 'forest'], index=['picture1.jpg', 'picture2.jpg'])
pictures

Unnamed: 0,people_group,outdoor,sunset,forest
picture1.jpg,0.67,0.93,0.89,0.0
picture2.jpg,0.74,0.92,0.0,0.88


In the first and second row `forest` and `sunset` were assigned a confidence score of `0.00` respectively as these tags were not present in the images. Next, we perform the cosine similarity operation which measures the angle between two vectors and determines whether two vectors are pointing in the same direction. More specifically, we multiply the confidence scores of pictures 1 and 2 for each image tag (e.g., for people: 0.67 x 0.74) and divide by the multiplication of the length of both vectors. Mathematically, this can be denoted as: 
$$sim(r,c)=  (r \cdot c)/(\left\Vert r \right\Vert \cdot \left\Vert c \right\Vert)$$ 
<br /> 
Here $r$ and $c$ are the image vectors for picture 1 and 2 respectively, and $||r||$ is defined as $\sqrt{r_1^2+r_2^2+ ... + r_n^2}$. A larger confidence score has a larger weight and more overlapping image tags gives a higher cosine similarity score. Filling in the confidence scores above, we find a cosine similarity of `0.63` between picture 1 and 2. The diagonal contains 1s as comparing any image with itself always yields a cosine similarity of 1. 

In [7]:
cosine_similarity(pictures)

array([[1.        , 0.63240507],
       [0.63240507, 1.        ]])

Next, we explain how we can apply cosine similarity transformations to address whether the variety of posts changes after the introduction of the intervention. First, we compute the cosine similarity between pictures taken before [after] the intervention with all other pictures taken before [after] the intervention. This gives a mean image similarity score by user: 
<br/>

| username | before_after | image_similarity |
| -------  | -------- | --------- | 
| alexanderkuckart | before | 0.2140 | 
| alexanderkuckart | after | 0.1984 | 
| ... | ... | ... | 
| xannabellex27 | before | 0.3134 | 
| xannabellex27 | after | 0.3087 | 

Again, let's consider a hypothethical user who used to only share like-seeking selfies on Instagram. After hiding like counts about half of the posts still include selfies, but the remaining posts include other subjects (e.g. scenery). This implies that the image similarity would drop since the cosine similarity of a blend of selfies and scenery photos is lower than the cosine similarity among a homogeneous sample of selfies. 

#### E.2 Image Similarity

First, we make within-subject comparisons to address whether the variety of posts changes after the introduction of the intervention. Second, we make between-subjects comparisons to determine whether treated users share more unique content relative to others. 

*Within-subject similarity*  
For each user i we compute the cosine similarity between pictures taken by the same user $i$. We distinguish between pictures taken before ($1_{before}$…$n_{before}$) and after  
($1_{after}$…$n_{after}$) the intervention. This yields a similarity matrix in which each picture *before* [*after*] the intervention is compared with all pictures *before* [*after*] hiding like counts (i.e., white squares in figure below). Each row ($r$) and column ($c$) name present a picture from user $i$ in $k$, where $k$ can take on the value *before* or *after*. Given these two separate subsets $k$, we calculate how similar each picture on average is to all other pictures in the same subset. That is, for each row we take the row average excluding the diagonal values. Finally, we aggregate the results across all rows in $k$:


$\omega_{ik} = \frac{1}{n_{ik}(n_{ik}-1)}\sum_{r=1}^{n} \sum_{c=1|c \neq r}^{i-1} sim(r_{ik},c_{ik})$ 

The row and column names represent pictures before and after the intervention for user *i* ($\mu_i$). Values in the matrix denote the cosine similarity for each picture pair (only the diagonal of 1s have been reported).  For the purpose of this analysis we restrict ourselves to the white squares in the top left and bottom right quadrant of the figure. Within these areas we compute row means (excluding the diagonal values) to determine how similar a given picture is to all other pictures in the same subset on average. Thereafter, we derive the before and after within-subject similarity by taking the average of the row means in the top left and bottom right squares, respectively. Note: calculating column means, rather than row means, yields identical outcomes. 

<img src="./images/within_subjects_similarity.png" align="left"/>


In [8]:
def within_subject_similarity(consumers, consumers_selected, categories=False):
    '''compute the within-subject similarity from image tags before and after the intervention'''

    for consumer in consumers_selected:
        if categories: 
            image_data = pickle.load(open("pickles/image_output/Azure_Categories/" + consumer + ".pickle", "rb"))
        else: 
            image_data = pickle.load(open("pickles/image_output/Azure_Tags/" + consumer + ".pickle", "rb"))
            
        consumers_df = pd.merge(image_data, consumers, left_on='uri', right_on='media1')[['uri', 'category', 'score', 'before_after']]

        # turn image categories/tags into a matrix (rows: images, columns: categories/tags) and order by date (year & month)
        tags_matrix = consumers_df.pivot_table(index=["before_after", "uri"], columns="category")
        tags_matrix = tags_matrix.fillna(0)
        similarity = cosine_similarity(tags_matrix)

        try: 
            before_intervention = len(tags_matrix.loc[('before')])
            after_intervention = len(tags_matrix.loc[('after')])

            before_similarity = pd.Series([similarity[counter][list(range(0, counter)) + list(range(counter + 1, before_intervention))].mean() for counter in range(before_intervention)]).mean()
            similarity_scores.loc[len(similarity_scores) + 1,] = [consumer, 'before', before_similarity]

            after_similarity = pd.Series([similarity[counter][list(range(before_intervention, counter)) + list(range(counter + 1, before_intervention + after_intervention))].mean() for counter in range(before_intervention, before_intervention + after_intervention)]).mean()
            similarity_scores.loc[len(similarity_scores) + 1,] = [consumer, 'after', after_similarity]
        
        except: 
            pass
            
    return similarity_scores 

# load consumers data and extract year and month from dates            
connection = engine.connect()
consumers_before_after_query = "SELECT c.username, media1, CASE WHEN cc.country = 'canada' AND c.date > '2019-04-30' THEN 'after' WHEN c.date > '2019-07-17' THEN 'after' ELSE 'before' END as before_after FROM consumers c INNER JOIN consumers_psm cp ON c.username = cp.username INNER JOIN consumers_country cc ON c.username = cc.username WHERE c.date > '2018-04-30' AND c.date < '2020-04-30'"
consumers_before_after = connection.execute(consumers_before_after_query).fetchall()
consumers_before_after_df = pd.DataFrame(consumers_before_after).rename({0: 'username', 1: 'media1', 2: 'before_after'}, axis=1)

# compute within-subject similarity 
similarity_scores_tags = within_subject_similarity(consumers_before_after_df, consumers_before_after_df['username'].unique())
similarity_scores_tags.to_sql("image_similarity_within_tags", engine, index=None, if_exists='replace')

*Between-subjects similarity*  
To assess between-subjects similarity (B) we distinguish between cohorts of users in the treatment and control group. We choose for these comparisons for two reasons. First, Instagram users may especially stay on top of the trends in their local market and therefore their postings might have already been more like other treatment units prior to the intervention. Second, by defining cohorts we establish more homogeneous clusters of users. Within these two cohorts, we determine the cosine similarity of each user pair ($u_i, u_j$) in k = {before, after} (i.e., white squares in figure below). That is, how similar pictures from user i are to pictures from another user j on average, where $u_i$  and $u_j$  belong to the same cohort.

$B_{ijk} = \frac{1}{n_i n_j} \sum_{r=1}^{n} \sum_{c=1}^{n} sim(r_{ijk},c_{ijk})$

The row and column names represent pictures before the intervention for user i ($u_i$) and user j ($u_j$) in the same cohort. Values in the matrix denote the cosine similarity for each picture pair (note: values are left out for simplicity). For the purpose of this analysis we restrict ourselves to the top right or the bottom left white square. Within this area we compute row means to determine how similar a given picture from $u_i$ [$u_j$] is to all pictures from $u_j$ [$u_i$] on average. Thereafter, we sum up the row means and divide by the number of rows ($n_{ik}$ [$n_{jk}$]) to derive the before between-subjects similarity. In a similar fashion, the after between-subjects similarity can be determined. Note that only one of both white squares should be used to avoid duplicates.

<img src="./images/between_subjects_similarity.png" width="500px" align="left"/>

In [9]:
def create_image_output(consumer, consumers, before_after='before'):
    '''construct cosine similarity matrix for either all images before or after the intervention'''
    image_input = pickle.load(open("pickles/image_output/Azure_Tags/" + consumer + ".pickle", "rb"))
    consumers_df = pd.merge(image_input, consumers, left_on='uri', right_on='media1')[['uri', 'category', 'score', 'before_after', 'treatment_control']]
    tags_matrix = consumers_df.pivot_table(index=["before_after", "uri"], columns="category")
    tags_matrix = tags_matrix.fillna(0)
    if before_after == 'before':
        tags_matrix = tags_matrix.loc[('before')]
    else: 
        tags_matrix = tags_matrix.loc[('after')]    
    return tags_matrix


def between_subjects_similarity(consumers):
    '''compute the between-subjects similarity from image tags before and after the intervention'''
    for consumer1 in consumers.username.unique():
        for consumer2 in consumers.username.unique(): 
            b_similarity_scores = pd.DataFrame(columns = ['username1' , 'username2', 'username1_2', 'before_after', 'similarity']) 
           
            try: 
                if consumer1 != consumer2: # do not compare image data of the same user
                    for before_after in ['before', 'after']: # run this procedure for images before and after the intervention separately
                        consumer1_df = create_image_output(consumer1, consumers, before_after)
                        consumer2_df = create_image_output(consumer2, consumers, before_after)
                        consumers_1_2 = pd.concat([consumer1_df, consumer2_df])
                        consumers_1_2 = consumers_1_2.fillna(0)
                        similarity = cosine_similarity(consumers_1_2)

                        # for each image of consumer 1 take the mean cosine similarity with all image of consumer 2
                        comparisons = [similarity[counter][len(consumer1_df):].mean() for counter in range(len(consumer1_df))]

                        # aggregate the results across all images of consumer 1 (so take the mean of all mean cosine similarities) 
                        before_after_similarity = pd.Series(comparisons).mean() 
                        b_similarity_scores = b_similarity_scores.append({
                                                                "username1": consumer1,
                                                                "username2": consumer2,
                                                                "username1_2": "_".join(sorted([consumer1, consumer2])),
                                                                "before_after": before_after, 
                                                                "similarity": before_after_similarity
                                                                }, ignore_index=True)

                b_similarity_scores.to_sql("image_similarity_between_tags", engine, index=None, if_exists='append')

            except: 
                pass        

# load consumers data and extract year and month from dates            
connection = engine.connect()
consumers_before_after_treatment_control_query = "SELECT c.username, media1, CASE WHEN cc.country = 'canada' AND c.date > '2019-04-30' THEN 'after' WHEN c.date > '2019-07-17' THEN 'after' ELSE 'before' END as before_after, CASE WHEN cc.country IN ('australia', 'canada', 'italy') THEN 'treatment' ELSE 'control' END as treatment_control FROM consumers c INNER JOIN consumers_psm cp ON c.username = cp.username INNER JOIN consumers_country cc ON c.username = cc.username WHERE c.date > '2018-04-30' AND c.date < '2020-04-30'"
consumers_before_after_treatment_control = connection.execute(consumers_before_after_treatment_control_query).fetchall()
consumers = pd.DataFrame(consumers_before_after_treatment_control).rename({0: 'username', 1: 'media1', 2: 'before_after', 3: 'treatment_control'}, axis=1)

# compute between-subjects similarity 
between_subjects_similarity(consumers)

<a id='outlier-screening'></a>
### F. Outlier Screening
Even though we use a stringent [Instagram consumers selection](#instagram-consumers-selection) procedure, it may sporadically occur that a user systematically differs in terms of the number of followers, followings, and the average number of likes of their image posts. To overcome this issue we use a multivariate outlier screening approach and remove these users from our sample before propensity score matching.

In [10]:
%%R
# connect to local database
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, host='localhost', port='5433', dbname='thesis',
                user='postgres', password='admin')

# split query into subqueries
query_1 = "SELECT x.username, followers_count, following_count, average_likes FROM" 
query_2 = "x INNER JOIN (SELECT username, AVG(total_likes) as average_likes FROM"
query_3 = "l GROUP BY username) l ON x.username = l.username"

query_user_data = function(user_profile, user){
    # paste queries and determine follower count, following count, and average number of likes by user
    query = paste(query_1, user_profile, query_2, user, query_3)
    return(dbGetQuery(con, query))
}

outlier_screening = function(df){
    # determine if the mahalanobis distance exceeds the threshold value
    mahal = mahalanobis(df[,-1], colMeans(df[,-1]), cov(df[,-1]), tol=1e-20)
    cutoff = qchisq(1-0.001, ncol(df[,-1]))
    outliers = subset(df, mahal > cutoff)    
    no_outliers = subset(df, mahal < cutoff)
    
    return(c(outliers, no_outliers))
}


remove_record = function(table, usernames){
    # remove all posts of users labeled as outliers
    for(username in usernames){
        del_query = paste("DELETE FROM ", table, " WHERE username='", username, "'", sep="")
        dbGetQuery(con, del_query)
    }
}

# collect user stats and then screen for outliers
consumers_stats = query_user_data("consumers_profile", "consumers")
consumers_screening = outlier_screening(consumers_stats)    

# remove outliers from analysis (don't run this cell twice to avoid removing outliers after already excluding outliers)
remove_record("consumers", consumers_screening[1])

<a id='propensity-score-matching'></a>
### G. Propensity Score Matching
To reduce bias of distribution overlap and different density weighting, we rebalance our data through matching non-treated users to treated ones on similar covariate values. First, we estimate a probit model of receiving treatment on the number of followers, the number of followings, the adoption speed of Instagram users, and the percentage of image posts relative to all types of media posts. Second, we compute the Mahalanobis distance for each treated and control user pair and select unique matches sequentially, in order of closeness of their Mahalanobis distances. We match without replacement such that control units are only allowed to be used as a match once. Each treatment unit is matched with a single control unit as a higher number of matches deteriorates matching quality significantly. Third, we conduct an imbalance check before and after matching of which the results are reported in the paper. 

In [11]:
%%R
# prepare data for propensity score matching (treatment/control country), follower/following count, image share, days since adoption
consumers_query = 
"
SELECT w.username, treatment_control, followers_count, following_count, image, days_since_adoption 
FROM consumers_profile w 
INNER JOIN 
    (SELECT username, CASE     
        WHEN country in ('australia', 'brazil', 'canada', 'italy') THEN 1   
        ELSE 0 END as treatment_control 
    FROM consumers_country cc WHERE country != 'brazil') x ON w.username = x.username

INNER JOIN (SELECT w.username, AVG(CAST(image_count AS DECIMAL) / CAST(posts_count AS DECIMAL)) as image 
            FROM consumers_profile w 
            INNER JOIN (SELECT username, SUM(CASE WHEN content_type = 'GraphImage' THEN 1 END) as image_count 
            FROM consumers GROUP BY username) l ON l.username = w.username GROUP BY w.username) y ON y.username = w.username 

INNER JOIN (SELECT username, DATE_PART('day', '2020-06-01'::timestamp - MIN(timestamp)) as days_since_adoption 
            FROM consumers GROUP BY username) z ON z.username = y.username
"

PSM = function(df){  
    # propensity score matching
    Tr = cbind(as.vector(df$treatment_control))
    X = as.matrix(df[,c('followers_count', 'following_count', 'days_since_adoption', 'image')])
    
    # replace NAs in the image column with zero
    X[is.na(X)] = 0 
    
    glm1 = glm(Tr ~ X, family=binomial)
    
    rr1 = Match(Tr=Tr, X=glm1$fitted, replace = FALSE, Weight=1, M=1)
    summary(rr1)
  
    # check balancing properties (results may deviate between bootstrap iterations)
    MatchBalance(Tr ~ X, match.out = rr1, nboots=10000)
  
    # store indices of matched users
    treatment = data.frame(df[rr1$index.treated,'username'], 'treatment')
    colnames(treatment) = c("username", "type")
    control = data.frame(df[rr1$index.control,'username'], 'control')
    colnames(control) = c("username", "type")
    return(rbind(treatment, control))
}

consumers_PSM_input = dbGetQuery(con, consumers_query)

# lines below are commented to ensure consistency with paper (PSM results may slightly deviate for each run)
# consumers_PSM = PSM(consumers_PSM_input)
# dbWriteTable(con, "consumers_psm", consumers_PSM, overwrite = TRUE, row.names = FALSE) 

<a id='differences-in-differences'></a>
### H. Difference in Differences
In line with our hypotheses, we examine posting frequency (H1), variety (H2), and like behavior (H3). To this end, we query the local data base and apply a difference in differences (DiD) approach to estimate the effect of hiding like counts on our matched sample of users. 

We compare the outcome measures of Instagram users in the treatment countries with those in control countries. As Canadians enter the treatment group prior to Australians, Brazilians, and Italians, we estimate a DiD where the time variable is relative to the intervention date.

$Y_{it} = \alpha_i + \gamma_t + I_{t} + \tau_{i} \cdot I_{t}  + \epsilon_{it}$

where $Y_it$ is the dependent variable for user i at time t, 
$\alpha_i$ is a user-level fixed effect, 
$\gamma_t$ is a trend variable, 
$\tau_{i}$ is 1 of if user i was assigned to the treatment group and 0 otherwise, 
$I_{t}$ is 1 if the intervention was implemented at time t and 0 otherwise, 
$\epsilon_{it}$ is the error term for user i at time t 

User level fixed effects control for time-invariant user characteristics. Intervention 1 and 2 take place in late April and mid-July, respectively. Given above equation, we are especially interested in the coefficient estimate and significance of the interaction between the treatment group and intervention variables as this indicates whether treatment units respond significantly different to the intervention than control units. To account for any serial correlation, we use robust standard errors clustered at the user level.

#### H.1 Posting frequency
We run a difference in differences model on the monthly number of Instagram posts and interpret the coefficients. Reported R-Square values are obtained by running a linear model with user fixed effects. The model coefficients relate to the regression output as follows:

| Coefficient | Regression Output | 
| :--- | :--- |
| $\gamma_t$ | `counter` |
| $I_{t}$ | `interventionTRUE` |
| $\tau_{i} \cdot I_{t}$ | `treatment:interventionTRUE`|

In [12]:
%%R
# for each user collect the number of posts, mean number of likes per post, and mean number of comments per post in each month
posts_likes_comments_query = 
"
SELECT c.username, followers_count, following_count, treatment, 
CASE WHEN country = 'canada' THEN date_part('year', age(month, '2019-05-01')) * 12 + date_part('month', age(month, '2019-05-01'))
WHEN country != 'canada' THEN date_part('year', age(month, '2019-08-01')) * 12 + date_part('month', age(month, '2019-08-01')) END as months_since_intervention,
posts, likes, comments
FROM
(SELECT c.username, country, followers_count, following_count,
 to_date(concat_ws('-', date_part('year', timestamp), date_part('month', timestamp), '1'), 'YYYY-MM-DD') as month, 
 COUNT(DISTINCT(shortcode)) as posts, 
 CASE WHEN country IN ('australia', 'canada', 'italy') THEN 1 ELSE 0 END as treatment,
 AVG(total_likes) as likes,
AVG(total_comments) as comments
FROM consumers c
INNER JOIN consumers_country cc ON cc.username = c.username
INNER JOIN consumers_profile cp ON c.username = cp.username 
INNER JOIN consumers_psm cpsm ON cpsm.username = c.username
WHERE CASE WHEN country = 'canada' THEN DATE(timestamp) >= '2018-04-30' AND DATE(timestamp) <= '2020-04-30'
WHEN country != 'canada' THEN DATE(timestamp) >= '2018-07-17' AND DATE(timestamp) <= '2020-07-17' END
GROUP BY c.username, followers_count, following_count, treatment, cc.country, to_date(concat_ws('-', date_part('year', timestamp), date_part('month', timestamp), '1'), 'YYYY-MM-DD')) as c
"

posts = dbGetQuery(con, posts_likes_comments_query)
posts$after = posts$months_since_intervention >= 0 # create boolean that indicates whether the intervention was in place in a given month
posts$counter = posts$months_since_intervention # copy variable for the panel data analysis (see below)
    
fill_missing_months = function(df){
    # if users do not post in a given month we do not have any record of this. From this we can deduce that the number of posts in that month equals zero. This function searches for missing months and adds these records to the data frame.
    for(username in unique(df$username)){
        df_user = df[df$username == username,] 
        min_counter = min(df_user[, 'counter'])
        max_counter = max(df_user[, 'counter'])
        
        for(counter in min_counter:max_counter){
            if(!counter %in% df_user$counter){
                df[nrow(df) + 1, ] = c(username, df_user[1,'followers_count'], df_user[1,'following_count'], df_user[1,'treatment'], counter, 0, 0, 0, df_user[1,'after'], counter)
            }
        }
    }
    num_columns = c('followers_count', 'following_count', 'treatment', 'months_since_intervention', 'posts', 'likes', 'comments', 'counter')
    df[, num_columns] = sapply(df[, num_columns], as.numeric) 
    return(df)
}

posts = fill_missing_months(posts)

# log-transform posting frequency to account for skewness
posts[posts$posts == 0, 'posts'] = 0.0001
posts$log_posts = log(posts$posts)
posts$after = as.logical(posts$after)
posts$before = 1-posts$after

In [13]:
%%R
# run a Hausman test comparing random and fixed effects 
df.p = pdata.frame(posts, index=c('username', 'months_since_intervention'))
fixed_effects_posts = plm(as.formula(paste('log_posts', '~ treatment + after + treatment:after + counter + counter:treatment + after:treatment:counter')), data=df.p, model='within')
random_effects_posts = plm(as.formula(paste('log_posts', '~ treatment + after + treatment:after + counter + counter:treatment + after:treatment:counter')), data=df.p, model='random')
phtest(fixed_effects_posts, random_effects_posts) # choose for fixed effects (p < .001)
summary(fixed_effects_posts)

Oneway (individual) effect Within Model

Call:
plm(formula = as.formula(paste("log_posts", "~ treatment + after + treatment:after + counter + counter:treatment + after:treatment:counter")), 
    data = df.p, model = "within")

Unbalanced Panel: n = 238, T = 1-25, N = 5016

Residuals:
       Min.     1st Qu.      Median     3rd Qu.        Max. 
-9.93207332 -1.00494633  0.00010968  1.39768620  8.21545595 

Coefficients:
                             Estimate Std. Error  t-value  Pr(>|t|)    
afterTRUE                    4.649928   0.149663  31.0693 < 2.2e-16 ***
counter                     -0.304943   0.010933 -27.8912 < 2.2e-16 ***
treatment:afterTRUE         -0.804188   0.222859  -3.6085 0.0003111 ***
treatment:counter           -0.147786   0.016576  -8.9157 < 2.2e-16 ***
treatment:afterTRUE:counter  0.465593   0.025173  18.4956 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    39812
Residual Sum of Squares: 25925
R-Squared:      

#### H.2 Variety

In [14]:
%%R
# collect data for within-subject image similarity
within_subject_tags_query = 
"
SELECT ist.username, CASE WHEN cc.country IN ('australia', 'canada', 'italy') THEN 1 ELSE 0 END as treatment, 
CASE WHEN before_after = 'before' THEN 0 ELSE 1 END as after, image_similarity 
FROM image_similarity_within_tags ist 
INNER JOIN consumers_psm cp ON cp.username = ist.username 
INNER JOIN consumers_country cc ON cc.username = ist.username;
"
within_subjects_tags = dbGetQuery(con, within_subject_tags_query)
within_subjects_tags = within_subjects_tags[!duplicated(within_subjects_tags),] 
within_subjects_tags$after = factor(within_subjects_tags$after)
within_subjects_tags$treatment = factor(within_subjects_tags$treatment)

# image similarity did not differ between both treatment conditions nor before and after the intervention
ezANOVA(data = within_subjects_tags, 
        wid = username, 
        within = .(after), 
        between = .(treatment),
        dv = image_similarity)

R[write to console]:  Converting "username" to factor for ANOVA.

R[write to console]:  Data is unbalanced (unequal N per group). Make sure you specified a well-considered value for the type argument to ezANOVA().



$ANOVA
           Effect DFn DFd         F         p p<.05          ges
2       treatment   1 227 0.1333997 0.7152729       0.0005183815
3           after   1 227 0.4248760 0.5151734       0.0002197592
4 treatment:after   1 227 0.3231539 0.5702803       0.0001671541



In [15]:
%%R
# collect data for between-subjects image similarity (only compare treatment units with treatment units or control units with control units)
between_subjects_tags_query = 
"
SELECT username1, username2, username1_2, 
CASE WHEN before_after = 'after' THEN 1 ELSE 0 END as after,
CASE WHEN cc1.country IN ('australia', 'canada', 'italy') THEN 1 ELSE 0 END as treatment,
CAST(similarity as numeric)
FROM image_similarity_between_tags i
INNER JOIN consumers_country cc1 ON i.username1 = cc1.username
INNER JOIN consumers_country cc2 ON i.username2 = cc2.username
WHERE (cc1.country IN ('australia', 'canada', 'italy') AND cc2.country IN ('australia', 'canada', 'italy'))
OR (cc1.country NOT IN ('australia', 'canada', 'italy') AND cc2.country NOT IN ('australia', 'canada', 'italy'))
"
between_subjects_tags = dbGetQuery(con, between_subjects_tags_query)

# remove duplicates
between_subjects_tags = between_subjects_tags[!duplicated(between_subjects_tags[,c('username1_2', 'after')]),]
between_subjects_tags$after = factor(between_subjects_tags$after)
between_subjects_tags$treatment = factor(between_subjects_tags$treatment)

# the treatment group reacted significantly different to the intervention than the control group (repeated-measures ANOVA)
ezANOVA(data = between_subjects_tags, 
        wid = username1_2, 
        within = .(after), 
        between = .(treatment),
        dv = similarity)

R[write to console]:  Converting "username1_2" to factor for ANOVA.

R[write to console]:  Data is unbalanced (unequal N per group). Make sure you specified a well-considered value for the type argument to ezANOVA().



$ANOVA
           Effect DFn   DFd         F            p p<.05          ges
2       treatment   1 13810 71.801250 2.618727e-17     * 4.617082e-03
3           after   1 13810  8.931177 2.808368e-03     * 6.974213e-05
4 treatment:after   1 13810 69.793206 7.208961e-17     * 5.447451e-04



#### H.3 Like Behavior

In [16]:
%%R
likes = dbGetQuery(con, posts_likes_comments_query)
likes$after = likes$months_since_intervention >= 0 
likes$counter = likes$months_since_intervention

for(counter in 1:nrow(likes)){
    if(likes[counter, 'likes'] == 0){
        likes[counter, 'log_likes'] = 0 # to avoid log-transformation issues
    } else {
        likes[counter, 'log_likes'] = log(likes[counter, 'likes'])  
    }         
}

df.p = pdata.frame(likes, index=c('username', 'months_since_intervention'))
df.p$log_likes1000 = df.p$log_likes * 1000 # so that coefficients can be easier interpreted
fixed_effects_likes = plm(as.formula(paste('log_likes1000', "~ treatment + after + treatment:after + counter + followers_count + following_count")), data=df.p, model='within')
random_effects_likes = plm(as.formula(paste('log_likes1000', "~ treatment + after + treatment:after + counter + followers_count + following_count")), data=df.p, model='random')
phtest(fixed_effects_likes, random_effects_likes) # choose for random effects (p > .05)

# number of likes goes down after the intervention, especially among small-scale audiences
summary(random_effects_likes)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = as.formula(paste("log_likes1000", "~ treatment + after + treatment:after + counter + followers_count + following_count")), 
    data = df.p, model = "random")

Unbalanced Panel: n = 238, T = 1-25, N = 4436

Effects:
                   var  std.dev share
idiosyncratic 111290.1    333.6 0.182
individual    498599.0    706.1 0.818
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5728  0.8862  0.8974  0.8926  0.9040  0.9059 

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2149.22  -180.64    19.13     0.36   190.44  1640.88 

Coefficients:
                       Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept)         3280.592227   93.083612 35.2435 < 2.2e-16 ***
treatment            -94.735094   92.765368 -1.0212  0.307144    
afterTRUE            -66.739973   22.063356 -3.0249  0.002487 ** 
counter                6.150829    1.444412  4.2584 2.059e-05

### I. Randomized Experiment
<a id="experiment"></a>

<h5> I.1 Manipulation</h5>
<p>Our responses were recruited via a questionnaire-based <a href="https://tilburgss.co1.qualtrics.com/jfe/form/SV_brsLIHF0unNdsBT">experiment</a> on Prolific. We required workers to be aged between 18 to 30 years and active Instagram users to ensure they are familiar with the platform dynamics. Participants were randomly assigned to one of three same-gender conditions: high likes (128), low likes (15), and hidden likes. Individuals who self-identify as "Other" in the gender question were shown a picture of a woman.</p>

<img src="./images/experiment_pictures.png" align="left" width="800px"/>

<h5 style="clear: both;"> I.2 Target- and self-ratings</h5>
<p>They were asked to imagine that the person portrayed on the photo was their neighbor. Then, we asked participants to make specific evaluations of themselves and the target person in terms of likeability, popularity, and attractiveness:</p>

<img src="./images/target_self_ratings.png" alt="Questionnaire self and target ratings" align="left" width="600px"/>

<h5 style="clear: both;"> I.3 User behavior & demographics</h5>
<p>Thereafter, we measured user’s intention to like, comment, or share the picture after viewing the post. Finally, the questionnaire concluded by collecting the frequency of Instagram use and demographic information of the participants such as age and ethnicity. We received 600 responses which we analyze in the code blocks below.</p>

In [17]:
%%R
# import data
df = read.csv('./data/experiment.csv')

# exclude unknown genders (users who filled out "Other")
df = df[df$gender %in% c(1,2),]

# convert to factor
df$ethnicity = as.factor(df$ethnicity)

# construct reliability (Cronbach-Alpha)
alpha(df[,c('other.evaluation_1', 'other.evaluation_2', 'other.evaluation_3')]) #CR: 0.67
alpha(df[,c('self.evaluation_1', 'self.evaluation_2', 'self.evaluation_3')]) #CR: 0.78

# calculate aggregated target and self-ratings
df$target_rating = (df$other.evaluation_1 + df$other.evaluation_2 + df$other.evaluation_3)/3
df$self_rating = (df$self.evaluation_1 + df$self.evaluation_2 + df$self.evaluation_3)/3

# derive relative self-esteem (difference between target and self-ratings)
df$difference_evaluation = df$target_rating - df$self_rating

In [18]:
%%R
# there is a significant effect of the like count condition on relative self-esteem
anova = aov(difference_evaluation ~ condition + gender + age + ethnicity + instagram_usage, data=df)
print(summary(anova))

TukeyHSD(anova, which = "condition")

                 Df Sum Sq Mean Sq F value Pr(>F)  
condition         2   12.2   6.087   3.305 0.0374 *
gender            1    4.6   4.612   2.504 0.1141  
age               1    3.1   3.120   1.694 0.1936  
ethnicity         4   10.7   2.671   1.450 0.2161  
instagram_usage   1    4.9   4.913   2.667 0.1030  
Residuals       582 1072.0   1.842                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = difference_evaluation ~ condition + gender + age + ethnicity + instagram_usage, data = df)

$condition
                   diff        lwr          upr     p adj
high-hidden  0.05117642 -0.2718321  0.374184944 0.9264655
low-hidden  -0.27268164 -0.5975855  0.052222200 0.1200701
low-high    -0.32385806 -0.6400208 -0.007695346 0.0432387



In [19]:
%%R 
# here we conduct a similar analysis but this time as a 2 (source: self or target) X 3 (likes: low, high, hidden) mixed-model ANOVA (Vogel et al., 2014)
# this approach gives comparable results for the interaction between source and likes
df_melt = melt(df, colnames(df)[-c(51,52)])
anova_source = aov(value ~ variable + condition + variable:condition + gender + age + ethnicity + instagram_usage, data=df_melt)
summary(anova_source)

                     Df Sum Sq Mean Sq F value   Pr(>F)    
variable              1  372.4   372.4 356.032  < 2e-16 ***
condition             2    1.8     0.9   0.869  0.41977    
gender                1    0.0     0.0   0.022  0.88265    
age                   1    6.0     6.0   5.737  0.01677 *  
ethnicity             4   16.0     4.0   3.831  0.00424 ** 
instagram_usage       1   48.8    48.8  46.636 1.37e-11 ***
variable:condition    2    6.1     3.0   2.910  0.05486 .  
Residuals          1171 1224.8     1.0                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


In [20]:
%%R
# follow-up user behavior as a function of the like count
aggregate(df[,c('instagram_actions_1', 'instagram_actions_2', 'instagram_actions_3', 'instagram_actions_4')]-53, list(df$condition), mean)

  Group.1 instagram_actions_1 instagram_actions_2 instagram_actions_3
1  hidden            3.362162            1.686486            2.713514
2    high            3.368932            1.635922            2.509709
3     low            3.368159            1.716418            2.761194
  instagram_actions_4
1            1.870270
2            1.781553
3            2.049751


In [21]:
%%R
# the manipulation (condition) did not affect any of the dependent measures
manova_results = manova(cbind(instagram_actions_1, instagram_actions_2, instagram_actions_4) ~ condition + gender + age + ethnicity + instagram_usage, data = df)
summary(manova_results)

                 Df   Pillai approx F num Df den Df    Pr(>F)    
condition         2 0.005771   0.5605      6   1162   0.76200    
gender            1 0.029619   5.9012      3    580   0.00057 ***
age               1 0.094243  20.1162      3    580 2.050e-12 ***
ethnicity         4 0.071790   3.5672     12   1746 2.840e-05 ***
instagram_usage   1 0.058748  12.0668      3    580 1.135e-07 ***
Residuals       582                                              
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


In [22]:
%%R
# ordered probit regression - posting frequency (example)
# note that these type of models are prefered given the ordinal scale of the answer response options (yet a simple lm-model gives comparable results)
df$instagram_actions_4_fact = as.factor(df$instagram_actions_4)
post_probit = polr(instagram_actions_4_fact ~ condition + gender + age + ethnicity + instagram_usage + self_rating + target_rating, data = df, Hess = TRUE, method='probit')
print(summary(post_probit))
print(pR2(post_probit))
ocME(post_probit)$out

Call:
polr(formula = instagram_actions_4_fact ~ condition + gender + 
    age + ethnicity + instagram_usage + self_rating + target_rating, 
    data = df, Hess = TRUE, method = "probit")

Coefficients:
                   Value Std. Error t value
conditionhigh   -0.08909    0.10786 -0.8260
conditionlow     0.07798    0.10807  0.7216
gender           0.08151    0.08741  0.9325
age              0.03278    0.01232  2.6613
ethnicity2       0.33974    0.13819  2.4585
ethnicity3       0.05818    0.15940  0.3650
ethnicity4      -0.11820    0.12859 -0.9192
ethnicity5      -0.07347    0.31183 -0.2356
instagram_usage  0.05618    0.03931  1.4290
self_rating      0.38661    0.04185  9.2378
target_rating   -0.01271    0.04856 -0.2617

Intercepts:
      Value   Std. Error t value
53|54  1.9932  0.4381     4.5492
54|55  2.7028  0.4407     6.1333
55|56  3.1568  0.4432     7.1225
56|57  3.7745  0.4491     8.4043
57|58  4.2121  0.4558     9.2417
58|59  4.8627  0.4695    10.3579

Residual Deviance: 1967.7

<img src="./images/instagram_header.png" align="left"/>

*Klaasse Bos, R.J. (2020). Web Appendix: Goodbye Likes, Hello Mental Health: How Hiding Like Counts Affects User Behavior & Self-Esteem.*