# Dataset Documentation


## Observational Study


#### 1. Motivation of data collection 
The data have been scraped from Instagram for academic purposes (July 2020). The authors  created and prepared the dataset for analysis.


#### 2. Composition of dataset 
The dataset provides a list of 111960 historical posts from 316 Instagram users. The `consumers_psm` table can be used to select usernames included in the matched sample. Image content associated with these posts have been analyzed with Azure Vision API to compute image-similarity within and between-subjects. All data have been normalized and separated into various schemas, whose variable operationalisations we describe next.



#### `consumers_country`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| username | Instagram username of the account |
| country | The country of origin of an Instagram user                                                  |

#### `consumers_posts_json`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| username | Instagram username of the account |
| date | The date of data retrieval (i.e., scraping from Instagram) |
| posts | A dictionary that contains individual level posts from a username (formatted in JSON) |


#### `consumers_posts`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| content_type | Type of content, GraphImage (images), GraphSidecar (carousel), or GraphVideo (video) |
| media1 | Link to media file (URL signature only valid for limited time) |
| shortcode | Instagram post ID (visit instagram.com/p/{shortcode} to view the associated post)|
| total_comments | Number of comments at data retrieval |
| username | Instagram username of the account |
| video_views | Number of video views at data retrieval (null for images) |
| date | Publication date of the post                                                             |

#### `consumers_profile`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| biography | A section on a user's profile page to include personal  information |
| followers_count | The number of accounts that follow the user at data retrieval |
| following_count | The number of accounts a user follows at data retrieval|
| full_name | The first and last name of a user (may deviate from the username) |
| posts_count | The total number of posts on Instagram for a user at data retrieval|
| username | Instagram username of the account                                                 |



#### `consumers_profile_json`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| username | Instagram username of the account |
| date | The date of data retrieval (i.e., scraping from Instagram) |
| profile | A dictionary that contains Instagram profile information from a username (formatted in JSON)                         |

#### `consumers_psm`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| username | Instagram username of the account (only includes usernames that are a part of the matched sample) |
| type | Indicator variable for the test condition (either treatment or control)                                     |

#### `hypeauditor`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| followers_from_country | The number of accounts that follow the user from a given country at data retrieval (i.e., subset of total_follower)|
| topics | Tag words associated with the image content of the user|
| total_follower| The total number of accounts that follow the user at data retrieval |
| url| Link to HypeAuditor ranking for a topic (country can be extracted from the URL)|
| username| Instagram username of the influencer account                                                |

#### `influencer_country_purity`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| country | The major country of origin among the influencer's followers  |
| percentage_country | The percentage of the influencer's followers located in `country` |
| username| Instagram username of the influencer account |

#### `image_similarity_between_tags`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| username1 | Instagram username of the account whose image content is compared with username2 |
| username2 | Instagram username of the account whose image content is compared with username1 |
| username1_2 | Concatenation of username1 and username2 in alphabetical order to identify overlapping pairs: (username1, username2) and (username2, username1) |
| before_after | Indicator variable for the intervention time frame (either before or after) |
| similarity | Mean cosine similarity of images of username1 and username2 before or after hiding like counts                                                             |

#### `image_similarity_within_tags`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| username | Instagram username of the account |
| before_after | Indicator variable for the intervention time frame (either before or after) |
| image_similarity | Mean cosine similarity of all images of a user before or after hiding like counts                                                 |

#### 3. Collection Process
The data was collected in July 2020 and consists of posts ranging between 2011 and 2020 (depending on when users joined Instagram). We used an [external library](https://github.com/arc298/instagram-scraper) to scrape posts from selected Instagram accounts. Our start seed of influencer accounts was obtained in February 2020 from [HypeAuditor](https://hypeauditor.com/top-instagram/). The dataset only comprises data of users whose Instagram account is public (no private data). Furthermore, image content data was obtained through [Azure Computer Vision API](https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/).

#### 4. Preprocessing
A detailed description of the preprocessing of raw JSON data can be found over [here](https://github.com/RoyKlaasseBos/Hiding-Instagram-Likes/blob/master/Web_Appendix_Hiding_Like_Counts_Instagram.ipynb). The data has been cleaned using a combination of Python and R.

#### 5. Uses
The dataset has been used to study the effect of hiding like counts on user behavior and self-esteem. To this end, the data consists of observational data before and after the intervention. Other use cases for the dataset are natural language processing and object recognition (deep learning).

#### 6. Distribution
The repository containing all code is publicly available on Github. Raw and preprocessed data is stored in a cloud database. Anonymized data may be requested for academic purposes only.

#### 7. Maintenance
The dataset is stored on Amazon Web Services (RDS) and backed-up daily.

## Experiment

#### `experiment`

| ** Variable ** | ** Operationalisation ** | 
| :---- | :--- | 
| gender | The gender of a participant (1=Male, 2=Female, 3=Other)|
| condition | The test condition a participant was assigned to (low, high, or hidden likes)|
| gender_condition | The combination of the gender of the person shown on the image and the like count test condition ({female, male} X {low, high, hidden})|
| other.evaluation_1 | The extent to which the target person is considered likeable (1=Strongly disagree, 7=Strongly agree) |
| other.evaluation_2 | The extent to which the target person is considered popular (1=Strongly disagree, 7=Strongly agree) |
| other.evaluation_3 | The extent to which the target person is considered attractive (1=Strongly disagree, 7=Strongly agree) |
| self.evaluation_1 | The extent to which the participant considers himself/herself likeable (1=Strongly disagree, 7=Strongly agree) |
| self.evaluation_2 | The extent to which the participant considers himself/herself popular (1=Strongly disagree, 7=Strongly agree) |
| self.evaluation_3 | The extent to which the participant considers himself/herself attractive (1=Strongly disagree, 7=Strongly agree) |
| instagram_actions_1 | The participant's likelihood to like the photo (1=Extremely unlikely, 7=Extremely likely)|
| instagram_actions_2 | The participant's likelihood to comment on the photo (1=Extremely unlikely, 7=Extremely likely)|
| instagram_actions_3 | The participant's likelihood to follow the Instagram user (1=Extremely unlikely, 7=Extremely likely)|
| instagram_actions_4 | The participant's likelihood to share a photo on Instagram (1=Extremely unlikely, 7=Extremely likely)|
| instagram_usage | The average time daily spent on Instagram (1=Less than 10 minutes, 2=11-30 minutes, 3=31-60 minutes, 4=1-2 hours, 5=2-3 hours, 6=More than 3 hours).|
| age | The participant's age (in years). The survey was conducted among 18-30 year-olds.|
| ethnicity | The participant's ethnicity (1=White, 2=Black, 3=Hispanic, 4=Asian, 5=Other)|






---

*Klaasse Bos, R.J. (2020). Dataset Documentation: Goodbye Likes, Hello Mental Health: How Hiding Like Counts Affects User Behavior & Self-Esteem.*