# Social Influence Project - Reddit Submission Popularity RCT <a class="anchor" id="first-bullet"></a>

This notebook concretely illustrates the data gathering and analysis for which the main paper is based upon.

# Table of contents:
TODO

# Data gathering setup and pipeline

The following section contains all the relevant python code used to interface with reddit and store the resulting data in a local SQLite database. Keep in mind that the code is not meant to be deployed and run from within a jupyter notebook. Instead, it was designed such that the entrypoint `main.py` can be run on a schedule with a cronjob. For a greater viewing experience and better overview, it is recommended to visit [the GitHub repo](https://github.com/NValsted/RDS-Project-2022-1)

### Database interface

In [1]:
#src/database.py

from dataclasses import dataclass
from typing import Optional, TypeVar, List, Type
from contextlib import contextmanager

from sqlalchemy.engine import Engine
from sqlalchemy.sql.schema import Table
from sqlmodel import create_engine, SQLModel, Session

ModelType = TypeVar("ModelType", bound=SQLModel)


@dataclass
class Database:
    """
    Database class with methods to create/drop tables and add/retrieve table entries
    """
    engine: Engine

    @contextmanager
    def session(self):
        with Session(self.engine) as session:
            yield session

    def create_tables(self, tables: Optional[List[Table]] = None) -> None:
        SQLModel.metadata.create_all(self.engine, tables=tables)

    def drop_tables(self, tables: Optional[List[Table]] = None) -> None:
        SQLModel.metadata.drop_all(self.engine, tables=tables)

    def add(self, instances: List[ModelType]) -> None:
        with self.session() as session:
            session.add_all(instances)
            session.commit()

    def get(self, model: Type[ModelType], id: int) -> Optional[ModelType]:
        with Session(self.engine) as session:
            matches = session.query(model).filter(model.id == id).all()
            if len(matches) > 1:
                raise ValueError(
                    f"Multiple matches for {id=} in {model.__name__}:\n{matches}"
                )
            elif len(matches) == 0:
                return None
            return matches[0]


@dataclass
class DBFactory:
    """
    Factory to create Database instances
    """
    engine_url: str = "sqlite:///../database.db"

    def __call__(self, *args, **kwargs) -> Database:
        engine = create_engine(url=self.engine_url, **kwargs)
        return Database(engine=engine, **kwargs)

### Model and database table definitions
Python Pydantic models and SQLite table definitions are made simultaneously using the SQLModel ORM capabilities

A `RedditPost` entry will be created once when a post is fetched the first time, which includes generic metadata about the post, while a `RedditPostLogPoint` will be created periodically, which is responsible for keeping track of the score and number of comments. 

In [2]:
# src/database_models.py
from enum import Enum
from datetime import datetime
from typing import Optional

from sqlalchemy import Column, Enum as SAEnum
from sqlmodel import SQLModel, Field


class GroupEnum(str, Enum):
    CONTROL = "CONTROL"
    TREATMENT = "TREATMENT"


class RedditPost(SQLModel):
    id: str = Field(primary_key=True, index=True)
    batch_id: str = Field(
        index=True,
        description="Unique ID of the batch in which the post was added",
    )
    active: bool = Field(
        default=True, description="Indicates whether the post is reachable"
    )
    group: GroupEnum = Field(sa_column=Column(SAEnum(GroupEnum)))
    subreddit: str = Field()
    title: str = Field()
    creation_date: datetime = Field(description="Date at which post was created")


class RedditPostTable(RedditPost, table=True):
    __tablename__ = "RedditPost"


class RedditPostLogPoint(SQLModel):
    pk: Optional[int] = Field(primary_key=True, default=None, index=True)
    id: str = Field(index=True)
    score: int = Field()
    num_comments: int = Field()
    date: datetime = Field(description="Date at which stats were collected")


class RedditPostLogPointTable(RedditPostLogPoint, table=True):
    __tablename__ = "RedditPostLogPoint"

### Utilities
A few utility functions are also defined which are primarily concerned with increasing the robustness of the data gathering solution.

The logger and its factory method provide a structured interface for saving logs persistently to disk when the main job runs for longer periods of time without manual intervention, and the `safe_call` will prevent the process from terminating immediately on any error - e.g. if a single post out of thousands in a batch raise a 404 error because it was deleted, then we still want to process the rest of the batch. Likewise, it makes sense to retry requests if the connection temporarily drops.

In [3]:
# src/utils.py
from typing import Callable, Any, List, Dict, Optional
from time import sleep
import logging
import traceback

from prawcore import exceptions


def get_logger(name: str = "RDS-PROJECT") -> logging.Logger:
    logger = logging.getLogger(name)
    fhandler = logging.FileHandler(filename="logs.log", mode="a")
    formatter = logging.Formatter(
        "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
    )
    fhandler.setFormatter(formatter)
    logger.addHandler(fhandler)
    logger.setLevel(logging.DEBUG)
    return logger


def safe_call(
    func: Callable,
    args: Optional[List] = None,
    kwargs: Optional[Dict] = None,
    max_retries: int = 3,
    sleep_time: int = 1,
    exception: Exception = exceptions.NotFound,
    raise_on_failure: bool = True,
) -> Any:
    """
    Wraps a function and retries it if it raises an exception.
    """
    logger = get_logger()

    if args is None:
        args = []
    if kwargs is None:
        kwargs = {}

    error = Exception("Unknown error")

    while max_retries > 0:
        try:
            return func(*args, **kwargs)
        except exception as e:
            error = e
            max_retries -= 1
            logger.info(
                f"{func.__name__} failed with args {args} and kwargs {kwargs}\n"
                f"{e}\n{traceback.format_exc()}"
                f"{max_retries} retries left"
            )
            sleep(sleep_time)

    if raise_on_failure:
        logger.error(f"Failed to execute function {func.__name__}")
        raise error

### The reddit interface
Interfacing with reddit is done via the praw reddit API wrapper, which in turn is wrapped in the RedditBot class. This class provides methods to:
- Fetch new posts:
    - A random batch of new posts can be fetched with `get_batch_of_posts`
    - This batch of posts can be randomly divided into CONTROL and TREATMENT groups with `group_posts`
- Interface with database:
    - Add:
        - Posts (i.e. metadata about subreddit, creation_date, CONTROL/TREATMENT group, etc.) can be added with `add_posts_to_db`
        - Log points (i.e. observations of score and number of comments) can be added with `add_log_points`
    - Get:
        - Posts no older than a certain amount of days can be retrieved from the database with `get_stored_posts`
- The data (title, score, etc.) for a list of posts can be fetched given a list of ids with `get_posts`

In [4]:
# src/reddit_bot.py
import os
import random
import traceback
from datetime import datetime, timedelta
from multiprocessing.pool import ThreadPool
from typing import Tuple, List, Optional, Dict
from uuid import uuid4
import json

import praw

# Local imports suppressed in notebook cells since the objects are already available in scope.
# from src.database import DBFactory
# from src.database_models import RedditPostTable, RedditPostLogPointTable, GroupEnum
# from src.utils import get_logger, safe_call

CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
USERNAME = os.getenv("USERNAME")
PASSWORD = os.getenv("PASSWORD")
USER_AGENT = os.getenv("USER_AGENT")
RATELIMIT = int(os.getenv("RATELIMIT", 5))

logger = get_logger("REDDIT-BOT")


class RedditBot:
    """
    Wrapper for the Reddit bot.

    It contains the following methods:
    - get_batch_of_posts: Selects a batch of posts for the experiment
    - group_posts: Groups a list of posts into treatment and control groups
    - add_posts_to_db: Adds a list of posts to the database
    - add_log_points: Adds a list of log points to the database
    - get_stored_posts: Fetches posts from the database given a date filter
    - get_posts: Fetches posts from the Reddit API given a list of ids
    """

    reddit: praw.Reddit
    url: str = "https://www.reddit.com"

    def __init__(
        self,
        client_id: str = CLIENT_ID,
        client_secret: str = CLIENT_SECRET,
        username: str = USERNAME,
        password: str = PASSWORD,
        user_agent: str = USER_AGENT,
        ratelimit: int = RATELIMIT,
    ):
        """
        Authenticates the bot and initializes the Reddit instance.
        """
        assert isinstance(client_id, str)
        assert isinstance(client_secret, str)
        assert isinstance(username, str)
        assert isinstance(password, str)
        assert isinstance(user_agent, str)
        assert isinstance(ratelimit, int)

        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            username=username,
            password=password,
            user_agent=user_agent,
            ratelimit=ratelimit,
        )

    def get_batch_of_posts(
        self,
        subreddit: str = "all",
        score: int = 1,
        num_comments: int = 1,
        batch_size: int = 64,
    ):
        """
        Selects a batch of posts with at most 'score' number of upvotes and
        'num_comments' number of comments in the given subreddit.

        NOTE: batch_size is an upper bound on the number of posts returned.
        """

        posts = [
            post
            for post in self.reddit.subreddit(subreddit).new(limit=batch_size)
            if post.score <= score and post.num_comments <= num_comments
        ]

        return posts

    @staticmethod
    def group_posts(
        posts: List[praw.models.Submission],
    ) -> Tuple[List[praw.models.Submission], List[praw.models.Submission]]:
        """
        Assigns posts into treatment and control groups.
        """

        batch_id = str(uuid4())
        random.shuffle(posts)
        if len(posts) % 2 != 0:
            posts.pop()  # Drop a random post to make the list even

        middle = len(posts) // 2
        treatment_posts = posts[:middle]
        control_posts = posts[middle:]

        for post in treatment_posts:
            post.upvote()
            post.group = GroupEnum.TREATMENT
            post.batch_id = batch_id

        for post in control_posts:
            post.group = GroupEnum.CONTROL
            post.batch_id = batch_id

        return treatment_posts, control_posts

    @staticmethod
    def add_posts_to_db(
        posts: List[praw.models.Submission],
        backup: bool = False,
    ) -> None:

        prepared_posts = []

        def _prepare_post(post: praw.models.Submission) -> Dict:
            return dict(
                id=post.id,
                batch_id=post.batch_id,
                group=post.group,
                subreddit=str(post.subreddit),
                title=post.title,
                creation_date=post.created_utc,
            )

        for post in posts:
            prepared_post = safe_call(
                _prepare_post,
                args=[post],
                exception=Exception,
                raise_on_failure=False,
            )
            if prepared_post is not None:
                prepared_posts.append(prepared_post)

        if backup:
            today = datetime.today().date().isoformat()
            with open(f"backup/REDDITBOT_{today}_{str(uuid4())}.json", "w") as f:
                json.dump(prepared_posts, f, indent=4, default=str)

        db = DBFactory()()

        parsed_posts = [RedditPostTable(**post) for post in prepared_posts]
        db.add(parsed_posts)
        logger.info(f"Added {len(parsed_posts)} posts to the database")

        RedditBot.add_log_points(posts, backup=backup)

    @staticmethod
    def add_log_points(
        posts: List[praw.models.Submission], backup: bool = False
    ) -> None:

        prepared_posts = []
        stale_posts = []

        def _prepare_post(post: praw.models.Submission) -> Dict:
            return dict(
                id=post.id,
                score=post.score,
                num_comments=post.num_comments,
                date=datetime.now(),
            )

        for post in posts:
            prepared_post = safe_call(
                _prepare_post,
                args=[post],
                exception=Exception,
                raise_on_failure=False,
            )
            if prepared_post is not None:
                prepared_posts.append(prepared_post)
            else:
                stale_posts.append(post)

        if backup:
            today = datetime.today().date().isoformat()
            with open(f"backup/REDDITBOT_{today}_{str(uuid4())}.json", "w") as f:
                json.dump(prepared_posts, f, indent=4, default=str)

        db = DBFactory()()

        parsed_posts = [RedditPostLogPointTable(**post) for post in prepared_posts]
        db.add(parsed_posts)
        logger.info(f"Added {len(parsed_posts)} log points to database")

        for post in stale_posts:
            old_instance = db.get(RedditPostTable, id=post.id)
            if old_instance is not None:
                old_instance.active = False
                db.add([old_instance])

        logger.info(f"Marked {len(stale_posts)} stale posts")

    @staticmethod
    def get_stored_posts(max_age: int = 8) -> List[RedditPostTable]:
        db = DBFactory()()

        with db.session() as session:
            posts = (
                session.query(RedditPostTable)
                .filter(
                    RedditPostTable.creation_date
                    >= (datetime.now() - timedelta(days=max_age))
                )
                .filter(RedditPostTable.active)
                .all()
            )

            logger.info(f"Fetched {len(posts)} active posts from the database")

            return posts

    def _submission_wrapper(self, *args, **kwargs) -> Optional[praw.models.Submission]:
        try:
            return self.reddit.submission(*args, **kwargs)
        except Exception as e:
            logger.error(
                f"{e}\n{traceback.format_exc()}\nargs: {args}\nkwargs: {kwargs}"
            )
            return None

    def get_posts(
        self, ids: List[str], threads: int = 4
    ) -> List[praw.models.Submission]:
        with ThreadPool(threads) as pool:
            posts = pool.map(self._submission_wrapper, ids)

        posts = [post for post in posts if post is not None]

        return posts

### One-time entrypoint - `setup.py`
The `setup` function simply establishes a connection to- and possibly creates the database, after which it drops any existing tables and creates new ones afresh.

In [5]:
# setup.py

# Local imports suppressed in notebook cells since the objects are already available in scope.
# from src.database_models import RedditPostTable, RedditPostLogPointTable  # NOQA : F401
# from src.database import DBFactory


def setup():
    db = DBFactory()()
    db.drop_tables()
    db.create_tables()

### Main entrypoint - `main.py`
With everything set up, the `main` function defines a simple routine for monitoring the stats of previously fetched posts as well as fetching a batch of newly created posts. This is the function that is deployed to run periodically. Check [the GitHub repo](https://github.com/NValsted/RDS-Project-2022-1) for more details.

In [6]:
# main.py

# Local imports suppressed in notebook cells since the objects are already available in scope.
# from src import RedditBot
# from src.utils import safe_call


def main():
    bot = RedditBot()

    # Update old Posts
    posts = bot.get_stored_posts()
    ids = {post.id for post in posts}

    posts = bot.get_posts(ids)
    bot.add_log_points(posts)

    # New posts
    treatment, control = safe_call(
        func=lambda: bot.group_posts(bot.get_batch_of_posts())
    )
    bot.add_posts_to_db(treatment)
    bot.add_posts_to_db(control)

# Data analysis
At this point, over 20000 posts have been fetched and monitored over the course of 7 days each, which has resulted in around 1.5 million log points. This section marks the start of an analysis of the resulting data.

A snapshot of the database is available at https://ituniversity-my.sharepoint.com/:u:/g/personal/nicv_itu_dk/ESyJlN06ZbJEsYJSpBH6zSEB8IKTH5iKjAMcHIjumsXfIQ?e=YGatIC

In [7]:
from datetime import timedelta

import pandas as pd
import numpy as np
from scipy import stats
from tqdm import tqdm

## Load data

In [8]:
reddit_post_df = pd.read_sql_table(RedditPostTable.__tablename__, DBFactory.engine_url)
reddit_post_log_point_df = pd.read_sql_table(RedditPostLogPointTable.__tablename__, DBFactory.engine_url)

## Preprocessing

Before commencing with the analysis, a little preprocessing is beneficial, e.g. due to the fact that certain posts are marked inactive since they have been deleted or otherwise made unreachable, which has unbalanced the dataset slightly.

The preprocessing steps are the following (Which are intertwined in practice for convenience):
- Join post info and log points (`RedditPostTable` and `RedditPostLogPointTable`)
- Balance dataset. 
    - For each inactive post, identify a post from the conjugate group with the same `batch_id` and filter away both.
    - Filter away any post that has not been monitored for at least 7 days (i.e. younger than 7 days or marked inactive before the 7 day mark).
- Create derived columns: `age` and `saturation`.
- Unbias treatment group by subtracting 1 from the score
- For each of the two groups, create dataframes containing only the latest log point for a post.

These steps will be described further in the relevant cells.

### Join dataframes
A join between the RedditPost and RedditPostLogPoint on the `id` column is done (reddit's id for a given post), which essentially yields RedditPostLogPoint but with all the relevant metadata attached for the post which the log point corresponds to. A random sample of the dataframe is shown as an example of the resulting format.

In [123]:
joined = pd.merge(reddit_post_df, reddit_post_log_point_df, on="id", how="left")

In [124]:
joined.sample(n=5)

Unnamed: 0,group,id,batch_id,active,subreddit,title,creation_date,pk,score,num_comments,date
409005,TREATMENT,ta0bzb,151f376f-c0c4-46e1-9e70-e61037d1ce8d,True,DealAndSale,üî•50% Off Code ‚Äì $19.99 Waterproof Solar Panel ...,2022-03-09 05:12:40,331365,2,0,2022-03-12 09:14:58.107552
134850,TREATMENT,t80w05,21f1f24d-4c5e-4e59-a376-d57474fd5c0a,True,danganronpa,(Future Arc) I figured out what the ‚ÄúNG‚Äù in ‚ÄúN...,2022-03-06 15:19:35,106452,515,25,2022-03-08 03:49:24.295059
998993,TREATMENT,tp5g27,08b3d48f-1c14-4501-a4a9-c133aaf20411,True,relationship_advice,"I (23 m) my gf (20 f), she broke up with me a ...",2022-03-26 22:06:31,995099,2,3,2022-03-30 16:05:02.038658
370403,TREATMENT,t9giz1,d684ad72-d692-4f5e-90d9-c0178e92a1c2,True,BayleyBooty,its fucking huge,2022-03-08 13:13:14,146927,11,0,2022-03-08 15:29:02.891053
1175679,TREATMENT,ttn5nk,8bad43ba-4009-4a20-8f7a-13e8bdb457de,True,DBZDokkanBattle,Global is dead,2022-04-01 10:06:33,1105329,199,53,2022-04-03 06:17:46.574560


### Derived columns p0 - age
The `age` column simply denotes how long it has been since a post was created when a log point was recorded.

In [125]:
joined["age"] = joined["date"] - joined["creation_date"]

### Unbias TREATMENT group
The initial upvote of 1 is subtracted from the TREATMENT group, since we are interested in investigating whether the treatment has increased popularity as expressed by external users - i.e. all users that are not our bot.

Perhaps, this is best explained with an example in the extremes: If we assume an artificial world where no other users interact with posts on the platform, then we can safely say beforehand that the treatment will not have an effect. If we then perform our experiment, all posts in the CONTROL group will have a score of 0, and all posts in the TREATMENT group will have a score of 1. If we were to perform any analysis on the resulting data without unbiasing the TREATMENT group, we would wrongfully conclude effectiveness of the TREATMENT due to the difference in score distributions. 

In [126]:
joined.loc[joined["group"] == GroupEnum.TREATMENT.value, ("score")] -= 1

### Latest posts
We have record many log points for each individual post, so the `joined_latest` dataframe consists of only the latest log point for each post  

In [127]:
joined_latest = joined.sort_values(["date"], ascending=False).groupby(by="id", as_index=False).first()

### Balance dataset
To balance the dataset, we want remove any posts that have not yet been tracked for at least 7 days, and we also want to ensure that there are equally many data points in the control and treatment groups. Since we sample an equal amount of control and treatment posts in each batch, the only reason these can be different is if one or more posts have been marked inactive (i.e. deleted or otherwise unreachable)

In [128]:
ids_to_drop = set()

First, filter posts younger than 7 days:

In [129]:
for idx, row in joined_latest[joined_latest["age"] < timedelta(days=7)].iterrows():
    ids_to_drop.add(str(row["id"]))

Then balance pairs within batches, prioritizing removing conjugate posts which are also inactive - e.g. if a batch contains a single inactive control post and inactive treatment post, then these two should simply cancel out. Otherwise, simply sample a random active post of the conjugate group to cancel out. 

In [130]:
inactive_posts = joined_latest[joined_latest["active"] == False]

for group in inactive_posts.groupby(by="batch_id", as_index=False):
    group_key, group_df = group

    control_remainder, treatment_remainder = (
        joined_latest[
            (joined_latest["batch_id"] == group_key)
            & (joined_latest["group"] == group.value)
            & ~(joined_latest["id"].isin(set(group_df["id"])))
        ]
        for group in (GroupEnum.CONTROL, GroupEnum.TREATMENT)
    )
    num_control = control_remainder.shape[0]
    num_treatment = treatment_remainder.shape[0]
    
    if num_control != num_treatment:
        conjugate_remainder = control_remainder if num_treatment < num_control else treatment_remainder
        num_to_drop = abs(num_control - num_treatment)
        to_drop = conjugate_remainder.sample(n=num_to_drop)
        ids_to_drop.add(str(to_drop["id"]))
        
    for entry in group_df["id"]:
        ids_to_drop.add(str(entry))

Apply filters

In [131]:
print(f"Dropping {len(ids_to_drop)} post(s)")

joined_latest = joined_latest[~(joined_latest["id"].isin(ids_to_drop))]
# joined = joined[~(joined["id"].isin(ids_to_drop))]

Dropping 4046 post(s)


### Derived columns p1 - saturation
Saturation is useful in a temporal context and describes the ratio of some maximum value for a post. E.g. a post with `score` values 0, 50, and 100 will be mapped to `score_saturation` values of 0, 0.5, 1.0.

In [132]:
joined["score_saturation"] = joined["score"] / joined.groupby("id")["score"].transform(np.max)
joined["num_comments_saturation"] = joined["num_comments"] / joined.groupby("id")["num_comments"].transform(np.max)

### Overview of resulting dataframe

In [133]:
joined.sample(n=5)

Unnamed: 0,group,id,batch_id,active,subreddit,title,creation_date,pk,score,num_comments,date,age,score_saturation,num_comments_saturation
767375,TREATMENT,ti2bl1,bda7a90c-4b82-4cfe-8848-14a8f5fc5c75,True,natureporn,"Bijela, Bosnia and Herzegovina",2022-03-19 19:06:44,733526,22,0,2022-03-22 15:48:54.255870,2 days 20:42:10.255870,0.88,
944012,CONTROL,tndnmc,db09c191-5065-4a2e-8203-7eaf8e3ad80a,True,crossfit,Am I the only idiot that can‚Äôt figure out how ...,2022-03-25 04:06:48,828782,2,2,2022-03-25 15:08:55.277513,0 days 11:02:07.277513,1.0,1.0
388521,TREATMENT,t9mqzs,6cf848ad-85d1-4660-9352-3eced31bcbe7,True,autotldr,Biden set to ban Russian oil under pressure fr...,2022-03-08 18:00:40,350172,1,0,2022-03-12 18:12:55.308119,4 days 00:12:15.308119,1.0,
38470,TREATMENT,t7due5,4ffbf91c-01c8-444d-b1cf-f70095977a11,True,AskReddit,You are a main author in for 2000's cartoon. W...,2022-03-05 17:04:58,38575,0,4,2022-03-07 02:21:32.407340,1 days 09:16:34.407340,0.0,1.0
1071130,TREATMENT,tqrbze,be3e6add-00f3-4548-90f5-33143fae2199,True,Wonderlands,Seeing that Top 1 always makes me feel good.,2022-03-29 04:06:54,1194339,0,6,2022-04-06 03:08:39.028365,7 days 23:01:45.028365,0.0,1.0


### Split dataframe in groups
Finally, we simply create dataframes containing only values from the respective groups for convenience.

In [134]:
control, treatment = (joined[joined["group"] == group.value] for group in (GroupEnum.CONTROL, GroupEnum.TREATMENT))
control_latest, treatment_latest = (
    joined_latest[joined_latest["group"] == group.value] for group in (GroupEnum.CONTROL, GroupEnum.TREATMENT)
)

## Distribution similarity
The following section investigates the similarity between the score and number of comments distributions between the two groups, in order to evaluate the effectiveness of the treatment.

The following plot displays histograms of the data points within each group. The general distribution type seems similar, but parameters seem to be noticeably different.

In [135]:
from plotly import io as pio, graph_objects as go, express as px
from plotly.subplots import make_subplots
pio.renderers.default = "iframe"
# Jupyter notebooks don't handle plotly figures well.
# Therefore, iframes and %%capture are used to save the resulting html files to disk instead.
# These should appear in the iframe_figures directory, following the figure_{cell_number}.html naming scheme.

In [136]:
%%capture

fig = make_subplots(
    rows=2,
    cols=2,
    shared_yaxes=True,
    subplot_titles=("Score distribution", "Number of comments distribution"),
    horizontal_spacing=0.05,
)

for i in (1, 2):
    for j, attr in ((1, "score"), (2, "num_comments")):
        for df, color in ((control_latest, "#4969AA"), (treatment_latest, "#C90D0D")):
            fig.add_trace(
                go.Histogram(
                    x=df[attr],
                    histnorm="percent",
                    name=df["group"].min(),
                    xbins=dict(
                        start=df[attr].min(),
                        end=df[attr].max(),
                        size=0.5
                    ),
                    marker_color=color,
                    opacity=0.75,
                    showlegend=i == 1 and j == 1
                ),
                row=i,
                col=j,
            )

        fig.update_layout(
            yaxis_title_text="Percent",
            bargap=0.2,
            bargroupgap=0.1
        )

fig.update_xaxes(type="log", row=1, col=1)
fig.update_xaxes(type="log", row=1, col=2)
fig.update_xaxes(title_text="Score (log)", row=2, col=1, type="log")
fig.update_xaxes(title_text="Number of comments (log)", row=2, col=2, type="log")

fig.update_yaxes(title_text="Percent (log)", type="log", row=2, col=1)
fig.update_yaxes(type="log", row=2, col=2)

fig.show()

### Kolmogorov-Smirnov test
With the Kolmogorov-Smirnov test, we perform an empirical test to investigate whether the data points appear to belong to the same distribution. The null hypothesis is that the distributions are identical, and thus a low p-value indicates that there is a significant difference between the control and treatment groups. As can be seen below, the results are in favour of the alternative hypothesis.

In [137]:
stats.kstest(control_latest["score"], treatment_latest["score"])

KstestResult(statistic=0.05385415462845256, pvalue=2.6247763454387786e-11)

In [138]:
stats.kstest(control_latest["num_comments"], treatment_latest["num_comments"])

KstestResult(statistic=0.03548486826193091, pvalue=3.7392165653535435e-05)

However, small fluctuations might provide an inaccurate picture - e.g. a post with a score of 34 is probably not significantly different from one with a score of 35. So we will investigate whether rounding the score and number of comments to the nearest multiple of a range of numbers will jeopardize the result above.

In [139]:
binned_ks_test_score = [stats.kstest(*(df["score"].map(lambda x: i * round(x / i)) for df in (control_latest, treatment_latest))).pvalue for i in tqdm(range(1, 10))]
binned_ks_test_num_comments = [stats.kstest(*(df["num_comments"].map(lambda x: i * round(x / i)) for df in (control_latest, treatment_latest))).pvalue for i in tqdm(range(1, 10))]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:47<00:00,  5.29s/it]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

...But as the plot below shows, even pessimistically trying several different bin sizes, the maximum p-value is still no larger than on the order of $10^{-3}$, which occurs for the score distributions when rounding the score to the nearest multiple of 5. Thus, it seems more likely that there is a significant difference between the control and treatment groups. 

In [140]:
fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    subplot_titles=("KS-test binned score distributions", "KS-test binned number of comments distributions"),
)
for i, lst in enumerate((binned_ks_test_score, binned_ks_test_num_comments)):
    fig.add_trace(go.Scatter(x=list(range(1, len(lst) + 1)), y=lst), row=i+1, col=1)
fig.update_yaxes(type="log", row="all", col="all")
fig.update_layout(showlegend=False)
fig.show()

### Descriptive statistics
The following section simply displays key descriptive statistics for the 2 distribution pairs

#### Quantile statistics

In [141]:
for attr in ("score", "num_comments"):
    for name, df in (("CONTROL", control_latest), ("TREATMENT", treatment_latest)):
        print(f"{name}: {attr}")
        print(df[attr].describe())
        print()

CONTROL: score
count     8583.000000
mean        66.090994
std        577.805053
min          0.000000
25%          1.000000
50%          2.000000
75%         14.000000
max      28010.000000
Name: score, dtype: float64

TREATMENT: score
count     8653.000000
mean        71.197041
std        705.669589
min         -1.000000
25%          1.000000
50%          3.000000
75%         16.000000
max      46709.000000
Name: score, dtype: float64

CONTROL: num_comments
count     8583.000000
mean        13.459047
std        356.097275
min          0.000000
25%          0.000000
50%          2.000000
75%          7.000000
max      32770.000000
Name: num_comments, dtype: float64

TREATMENT: num_comments
count    8653.000000
mean       11.984861
std        95.288079
min         0.000000
25%         0.000000
50%         2.000000
75%         8.000000
max      6577.000000
Name: num_comments, dtype: float64



#### Skewness and Kurtosis

In [142]:
for metric in ("skew", "kurtosis"):
    print(f"METRIC: {metric}")
    print(
        "\n".join(
            (
                f"{name} {attr} {metric}: {getattr(df[attr], metric)()}"
                for attr in ("score", "num_comments")
                for name, df in (("CONTROL", control_latest), ("TREATMENT", treatment_latest))
            )
        )
    )
    print()

METRIC: skew
CONTROL score skew: 25.830230227857694
TREATMENT score skew: 41.480538097552596
CONTROL num_comments skew: 90.75046398862972
TREATMENT num_comments skew: 50.795881485103315

METRIC: kurtosis
CONTROL score kurtosis: 906.1568988458805
TREATMENT score kurtosis: 2381.0497768056844
CONTROL num_comments kurtosis: 8346.061735457748
TREATMENT num_comments kurtosis: 3139.3810454998165



#### (score, num_comments)-correlation

In [143]:
for name, df in (("CONTROL", control_latest), ("TREATMENT", treatment_latest)):
    print(f"{name}:")
    print(df["score"].corr(df["num_comments"]))
    print()

CONTROL:
0.07851519321759166

TREATMENT:
0.14937576884249684



#### Filter extremes

In [144]:
def filter_top(df: pd.DataFrame, attr: str, qt: float = 0.95):
    return df[df[attr] <= df[attr].quantile(qt)][attr]

In [145]:
for attr in ("score", "num_comments"):
    for name, df in (("CONTROL", control_latest), ("TREATMENT", treatment_latest)):
        print(f"{name}: {attr}")
        print(df[df[attr] <= df[attr].quantile(0.95)][attr].describe())
        print()

CONTROL: score
count    8153.000000
mean       13.143137
std        27.702794
min         0.000000
25%         1.000000
50%         2.000000
75%        10.000000
max       184.000000
Name: score, dtype: float64

TREATMENT: score
count    8220.000000
mean       14.498905
std        30.120945
min        -1.000000
25%         1.000000
50%         2.000000
75%        12.000000
max       198.000000
Name: score, dtype: float64

CONTROL: num_comments
count    8155.000000
mean        4.484365
std         6.556211
min         0.000000
25%         0.000000
50%         2.000000
75%         6.000000
max        34.000000
Name: num_comments, dtype: float64

TREATMENT: num_comments
count    8223.000000
mean        5.177429
std         7.452877
min         0.000000
25%         0.000000
50%         2.000000
75%         7.000000
max        40.000000
Name: num_comments, dtype: float64



In [146]:
for metric in ("skew", "kurtosis"):
    print(f"METRIC: {metric}")
    print(
        "\n".join(
            (
                f"{name} {attr} {metric}: {getattr(df[df[attr] <= df[attr].quantile(0.95)][attr], metric)()}"
                for attr in ("score", "num_comments")
                for name, df in (("CONTROL", control_latest), ("TREATMENT", treatment_latest))
            )
        )
    )
    print()

METRIC: skew
CONTROL score skew: 3.438887605605816
TREATMENT score skew: 3.3554357393754417
CONTROL num_comments skew: 2.1308976390012226
TREATMENT num_comments skew: 2.144510730065316

METRIC: kurtosis
CONTROL score kurtosis: 12.9692022378779
TREATMENT score kurtosis: 12.304818847783517
CONTROL num_comments kurtosis: 4.579087890105466
TREATMENT num_comments kurtosis: 4.763500495194618



In [147]:
for name, df in (("CONTROL", control_latest), ("TREATMENT", treatment_latest)):
    print(f"{name}:")
    print(filter_top(df, "score").corr(filter_top(df, "num_comments")))
    print()

CONTROL:
0.2811512893667608

TREATMENT:
0.29734789310834914



## Temporal similarity
TODO

In [148]:
%%capture
px.scatter(joined, x="age", y="score_saturation", color="group", opacity=0.005)

In [149]:
control["score_saturation"].mean()

0.8121326617171523

In [152]:
treatment["score_saturation"].mean()

-inf

# Draft conclusion

In conclusion: From the KS-tests we deduce that the probability of the samples from the two distribution pairs - (TREATMENT_score, CONTROL_score) and (TREATMENT_num_comments, CONTROL_num_comments) - originating from the respective same distribution are both significantly less than 5% (p-value < 0.05). Investigating descriptive statistics of these distributions leads us to believe that the popularity of a post in terms of its score is indeed higher for posts in the treatment group. However, the opposite is true for the number of comments. Initially, we had hypothesized that score and number of comments were correlated, which does not seem to be the case, and this does tie together with the aforementioned uncorrelated effectiveness of the treatment on score and comments. However, a word of caution: The distributions are severely heavy-tailed as indicated by the kurtosis measures, and thus the extreme sample values can easily impact the results significantly, given the fairly limited size of the sample. This is demonstrated by filtering away the top 5% of the data from each distribution, yielding noticeably different perceived relationships between the distributions.