# Data Collection

The first step is collecting the data. This can be done with the [Pushshift API](https://github.com/pushshift/api), which will be called through [PMAW](https://github.com/mattpodolak/pmaw). As an additional step, the package [PRAW](https://github.com/praw-dev/praw) will also be used, which will ensure that the corresponding Reddit metadata is up-to-date.

In [None]:
# Versions used:
#
# python 3.9.7
# pandas 1.3.5
# praw   7.5.0
# pmaw   2.1.1

import datetime as dt

import pandas as pd
import praw
from pmaw import PushshiftAPI

## Defining Parameters

The data will be collected from [r/college](https://reddit.com/r/college/), taking into account submissions from January 1, 2020 00:00 UTC to January 1, 2022 00:00 UTC.

In [None]:
reddit = praw.Reddit(client_id='[REDACTED]',
                     client_secret='[REDACTED]',
                     user_agent=f'python: PMAW request enrichment (by u/[REDACTED])')

subreddit = 'college'

start_date = int(dt.datetime(2020, 1, 1, 0, 0).timestamp())
end_date = int(dt.datetime(2022, 1, 1, 0, 0).timestamp())

## Collecting the Data

We can define PushshiftAPI with a decorrelated jitter parameter, which can help reduce competition between threads and better distribute requests to the API. This will cause a reduction in the number of rejected requests.

In [None]:
api = PushshiftAPI(jitter='decorr', praw=reddit)

posts = api.search_submissions(subreddit=subreddit,
                               after=start_date,
                               before=end_date,
                               is_video=False,
                               limit=None)

print(f"Total of submissions collected: {len(posts)}.")

After collecting the data, we can define a DataFrame.

In [None]:
posts_df = pd.DataFrame(posts)

posts_df.head(5)

## Saving

Finally, the DataFrame can be converted to a CSV file.

In [None]:
posts_df.to_csv('./rcollege_20200101-20220101_praw.csv', columns=list(posts_df.axes[1]), header=True, index=False)