GitHub - andersonpaac/Creddit: cReddit: Misinformation Assessment Tool for Comments from Reddit

Objective

This is a repository containing the tools to Ingest, Analyze and train ML models to predict the mis-information presented in comments on Reddit

Included is the discourse on 3 threads on r/politics

We section the data required into 2 Sets

Comment Frame

Extracted by the Ingest.CommentExtractor into data/comment_frame/{post_id}_comment_frame.pkl The pickle is of type pandas.core.frame.DataFrame Each comment is encapsulated in a row and there are no duplicates Contains 12 below features

Features	dType	Example	Inference
author	string	santaKlaus	author of the comment
body	string	Now it's up to Congress ...	comment body
comment_created_utc	datetime64	2019-03-22 22:35:58	comment creation time
comment_deleted	bool	False	if comment was deleted later by the user
comment_id	string	ej5l32l	ID of the comment
comment_removed_by_mods	bool	True	if the comment was removed by moderators
edited	bool	True	If the comment was edited
golds	int	0	Number of "gold"/"awards" received by the author for the comment
is_submitter	bool	False	If the author of the comment is also the author of the post
parent_id	string	t3_b4agza	The comment_id that the comment is a response to. If the parent_id is the post_id, it is a "top-level" comment.
post_id	string	b4agza	The post to which the comment belongs
score	int	19	The "karma" / score is defined by upvotes - downvotes for the comment

User Frame

Extracted by the Ingest.UserExtractor into data/user_frame/{post_id}_user_frame.pkl The pickle is of type pandas.core.frame.DataFrame Each user is encapsulated in a row and there are no duplicates Contains 11 below features

Features	dType	Example	Inference
author	string	HereWeGoAgainTJ	author of the comment
subreddit_comment_count	Dict[str: int]	{"worldnews": 38, "confession": 6, ...}	An aggregation of the comments made in various subreddits (Count)
subreddit_comment_karma	Dict[str: int]	{"worldnews": 333, "conspiracy": 31, ...}	An aggregation of the karma from the comments made in various subreddits (Sum)
subreddit_post_count	Dict[str: int]	{"bestoflegaladvice": 1, "SubredditDrama": 1, ...}	An aggregation of the posts made in various subreddits (Count)
subreddit_post_karma	Dict[str: int]	{"SubredditDrama": 32, "bestoflegaladvice": 2, ...}	An aggregation of the karma from the posts made in various subreddits (Sum)
user_account_creation_utc	datetime64	2018-08-11 12:02:37	When the users' account was created
user_email_verified	bool	True	If the user verified his email
user_total_comment_count	int	993	Total number of comments made by the author on Reddit
user_total_comment_karma	int	54048	Total karma accumulated by the author from their comments on Reddit
user_total_post_count	int	13	Total number of posts made by the author on Reddit
user_total_post_karma	int	226	Total karma accumulated by the author from their posts on Reddit

Training Data Construction Pipeline

You can skip step 1 and 2 if you intend to work with the existing dataset.

Generate your credentials from apps here. Select the app-type to be "Script for personal use".
Insert your secret-key and client-id into reddit_api_creds.json
Call Ingest.extractors.CommentExtractor with a valid post_id. This will first pull a list of comment_ids associated with the post_id from PushShift.io. We do this because reddit's API's does not give access to several deleted and removed comments. PushShift still hosts these datasets. A post is is the base 36 id of the reddit post submission.
The earlier step gets all the comments for the associated post_id. 3. We augment the predictors in the user_frame by leveraging the Analysis.FeatureBuilder.UserFeatureBuilder. This class aggregates the user's historical activity in partisan subreddits along with r/news and r/politics. The Analysis/partisan_resource.json is used to determine the partisan bias in subreddits. It also aggregates the post, comment karma and count. Sample usage is below
```
post_id, post_creation_time = 'b4agza', pd.to_datetime(get_reddit_instance().submission('b4agza').created_utc, unit='s')
user_frame = Common.load_user_frame(post_id)
userFeatureBuilder = UserFeatureBuilder(post_creation_time)
user_features = Df(list(user_frame.apply(userFeatureBuilder.get_features_for_row, axis=1)))
```
Similarly, we augment the features in the 'comment_frame' with the Analysis.FeatureBuilder.PostFeatureBuilder. This will generate network based features for every comment that is passed in. Network features include network_comment_thread_max_depth, network_comment_thread_top_level_count, network_comment_thread_size. See CommentNetworkFeatureBuilder class for more documentation. Features on the sentiment of the comment body is capture by profanity, polarity, objectivity, spelling errors by the CommentTextFeatureBuilder class. Additional features such as timedelta between the comment creation and post creation are also captured. Sample usage is below
```
comment_frame = Common.load_comment_frame(post_id)
top_level_frame = Common.get_top_level_comments(comment_frame)
postFeatureBuilder = PostFeatureBuilder(comment_frame)
post_features = Df(list(top_level_frame.progress_apply(postFeatureBuilder.get_features_for_row, axis=1)))
```
We then stitch the user_frame, comment_frame along with all augmented features together to form a single dataset. This dataset contains 59 columns. The code for generating this dataset is in Analysis/featureGenScript.py. The dataset can contain data from multiple posts. For example: the frame in data/dataset/combined.pkl is a dataset merging the top level comments from the 3 posts mentioned above.
The training and the inference of the models can be found in the jupyter notebook Model.ipynb and in the conclusion of the paper cReddit_Munchen. The feature set we use for training is restricted to the below 42 features.

Feature	dType	Example
golds	float64	1
comment_char_count	int64	134
comment_text_polarity	float64	-0.123
network_comment_thread_max_depth	int64	2
network_user_total_comment_count	int64	5
news_subreddit_comment_karma	int64	7
politics_subreddit_post_karma	int64	31231
left_subreddit_comment_count	int64	2981
news_subreddit_comment_count	int64	7
comment_has_user_ref	bool	True
user_email_verified	object	1
user_total_comment_count	int64	157
comment_spelling_error_count	int64	1
right_subreddit_comment_count	int64	1
network_user_thread_comment_count	int64	1
user_account_age_seconds	float64	5.96577e+06
center_subreddit_comment_karma	int64	210
post_comment_timedelta_seconds	float64	7730
right_subreddit_post_karma	int64	-2120
comment_url_refer_count	int64	0
center_subreddit_comment_count	int64	12
comment_has_citation	bool	True
center_subreddit_post_karma	int64	0
politics_subreddit_comment_count	int64	50
politics_subreddit_post_count	int64	0
user_total_post_karma	int64	15
politics_subreddit_comment_karma	int64	69
left_subreddit_comment_karma	int64	0
user_total_post_count	int64	4
comment_text_profanity	int64	7
network_comment_thread_size	int64	3
user_total_comment_karma	int64	753
network_comment_thread_top_level_count	int64	2
network_user_top_level_comment_count	int64	2
left_subreddit_post_count	int64	3123
news_subreddit_post_count	int64	0
right_subreddit_comment_karma	int64	0
left_subreddit_post_karma	int64	0
comment_text_subjectivity	float64	0
right_subreddit_post_count	int64	0
center_subreddit_post_count	int64	9012
news_subreddit_post_karma	int64	18000

Setting up the environment

This codebase was developed on Python 3.7.3, there is a requirements.txt containing all dependencies. To setup the environment, first install and setup a virtual environment, and then install the dependencies.
```
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```
One of the dependencies textblob==0.15.3 requires the appropriate NLTK corpora to function. Run the below command to fetch them
```
python -m textblob.download_corpora
```

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Analysis		Analysis
Figures		Figures
Ingest		Ingest
data		data
.gitignore		.gitignore
LICENSE		LICENSE
Model.ipynb		Model.ipynb
README.md		README.md
cReddit_Munchen.pdf		cReddit_Munchen.pdf
reddit_api_creds.json		reddit_api_creds.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis

Analysis

Figures

Figures

Ingest

Ingest

data

data

.gitignore

.gitignore

LICENSE

LICENSE

Model.ipynb

Model.ipynb

README.md

README.md

cReddit_Munchen.pdf

cReddit_Munchen.pdf

reddit_api_creds.json

reddit_api_creds.json

requirements.txt

requirements.txt

Repository files navigation

Objective

Comment Frame

User Frame

Training Data Construction Pipeline

Setting up the environment

About

Releases

Packages

Contributors 3

Languages

License

andersonpaac/Creddit

Folders and files

Latest commit

History

Repository files navigation

Objective

Comment Frame

User Frame

Training Data Construction Pipeline

Setting up the environment

About

Topics

Resources

License

Stars

Watchers

Forks

Languages