# **SHapley Additive exPlanations Model Interpretability Analysis for Top2Vec Natural Language Processing**: Part 1

## **Abstract**

### **III.A. Data Sourcing and Management:**

  The data in this study is web scraped from comment conversations in the subreddit called “r/AirForce”. This subreddit is used members of the United States Air Force (USAF) and for Reddit users who are interested the USAF. Reddit users here can post about mutual interests, experiences, and questions among the USAF community as well as express grievances regarding any current events and trends. While branches of the military have analogous subreddit
pages, the USAF was chosen for this research project due to my general familiarity from having worked with the Deparment of Air Force (DAF). [[1]](https://www.reddit.com/r/AirForce/)

### **1.a. How the data was obtained/collected:**

The subreddit data was scraped from 2015 to 2022 in order to be used for time series, correlation, regression, and model interpretability analyses between the association of "Permanent Change of Station (PCS)" topic related comments and "depression" topic related comments. About 100,000 comments were scraped for analysis using Pushshift Reddit API. [[2]](https://medium.com/swlh/how-to-scrape-large-amounts-of-reddit-data-using-pushshift-1d33bde9286)

### **Installations:**

Data Retrieval:

In [None]:
# Main wrapper in Python for Pushshift.
pip install pmaw pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pmaw
  Using cached pmaw-2.1.3-py3-none-any.whl (25 kB)
Collecting praw
  Using cached praw-7.6.1-py3-none-any.whl (188 kB)
Collecting update-checker>=0.18
  Using cached update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting websocket-client>=0.54.0
  Using cached websocket_client-1.4.2-py3-none-any.whl (55 kB)
Collecting prawcore<3,>=2.1
  Using cached prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw, pmaw
Successfully installed pmaw-2.1.3 praw-7.6.1 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.4.2


### **Imports and Scrape:**

In [None]:
import pandas as pd # For data manipulation.
from pmaw import PushshiftAPI # For web scrape.
api = PushshiftAPI() # Application program interface.

import datetime as dt # For date time manipulation.
before = int(dt.datetime(2021,01,1,0,0).timestamp()) # End date.
after = int(dt.datetime(2015,12,1,0,0).timestamp()) # Beginning date.

subreddit = "AirForce" # Subreddit location.
limit=100000 # Approximate comment document scrape limit.
# Scraping process:
comments = api.search_comments(subreddit=subreddit, # Subreddit.
                               limit=limit, # Data scrape amount.
                               before=before, # End date.
                               after=after # Beginning date.
                               )
# Number of comments/rows from the dataset.
print(f'Retrieved {len(comments)} comments from Pushshift')

comments_df = pd.DataFrame(comments) # Turns dataset to dataframe. 
# Preview the comments data.
comments_df.head(5)
# Saves dataframe to csv file.
comments_df.to_csv('./AF_comments_100k.csv', header=True, index=False, 
                   columns=list(comments_df.axes[1]))



Retrieved 5672 comments from Pushshift


Example data file generated from Reddit scrape to be used for project analysis:
* [AF_comments_100k.csv](https://drive.google.com/file/d/11ROOAKTTEj9djDy1c-ct7I3W2-Jmmby5/view?usp=share_link)


Tutorial site reference:
* How to Scrape Large Amounts of Reddit Data by Matt Podolak
  * https://medium.com/swlh/how-to-scrape-large-amounts-of-reddit-data-using-pushshift-1d33bde9286
