---

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path[-1] = f'{sys.path[0]}' + '/src'

from preprocessing.preprocessing_util import load_raw_data, NlpPipe, concat_processed, pd

---
The final product of the preprocessing pipeline is a corpus of posts, with each post consisting of their respective lemmatized words. For a more granular view of each step, please reference the ```preprocessing_util.py``` file.

One note I do want to address is that by default, stopwords and punctuation are removed. For word/document embeddings, it has been recommended to leave the former, and potentially the latter, by many. I remove them for 2 reasons:
1. This corpus is used for a variety of methods that include both LDA and embeddings. For the combination of both (i.e., contextual topic embeddings in the ```modeling``` notebook, I would have needed 2 separate preprocessing objects, as well as 2 separately processed corpora for training the LDA/embeddings. My memory usage was moreover near capacity with 1. 
2. [This paper](https://ep.liu.se/ecp/131/039/ecp17131039.pdf) demonstrated that leaving stopwords in or removing had almost no influence in semantic similarity tasks. 

In [3]:
data = load_raw_data('data/raw/raw_sample.csv')
data.head(1)

Unnamed: 0,created,id,title,text,comments,url,flair,edited,ups,down,num_comments,gilded,awards,sub,total_text
0,2018-05-20 05:15:30,8kqg3d,Staying friends with an ex,I was with this guy for about 3 months. During...,['I cant really say since I dont know how old...,https://www.reddit.com/r/relationship_advice/c...,,0.0,1,0,3,0,0,relationships_advice,Staying friends with an ex. I was with this gu...


In [4]:
pipeline = NlpPipe(data['total_text'].tolist()) 

In [5]:
processed = pipeline.lemmatize()

In [6]:
# First 10 words of example post after preprocessing 
processed[0][:10]

['stay',
 'friend',
 'ex',
 'guy',
 '3',
 'month',
 'time',
 'start',
 'develop',
 'depression']

In [7]:
processed_df = concat_processed(processed, data, ['id', 'url', 'title', 'text'])
processed_df.head(1)

Unnamed: 0,id,url,title,text,processed
0,8kqg3d,https://www.reddit.com/r/relationship_advice/c...,Staying friends with an ex,I was with this guy for about 3 months. During...,"[stay, friend, ex, guy, 3, month, time, start,..."


In [17]:
#processed_df.to_csv('data/interim/processed_sample.csv')

---
<br>
Since the goal of this project would be a recommender system, and the data would need to be referenced for each recommendation, I uploaded the data to a Google cloud postgres instance as well.

In [None]:
# data_save_df(data=processed_df, 
#              user='postgres', 
#              pw=___, 
#              ip='34.94.44.13', 
#              port='5432')