# Pre-processing and cleaning the Reddit data

By: Iris Luden 

Last edited: 27-03-2023



The retrieved Reddit posts and comments are stored in the folder:

'Reddit_data'

    |__> 2016
        |__> ids
            |__> ids_2016_1.txt
            |__> ids_2016_2.txt
            |__> ...
        |__> texts
            |__> texts_2016_1.txt
            |__> texts_2016_2.txt
            |__> ...
        |__> texts_clean*
            |__> Monthly_freqs_2016_1.json* 
            |__> ...
            |__> {year}_{month}_cleaned_texts.json* 
            |__> ... 
    |__> 2017
    |__> 2018
    |__> ... 
    |__> 2023

The clean texts are being written into the folder year/texts_clean/...
The monthly frequency counts of the words are also stored. 

In [4]:
from collections import Counter
import time
import os
import nltk 
import matplotlib.pyplot as plt
from gensim.models.word2vec import PathLineSentences
import numpy as np
import pandas as pd

from helpers import * 

# 1. Cleaning 

1. Filtering Non-English: 
    
    The text are filtered for non-English words using `filer_non_English` in helpers.py. This function checks if any of the stopwords is in the text. If not, the text is not added to the clean_texts file. Inspired by [TO DO]
    
2. Sentence Tokenizing & Word tokenizing 
    - The texts are tokenized into sentences using the standard `nltk.tokenize.sent_tokenize`
    - Sentences are tokenized into words using the standard `nltk.tokenize.TreebankWordTokenizer()`` 
        - [TO DO] Another option could have been to use the TweetTokenizer  https://towardsdatascience.com/top-5-word-tokenizers-that-every-nlp-data-scientist-should-know-45cc31f8e8b9

3. Cleaning punctuation
    - Special punctuations ’ and … are replaced by default punctuation ' and ... 
    - All words are stripped from punctuation from `string.punctuation` and the characters ”“
    
4. Lower case 

5. Disregard posts/comments of less than 10 terms, separated by spaces.

#### A: Clean data of 2015

In [None]:
year = 2015
start = time.time()

for month in range(7, 13):
    collect_clean_texts(year, month)
end = time.time()
print(f'Cleaning the year {year} took {end-start} seconds')

#### B: Clean data of full years 2016-2022

In [None]:
# years 2016-2022
for year in range(2016, 2023): 
    
    start = time.time()
    
    # loop over the months of the year, clean the texts, and place in the 
    for month in range(1,13):
        collect_clean_texts(year, month)
        
    end = time.time()
    
    print(f'Cleaning the year {year} took {end-start} seconds')

#### C:  Clean data for the year 2023, months 1, 2


In [None]:
year = 2023
start = time.time()
# loop over the months of the year
for month in range(1, 3):
    collect_clean_texts(year, month)
    
end = time.time()

print(f'Cleaning the year {year} took {end-start} seconds')

# Move the files into the two corpus directories 

Corpus 1:  July 2015 until and including April 2019 (total 46 months)

Coprus 2: May 2019 until Februari 2023 (total 46 months)

#### Divide and sort cleaned data in two folders
    - Reddit_data/Corpus1
    - Reddit_data/Corpus2

These will be moved to the general folders Corpus1 and Corpus 2 together with the twitter data set. 



In [5]:
# os.mkdir('Reddit_data/Corpus1')
# os.mkdir('Reddit_data/Corpus2')

In [None]:
# # CORPUS 1
# year = 2015
# for month in range(7, 13):
#     old_path = f'Reddit_data/{year}/texts_clean/{year}_{month}_cleaned_texts.txt'
#     new_path = f'Reddit_data/Corpus1/{year}_{month}_cleaned_texts.txt'
#     os.replace(old_path, new_path)
        
# ## corpus 1 for t5: all documents from 2016-01 until  2019-04
# for year in range(2016, 2020): 
    
#     for month in range(1, 13):
        
#         if year == 2019 and month > 4: 
#             break 
#         else: 
#             old_path = f'Reddit_data/{year}/texts_clean/{year}_{month}_cleaned_texts.txt'
#             new_path = f'Reddit_data/Corpus1/{year}_{month}_cleaned_texts.txt'
#             os.replace(old_path, new_path )

# # CORPUS 2
# for year in range(2019, 2024): 
    
#     for month in range(1, 13):
#         if year == 2019 and month < 5: 
#             pass
#         elif year == 2023 and month > 2: 
#             break 
#         else: 
#             old_path = f'Reddit_data/{year}/texts_clean/{year}_{month}_cleaned_texts.txt'
#             new_path = f'Reddit_data/Corpus2/{year}_{month}_cleaned_texts.txt'
#             os.replace(old_path, new_path )