# Flatiron Capstone Project – Notebook #1: Data Prep

Student name: **Angelo Turri**

Student pace: **self paced**

Project finish date: **1/19/24**

Instructor name: **Mark Barbour**

# Instructions

Due to the size of this project, there are four notebooks instead of one. The proper order to execute these notebooks is as follows:

- Gathering Data **<---- You are here**
- Preprocessing
- Feature Engineering
- Modeling

### Stakeholder
Your stakeholder is a social media communications team working for a political candidate, Donald Trump. They have requested that you analyze a body of social media posts from their voter base and extract meaningful insights on their base's attitudes.

### Data: Origin
Data is taken from the former reddit titled ***r/the_donald***. This reddit has been archived along with 20,000 others on [the-eye.eu](https://the-eye.eu/redarcs/). If you want to download it yourself, you just need to type "the_donald" in the search bar on this website and download the "Comments" link provided there. However, this notebook should download the file for you.

Fair warning - if you are about to explore on this website, be cautious. I looked at some of the archived reddits and will never be the same again.

### Data: Statistics
The compressed file is sizeable at 3.8GB, but this is in .zst format. Once converted to a .txt file, it takes up a whopping 37.48 GB of space, containing data on approximately 48 million posts. Due to the sheer amount of data and the limitations of my machine, I was unwilling to analyze all 48 million posts. I wanted to take 2 million posts, so I kept every 23rd post from this file.

Each post is recorded as a dictionary. Only some of the keys were relevant to our analysis:
- Raw text
- Post score (upvotes - downvotes)
- Author
- Date posted

After extraction, our initial dataframe had 2.1 million total entries ranging from August of 2015 to April of 2020, for a total of 1710 days – approximately 4.5 years. There are 178,308 unique authors.


### Basic spam removal
Thorough spam removal occurs in the **preprocessing notebook**, but we conduct some very basic spam removal in this notebook. Any removed posts (whether deleted by moderators, or authors), and empty posts are removed. All in all, 198,237 of our original posts were removed due to these measures.

# Importing packages

In [1]:
import pandas as pd
import json
import numpy as np

import time
from tqdm.notebook import tqdm
import os
import urllib
import warnings
warnings.filterwarnings('ignore')

import zstandard as zstd

### Downloading data

In [6]:
name='the_donald_comments'
path = 'data/the_donald_comments.zst'

exists = os.path.exists(path)

In [8]:
name='the_donald_comments'
path = '../data/the_donald_comments.zst'

# Downloading can be a lengthy process. If you have already downloaded the file,
# this cell will not re-download.
exists = os.path.exists(path)

if exists:
    print(f"Compressed .zst file for {name} already exists. Skipping download.")
else:
    url = 'https://the-eye.eu/redarcs/files/The_Donald_comments.zst'
    urllib.request.urlretrieve(url, path)

Compressed .zst file for the_donald_comments already exists. Skipping download.


### Converting .zst file to a .txt file

The raw data is in a compressed .zst format, and it has to be un-compressed. I stream-read the file to avoid crashing my kernel. The file is huge – after it gets uncompressed, it takes up almost 38GB of space.

In [None]:
input_path = f"../data/{name}.zst"
output_path = f"../data/{name}.txt"
exists = os.path.exists(output_path)


# Decompression can take time and is avoided if the text file already exists.
if exists:
    print(f"Output file for {name}_comments already exists. Decompression skipped.")
    pass

else:
    print(f"Output file for {name}_comments does not exist. Commencing decompression.")

    # Get the size of the input file
    input_path_size = os.path.getsize(input_path)

    # Open the input file in binary read mode
    with open(input_path, 'rb') as compressed_path:
        # Create a ZstdDecompressor object
        decompressor = zstd.ZstdDecompressor()

        # Create a decompression stream reader
        with decompressor.stream_reader(compressed_path) as reader:
            # Open the output file in binary write mode
            with open(output_path, 'wb') as decompressed_path:
                # Initialize tqdm with total size
                with tqdm(total=input_path_size, unit='B', unit_scale=True, desc='Decompressing', leave=True) as pbar:
                    # Read and decompress data in chunks
                    while True:
                        # Read a chunk of data from the decompression stream
                        chunk = reader.read(65536)  # Read 64KB at a time

                        # Check if there's no more data to read
                        if not chunk:
                            break

                        # Write the decompressed chunk to the output file
                        decompressed_path.write(chunk)

                        # Update progress bar
                        pbar.update(len(chunk))

### Extracting important information from the .txt file

Now that we've uncompressed the file, it's time to start extracting information. I am not analyzing all 38GB of this file. Instead, I am taking every 23rd post for analysis, which comes up to about 2,000,000 posts. This is more than enough.

In [None]:
# Lists for older posts
old_lines = []
ups = []
downs = []
old_scores = []

# Lists for newer posts
lines = []
scores = []
authors = []
utcs = []
posts = []

with open(f"../data/{name}.txt", 'r') as txt_file:
    lines_read = 0
    
    for line in tqdm(txt_file, desc = f"Extracting text from"):
        data = json.loads(line.strip())
        lines_read +=1
        
        if lines_read % 23 == 0:
            try:
                downs.append(data['downs'])
                old_scores.append(data['score'])
                ups.append(data['ups'])
                old_lines.append(data)
            except:
                lines.append(data)
                scores.append(data['score'])
                authors.append(data['author'])
                utcs.append(data['created_utc'])
                posts.append(data['body'])

### Proving that "score" is equal to upvotes minus downvotes

Older Reddit posts have upvotes and downvotes recorded separately as well as the total post score. Newer Reddit posts only have the total post score recorded.

For every posts with upvotes and downvotes recorded, I subtracted the number of downvotes from the number of upvotes and compared it to the total score. For all 45,529 posts, they were the same. This proves that the "score" attribute of the Reddit posts is equal to the number of upvotes minus the number of downvotes.

In [None]:
# Example of an older post
old_lines[0]

In [None]:
# Example of a newer post
lines[0]

In [4]:
with open(f'data/sample_post.pkl', 'wb') as f:
    joblib.dump(lines[0], f)

SyntaxError: invalid syntax (4218242034.py, line 2)

In [8]:
old_posts_df = pd.DataFrame({'upvotes': ups, 'downvotes': downs, 'score': old_scores})
old_posts_df

Unnamed: 0,upvotes,downvotes,score
0,1,0,1
1,0,0,0
2,2,0,2
3,1,0,1


In [9]:
# For every entry, the score is equal to the upvotes minus the downvotes.
(old_posts_df['upvotes'] - old_posts_df['downvotes'] == old_posts_df['score']).value_counts()

True    4
Name: count, dtype: int64

### Creating initial dataframe

In [10]:
df = pd.DataFrame({'date': pd.to_datetime(utcs, unit='s', utc=True), 
                   'author': authors, 
                   'post': posts, 
                   'score': scores})

df.index = df.index.rename('id')

In [11]:
# A preview at the data and the number of rows – 2.1 million
df

Unnamed: 0_level_0,date,author,post,score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2015-08-05 22:19:44+00:00,NYPD-32,A lot of latinos are annoyed with illegal immi...,2
1,2015-08-09 23:51:28+00:00,shitheadsean2,"If Donald Trump liquidated everything, and the...",5
2,2015-08-13 19:16:16+00:00,NYPD-32,An*,7
3,2015-08-14 16:38:24+00:00,the_achiever,"I really support Trump, but he has to work on ...",1
4,2015-08-17 14:53:52+00:00,Degenerate_Nation,Two-part Trump-centric podcasts:\n\nhttp://www...,1
...,...,...,...,...
2093201,2020-04-09 21:21:21+00:00,[deleted],[deleted],1
2093202,2020-04-10 04:26:11+00:00,Fordheartskav,yes and yes. So many spez suckers lying about ...,1
2093203,2020-04-10 15:57:11+00:00,RhettOracle,Now it's brutality? You are so biased it's h...,1
2093204,2020-04-11 03:00:14+00:00,[deleted],[removed],17


In [12]:
# The dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2093206 entries, 0 to 2093205
Data columns (total 4 columns):
 #   Column  Dtype              
---  ------  -----              
 0   date    datetime64[ns, UTC]
 1   author  object             
 2   post    object             
 3   score   int64              
dtypes: datetime64[ns, UTC](1), int64(1), object(2)
memory usage: 63.9+ MB


In [13]:
print(f"The earliest post is {df.date.min()}.")
print(f"The latest post is {df.date.max()}.")
print(f"The data spans {df.date.max() - df.date.min()}.")

The earliest post is 2015-08-05 22:19:44+00:00.
The latest post is 2020-04-11 18:49:38+00:00.
The data spans 1710 days 20:29:54.


In [14]:
print(f"The data contains {df.author.nunique()} unique authors.")

The data contains 178308 unique authors.


# Storing data externally

Now we have collected all the essential data for analysis. We are going to store the data externally and pick up from there in a different notebook.

In [16]:
df.to_parquet(path=f'../data/{name}.parquet')