# SC4021 - Data Collection, Cleaning and Analysis
This notebook presents the data collection, cleaning and analysis for the SC4021 - Information Retrieval course project. We will start with the data collection from Reddit using the Python Reddit API Wrapper (PRAW).

## 1. Reddit Data Collection
In this section, we will collect the data from Reddit using the Python Reddit API Wrapper (PRAW). We will collect the submissions and comments for the following subreddits:
- VisionPro
- virtualreality
- augmentedreality
- MetaQuestVR
- oculus
- OculusQuest
We will collect the data for the top 1000 submissions for each subreddit (the collection is limited to 1000 submissions due to the Reddit API limitations).

But first, we need to install and import the necessary libraries. We will install the PRAW library using the following command: `pip install praw`.

In [None]:
# Importing the necessary libraries
import praw  # Python Reddit API Wrapper
import pandas as pd # Data manipulation library
import os # Operating system library
from datetime import datetime # Datetime library

After we have successfully installed the necessary libraries, we can proceed with the data collection. First, we will define the constants and the Reddit instance.

In [None]:
# Define the constants and the Reddit instance

# Create a Reddit instance
reddit = praw.Reddit(user_agent=True, client_id='kqB2Mfaq32Jax9LAmdsr3A',
                     client_secret='7HR4TNjSVDXZrgEwlsrjF0Pcwzdc2w', username='KrisCholakov',
                     password='kyhnoh-pixci0-Bedgit')

# Define the columns for the submissions and comments dataframes
submission_columns = ["author", "created_utc", "distinguished", "id", "name", "num_comments", "score", "selftext", "title", "upvote_ratio", "url"]
comment_columns = ["author", "body", "body_html", "created_utc", "distinguished", "id", "link_id", "parent_id", "score"]

# Define the directory to save the data for the subreddits
data_directory = "subreddits"

Now, we will define the functions to crawl the data from Reddit. We will define the function to get the submissions and the corresponding comments for a subreddit.

In [None]:
# Define the functions to craw the data from Reddit

# Function to get the submissions and the corresponding comments for a subreddit
def get_submissions_and_comments(reddit, subreddit_name, limit=None):
    # Create the lists to store the submissions and comments
    submissions_list = []
    comments_list = []
    
    # Define the counter for the submissions and comments
    submission_cnt, comment_cnt = 1, 1
    
    # Browse the submissions
    for submission in reddit.subreddit(subreddit_name).top(limit=limit):
        # Print the progress - submission title, submission cnt
        print(f'{subreddit_name}-{submission_cnt}', submission.title)
        # Define the submission
        new_submission = {
        "author": submission.author,
        "created_utc": submission.created_utc,
        "distinguished": submission.distinguished,
        "id": submission.id,
        "name": submission.name,
        "num_comments": submission.num_comments,
        "score": submission.score,
        "selftext": submission.selftext,
        "title": submission.title,
        "upvote_ratio": submission.upvote_ratio,
        "url": submission.url
    }
        # Add the submission to the list
        submissions_list.append(new_submission)
        # Get the comments
        submission.comments.replace_more(limit=0)
        # Browse the comments
        for comment in submission.comments.list():
            # Print the progress - comment cnt
            print(f'comment #{comment_cnt}')
            # Define the comment
            new_comment = {
            "author": comment.author,
            "body": comment.body,
            "body_html": comment.body_html,
            "created_utc": comment.created_utc,
            "distinguished": comment.distinguished,
            "id": comment.id,
            "link_id": comment.link_id,
            "parent_id": comment.parent_id,
            "score": comment.score
        }
            # Add the comment to the list
            comments_list.append(new_comment)
            comment_cnt += 1
        submission_cnt += 1

    # Convert the lists to dataframes
    submissions = pd.DataFrame(submissions_list, columns=submission_columns)
    comments = pd.DataFrame(comments_list, columns=comment_columns)
    
    return submissions, comments

After having defined the function to crawl the data from Reddit we will define some helper functions to save the data to csv files in the corresponding directory.

In [None]:
# Define the functions to save the data to csv files in the corresponding directory

# Function to save the submissions and comments to csv files
def save_submissions_and_comments(submissions, comments, subreddit_name):
    # Create the directory if it does not exist
    if not os.path.exists(f'{data_directory}/{subreddit_name}'):
        os.makedirs(f'{data_directory}/{subreddit_name}')
    
    # Save the submissions to a csv file
    submissions.to_csv(f'{data_directory}/{subreddit_name}/submissions.csv', index=False)
    # Save the comments to a csv file
    comments.to_csv(f'{data_directory}/{subreddit_name}/comments.csv', index=False)
    
# Function to check if the subreddit directory exists and if the data is already collected
def check_subreddit_data(subreddit_name):
    # Check if the subreddit directory exists
    if not os.path.exists(f'{data_directory}/{subreddit_name}'):
        return False
    # Check if the submissions and comments csv files exist
    if not os.path.exists(f'{data_directory}/{subreddit_name}/submissions.csv') or not os.path.exists(f'{data_directory}/{subreddit_name}/comments.csv'):
        return False
    
    return True

Having defined all the functions needed for the crawling and saving the data, we can now proceed with the data collection for the subreddits. But first, let's define the subreddits we want to crawl the data from.

In [None]:
# Define the subreddits to crawl the data from
subreddits = ["VisionPro", "virtualreality", "augmentedreality", "MetaQuestVR", "oculus", "OculusQuest"]

Now, let's crawl the data for the subreddits.

In [None]:
# Crawl the data for the subreddits
for subreddit_name in subreddits:
    # Check if the data is already collected
    if check_subreddit_data(subreddit_name):
        continue
    # Get the submissions and comments
    submissions, comments = get_submissions_and_comments(reddit, subreddit_name, limit=1000)
    # Save the data to csv files
    save_submissions_and_comments(submissions, comments, subreddit_name)

## 2. Analyzing the data
In this section, we will analyze the collected data. This section is important to understand the data and to identify any issues that need to be addressed in the data cleaning section. 

First, we will need to import the matplotlib library, used to visualize the data.

In [None]:
# Importing the necessary libraries
import matplotlib.pyplot as plt

Now, we will define the functions to load the data and analyze it.

In [None]:
# Define the functions to load the data and analyze it

# Function to load the submissions and comments dataframes
def load_submissions_and_comments(subreddit_name):
    # Load the submissions and comments dataframes
    submissions = pd.read_csv(f'{data_directory}/{subreddit_name}/submissions.csv')
    comments = pd.read_csv(f'{data_directory}/{subreddit_name}/comments.csv')
    
    return submissions, comments

def simple_analyze_submissions_and_comments(submissions, comments):
    # Create a dictionary to store the results
    results = {}

    # Calculate and store the results in the dictionary
    results["Number of submissions"] = len(submissions)
    results["Number of comments"] = len(comments)
    results["Number of unique authors in submissions"] = len(submissions["author"].unique())
    results["Number of unique authors in comments"] = len(comments["author"].unique())
    results["Number of unique submissions"] = len(submissions["id"].unique())
    results["Number of unique comments"] = len(comments["id"].unique())
    comments["word_length"] = comments["body"].apply(lambda x: len(str(x).split()))
    results["Average word length of comments"] = comments["word_length"].mean()
    results["Number of comments that have more than 50 words"] = len(comments[comments["word_length"] > 50])
    results["Number of submissions that have more than 50 words in the selftext"] = len(submissions[submissions["selftext"].apply(lambda x: len(str(x).split())) > 50])
    results["Average score of submissions"] = submissions["score"].mean()
    results["Average score of comments"] = comments["score"].mean()
    results["Average number of comments per submission"] = submissions["num_comments"].mean()

    # Convert the dictionary to a DataFrame
    results = pd.DataFrame(list(results.items()), columns=['Description', 'Data'])
    
    return results

Now, we will combine the data for all subreddits and perform the simple analysis.

In [None]:
# Perform analysis on the combined data for all subreddits

# Load the data for the subreddits and analyze it
submissions_list = []
comments_list = []
for subreddit_name in subreddits:
    # Load the submissions and comments dataframes
    submissions, comments = load_submissions_and_comments(subreddit_name)
    # Add the subreddit name to the submissions and comments dataframes
    submissions["subreddit"] = subreddit_name
    comments["subreddit"] = subreddit_name
    # Add the submissions and comments to the lists
    submissions_list.append(submissions)
    comments_list.append(comments)

# Concatenate the submissions and comments dataframes
all_submissions = pd.concat(submissions_list)
all_comments = pd.concat(comments_list)

# Simple analyze the data
simple_analyze_submissions_and_comments(all_submissions, all_comments)

We will now look at the authors with the most comments.

In [None]:
# Show the users with most comments and the number of comments
all_comments["author"].value_counts()

We will now check if the top 100 authors with the most comments have a lot of duplicated comments.

We will proceed with the analysis of the authors with most submissions (posts).

In [None]:
# Show the users with most submissions
all_submissions["author"].value_counts()

We will now check the most common comments.

In [None]:
# Show the most common comments
all_comments["body"].value_counts()

Let's check the most common comments with more than 10 words and more than 1 occurrence.

In [None]:
# Show the most common comments with more than 10 words and more than 1 occurrence add the comment ID too
all_repeated_comments = all_comments[all_comments["body"].apply(lambda x: len(str(x).split()) >= 10)]["body"].value_counts()[all_comments[all_comments["body"].apply(lambda x: len(str(x).split()) >= 10)]["body"].value_counts() > 1]
all_repeated_comments

We will lastly plot the distribution of the scores for the comments.

In [None]:
# Plot the distribution of the scores for the comments in a log-log scale
plt.figure(figsize=(10, 6))
plt.hist(all_comments["score"], bins=100, log=True)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.title("Distribution of the scores for the comments")
plt.show()

As we can see from plotting the distribution of the scores for the comments, many comments have a score of 0-10 (because they are new or not interesting). But also, there are thousands of comments with a scores higher than 100. This means that there are many popular/interesting comments in the dataset. We may use this score to perform weighted retrieval in the future.

## 3. Data Cleaning
In this section, we will clean the data. This is important because we don't want to have duplicated comments, comments with less than a certain number of words, comments with a high percentage of special characters, etc.  

First, we will define a function to clean the comments that:
- Occur in all_repeated_comments and have more than 3 occurrences
- Have less than a certain number of words
- Are duplicated (have the same body and author)
- Have a high percentage of special characters
- Have low number of unique words

In [None]:
# Define the function to clean the comments
def clean_comments(subreddits, min_word_count):
    subreddits_data = {}
    for subreddit_name in subreddits:
        # Load the submissions and comments dataframes
        submissions, comments = load_submissions_and_comments(subreddit_name)
        # Add the submissions and comments dataframes to the dictionary
        subreddits_data[subreddit_name] = {"submissions": submissions, "comments": comments}
        
    # Loop through the subreddits
    for subreddit_name, data in subreddits_data.items():
        initial_count = len(data["comments"])

        # Remove the comments that occur in all_repeated_comments and have more than 3 occurrences
        data["comments"] = data["comments"][~data["comments"]["body"].isin(all_repeated_comments[all_repeated_comments > 3].index)]
        print(f"{subreddit_name}: Removed {initial_count - len(data['comments'])} comments that occur in all_repeated_comments and have more than 3 occurrences")
        initial_count = len(data["comments"])

        # Clear the comments with less than min_word_count words
        data["comments"] = data["comments"][data["comments"]["body"].apply(lambda x: len(str(x).split()) >= min_word_count)]
        print(f"{subreddit_name}: Removed {initial_count - len(data['comments'])} comments with less than {min_word_count} words")
        initial_count = len(data["comments"])

        # Remove duplicated comments with same body and author
        data["comments"] = data["comments"].drop_duplicates(subset=["body", "author"])
        print(f"{subreddit_name}: Removed {initial_count - len(data['comments'])} duplicated comments")
        initial_count = len(data["comments"])

        # Remove comments with high percentage of special characters
        data["comments"] = data["comments"][data["comments"]["body"].apply(lambda x: len([c for c in str(x) if not c.isalnum()]) / len(str(x)) < 0.5)]
        print(f"{subreddit_name}: Removed {initial_count - len(data['comments'])} comments with high percentage of special characters")
        initial_count = len(data["comments"])

        # Remove comments with less than 40% unique words
        data["comments"] = data["comments"][data["comments"]["body"].apply(lambda x: len(set(str(x).split())) / len(str(x).split()) > 0.4)]
        print(f"{subreddit_name}: Removed {initial_count - len(data['comments'])} comments with less than 40% unique words")
        

        # Save the cleaned comments to a csv file
        data["comments"].to_csv(f'{data_directory}/{subreddit_name}/comments_cleaned_{min_word_count}.csv', index=False)

        # Print the number of comments after cleaning
        print(f"{subreddit_name}: Number of comments after cleaning: {len(data['comments'])}")
        print()

    # Print the total number of comments after cleaning
    total_comments = sum([len(data["comments"]) for data in subreddits_data.values()])
    print(f"Total number of comments after cleaning: {total_comments}")

    # Print the number of comments that were removed using
    print(f"Number of comments that were removed using all_comments: {len(all_comments) - total_comments}")

As we want our comments to be meaningful, we will select the comments that have more than 50 words. We will clean the comments using the function defined above.

In [None]:
# Clean with min_word_count = 50
clean_comments(subreddits, 50)

In [None]:
# Define the min_word_count variable
min_word_count = 50

In [None]:
# Combine the cleaned comments for all subreddits
cleaned_comments_list = []
for subreddit_name in subreddits:
    # Load the cleaned comments
    cleaned_comments = pd.read_csv(f'{data_directory}/{subreddit_name}/comments_cleaned_{min_word_count}.csv')
    # Add the subreddit name to the cleaned comments
    cleaned_comments["subreddit"] = subreddit_name
    # Add the cleaned comments to the list
    cleaned_comments_list.append(cleaned_comments)

# Concatenate all the cleaned comments into a single DataFrame
all_cleaned_comments = pd.concat(cleaned_comments_list)

# Save the cleaned comments to a csv file
all_cleaned_comments.to_csv(f'{data_directory}/all/comments_cleaned_{min_word_count}.csv', index=False)

In [None]:
# Load the cleaned comments for all subreddits
cleaned_comments = pd.read_csv(f'{data_directory}/all/comments_cleaned_{min_word_count}.csv')
cleaned_comments

In [None]:
# Rename the score column to upvotes
cleaned_comments = cleaned_comments.rename(columns={"score": "upvotes"})

# Save the cleaned comments to a csv file
cleaned_comments.to_csv(f'{data_directory}/all/comments_cleaned_{min_word_count}.csv', index=False)

We will also combine all the submissions for all subreddits and save them to a csv file.

In [None]:
# Define the DataFrame to store the submissions for all subreddits
submissions_list = []

# Load the submissions for all subreddits
for subreddit_name in subreddits:
    # Load the submissions
    submissions = pd.read_csv(f'{data_directory}/{subreddit_name}/submissions.csv')
    # Add the subreddit name to the submissions
    submissions["subreddit"] = subreddit_name
    # Add the submissions to the list
    submissions_list.append(submissions)

# Concatenate all the submissions into a single DataFrame
all_submissions = pd.concat(submissions_list)

# Save the submissions to a csv file
all_submissions.to_csv(f'{data_directory}/all/submissions.csv', index=False)

Now, we have all the cleaned comments, so will again collect their submissions.

In [None]:
# Get the set of unique submission IDs for the cleaned comments
unique_submission_ids = set(all_cleaned_comments["link_id"].unique())
print(f"Number of unique submission IDs: {len(unique_submission_ids)}")

# Use PRAW to get the submissions for the unique submission IDs
submissions_list = []
cnt = 0
max = len(unique_submission_ids)
unique_submission_ids = list(unique_submission_ids)
for submission_id in unique_submission_ids:
    submission = reddit.submission(id=submission_id[3:])
    # Print error message if the submission is not found
    if submission is None:
        print(f"Submission with ID {submission_id} not found")
        continue
    new_submission = {
        "author": submission.author,
        "created_utc": submission.created_utc,
        "distinguished": submission.distinguished,
        "id": submission.id,
        "name": submission.name,
        "num_comments": submission.num_comments,
        "upvotes": submission.score,
        "selftext": submission.selftext,
        "title": submission.title,
        "upvote_ratio": submission.upvote_ratio,
        "url": submission.url
    }
    submissions_list.append(new_submission)
    # Add a progress bar that will stay on the same line
    cnt += 1
    print(f"Submission {cnt}/{max} collected", end="\r")
    

# Convert the list to a DataFrame
submissions = pd.DataFrame(submissions_list, columns=new_submission.keys())
submissions

In [None]:
# Define final submission df a copy of the submissions df
final_submissions = submissions.copy()
# Convert the created_utc column date
final_submissions["created_utc"] = final_submissions["created_utc"].apply(lambda x: datetime.utcfromtimestamp(x).strftime('%Y-%m-%d'))
# Save the submissions to a csv file
final_submissions.to_csv(f'{data_directory}/all/submissions_cleaned_{min_word_count}.csv', index=False)

In [None]:
# Load the comments with the classification
class_comments = pd.read_csv('comments_class.csv')

# Load the cleaned comments
cleaned_comments = pd.read_csv(f'{data_directory}/all/comments_cleaned_{min_word_count}.csv')

# Convert the created_utc column date
cleaned_comments["created_utc"] = cleaned_comments["created_utc"].apply(lambda x: datetime.utcfromtimestamp(x).strftime('%Y-%m-%d'))

# Add the Prediction_Class and Confidence_Level columns to the cleaned comments (match by id)
cleaned_comments = cleaned_comments.merge(class_comments[["id", "Predicted_Class", "Confidence_Level"]], on="id", how="left")

# Save the cleaned comments with the classification to a csv file
cleaned_comments.to_csv(f'{data_directory}/all/comments_cleaned_{min_word_count}_class.csv', index=False)