# **CSE 5095: Social Media Mining and Analysis**
Fall 2024, Assignment #1, 200 points


In this assignment, we will explore the statistical properties of the quantitative features associated with each subreddit in your data set. Each data set has observations from two subreddits. In some data sets, each observation is a post, whereas for the other data sets each observation is a compilation of comments for each unique post.


In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

In [4]:
def calc_stats(group):
    return pd.DataFrame({
        'mean': group[features].mean(),
        'variance': group[features].var()
    })


df = pd.read_csv('project10.csv')

post_features = ['post_score', 'post_upvote_ratio', 'post_thumbs_ups', 'post_total_awards_received']
comment_features = ['score', 'controversiality', 'ups', 'downs']
user_features = ['user_awardee_karma', 'user_awarder_karma', 'user_link_karma', 'user_comment_karma', 'user_total_karma']

is_post_level = all(feature in df.columns for feature in post_features)

if is_post_level:
    features = post_features + user_features
else:
    features = comment_features + user_features

stats = df.groupby('subreddit').apply(calc_stats).reset_index()
stats = stats.pivot(index='level_1', columns='subreddit', values=['mean', 'variance'])

subreddits = stats.columns.get_level_values(1).unique()
new_order = [(stat, subreddit) for subreddit in subreddits for stat in ['mean', 'variance']]
stats = stats.reindex(columns=new_order)

stats = stats.round(2)
stats.index.name = 'Feature'
stats.columns.names = ['Statistic', 'Subreddit']

print(stats.to_string())

stats.to_csv('subreddit_statistics.csv')

Statistic                        mean      variance       mean      variance
Subreddit                      action        action    science       science
Feature                                                                     
post_score                      94.45  1.750466e+04     180.31  1.177669e+05
post_thumbs_ups                 94.45  1.750466e+04     180.31  1.177669e+05
post_total_awards_received       0.00  0.000000e+00       0.00  0.000000e+00
post_upvote_ratio                0.93  1.000000e-02       0.93  1.000000e-02
user_awardee_karma            2089.47  5.427181e+07    1944.27  3.774821e+07
user_awarder_karma             575.29  9.556991e+06     605.58  5.119978e+06
user_comment_karma           77849.73  6.208256e+10   66500.80  1.322043e+10
user_link_karma              47975.48  4.181358e+10   90178.09  1.244889e+11
user_total_karma            128489.97  1.787689e+11  159228.75  1.764452e+11


  stats = df.groupby('subreddit').apply(calc_stats).reset_index()
