# <a href="http://www.redditscore.com">![redditscore](https://s3.us-east-2.amazonaws.com/redditscore/logo.png)</a> 
***A machine learning approach to predicting how badly you'll get roasted for your sub-par reddit comments.***

Alex Hartford & Trevor Hacker

### **Dataset**

Reddit comments from September, 2018 (<a href="http://files.pushshift.io/reddit/">source</a>).  This is well over 100gb of data.  We will likely only use a subset of the data but will ultimately try to use the entire dataset.

### **Objectives**

Create a linear regression model to predict the reddit score of a comment a user is considering posting. 

Stretch goal - narrow comment scoring down by subreddit, as comment popularity will differ between reddit communities.

Allow users to use this model with a publicly available <a href="http://www.redditscore.com">Website</a>.

Open source the project to allow further contributions if anyone is interested.

## Formal Hypothesis and Data Analysis

By analyzing comments made on the Reddit platform by prior users, we believe that people who seek to gather as much reputation as possible on Reddit would find value in being able to predict whether their comments will be well received by the community.  In the process, finding some of the most common highly/negatively received comments would be very interesting information as it can provide insight into the current trends of the web.

This dataset is just one of many - there are datasets for all the information ever posted on Reddit, publicly available for use. Community members of Reddit have assembled the data by running scripts on the Reddit API and did most of the cleaning for us. Interestingly, people released these datasets in hope that people would create something out of them - quite awhile ago. From what I can tell, Redditscore is one of the first applications that uses this data, rather than just providing a few nice graphs.  There is actually a problem potentially being solved here, as there are people who live for Reddit karma.

In [62]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

print('Libraries loaded!')

Libraries loaded!


## Import and Clean Data

In [63]:
print('Loading memes...')

# df = pd.read_csv('https://s3.us-east-2.amazonaws.com/redditscore/2500rows.csv')
df = pd.read_csv('https://s3.us-east-2.amazonaws.com/redditscore/2mrows.csv', error_bad_lines=False, engine='python', encoding='utf-8')

print('Memes are fully operational!')

Loading memes...


UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7941: character maps to <undefined>

In [None]:
print(df.dtypes)
print()
print(df.shape)
df.head(10)

The score will always be an integer since it is based on upvotes and downvotes. Before converting however, we need to check if there are any null values.

In [None]:
df.isna().sum()

In [None]:
df[df.isnull().any(axis=1)].head(20)

There is only a small amount of null values and they appear to be of little use, so removing them seems to be the best bet. Once the null values are removed we can convert score to an integer.

In [None]:
df = df.dropna()
df['score'] = df['score'].astype('int')

In [None]:
print(df.shape)
df.head(10)

## Initial Data Analysis

Before getting into handling the comment body a better understanding of the score collumn needs to be gained.

In [None]:
df['score'].describe()

In [None]:
sns.distplot(df["score"], kde=False)

As seen standard deviation and the distribution plot, there is a large distribution of data which makes the dataset skewed.

In order to solve this log sclaling can be applied which might be useful later on.

In [None]:
mask = df["score"] > 0
sns.distplot(np.log1p(df["score"][mask]), kde=False)

The positive scores appear to be skewed with a significant majority of values being equal to 1. 

In [None]:
mask = df["score"] < 0
sns.distplot(-np.log1p(-df["score"][mask]), kde=False)

The negative scores also seem a little skewed.

#### Adding another score column

In order to understand the data better and also create a logistic regression model a seperate column was created with the values of positive, negative or one score. Positive score being anything greater than 1, negative being anything less than 1 and one being 1. The reason for this classification is how comments on reddit work, since whenever a comment is made it automatically gets an upvote and therfore if the score is zero it got a downvote.

In [None]:
df['pn_score'] = ""

for i in df['score'].index:
    if  df['score'].at[i] > 1:
        df['pn_score'].at[i] = 'positive'
    elif float(df['score'].at[i]) <= 0:
        df['pn_score'].at[i] = 'negative'
    else:
        df['pn_score'].at[i] = 'one'  

df.head(10)

In [None]:
pn_counts = df['pn_score'].value_counts()
print(pn_counts)
pn_counts.plot.bar()
plt.ylabel("Number of Samples", fontsize=16)

Again there is an issue with distribution here. The majority of dataset has positive score values, where negative scores are much less frequent.

## Logistic Regression Model
There will be a combination of logistic regression and linear regression models used.

The logistic model will be created based on the categorical score values, so it will predict whether the comment will have a postive or negative score or a score of 1.  

In order for the comments to be meaningful predictors of score they first need to be turned into a vector of numerical features. The vectorizer used implements Text Frequency-Inverse Document Frequency (TfIdf) weighting. Additionally stop_words were removed from the vector.

In [None]:
log_vect = TfidfVectorizer(max_df = 0.95, min_df = 5, binary=True, stop_words='english')
text_features = log_vect.fit_transform(df.body)
print(text_features.shape)

In [None]:
list(log_vect.vocabulary_)[:10]

In [None]:
encoder = LabelEncoder()
numerical_labels = encoder.fit_transform(df['pn_score'])

training_X, testing_X, training_y, testing_y = train_test_split(text_features,
                                                               numerical_labels,
                                                               stratify=numerical_labels)
print(training_y)

logistic_regression = SGDClassifier(loss="log", penalty="l2", max_iter=1500)
logistic_regression.fit(training_X, training_y)
pred_labels = logistic_regression.predict(testing_X)

accuracy = accuracy_score(testing_y, pred_labels)
cm = confusion_matrix(testing_y, pred_labels)

print("Accuracy:", accuracy)
print("Classes:", str(encoder.classes_))
print("Confusion Matrix:")
print(cm)

Since the data is so skewed a simple random over-sampling was used in order to increase the number of negative scores. The reason for using over-sampling as opposed to under-sampling is because we didn't want to loose any comments that could contribute as predictors. This does run the risk of overfitting the data however.

In [None]:
count_pos, count_one, count_neg = df['pn_score'].value_counts()

df_pos_score = df[df['pn_score'] == 'positive']
df_neg_score = df[df['pn_score'] == 'negative']
df_one_score = df[df['pn_score'] == 'one']

df_neg_score_over = df_neg_score.sample(count_one, replace=True)
df_score_over = pd.concat([df_pos_score, df_neg_score_over, df_one_score], axis=0)

print('Random over-sampling:')

pn_counts = df_score_over['pn_score'].value_counts()
print(pn_counts)
pn_counts.plot.bar()
plt.ylabel("Number of Samples", fontsize=16)

Similarily to first model the comments need to be vectorized.

In [None]:
log_vect_over = TfidfVectorizer(max_df = 0.95, min_df = 5, binary=True, stop_words='english')
text_features = log_vect_over.fit_transform(df_score_over.body)
print(text_features.shape)

In [None]:
list(log_vect_over.vocabulary_)[:10]

Now that the comments are turned into vectorized features they can be used in the logistic regression model. In order to achieve better results the random over-sampled data is used.

In [None]:
encoder = LabelEncoder()
numerical_labels = encoder.fit_transform(df_score_over['pn_score'])

training_X, testing_X, training_y, testing_y = train_test_split(text_features,
                                                               numerical_labels,
                                                               stratify=numerical_labels)
print(training_y)

logistic_regression_over = SGDClassifier(loss="log", penalty="l2", max_iter=1500)
logistic_regression_over.fit(training_X, training_y)
pred_labels = logistic_regression_over.predict(testing_X)

accuracy = accuracy_score(testing_y, pred_labels)
cm = confusion_matrix(testing_y, pred_labels)

print("Accuracy:", accuracy)
print("Classes:", str(encoder.classes_))
print("Confusion Matrix:")
print(cm)

According to the confusion matrix the model struggles with determining a comment that has a score of 1 and usually mistakes it for a positive comment. It seems to perform the best with negative comments which could indicate overfitting of the data.

## Linear Regression Models

There will be two linear regression models, one for detecting the value of the positive score comments and another for detective the value of the negative score comments. The specific one will be used depending on the outcome of the logistic regression model.

### Positive Scores

The first linear regression model will predict that of the postive score. In order to do that only the rows with a positive score are necessary.

In [None]:
pos_score_df = df[df.pn_score == 'positive']

pos_score_df.head()

Similarily to the logistic regression the comments need to be transformed into a vector of numerical values.

In [None]:
pos_vect = TfidfVectorizer(max_df = 0.95, min_df = 5, binary=True, stop_words='english')
text_features = pos_vect.fit_transform(pos_score_df.body)
print(text_features.shape)

In [None]:
list(pos_vect.vocabulary_)[:10]

Now that the comments are vectorized, the model can be created. In order to eliminate the issue with large distribution noticed during the alaysis, the scores are log scaled.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(text_features, np.log1p(pos_score_df['score']))

pos_linear_regression = SGDRegressor(max_iter=1500)
pos_linear_regression.fit(X_train, y_train)
test = pos_linear_regression.predict(X_test)
mse = mean_squared_error(y_test, test)
rmse = np.sqrt(mse)
print()
print("Positive Score Model MSE:", mse)
print("Positive Score Model RMSE:", rmse)

Based on the rmse the model seems to preform pretty well.

### Negative Scores

The second linear regression model will predict the negative scores. Similarily to the first model only the rows with negative scores are necessary and the comment need to be vectorized using those.

In [None]:
neg_score_df = df[df.pn_score == 'negative']

neg_score_df.head()

In [None]:
neg_vect = TfidfVectorizer(max_df = 0.95, min_df = 5, binary=True, stop_words='english')
text_features = neg_vect.fit_transform(neg_score_df.body)
print(text_features.shape)

In [None]:
list(neg_vect.vocabulary_)[:10]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(text_features, -np.log1p(-neg_score_df["score"]))

neg_linear_regression = SGDRegressor(max_iter=1500)
neg_linear_regression.fit(X_train, y_train)
test = neg_linear_regression.predict(X_test)
mse = mean_squared_error(y_test, test)
rmse = np.sqrt(mse)
print()
print("Negative Score Model MSE:", mse)
print("Negative Score Model RMSE:", rmse)

The results are similar to the first model.

## Combining Models

First the logistic regression model will be used to preditc whether or not the score is negative or positive, then depending on the outcome the appropriate linear regression model will be used to predict the score value

In [None]:
a = (["You sir a simple idiot. Or a Russian bot. Either way not worth an actual sentence on why I didn't vote for that loon."])

logistic_result = logistic_regression_over.predict(log_vect_over.transform(a))
print('Logistic Result: ')
print(logistic_result)
print()

if(logistic_result) == 2:
    linear_result = pos_linear_regression.predict(pos_vect.transform(a))
    print('Linear Result: ')
    print(linear_result)
elif(logistic_result) == 0:
    linear_result = neg_linear_regression.predict(neg_vect.transform(a))
    print('Linear Result: ')
    print(linear_result)

Lastly, we want to pickle our models and vectorizers for deployment.

In [None]:
import pickle
pickle.dump(logistic_regression_over, open('logreg.pkl', 'wb'))
pickle.dump(pos_linear_regression, open('poslinreg.pkl', 'wb'))
pickle.dump(neg_linear_regression, open('neglinreg.pkl', 'wb'))
pickle.dump(log_vect_over, open('log_vect.pkl', 'wb'))
pickle.dump(pos_vect, open('pos_vect.pkl', 'wb'))
pickle.dump(neg_vect, open('neg_vect.pkl', 'wb'))

## **Conclusion**

We did a fair job of proving our hypothesis.  We found that due to the large volume of comments that go relatively unseen,
a model isn't going to be able to predict any huge scores.  However, it does a really nice job of pointing you in the right
direction as it can properly determine if the comment will be received positively or negatively, and does give a range,
usually between -10 and 10.  This is enough to let a commenter know what to expect from their comment.

Our analysis could be improved by including the subreddit as a category.  Reddit is a large community, with many subcommunities.
These subcommunities often have a completely different audience from one another, so it's important to distinguish what may be
received positively by one community may be received very negatively by another.  I'm sure there are also other factors that could
be added in that would help improve the model as the dataset is quite extensive, and we decided just to focus on what we considered
the strongest predictor.

A lesson learned was to really know the data you are working with.  We were struggling to get varying values for a long time,
having most scores range around 1.  Suddenly, it hit us... Reddit comments have a score of 1 by default as the poster automatically
upvotes their own comment.  Therefore, using zero to categorize comments that saw no attention was leaving all the 1 values in,
which led to them dominating the prediction of positive scores.  By instead counting 0 as a negative, as someone would have to
downvote a comment for it to have a 0, and 1 as the neutral value, our model became exponentially more predictive.  It was the
very definition of a lightbulb moment, and if we had saw this earlier it could have given us more time to spend in other areas.

Having the freedom to work on a dataset of our choice is a really great way to end a course as it gives you an application of what
you spent the term learning.  So often that is missed, you complete a course and you don't apply the material and therefore it's lost.
This was a great opportunity to apply our new skills to a domain we found very interesting.