## Reddit submission analysis

In this project, we'll be predicting the number of upvotes a submission in a subreddit received, based on their title. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

The submissions are based on the hot section in the subreddit which sorts the submission based on the submission time and upvotes.The program iterates through multiple subreddits to find the popular submission. As a sample we have considered the two subreddit - python and politics.

The program runs Linear regression and ridge regression to find out the predicted value of upvotes for the testing set.

for the regression,  Each word in title is tokenized and is converted to a numerical representations. Only the words that appear more than the 5 times are considered as feattures to avoid overfitting. Only the words that appear less than 50 times are considered as they might mostly be stop word and might skew the result.

In the end we should be able to see the top posts in each of the subreddit.

The program can be improved by - 
1) finding out the proper upper cut off and lower cut of point for considering the features.
2) Increasing the dataset set size
3) Adding features like headline lenght and word length
4) Using other algorithms like random forest




In [23]:
client_id = 'xxxxxA'
secret = 'xxxxx'

#-----------------Import Modules-----------------#
import praw
import pandas as pd
from datetime import datetime
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt

#-----------------Define variables-----------------#
subreddits = ['python', 'politics']
data_list = {}
rows = 100
word_count_lower_limit = 5
word_count_upper_limit = 50
subreddit_top_5 = []
num_top_posts = 5



#-----------------connect to reddit api using Praw wrapper-----------------#
reddit = praw.Reddit(client_id= client_id, client_secret = secret, username = '',
                     password  ='PuzzledTarget', user_agent ='Incubator_project')

#-----------------extract data from the subreddit -----------------#
for one_subreddit in subreddits:
    subreddit = reddit.subreddit(one_subreddit)
    
    tokenized_title = []
    unique_tokens = []
    single_token =[]

    # extracting from the hot tab ( Time and votes are considered for hot rating)
    hot_python = subreddit.hot(limit = rows)


    for submission in hot_python:
        if not submission.stickied:
            if submission.id not in data_list :
                created_time = (datetime.utcfromtimestamp(submission.created_utc).strftime('%Y-%m-%d %H:%M:%S'))
                data_list[submission.id] =[submission.title, submission.ups, submission.downs, created_time, submission.upvote_ratio]
            else:
                raise ValueError('duplicate id found')
                
    #-----------------Create a pandas DataFrame out of the dictionary -----------------#
    columns = ['title' , 'ups' , 'downs', 'create_time' , 'upvote_ratio' ]
    reddit_submission = pd.DataFrame.from_dict(data_list, orient = 'index' , columns = columns)
    print('Analysis for subreddit:' + one_subreddit)
    
    if reddit_submission[reddit_submission.isnull().any(axis=1)].shape[0] > 0:
        raise ValueError('null values observed')
    
    
    # covert each title into a numerical repesentation
 

    for item in reddit_submission['title']:
        tokenized_title.append(item.split(" "))
        
    # lowercase all the items and removing punctuations
    punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")", "|" , ">" , "<" , "[" , "]" , "-"]

    clean_tokenized = []

    for item in tokenized_title:
        tokens = []
        for token in item:
            token = token.lower()
            for punc in punctuation:
                token = token.replace(punc, "")
            if token != "":
                tokens.append(token)
        clean_tokenized.append(tokens)
        
    # find all the unique token in clean_tokenised and assign the result to unique tokenize. Any token occuring
    # only once is eleminated from unique token and added to single_token
        for item in clean_tokenized:
            for element in item:
                if element not in  single_token:
                    single_token.append(element)
                elif element in single_token and element not in unique_tokens:
                    unique_tokens.append(element)


    # initialising DataFrame to hold the numeric values for each token 
    counts =  pd.DataFrame(0, index = np.arange(len(clean_tokenized)) , columns = unique_tokens)
    
    
    # for the counts dataframe ,set the index and for each word increment the value count
    for i,item in enumerate(clean_tokenized):
        for element in item:
            if element in unique_tokens:
                counts.iloc[i][element] +=1 
            else:
                continue


    # Features or words occuring too few times will result in overfillting These feature will 
    # probably correlates differently with upvote in training set and testing set.Features or 
    # words occuring too many times will also cause issue (stopwords - such as 'and','or' etc). 
    # They do not add any information to the model.
    # After having a look at the word_count distribution, 
    # to make the model better we reduce the feature by removing words that occur less than 
    # 5(word_count_lower_limit) times or more than 50 (word_count_upper_limit) times.
    
    word_counts = counts.sum(axis =0)
    counts = counts.loc[:,(word_counts >= word_count_lower_limit) & (word_counts <= word_count_upper_limit)]
    counts.shape
    
    #Now we will split the data into 2 sets. Test and train to evaluate the algorithm effectively. 
    #we will select 20% of our rows for test and 80% of our rows for training. 
    
    #-------------Linear Regression----------------------------------------------------
    
    #We will use linear regression algorithm.
    
    X_train, X_test, y_train, y_test = train_test_split(counts, reddit_submission["ups"], test_size=0.2, random_state=1)

    lr = LinearRegression()
    lr.fit(X_train, y_train)
    predictions = lr.predict(X_test)
    
    #lets us calculate the MSE (mean square error associated with our predictions)
    mse_lr = ((predictions - y_test)**2).sum()/len(predictions)
    mse_lr_std_error = mse_lr**(1/2)
    
    # mean of up_votes in the original dataset
    ups_mean = reddit_submission["ups"].describe()[1]
    ups_std_dev = reddit_submission["ups"].describe()[2]


    print (' The mean of up_votes is ' + str(ups_mean) + ' and the std dev is ' + str(ups_std_dev)
           + ' The average predicted upvotes using Linear regression is ' + str(mse_lr_std_error) 
           + 'away from real value')
        
    #-------------Ridge Regression----------------------------------------------------
    # We will use ridge regression to predict the up_votes 
    
    train_rows = int(counts.shape[0]* .8)
    # Set a seed to get the same "random" shuffle every time.
    random.seed(1)

    # Shuffle the indices for the matrix.
    indices = list(range(counts.shape[0]))
    random.shuffle(indices)

    
    # Create train and test sets.
    X_train_ridge = counts.loc[indices[:train_rows], :]
    X_test_ridge = counts.loc[indices[train_rows:], :]
    y_train_ridge = reddit_submission["ups"].iloc[indices[:train_rows]]
    y_test_ridge = reddit_submission["ups"].iloc[indices[train_rows:]]
    X_train_ridge = np.nan_to_num(X_train_ridge)

    # Run the regression and generate predictions for the test set.
    reg = Ridge(alpha=.1)
    reg.fit(X_train_ridge, y_train_ridge)
    predictions_ridge = reg.predict(X_test_ridge)
    
    mse_ridge = ((predictions_ridge - y_test_ridge)**2).sum()/len(y_test_ridge)
    mse_ridge_std_error = mse_ridge**(1/2)
    

    print (' The mean of up_votes is ' + str(ups_mean) + ' and the std dev is ' + str(ups_std_dev)
           + ' The average predicted upvotes using ridge regression is ' + str(mse_ridge_std_error) 
           + 'away from real value')
                
                
    # selecting the model with the lesser standard error for
    
    if mse_ridge_std_error < mse_lr_std_error:
        
        predictions_ridge = pd.DataFrame(data = predictions_ridge , index = X_test_ridge.index )
        reddit_submission.index = counts.index
        reddit_submission.loc[X_test_ridge.index,:]
        reddit_predictions = pd.merge(reddit_submission, predictions_ridge, left_index = True, right_index = True)
       
    else:
        predictions = pd.DataFrame(data = predictions , index = X_test.index )
        reddit_submission.index = counts.index
        reddit_submission.loc[X_test.index,:]
        reddit_predictions = pd.merge(reddit_submission, predictions, left_index = True, right_index = True)
    
    reddit_predictions['predicted_ups'] = reddit_predictions[0] 
        
    # top 5 posts that have maximum predicted up_votes
    top_5 = reddit_predictions.sort_values('predicted_ups' , ascending = False)['title'].head(num_top_posts)
    subreddit_top_5.append((one_subreddit, top_5))
     

                


Analysis for subreddit:python
 The mean of up_votes is 64.7171717172 and the std dev is 289.271641295 The average predicted upvotes using Linear regression is 227.914981362away from real value
 The mean of up_votes is 64.7171717172 and the std dev is 289.271641295 The average predicted upvotes using ridge regression is 349.781565327away from real value
Analysis for subreddit:politics
 The mean of up_votes is 2070.17085427 and the std dev is 5269.87849331 The average predicted upvotes using Linear regression is 8706.18086051away from real value
 The mean of up_votes is 2070.17085427 and the std dev is 5269.87849331 The average predicted upvotes using ridge regression is 6341.22223522away from real value


In [25]:
#subreddit with top tiles
subreddit_top_5

[('python', 31    Download Instagram Photos and Videos Based on ...
  78    How to distribute Python Slack bot running on ...
  92        Need help with PDF split based on the content
  77    Control a web page that is signed in to your a...
  54                How to average across many wav files?
  Name: title, dtype: object),
 ('politics', 99     Trump launches unprovoked attack on beloved bl...
  184    Baltimore stands up for its city after Trump t...
  145    U.S. Senator Patty Murray supports Trump impea...
  195    Steph Curry Forcefully Responds to Donald Trum...
  189    'His feed is the most hate-filled, racist, and...
  Name: title, dtype: object)]