# NLP modeling on baseball subreddits

# Problem statements

Our client Major League Baseball (MLB) wants to improve their fan outreach. By getting fans excited about the game they hope to increase attendence and generate other types of revenue. They want to know if the same type of outreach will work for different types of fans. We will be comparing more analytically inclined fans to fans as a whole. We will create a model that tries to predict whether a post comes from the mlb subreddit or the sabermetrics subreddit. The mlb subreddit is for all MLB fans. The sabermetric subreddit is specifically for fans interested in a more analytical outlook. If the model can identify the differences that means that outreach should probably be tailored depending on which group it is for. And using the model we create MLB will be able to study the differences. For the model to be useful it will need to beat the baseline on new data. 

To create the model we will use ada boosting, random forests, logistic regression, and naive bayes. There are advantages to using these models. Logistic regression and naive bayes take relatively less computing power. It also tends not to overfit the data. Random Forests and adaboosting tend to perform well and correct overfitting issues in decision trees. The models will be evaluated using accuracy since one type of error is not any better or worse then the other in this case. We will also look at other metrics to get a full picture of how the model is doing.

In addition for each model we will try the CountVectorizer and TfidfVectorizer to see which performs better.

# Data Scrapping 

Data will be scrapped from each subreddit. We are interested in what people are posting rather then the comments since those sometimes go off topic. Using the pushshift API library from pmaw allows us to read in the data faster.

At first we were scrapping the same amount of data from both subreddits. However the mlb subreddit has more posts which are removed or deleted. In order to have more balanced classes we need to initially pull in more posts from the mlb subreddit. Once the data has been cleaned there will be a similar number of posts from both subreddits. There are arround 2250 posts in the sabermetric subreddit. By pulling 4500 posts from the mlb subreddit we achieve a good balance.

In [243]:
from pmaw import PushshiftAPI

In [284]:
def data_scrapper(subreddit): # recieved guidence on function from Tanveer Kahn and David Coons
    api = PushshiftAPI() 
    posts = list(api.search_submissions(subreddit = subreddit, limit = 4500)) # using 4500 as the maximum led to fairly balanced data after preprocessing
    df = pd.DataFrame(posts) 
    df = df[['selftext', 'subreddit']] #get the columns I am going to use in the models
    df.to_csv(f'{subreddit}.csv',index = False) 

In [285]:
data_scrapper('sabermetrics') # there are only about 2250 posts in this subreddit

Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift sh

In [290]:
data_scrapper('mlb')

Not all PushShift shards are active. Query results may be incomplete.


In [289]:
df1 = pd.read_csv('sabermetrics.csv') 

In [291]:
df2 = pd.read_csv('mlb.csv')

In [293]:
df = pd.concat([df1, df2]) # merge into one dataframe

# Data Cleaning

Now that the we have downloaded the data we have to clean it so it can be used in the modeling process.

In [294]:
df.info() #check the new data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6750 entries, 0 to 4499
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   selftext   3585 non-null   object
 1   subreddit  6750 non-null   object
dtypes: object(2)
memory usage: 158.2+ KB


There are over 6000 posts in the data. The subreddit category needs to be changed to a binary variable taking the value of 0 or 1 depending on which subreddit it is in.

In [296]:
df['subreddit'] = df['subreddit'].map(lambda x: 0 if x == 'mlb' else 1) #from nlp lesson 2 code

Now the target variable is binary.

In [297]:
df.dropna(subset = ['selftext'], inplace = True) # remove nas

In [298]:
df = df[df.selftext != '[deleted]'] #remove deleted posts

In [299]:
df = df[df.selftext != '[removed]'] # remove removed posts

In [300]:
df = df[df.selftext != ''] #remove blank posts

In [301]:
df.info() #cneck the data now

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3091 entries, 1 to 4499
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   selftext   3091 non-null   object
 1   subreddit  3091 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 72.4+ KB


There are still over 3000 posts in our dataset. That should be enough to answer our question about whether we can predict which post a subreddit will belong to.

In [302]:
df['subreddit'].value_counts()

0    1568
1    1523
Name: subreddit, dtype: int64

The value counts are very similar now. We will not have a serious issue with unbalanced classes. There are slightly more posts from the mlb subreddit so the baseline will be above 50%.

In [303]:
df.to_csv('data.csv',index = False)

Further analysis and modeling is in project 2 notebook