## Predicting the Forum of origin for text data from Reddit.com using Latent Semantic analysis and k-nearest-neighbors classifier

***Table of Contents***


$\S$**1**: Data acquisition

$\S$**2**: Data preprocessing

$\S$**3**: Model fitting

$\S$**4**: Model assessment


---

$\S1$

Using ```praw```, or Python Reddit API Wrapper, we acquire text data from reddit.com

Then, for each subreddit in ```forums```, add the posts which satisfy a chosen condition*

*in this case, contains at least 100 alphabetic characters

In [1]:
import praw
import re


# Instantiate reddit object
reddit = praw.Reddit(client_id="id",client_secret='secret',user_agent = 'Reddit Scraper')

# List of subreddits to serve as our data labels 
forums = [
    'astrology',
    'datascience'] 

# Count number of alphabetic characters using RegEx substitution
char_count = lambda post:len(re.sub('\W|\d','',post.selftext))

# Condition for filtering the posts
condition = lambda post: char_count(post) >= 100

# Instatiate lists for data/labels and add data from each forum
data,labels = [],[]
for i, forum in enumerate(forums):
    # Get latest posts from the subreddit
    subreddit_data = reddit.subreddit(forum).new(limit=200)
    # Filter out posts not satisfying condition
    posts = [post.selftext for post in filter(condition,subreddit_data)]
    # Add posts and labels to respective lists
    data += posts
    labels += [i]*len(posts)
    print(f"Number of posts from {forum}: {len(posts)}")

print("Example Post:\n",data[1][:100])

Number of posts from astrology: 106
Number of posts from datascience: 179
Example Post:
 Back in December when Saturn went into Capricorn I experienced major depression, major losses, and j


$\S2$ 

- Remove symbols, numbers, and url-like strings with custom preprocessor 
- vectorize text using term frequency-inverse document frequency
- reduce to principal values using singular value decomposition
- Partition data and labels into training/validation sets


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split

"""Tune the following parameters for optimal performance"""
MIN_DOC_FREQ = 2 # minimum frequency for a term to be used in our vectorization
N_COMPONENTS = 1000 # number of components (words) used in our vectorization
TEST_SIZE = .1 # percentage of total data partitioned for validation

# Function to remove all non alphabetic characters/URL-like strings
preprocessor = lambda doc: re.sub('\W|\d|http\S+|www\S+',' ',doc)
# tf-idf vectorizer with custom preprocessing function
vectorizer = TfidfVectorizer(preprocessor=preprocessor, stop_words='english',min_df=2)
# SVD object to combine with vectorizer for latent semantic analysis
decomposition = TruncatedSVD(n_components=N_COMPONENTS, n_iter=10)

# Partition training/validation sets
X_train,X_test,y_train,y_test = train_test_split(
    data,labels,test_size=TEST_SIZE,random_state=0)

print(f"Selected {len(y_test)} samples for model testing")

Selected 29 samples for model testing


$\S3$

Created k-neighbors classifier with specified ```N_NEIGHBORS``` 

In [3]:
from sklearn.neighbors import KNeighborsClassifier

# Number of neighbors used for comparison when predicting on unseen data
N_NEIGHBORS = 4

# create classification model using k nearest neighbors
model = KNeighborsClassifier(n_neighbors=N_NEIGHBORS)


$\S4$

Send training set through pipeline, and evaluate model performance on validation set

In [4]:
from sklearn.pipeline import Pipeline

# Establish pipeline
pipe = Pipeline([
    ('vectorizer',vectorizer),
    ('svd',decomposition),
    ('model',model)
])

# Send training data/labels through pipeline
pipe.fit(X_train,y_train)
# Predict on test data and get accuracy score
score = pipe.score(X_test,y_test)
print("Accuracy: {0:.2f} %".format(score*100))

Accuracy: 89.66 %
