## Fundamentals of Social Data Science
# Week 4 Day 1 Lab. Classification 

In this lab, you will be encouraged to explore your subreddits of choice using multinomial naive bayes and k-means classifications. Determine which one is more suitable using accuracy scores. Use both the TfIDFVectorizer and the CountVectorizer. 

Consider the use of stop words and lemmatisation. 

1. Plot the documents using t-SNE and then color the documents according the most accurate solution. 
2. For Naive Bayes report the 5 most informative terms per solution.  
* Would you be able to report the 5 most informative terms with k-means? This would be a bit far out for this lecture but if you are adventurous you can explore approaches like k-nearest neighbors using the centroids (as in report the 5 nearest neighbors to the centroid for each of the k solutions). 

There is only limited example code for this exercise. It is up to you to stitch together what you have learned as well as potentially draw upon external sources. On Wednesday we will provide an example solution.

Some guidance: 
1. Transform your headlines into a list similar to the walkthrough: [("headline (and maybe selftext)", "subreddit_label"), ("next headline", "next subreddit_label")]
 * Create one long list for all three subreddits to send to the Vectorizer. This is different to what I showed in Week 3 Day 3 where we had a separate vectorizer for each subreddit. To help you out I've started some code that creates a DataFrame for all the subs. 
2. Consider your tokenization. Will you use stop words or not? 
3. Consider plotting the classification on t-SNE to get some intuitions for how the solution maps out visually. 
4. Remember, are you classifying the documents using the terms? Or classifying the terms using the documents? Be careful with how you set this up. Notice that in the examples in the walkthrough we were classifying the documents using the terms. 
5. Consider the structure of this repository. Will you want to place some code for a plotting function in the `analysis.py`? What about creating a function under `text_processor.py` to transform the reddit data into the data structure needed. You can do everything in this Jupyter lab notebook but you should use this opportunity to think about how you might make use of this structure in order to help keep your code tidy. 


In [1]:
import os
import pickle
from models.reddit_scraper import RedditScraper
from config.settings import USER_AGENT
from utils.analysis import *

scraper = RedditScraper(USER_AGENT)
subs_of_interest = ['AmItheAsshole', 'confessions', 'tifu']

posts_list = []

for sub in subs_of_interest:    
    posts = scraper.get_subreddit_posts(sub, limit=100, cache=True)
    df = create_posts_dataframe(posts)
    df['subreddit'] = sub
    posts_list.append(df)

posts_df = pd.concat(posts_list)
posts_df = posts_df.reset_index(drop=True)


In [None]:
# Naive Bayes Classifier

posts_dft

Unnamed: 0,title,selftext,url,domain,time,author,subreddit
0,AITA for being mad my friend didn’t pay me back?,So I’ve paid for two things in the past year a...,https://www.reddit.com/r/AmItheAsshole/comment...,self.AmItheAsshole,2024-11-04 17:55:17,VirgoEsti,AmItheAsshole
1,AITA for not responding to the father of my un...,So to give context I dated this guy for a coup...,https://www.reddit.com/r/AmItheAsshole/comment...,self.AmItheAsshole,2024-11-04 17:51:35,True_Traffic_9928,AmItheAsshole
2,AITA - Child Care and Individual Careers.,I was wondering if I’ve been the asshole to my...,https://www.reddit.com/r/AmItheAsshole/comment...,self.AmItheAsshole,2024-11-04 17:48:37,movie2019,AmItheAsshole
3,WIBTA if I told my friend she can’t be friends...,I (20F) have been friends with Beth (20F) for ...,https://www.reddit.com/r/AmItheAsshole/comment...,self.AmItheAsshole,2024-11-04 17:42:50,SierraWe,AmItheAsshole
4,AITA for screaming at my mom to die in a car c...,My mom was close with both me (34f) and my sis...,https://www.reddit.com/r/AmItheAsshole/comment...,self.AmItheAsshole,2024-11-04 17:42:12,Katthevamp,AmItheAsshole
...,...,...,...,...,...,...,...
295,TIFU by telling my friend congrats on her daug...,"Obligatory ""Didn't happen today.""\n\nWhen I wa...",https://www.reddit.com/r/tifu/comments/1gdrfgv...,self.tifu,2024-10-28 02:06:24,10Kfireants,tifu
296,TIFU by thinking something was actually wrong ...,So kind of embarrassing but I felt it was nece...,https://www.reddit.com/r/tifu/comments/1gdp9fh...,self.tifu,2024-10-28 00:14:14,Gumbinator10,tifu
297,TIFU by trying to make my boyfriend choose bet...,"Hey, do last time I told everybody about how I...",https://www.reddit.com/r/tifu/comments/1gdommr...,self.tifu,2024-10-27 23:43:41,ThisAntiMatter,tifu
298,TIFU passing gas on the dance floor,I’m a mid 30s Female. last night I went out fo...,https://www.reddit.com/r/tifu/comments/1gdmvza...,self.tifu,2024-10-27 22:20:36,queerharveybabe,tifu


Summarise NBC results: 



In [None]:
# K-means Classifier 

Summarise k-Means results