# Let's test out a zero-shot classification

This notebook is the next step after scrapping posts and comments from the subreddit r/AITA and analysing the data collected: <br />
https://github.com/Nico404/scrap_reddit <br />
https://github.com/Nico404/AITA_data_exploration_and_ML/blob/master/AITA_data_exploration.ipynb

In [2]:
import pandas as pd
import pickle
from transformers import pipeline
verbose=0

In [3]:
# Load the data from AITA_data_exploration.ipynb
with open('data/pickled_post_data.pkl', 'rb') as f:
    data = pickle.load(f)

posts_df = pd.DataFrame(data)

posts_df.head()

Unnamed: 0,post_id,post_content,post_title,comment_results
0,10uxee0,"I know this post sounds super petty, but this ...",AITA for telling my boyfriend I'll shave my le...,"{'NTA': 31, 'YWBTA': 0, 'YWNBTA': 0, 'ESH': 0,..."
1,10ur722,My daughter Bryn F9 is going on a trip to a ne...,AITA for pulling my daughter from a waterpark ...,"{'NTA': 14, 'YWBTA': 0, 'YWNBTA': 0, 'ESH': 0,..."
2,10upxdd,Alright so my son (17) has weekly therapy appo...,AITA for not letting an elderly woman have my ...,"{'NTA': 27, 'YWBTA': 0, 'YWNBTA': 0, 'ESH': 0,..."
3,10v2vra,We live three blocks away from my parents and ...,AITA for taking my kids to my parents house to...,"{'NTA': 63, 'INFO': 1, 'YWBTA': 0, 'YWNBTA': 0..."
4,10ung90,My daughter (16) and I have gotten into a mass...,AITA for calling my daughter a selfish insecur...,"{'NTA': 17, 'ESH': 3, 'YWBTA': 0, 'YWNBTA': 0,..."


In [4]:
# set up pipeline & candidate labels for classification
candidate_labels = ["NTA", "YTA", "ESH", "NAH", "INFO", "YWBTA", "YWNBTA"]
candidate_labels_short = ["NTA", "YTA"]

pipe = pipeline(model="facebook/bart-large-mnli")

Let's run the zero-shot model on both Post content and Post title and compare it with the actual results we got from comments.
Let's try and make a candidate shortlist also and add that to the mix. Lets make functions & store all the results in a dataframe to make comparaison easy

In [5]:
# function to get predictions for a dataframe
def get_predictions(dataframe, labels, prompt_type, nb_iterations=20): # nb_iterations is the number of posts to predict on and will greatly implact run time 
    predictions = {}
    for i, row in dataframe.iterrows():
        post_results = {}
        column_name = 'post_content' if prompt_type == 'content' else 'post_title'
        post_prediction = pipe(row[column_name], labels)
        post_results = {label: round(score, 2) for label, score in zip(post_prediction["labels"], post_prediction["scores"])}
        predictions[row['post_id']] = post_results
        if i == nb_iterations - 1 :
            break
    return predictions

# get predictions for post content and post title with or without shortlist of labels
post_content_results = get_predictions(posts_df, candidate_labels, 'content')
post_content_shortlist_result = get_predictions(posts_df, candidate_labels_short, 'content')
post_title_results = get_predictions(posts_df, candidate_labels, 'title')
post_title_shortlist_results = get_predictions(posts_df, candidate_labels_short, 'title')



In [6]:
# returns a dictionary of post_id and the percentage of comments that are NTA, YTA, ... etc
def get_comment_results(dataframe):
    comment_results = {}
    for i, row in dataframe.iterrows():
        post_id = row['post_id']
        if post_id in post_content_results: # look for content of previous predictions to only return results for posts that have been predicted
            total_comments = sum(row['comment_results'].values())
            if total_comments == 0: percentage_results = {key: 0 for key in row['comment_results'].keys()} # assign 0 when no comments
            else: percentage_results = {key: value / total_comments for key, value in row['comment_results'].items()} # calculate percentage of comments for each label
            comment_results[post_id] = percentage_results
    return comment_results

# get actual comment results for posts
actual_comments_results = get_comment_results(posts_df)

# combine all results into one dictionary
dict_list = [{'post_id': key, **value, 'source_dict': 'post_content_results'} for key, value in post_content_results.items()] + \
            [{'post_id': key, **value, 'source_dict': 'post_content_shortlist_result'} for key, value in post_content_shortlist_result.items()] + \
            [{'post_id': key, **value, 'source_dict': 'post_title_results'} for key, value in post_title_results.items()] + \
            [{'post_id': key, **value, 'source_dict': 'post_title_shortlist_results'} for key, value in post_title_shortlist_results.items()] + \
            [{'post_id': key, **value, 'source_dict': 'actual_comments_results'} for key, value in actual_comments_results.items()]

In [7]:
# convert to dataframe clean and sort
df = pd.DataFrame(dict_list).sort_values(by=['post_id', 'source_dict'])
df.fillna(0.00, inplace=True)
df

Unnamed: 0,post_id,ESH,INFO,NAH,YTA,YWNBTA,YWBTA,NTA,source_dict
96,10ubcp5,0.00,0.000000,0.00,0.00,0.00,0.00,1.000000,actual_comments_results
16,10ubcp5,0.14,0.210000,0.19,0.10,0.11,0.13,0.120000,post_content_results
36,10ubcp5,0.00,0.000000,0.00,0.45,0.00,0.00,0.550000,post_content_shortlist_result
56,10ubcp5,0.12,0.240000,0.12,0.23,0.09,0.06,0.140000,post_title_results
76,10ubcp5,0.00,0.000000,0.00,0.62,0.00,0.00,0.380000,post_title_shortlist_results
...,...,...,...,...,...,...,...,...,...
83,10v2vra,0.00,0.015625,0.00,0.00,0.00,0.00,0.984375,actual_comments_results
3,10v2vra,0.15,0.200000,0.18,0.09,0.14,0.12,0.110000,post_content_results
23,10v2vra,0.00,0.000000,0.00,0.45,0.00,0.00,0.550000,post_content_shortlist_result
43,10v2vra,0.17,0.190000,0.15,0.18,0.11,0.07,0.140000,post_title_results


Okay let's face it, the results are not great at first glance.
- similar results for both title and content on short-listed labels. Less noise ?
- unclear results on content vs title. No real better candidate yet.
- all zero-shots have drastically different conclusions that the results we scrapped from comments


-> Maybe acryonyms for labels are a bad idea ? Especially since the model is not familiar with the context or data at all...
We can try making labels candidates more explicit and try again.

In [8]:
# set up explicit candidate labels for classification
explicit_candidate_labels = ["You're the Asshole", "You Would Be the Asshole", "Not the Asshole", "You Would Not be the Asshole", "Everyone Sucks here", "No Assholes here", "Not Enough Info"]
explicit_candidate_labels_short = ["You're the Asshole", "Not the Asshole"]

# get predictions and results
post_content_results = get_predictions(posts_df, explicit_candidate_labels, 'content')
post_content_shortlist_result = get_predictions(posts_df, explicit_candidate_labels_short, 'content')
post_title_results = get_predictions(posts_df, explicit_candidate_labels, 'title')
post_title_shortlist_results = get_predictions(posts_df, explicit_candidate_labels_short, 'title')
actual_comments_results = get_comment_results(posts_df)

# Our actual_comments_results dictionary has also got acronyms that we need to change before merging with the other results to match column names structure
key_map = {'NTA': "Not the Asshole", 'YWBTA': "You Would Be the Asshole", 'YWNBTA': "You Would Not be the Asshole", 'ESH': "Everyone Sucks here", 'NAH': "No Assholes here", 'INFO': "Not Enough Info", 'YTA': "You're the Asshole"}
new_actual_comments_results = {}

for k, v in actual_comments_results.items():
    new_actual_comments_results[k] = {key_map.get(key, key): val for key, val in v.items()}

# combine all results into one dictionary
dict_list = [{'post_id': key, **value, 'source_dict': 'post_content_results'} for key, value in post_content_results.items()] + \
            [{'post_id': key, **value, 'source_dict': 'post_content_shortlist_result'} for key, value in post_content_shortlist_result.items()] + \
            [{'post_id': key, **value, 'source_dict': 'all_post_title_results'} for key, value in post_title_results.items()] + \
            [{'post_id': key, **value, 'source_dict': 'all_post_title_results_short'} for key, value in post_title_shortlist_results.items()] + \
            [{'post_id': key, **value, 'source_dict': 'new_actual_comments_results'} for key, value in new_actual_comments_results.items()]

In [14]:
# convert to dataframe clean and sort
df = pd.DataFrame(dict_list).sort_values(by=['post_id', 'source_dict'])
df.fillna(0.00, inplace=True)
pd.set_option('display.max_rows', None)
df


Unnamed: 0,post_id,Not Enough Info,No Assholes here,You Would Not be the Asshole,You Would Be the Asshole,You're the Asshole,Not the Asshole,Everyone Sucks here,source_dict
56,10ubcp5,0.06,0.03,0.22,0.29,0.28,0.05,0.07,all_post_title_results
76,10ubcp5,0.0,0.0,0.0,0.0,0.85,0.15,0.0,all_post_title_results_short
96,10ubcp5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,new_actual_comments_results
16,10ubcp5,0.15,0.21,0.15,0.17,0.11,0.12,0.1,post_content_results
36,10ubcp5,0.0,0.0,0.0,0.0,0.48,0.52,0.0,post_content_shortlist_result
57,10uczru,0.02,0.04,0.26,0.28,0.25,0.07,0.07,all_post_title_results
77,10uczru,0.0,0.0,0.0,0.0,0.79,0.21,0.0,all_post_title_results_short
97,10uczru,0.0,0.0,0.0,0.0,0.9375,0.0,0.0625,new_actual_comments_results
17,10uczru,0.18,0.17,0.18,0.14,0.11,0.1,0.12,post_content_results
37,10uczru,0.0,0.0,0.0,0.0,0.52,0.48,0.0,post_content_shortlist_result


Let's apply some conditional formatting

<img src="assets/condi_format.png" alt="condi_format" />

Our best candidate is the post_content_shortlist_result, with a 9/19 probability to get the same results as the comment section.
Not great results which suggests we need to move away from zero-shot. We should try and find a good text-classifier that can be tuned to our data.