# Notes
## Dataset details:
[OpenAI reddit dataset](https://huggingface.co/datasets/openai/summarize_from_feedback)
2 subsets:
1. Axis
User(raters) rate a summary across different axis("overall", "accuracy", "coverage", "coherence", "compatible") 
2. Comparison
USer is given a document and summary pair and ask to rate which one is better.user also gives his rating a confidence score

|            | Train  | Valid                    | Test           |
| ---------- | ------ | ------------------------ | -------------- |
| Comparison | Reddit | Reddit<br>CNN/Daily mail |                |
| Axis       | x      | Reddit                   | CNN/Daily mail |
|            |        |                          |                |

## Appropriation

End goal: We need user history(series of clicks/skips) as well as user generated document summaries. OpenAI dataset has none.  Hence the appropriation.
### User generated summaries:  
We use Axis subset for the same. User is evaluating different model/policy summaries given a document. His maximum rating summary is consdired as if he himself is generating it.  
**Subset used:** : comparison  
**split used:** validation  

### User click/skip history
We use comparison subset for the same. Given a document user is given multiple pairs to pick better summary and mentions his confidence in that choice. We take mean of those confidence scores as a proxy to how easily user get"s it and make decision around that document. We use this to separate the documents which he might click(high mean confidence) or skip(low mean confidence)  
**Subset used:** : comparisons  
**split used:** validation  

## Desired output format for model input
1. news.tsv
News ID: document id  
Category:  category  
Headline: actual title  
News Body: document text  



2. personalized_test.tsv (contains user click history)   
userid: user id  
clicknewsID: document ids clicked by user   
posnewID: document ids user was asked to summarize  
rewrite_titles: personlized summary generated by user        


# Imports

In [26]:
import json
import pandas as pd
from datasets import load_dataset


# Prepare master data

In [29]:
def get_master_data(hf_dataset="openai/summarize_from_feedback", subset="axis", split="validation"):
    if subset:
        ds = load_dataset("openai/summarize_from_feedback", subset)
    else:
        ds = load_dataset("openai/summarize_from_feedback")
     
    df = ds[split].to_pandas()
    print(df.columns)  
    unique_users = df["worker"].drop_duplicates().shape
    print(f"unique_users: {unique_users}")
    unique_docs = df["info"].drop_duplicates().shape
    print(f"unique_docs: {unique_docs}")
    # flatten data
#     Category: category
# Headline: actual title
# News Body: document text
    df["News ID"] = df["info"].apply(lambda x: x["id"])
    df["News Body"] = df["info"].apply(lambda x: x["post"])
    df["Headline"] = df["info"].apply(lambda x: x["title"])
    df["Category"] =  df["info"].apply(lambda x: x["subreddit"])
    df = df[~df["Category"].isna()]
    df = df[["News ID", "News Body", "Headline", "Category"]]
    df = df.drop_duplicates("News ID")
    return df
    

# df_comparison
master_df.shape

# ds_history = load_dataset("openai/summarize_from_feedback", "comparisons")
# df_history = ds_history["validation"].to_pandas()


Index(['info', 'summary', 'worker', 'batch', 'split'], dtype='object')
unique_users: (32,)
unique_docs: (1038,)
Index(['info', 'summaries', 'choice', 'worker', 'batch', 'split', 'extra'], dtype='object')
unique_users: (63,)
unique_docs: (6714,)


(6320, 4)

In [31]:
df_axis = get_master_data(hf_dataset="openai/summarize_from_feedback", subset="axis", split="validation") # unique_users: (32,),  unique_docs: (1038,)
df_comparison = get_master_data(hf_dataset="openai/summarize_from_feedback", subset="comparisons", split="validation") # unique_users: (63,), unique_docs: (6714,)

master_df = pd.concat([df_axis, df_comparison], axis=0)# df_axis.append(df_comparison, ignore_index=True)
master_df = master_df.drop_duplicates("News ID")

master_df.to_csv("news.csv")

Index(['info', 'summary', 'worker', 'batch', 'split'], dtype='object')
unique_users: (32,)
unique_docs: (1038,)
Index(['info', 'summaries', 'choice', 'worker', 'batch', 'split', 'extra'], dtype='object')
unique_users: (63,)
unique_docs: (6714,)



# Prepare personalized data
User summaries are simulated from `axis` subset, `validation` split.  
User click histories are simulated from `comparison` subset, `validation` split.  
Note: user document pair that occures in summaries is excluded from click history.

In short:
1. User clicks is the documents for which user rates it"s different pairs with mean confidence > 5  
2. User summaries are the model/policy summary for which he rated highest.  

In [125]:
def get_user_summaries(hf_dataset="openai/summarize_from_feedback", subset="axis", split="validation", rating_summ_threshold=5.0):
    if subset:
        ds_history = load_dataset(hf_dataset, subset)
    else:
        hf_history = load_dataset(hf_dataset)
    df = ds_history["validation"].to_pandas()
    # flatten
    df =df.rename(columns={"worker":"uid"})
    df["doc_id"] = df["info"].apply(lambda x: x["id"])

    df["summary_text"] = df["summary"].apply(lambda x: x["text"])
    df["summary_rating"] = df["summary"].apply(lambda x: x["axes"]["overall"])
    df["summary_model"] = df["summary"].apply(lambda x: x["policy"])
    df = df[["uid","doc_id", "summary_model", "summary_text", "summary_rating"]] 
    idx = df.groupby(['doc_id', 'uid'])['summary_rating'].idxmax()
    df = df.loc[idx]

    # filter above threshold as clicks
    df_usummaries = df[df["summary_rating"] > rating_summ_threshold]
    df_usummaries = df_usummaries[["uid", "doc_id", "summary_text"]]
    print(df_usummaries.columns)
    # df_usummaries = df_usummaries.groupby("uid").aggregate({'doc_id': lambda x: x.tolist(), "summary_text":  lambda x: ";;".join(x.tolist())})
    # print(df_click)
    # df_usummaries.rename({"doc_id": "posnewID", "summary_text":"rewrite_titles"})
    # posnewID: document ids user was asked to summarize  
    # rewrite_titles: personlized summary generated by user 
    return df_usummaries


def get_user_clicks(hf_dataset="openai/summarize_from_feedback", subset="comparisons", split="validation", rating_click_threshold=5.0):
    if subset:
        ds_history = load_dataset(hf_dataset, subset)
    else:
        hf_history = load_dataset(hf_dataset)
    df_history = ds_history["validation"].to_pandas()
    # flatten
    df_history = df_history.rename(columns={"worker":"uid"})
    df_history["doc_id"] = df_history["info"].apply(lambda x: x["id"])
    df_history["doc_text"] = df_history["info"].apply(lambda x: x["post"])
    df_history["subreddit"] = df_history["info"].apply(lambda x: x["subreddit"])
    
    # filter out non subreddit 
    df_history = df_history[~df_history["subreddit"].isna()]
    
    df_history["confidence"] = df_history["extra"].apply(lambda x: x["confidence"])
    history_candidates = df_history[["uid","doc_id", "confidence"]].groupby(["doc_id", "uid"]).aggregate("mean")["confidence"]
     
    history_candidates = history_candidates.reset_index()
    history_candidates = history_candidates[history_candidates["confidence"]> rating_click_threshold]
    history_candidates = history_candidates[["uid", "doc_id"]]
    # history_candidates= history_candidates[["uid", "doc_id"]].groupby("uid").aggregate({"doc_id": lambda x: x.to_list()})
    # history_candidates = history_candidates.reset_index()
    
    return history_candidates
    

# consolidate history
user_summaries = get_user_summaries()
print(f"user_summaries.columns: {user_summaries.columns}")

user_clicks = get_user_clicks()
print(f"user_clicks.columns: {user_clicks.columns}")

Index(['uid', 'doc_id', 'summary_text'], dtype='object')
user_summaries.columns: Index(['uid', 'doc_id', 'summary_text'], dtype='object')
user_clicks.columns: Index(['uid', 'doc_id'], dtype='object')


## exclude user summary docs from click docs

In [126]:
# user_clicks
# user_summaries
user_doc_tuples = [tuple(x) for x in  user_summaries[["uid", "doc_id"]].values]
user_doc_tuples

user_clicks["summary_flag"] = user_clicks.apply(lambda x: (x["uid"],x["doc_id"]) in user_doc_tuples, axis=1)
user_clicks = user_clicks[~user_clicks["summary_flag"]]
user_clicks = user_clicks.groupby("uid").aggregate({"doc_id": lambda x: ",".join(list(x.to_list()))})
user_clicks = user_clicks.reset_index()
user_clicks = user_clicks.rename(columns={"doc_id": "clicknewsID"})
user_clicks

Unnamed: 0,uid,clicknewsID
0,3AFaFd3w9NjDGnO51kupLyK1N44DQ2,"t3_10uftj,t3_115df2,t3_11fvr8,t3_11z6b6,t3_121..."
1,43gHDyCi222pTzozK8X47V7YdLit7P,"t3_12a7za,t3_14bpbp,t3_1aoiah,t3_1i2p7t,t3_1j6..."
2,44Z8ttpKcY6Kr1sNymNnBA0nL0h4dZ,"t3_10de6c,t3_1cq11l,t3_1glbok,t3_1ke2y7,t3_1p0..."
3,4voZkkCJyOCnpQ5f8WFf5unLW1dSjC,"t3_19b8cq,t3_19zkjr,t3_1aoiah,t3_1bal64,t3_1k8..."
4,6TDC3rcGcujIOhfdq3356VhN4NzveC,"t3_12yk7r,t3_13cdkb,t3_1krnth,t3_2coxcy,t3_2f2..."
...,...,...
56,rmgbTjW1stlproQnuHE2bUpK78Jxle,"t3_103e7p,t3_10erz1,t3_10rz2c,t3_137uxi,t3_14y..."
57,sC4a4UNRMSYCGopXr3K8znnyna6TVh,"t3_11yrlm,t3_1hanb5,t3_1p0iwc,t3_1s4igw,t3_1tq..."
58,thott7XepukYSbOL2QgSlyXd0rgHvr,"t3_10x2g2,t3_115svb,t3_11yrlm,t3_12md1j,t3_138..."
59,uvzut5OK2bvei9zoCDdktcfLENYioY,"t3_11fvr8,t3_11yrlm,t3_120wzc,t3_12lkx5,t3_12m..."


## consolidate users that have both history as well as summaries.

In [127]:
print(f"user_summaries.columns: {user_summaries.columns}")
user_summaries = user_summaries.groupby("uid").aggregate({"doc_id":lambda x: ",".join(x.to_list()), "summary_text": lambda x:";;".join(x.to_list())}, axis=1)
user_summaries = user_summaries.reset_index()
user_summaries = user_summaries.rename(columns={"doc_id": "posnewID", "summary_text": "rewrite_titles"})
user_summaries
personalized_test = pd.merge(user_clicks, user_summaries, on=["uid"])
personalized_test = personalized_test.rename(columns={"uid":"userid"}) 
personalized_test.to_csv("personalized_test.csv", sep="\t", index=False)

user_summaries.columns: Index(['uid', 'doc_id', 'summary_text'], dtype='object')
