# Analysis of Praise by Words (🐙octopus)

**Date: June 9, 2011**

In this session we will work to gain insight about the way in which different types of activities influence the distribution of Impact Hours.

## Loading Packages and Importing the Data

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
import seaborn


In [None]:
praise_df = pd.read_csv("cleaned-non-quantifier-data.csv")
praise_df.columns

In [None]:
praise_df.head()

## Getting a Feel for the Words 

Before attempting a sophisticated algorithms, we want to do some human analysis in the data. We look at a large sample of "Reason for Dishing" to get a feel for the reasons that praise is given. 

In [None]:
praise_df["Reason for dishing"].to_list()

## Doing a Slight Bit of Cleaning 

It will help to get consistency in the words. First, we make all of the strings lower case. 

In [None]:
make_lower_case = lambda x: x.lower()

In [None]:
praise_df["Reason for dishing"] = praise_df["Reason for dishing"].map(make_lower_case)

As we've seen before, the Impact Hours are quite right-skewed.

In [None]:
praise_df['IH per Praise'].describe()

## Let's check for missing values. 

In [None]:
praise_df['IH per Praise'].isna().sum()

## There appear to be several -- let's verify. 

In [None]:
praise_df[praise_df['IH per Praise'].isna()]

## We drop these 502 missing values since they have no quantitative information.  We then re-set the index. 

In [None]:
praise_df.dropna(subset=["IH per Praise"], inplace = True)

In [None]:
praise_df["IH per Praise"].isna().sum()

In [None]:
praise_df = praise_df.reset_index()

## Many of the columns will not be useful for this particular analysis. 

We drop all of the following:
* To
* From
* v1 Norm
* v2 Norm
* v3 Norm
* IH per Person
* Cred per Praise
* Cred per person
* Period
* Room (since we have Room-NoEmoji, which is cleaner)
* v1
* v2
* v3
* Cred per Praise
* Cred per person
* Unnamed: 12 (a duplicate of "To")
* To.1

We are focusing on other information for this analysis. 

In [None]:
praise_df.drop(inplace = True, columns = ["To", "From", "v1 norm", "v2 norm", "v3 norm", "IH per person", "Cred per Praise", "Cred per person",
                                         "period", "Room", "v1", "v2", "v3", "Cred per Praise", "Cred per person", "Unnamed: 12", "To.1"])

In [None]:
praise_df.head()

## We can also drop "Avg %" and "Date" (since this is duplicated in other columns.)

In [None]:
praise_df.drop(columns = ["Avg %", "Date"], inplace = True)

In [None]:
praise_df.head()

## We notice an issue with "Server" name that we perhaps should have caught earlier. 

In [None]:
np.unique(praise_df["Server"])

## Both "TG" and "Telegram" are used for Telegram. It's time to fix this. 

In [None]:
replace_TG = lambda x: "Telegram" if x == "TG" else x

In [None]:
praise_df["Server"] = praise_df["Server"].map(replace_TG)

In [None]:
np.unique(praise_df["Server"])

## Let's see how much each Server contributed. 

In [None]:
praise_df.groupby("Server")["IH per Praise"].count()

In [None]:
praise_df.groupby("Server")["IH per Praise"].count()/len(praise_df["IH per Praise"])

In [None]:
praise_df.groupby("Server")["IH per Praise"].sum()

In [None]:
praise_df.groupby("Server")["IH per Praise"].sum()/praise_df["IH per Praise"].sum()

## Since both "Bot Training Ground" and "TEC template" contributed less than 1%, we will remove them from the data frame. 

In [None]:
praise_df = praise_df[~(praise_df["Server"] == "Bot Training Ground")]
praise_df = praise_df[~(praise_df["Server"] == "TEC template")]  
praise_df = praise_df.reset_index()

In [None]:
praise_df.groupby("Server")["IH per Praise"].count().plot.pie()

In [None]:
praise_df.groupby("Server")["IH per Praise"].count().plot.bar()

In [None]:
praise_df.groupby("Server")["IH per Praise"].sum().plot.pie()

In [None]:
praise_df.groupby("Server")["IH per Praise"].mean().plot.bar()

In [None]:
import seaborn as sns

## Since we originally made "Source" based on Server name and we have now changed Server, we also change Source. 

In [None]:
praise_df["Source"] = praise_df["Server"] + " : " + praise_df["Room-NoEmoji"]

In [None]:
praise_df["Source"]

## There are some activities that, according to the quantiifiers, that are exceptionally great (such as "inventing augmented bonded curves") and thus atypical. We'd like to decide on a cutoff point, to remove these occurrences and analyze them separately. 

In [None]:
list_of_quantiles = np.quantile(praise_df["IH per Praise"], [0.8,0.85,0.9,0.95,0.99])
list_of_quantiles

## The top 5% begins at approximately 4.42. This feels right to us, though we can change the quantile later if we wish. We create a new data frame that holds only exceptional data (top 5%)

In [None]:
cutoff_quant = 0.95
exceptional_df = praise_df[praise_df["IH per Praise"] >= np.quantile(praise_df["IH per Praise"], cutoff_quant)]
exceptional_df

In [None]:
print("How many exceptional praises are there?")
print("There are {} praises in this exceptional data frame, which begins at the {} percentile.".format(len(exceptional_df), cutoff_quant))
print("\n")
print("What percentage of the Impact Hours awarded come from these praises?")
pct_exceptional_IH =100*(exceptional_df["IH per Praise"].sum()/praise_df["IH per Praise"].sum())
print("{} percent of the Impact Hours awarded come from this group".format(pct_exceptional_IH))

In [None]:
exceptional_df.groupby("Server")["IH per Praise"].count()/len(exceptional_df)

In [None]:
exceptional_df.groupby("Server")["IH per Praise"].sum()/exceptional_df["IH per Praise"].sum()

## Organizing Into Words

## Now we are interested in the word-level data, with a hope of recognizing and grouping like activities. 

We use a basic [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which simply creates a new column for each word that appears in the data set, then gives each observation the number of times that word appeared. 

Settings:
* input: "content" -- since our source is a list
* stop_words = "english" -- to avoid common words like "the", "and", "a", etc.

Settings We Did Not Use, but Could:
* max_df: maximum document frequency -- ignore words which occur too frequently
* min_df: minimum_document_frequency - ignore words which are very rare

In [None]:
vectorizer = CountVectorizer(input = "content", stop_words = 'english', binary = True, analyzer = "word")

In [None]:
words_encoded = vectorizer.fit_transform(praise_df["Reason for dishing"].to_list())


In [None]:
words_df = pd.DataFrame(data = words_encoded.toarray(), columns = vectorizer.get_feature_names())

In [None]:
words_df.columns

In [None]:
words_df.columns.to_list()

In [None]:
words_df.head()

Now we join this data back to the original data, so we also have context of source, date, and Impact Hours. 

In [None]:
len(words_df)

In [None]:
sum(words_df.index == praise_df.index) == len(praise_df)

In [None]:
final_df = pd.concat([words_df, praise_df], axis = 1)

In [None]:
len(final_df)

## Now we can use his data frame to answer questions related to the textual information in the praise, as well as date, server, and room. To give an example of what this might look like, we focus on social media, finding all messages which contain "retweeting" or "mentioning". 

In [None]:
final_df.groupby("Server")["retweeting"].count()

In [None]:
final_df[final_df["retweeting"] == 1].groupby("Server")["IH per Praise"].mean()

In [None]:
final_df[final_df["mentioning"] == 1].groupby("Server")["IH per Praise"].mean()

In [None]:
final_df[final_df["retweeting"] == 1].groupby("Month")["IH per Praise"].mean().plot.barh()
plt.title("Impact Hours for retweeting by Month")
plt.show()

In [None]:
final_df[final_df["retweeting"] == 1].groupby("Month")["IH per Praise"].count().plot.barh()

In [None]:
final_df[final_df["retweeting"] == 1].groupby("Month")["IH per Praise"].sum().plot.barh()

In [None]:
social_condition = (final_df["retweeting"]== 1) | (final_df["mentioning"] == 1)
social_df = final_df[social_condition]

In [None]:
len(social_df)

In [None]:
social_df["Reason for dishing"].to_list()

In [None]:
month_and_server = pd.pivot_table(data = social_df,  values = "IH per Praise", index = "Month", 
                                  columns = "Server", aggfunc = 'mean')

In [None]:
month_and_server

In [None]:
month_and_server = month_and_server.fillna(0)

In [None]:
fig, ax = plt.subplots(figsize = (12, 7))
sns.heatmap(month_and_server, cmap ='RdYlGn', linewidths = 0.30, 
            annot = True)
ax.set_title("Value of Retweets and Mentions")

### <span style="color:red"> Well, this looks strange -- why would social media activity on Telegram in December be worth twice as much as similar activity in Token Engineering Commons in April? </span>

In [None]:
dec_TG_df = social_df.query("Server == 'Telegram' & Month == 12")

In [None]:
dec_TG_df["Reason for dishing"].to_list()

In [None]:
dec_TG_df[["Reason for dishing","IH per Praise"]]

### <span style = "color:purple"> 1. One reason for discrepancy is obviously fair: the work of writing an article vs. the work of retweeting. So perhaps "mentioning" needs to be utilized more carefully.  </span>

### <span style = "color: red"> 2. There are identical items which receive various values. This is likely a result of "hand-editing" by the quantifiers. In [my conversation with Griff on May 20](https://www.youtube.com/watch?v=XTlfElzjPWg), he said that validators would sometimes "hand-edit" the overall amount of Impact Hours that a contributor received if they felt it was not proportional to actual impact. They did this by adjusting an arbitrary row. I think we are seeing that here.  </span>



## Using Clustering to Group Like Praise

This would be a good place for someone with more expertise in clustering to come in:
* How many clusters/categories should we make?
* What isa good algothm?
* How should the data be preprocessed/cleaned/embedded beforehand? (Perhaps td-idr or Doc2Vec, etc.)
* How to use human labeling? 

In [None]:
clusterer = KMeans(n_clusters = 4, n_init = 20)

In [None]:
word_clusters = clusterer.fit(words_df)

In [None]:
(clusterer.labels_==0).mean()

In [None]:
(clusterer.labels_==1).mean()

In [None]:
(clusterer.labels_==2).mean()

In [None]:
(clusterer.labels_==3).mean()

In [None]:
words_df.loc[clusterer.labels_==0, :].sum().sort_values(ascending = False)[0:20]

In [None]:
words_df.loc[clusterer.labels_==1, :].sum().sort_values(ascending = False)[0:20]

In [None]:
words_df.loc[clusterer.labels_==2, :].sum().sort_values(ascending = False)[0:20]

In [None]:
words_df.loc[clusterer.labels_==3, :].sum().sort_values(ascending = False)[0:20]

In [None]:
praise_df.loc[clusterer.labels_==0, :]["IH per Praise"].describe()

In [None]:
praise_df.loc[clusterer.labels_==1, :]["IH per Praise"].describe()

In [None]:
praise_df.loc[clusterer.labels_==2, :]["IH per Praise"].describe()

In [None]:
praise_df.loc[clusterer.labels_==3, :]["IH per Praise"].describe()