# Cleaning the Impact Hours Column 🐙
Date: June 21, 2021

## Loading Packages and Data 

In [11]:
import pandas as pd
import numpy as np
from scipy.stats import mode

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px 

In [3]:
praise_df = pd.read_csv("data/all-hand-labeled-data.csv")
praise_df.head()

Unnamed: 0.1,Unnamed: 0,To,From,Reason for dishing,Server,Date,Room,v1 norm,v2 norm,v3 norm,...,Cred per Praise,Cred per person,To.1,Room-NoEmoji,Source,Year,Month,Day,Category Code,QUESTIONS
0,0,zeptimusQ,Tam2140,for hosting this kicking params party!,Token Engineering Commons,2021-05-07,🙏praise,10000.0,100.0,200.0,...,,,,praise,Token Engineering Commons:praise,2021,5,7,,
1,1,zeptimusQ,iviangita,for hosting and leading a lot of params parties,Token Engineering Commons,2021-05-07,🙏praise,10000.0,100.0,100.0,...,,,,praise,Token Engineering Commons:praise,2021,5,7,,
2,2,zeptimusQ,JuankBell,for testing and deploying the bot to record an...,Token Engineering Commons,2021-04-28,🙏praise,1000.0,200.0,200.0,...,,,,praise,Token Engineering Commons:praise,2021,4,28,,
3,3,zeptimusQ,iviangita,for the huge success of the MVV process,Token Engineering Commons,2021-04-30,🙏praise,1000.0,200.0,100.0,...,,,,praise,Token Engineering Commons:praise,2021,4,30,TEC2,
4,4,zeptimusQ,iviangita,"for his awesome work on the recorder bot, for ...",Token Engineering Commons,2021-04-30,🙏praise,1000.0,200.0,100.0,...,,,,praise,Token Engineering Commons:praise,2021,4,30,,


In [6]:
praise_df.columns

Index(['Unnamed: 0', 'To', 'From', 'Reason for dishing', 'Server', 'Date',
       'Room', 'v1 norm', 'v2 norm', 'v3 norm', 'Avg %', 'IH per Praise',
       'IH per person', 'Unnamed: 12', 'v1', 'v2', 'v3', 'period',
       'Cred per Praise', 'Cred per person', 'To.1', 'Room-NoEmoji', 'Source',
       'Year', 'Month', 'Day', 'Category Code', 'QUESTIONS'],
      dtype='object')

In [4]:
len(praise_df)

7625

In [5]:
len(praise_df["Reason for dishing"].unique())

2787

## Cleaning the Data 

### We drop anything for which 'IH per Praise' is not a number, since it can't yield Quantitative information about Impact Hours. We also drop anything that received 0 IH. 

In [14]:
praise_df = praise_df[~(praise_df["IH per Praise"].isna())]
praise_df = praise_df[praise_df["IH per Praise"] > 0]

In [15]:
len(praise_df)

6948

In [16]:
len(praise_df["Reason for dishing"].unique())

2533

## We clean the period that the praise was received to an integer. 

In [28]:
def clean_period(period_string):
    new_str = period_string.replace("#","")
    new_str = new_str.split()
    final = int(new_str[0])
    return final

In [29]:
praise_df["period"] = praise_df["period"].apply(clean_period)
praise_df["period"].unique()

array([17, 16, 15, 14, 13, 12, 11, 10,  9,  8,  7], dtype=int64)

## Notice that we only have periods 7-17 in our data. 

This is actually good, as it covers:
- the period with the contemporary Praise process
- the period when the TEC Discord Server existed

## Data Cleaning

There is an issue that has been repeatedly raised, and that now we want to address. There are times where exceptional contributions for an individual were recognized by adjusting an individual's Impact Hours (indirectly by a quantifier adjusting their own score). 

We are working off the following assumption:
> **Assumption:** if two Praises had *both*
> - the same "Reason for dishing"
> - the same time period
> then they should have the same value. 

We will make the following fix:
> **Procedure:** If a Praise is part of a group that came from the same time period **and** the exact same "Reason for dishing", we will assign it the mode of that group's Impact Hours. 

Now we add a new column called "clean_IH_per_praise" that adjusts for this. 

In [38]:
def find_mode_by_period_and_reason(df = praise_df, period = 0, reason = ""):
    new_df = df[(df["period"] == period) & (df["Reason for dishing"] == reason)]
    mode_of_new_df_IH = new_df["IH per Praise"].mode().to_list() #surely there's a better way
    mode_of_new_df_IH = mode_of_new_df_IH[0] #than this
    return mode_of_new_df_IH

def mode_by_period_and_reason(x):
    obs = praise_df.iloc[x]
    my_reason = obs["Reason for dishing"]
    my_period = obs["period"]
    my_mode = find_mode_by_period_and_reason(df = praise_df, period = my_period, reason = my_reason)
    return my_mode

mode_by_period_and_reason(0)


2.355251551

In [39]:
cleaned_IH = [mode_by_period_and_reason(x) for x in range(len(praise_df))]

In [40]:
praise_df["IH_per_praise_clean"] = cleaned_IH

In [42]:
praise_df["IH_per_praise_clean"]

0       2.355252
1       1.995539
2       1.610872
3       1.251160
4       1.251160
          ...   
7135    0.267460
7136    0.267460
7137    0.267460
7138    0.048312
7139    0.048312
Name: IH_per_praise_clean, Length: 6948, dtype: float64

In [45]:
len(praise_df["IH_per_praise_clean"].unique())

1000

In [46]:
praise_df.to_csv("data/praise-df-hand-labeled-with-clean-IH.csv")