# Aggregation

Now that we have several of our dataframes properly sorted, our next goal is to generate some useful information based on the scores. Specifically, we want to Load our Comments-Sorted.feather and Questions-Sorted.feather files, and aggregate the scores based on the 'PostId' and 'ParentId' columns respectively. The key idea is to determine the total score for Answers and Comments for a particular Question (Parent). We will then save these new dataframes for the next exercise and write out the top 10 results for the two new dataframes to prove that we got the right answer. Hint, we will need to resort the two dataframes again after the aggregation using the sort function from the previous exercise using the new aggregate column we create.

### Read a feather file and return a dataframe. This is already done for you. You just have to call it from main to convert a feather file into a dataframe.

In [1]:
import pyarrow.feather as feather
import pandas as pd

In [2]:
def arrow_to_df(input_file_name):
    df = feather.read_feather(input_file_name)
    return df

### Write a feather file using a dataframe. This is done for you, you just need to call it.

In [3]:
def df_to_arrow(output_file_name, df):
    feather.write_feather(df, output_file_name, compression='zstd')
    return

### This function will write the top ten items from a dataframe to a file. You simply need to pass in a dataframe, and a file name to write, like 'out.txt'.

In [4]:
def write_top10_to_file(df,out_file):
    with open(out_file,'w') as f:
        f.write(df.head(10).to_string())
    return

###  This is the same sort function from Q2. You need to copy your solution here.

In [5]:
def sort_df (df,key):
    new_df = df.sort_values(by=[key],ascending=False, ignore_index=True)
    return new_df

### The key idea of this function is to take a dataframe, aggregate based on the Score column, and in the process clean up the indexing (as there are now less rows). Finally you will want to sort the current dataframe using to sort function we created in the last exercise before returning the new dataframe.

In [6]:
def aggregate_scores(df, parent_id_name):
    val = df.groupby(parent_id_name, as_index=False)
    new_val=val.aggregate({'Score':'sum'})
    agg_val = sort_df(new_val,'Score')
    
    return agg_val

### Main Loop:
* First read the Answers-Sorted.feather and Comments-Sorted.feather files into dataframes using the provided function.
* Now get the aggregate_scores function working to generate a new dataframe containing answers or comments that combine scores based on the 'ParentId' or 'PostId' key, and return the newly sorted dataframe.
* Now write out the two new dataframes -- Answers-Sum.feather and Comments-Sum.feather for later.
* Finally, you need to call write_top10_to_file() for the the new dataframes. The output file name should be "Answers-Sum-10.txt" and Comments-Sum-10.txt"  This will by our sanity check that you got the aggregate function correct. Make sure you use the output file names exactly as shown (case sensitive).

In [7]:

def main():
    #Read Ans-sum and Comments-sum to dataframes
    answer_sort_df =arrow_to_df('Answers-Sorted.feather')
    comments_sort_df =arrow_to_df('Comments-Sorted.feather')
    
    #Aggregating a new dataframe based on 'ParentId' or 'Postid'
    agg_ans = aggregate_scores(answer_sort_df,'ParentId')
    agg_comm = aggregate_scores(comments_sort_df, 'PostId')
    
    #Writing df of feather files
    df_to_arrow('Answers-Sum.feather',agg_ans)
    df_to_arrow('Comments-Sum.feather',agg_comm)

    #Output files
    write_top10_to_file(agg_ans,'Answers-Sum-10.txt')
    write_top10_to_file(agg_comm,'Comments-Sum-10.txt')

    return

In [8]:

if __name__ == '__main__':
    main()
    
    print ('[INFO]: Script completed with no errors')

[INFO]: Script completed with no errors
