### Heroes Of Pymoli Data Analysis
* Of the 1163 active players, the vast majority are male (84%). There also exists, a smaller, but notable proportion of female players (14%).

* Our peak age demographic falls between 20-24 (44.8%) with secondary groups falling between 15-19 (18.60%) and 25-29 (13.4%).  
-----

### Notes
* This notebook will be used to show example code that is different from the Trilogy provided solutions. These solutions may or may not be an improvement upon their solutions and are meant to demonstrate different techniques that can be used.
* I'll use DataFrame instead of data frame when speaking about DataFrames. I'm doing this because pandas keyword for data frames is DataFrame. See ?pd.DataFrame() for more information

In [1]:
# Dependencies and Setup
import pandas as pd
import numpy as np

# Import purchase data csv, pandas natively builds this as a DataFrame for us.
purchase_data = pd.read_csv("purchase_data.csv")

In [2]:
type(purchase_data) # Notice how it's of type pandas DataFrame

pandas.core.frame.DataFrame

<div class="alert alert-success">
<h2><b>Part 1</b>: Player Count</h2>

<ul>
  <li>Display the total number of players</li>
</ul>
</div>

This question doesn't ask me to do any formatting or to save the results.
As such, I will only create an output.
Placing this information into a DataFrame is also unecessary.
This answer takes what is presented in the Trilogy solution notebook and 'simplifies' it.
I do this by using the chaining method, which is a 'preferred' functionality in pandas

In [3]:
# Using chaining methodology:
pd.DataFrame({"Total Players": 
              [purchase_data.loc[:, ["Gender", "SN", "Age"]]
               .drop_duplicates()
               .count()[0]]})

Unnamed: 0,Total Players
0,576


This next cell is an alternative solution.
I'll save this data because I'll use it again.
This helps reduce code redundancy.

For this reason I think this is the best solution.

In [20]:
total_players = purchase_data.loc[:, "SN"].drop_duplicates().count()

total_players

576

<div class="alert alert-success">
<h2><b>Part 2</b>: Purchasing Analysis (Total)</h2>

<ul>
  <li>Run basic calculations to obtain number of unique items, average price, etc.</li>
  <li>Create a summary data frame to hold the results</li>
  <li>Optional: give the displayed data cleaner formatting</li>
  <li>Display the summary data frame</li>
</ul>
</div>

In [5]:
# create a summary DataFrame using: totals (unique items), purchase sum, counts, and mean
summary = (pd.DataFrame({"Number of Unique Items": len(purchase_data["Item ID"].unique()),
                        "Total Revenue in USD": [purchase_data["Price"].sum()],
                        "Number of Purchases": [purchase_data["Price"].count()],
                        "Average Price in USD": purchase_data["Price"].mean()})
           .round(2)
          )

summary

Unnamed: 0,Number of Unique Items,Total Revenue in USD,Number of Purchases,Average Price in USD
0,183,2379.77,780,3.05


I run the calculations from the original dataframe.
Then build the columns from these calculations using a dictionary
in the form of key : value (column : data).
The question asks to create a DataFrame so I use pd.DataFrame

For the formatting I round the numbers to 2 decimals.
I don't include $ but instead name my columns descriptively.
Since we're dealing with numbers I keep my numbers as data type int and/or float
Using two decimals makes sense in dealing with money
We also know the columns that deal with money and the kind of values
If you ever took college science classes: LABEL YOUR UNITS!

<div class="alert alert-success">
<h2><b>Part 3</b>: Gender Demographics</h2>

<ul>
  <li>Percentage and Count of Male Players</li>
  <li>Percentage and Count of Female Players</li>
  <li>Percentage and Count of Other / Non-Disclosed</li>
  <li>Display the demographics</li>
</ul>
</div>

_To assist in following along with this next code segment:_
* I'm creating a DataFrame of the data to format it as a DataFrame
* I'm building the DataFrame from using key:value format (dictionary)
* The keys will be used as column names, the value(s) will be my data
* I slice my original data on SN and Gender then drop any duplicates
 * This performs only on columns that have duplicate data in both columns (I believe)
* I then slice further only on 'Gender' and take those value counts
* In the second iteration I then divide by total players, multiply by 100 to get the percent
* I save this DataFrame because I'll reuse it shortly after...

In [6]:
# Calculate the Number of players and their Percentage by Gender
gender_demographics = (pd.DataFrame({"Total Count": 
                                    purchase_data.loc[:, ['SN', "Gender"]].drop_duplicates('SN')
                                    ['Gender'].value_counts(),
              
                                    "Percentage of Players": 
                                    (purchase_data.loc[:, ['SN', "Gender"]].drop_duplicates('SN')
                                     ['Gender'].value_counts()
                                     / total_players * 100)})
                       .round(2)
                      )
gender_demographics

Unnamed: 0,Total Count,Percentage of Players
Male,484,84.03
Female,81,14.06
Other / Non-Disclosed,11,1.91


The above code cell is a single line format. Doing it this way can cause extraneous code useage.
This can reduce legability and increase redundancy.

We can see how I reused my first line of code: total_players. This is why we created a variable for it, to use it again.

The output below will be same as above.
This is an example of single line code versus multi line code.
I think the cell below is more legible and reduces redundancy,
__see what you think and let me know!__

In [7]:
gender_counts = (purchase_data.loc[:, ['SN', "Gender"]]
                 .drop_duplicates()['Gender']
                 .value_counts()
                )

gender_demographics = (pd.DataFrame({"Total Count": gender_counts,
                                    "Percentage of Players": gender_counts / total_players * 100})
                       .round(2)
                      )

<div class="alert alert-success">
<h2><b>Part 4</b>: Purchasing Analysis (Gender)</h2>

<ul>
  <li>Run basic calculations to obtain purchase count, avg. purchase price, avg. purchase total per person etc. by gender</li>
  <li>Create a summary data frame to hold the results</li>
  <li>Optional: give the displayed data cleaner formatting</li>
    <li>Display the summary data frame</li>
</ul>
</div>

In [8]:
# Rinse and repeat from prior exercises
gender_data = pd.DataFrame({"Purchase Count": purchase_data.groupby(["Gender"]).count()["Price"], 
                            "Average Purchase Price (USD)": purchase_data.groupby(["Gender"]).mean()["Price"], 
                            "Total Purchase Value (USD)": purchase_data.groupby(["Gender"]).sum()["Price"], 
                            "Normalized Totals (USD)": 
                            purchase_data.groupby(["Gender"]).sum()["Price"]
                            / gender_demographics['Total Count']}).round(2)
gender_data

Unnamed: 0_level_0,Purchase Count,Average Purchase Price (USD),Total Purchase Value (USD),Normalized Totals (USD)
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,113,3.2,361.94,4.47
Male,652,3.02,1967.64,4.07
Other / Non-Disclosed,15,3.35,50.19,4.56


I won't go into what I'm doing with this code, it's a slightly more elaborate form of what I've already done.

### _Stands on soap box:_

I dislike changing format from an integer or float into a string or object. The data that I'm working with is numbers and I think at all times it should remain numbers. An exception to this is if we're simply outputting information visually. Then I think it's fine to change the type to display it well. Because I am creating this as a new DataFrame I WILL NOT change the data type so in any future use of this DataFrame I know what to expect from the values (numbers).

## Now the fun begins... but first:
__In this next section I use np.inf__:
* __What is it__: it's 'infinity' thats built into numpy

* __Why do I use it__: It's physically smaller on memory, albeit barely, than using an integer

 * __Also__, this way I catch ALL numbers above 40 (up to and including infinity).
 
When dealing with age as we are with this question, this isn't really a problem. I am over engineering this solution by doing this.

In [9]:
#getsizeof will measure the physical memory size of the object in bytes
from sys import getsizeof

# Demo:
print(getsizeof(999))
print(getsizeof(np.inf))

28
24


<div class="alert alert-success">
<h2><b>Part 5</b>: Age Demographics</h2>
<ul>
    <li>Establish bins for ages</li>
    <li>Categorize the existing players using the age bins. Hint: use pd.cut()</li>
    <li>Calculate the numbers and percentages by age group</li>
    <li>Create a summary data frame to hold the results</li>
    <li>Optional: round the percentage column to two decimal points</li>
    <li>Display Age Demographics Table</li>
</ul>
</div>

### Notes about using pd.cut(): 


There is a 'right' parameter that defines the functionality of the rightmost edge, whether it is included or not. The default setting for this parameter is True. What this means is that our bins look like:

<p style="text-align: center;"> `(0, 9], (9, 14],...(39, np.inf]`</p>

* For the less familiar with interval notation in mathematics: The notation `[ a , c )` is used to indicate an interval from a to c that is inclusive of `a` but exclusive of `c`. That is, `[ 5 , 12 )` would be the set of all real numbers between 5 and 12, including 5 but not 12. The numbers may come as close as they like to 12, including 11.999 and so forth (with any finite number of 9s), but 12.0 is not included.

The important thing to note with this interval style is that if we had created bins like: 

<p style="text-align: center;"> `(0, 10], (10, 15], ..., (40, np.inf]` </p>

Then we actually have a group that is all values including and below 10, all values greater than 40 but not including 40. By looking that the labels that we have we actually want all ages lower than 10 (not incuding 10), and all ages 40 or older. Some students changed the bins and this caused their answers to diverage from the solutions. This functionality of pd.cut is what caused that divergence. 

In [10]:
# Notice the bin intervals for pd.cut(right=True)
age_bins = [0, 9, 14, 19, 24, 29, 34, 39, np.inf]
group_names = ["<10", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40+"]

cut_GroupBy = purchase_data.drop_duplicates('SN').groupby(pd.cut(purchase_data.drop_duplicates('SN').Age, 
                                                                 bins = age_bins, 
                                                                 labels = group_names))

In [11]:
age_demo = pd.DataFrame(data={'player_totals' : cut_GroupBy.SN.count(),
                             'percent_players' : ((cut_GroupBy.SN.count() / total_players) * 100).round(2)})
age_demo

Unnamed: 0_level_0,player_totals,percent_players
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
<10,17,2.95
10-14,22,3.82
15-19,107,18.58
20-24,258,44.79
25-29,77,13.37
30-34,52,9.03
35-39,31,5.38
40+,12,2.08


### An example of pd.cut(right=False)

In [12]:
# Notice the bin intervals for pd.cut(right=False)
age_bins = [0, 10, 15, 20, 25, 30, 35, 40, np.inf]

cut_GroupBy = purchase_data.drop_duplicates('SN').groupby(pd.cut(purchase_data.drop_duplicates('SN').Age, 
                                                                 right=False,
                                                                 bins = age_bins, 
                                                                 labels = group_names))

In [13]:
age_demo = pd.DataFrame(data={
    'player_totals' : cut_GroupBy.SN.count(),
    'pecent_of_players' : ((cut_GroupBy.SN.count() / total_players) * 100).round(2)
})

age_demo

Unnamed: 0_level_0,player_totals,pecent_of_players
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
<10,17,2.95
10-14,22,3.82
15-19,107,18.58
20-24,258,44.79
25-29,77,13.37
30-34,52,9.03
35-39,31,5.38
40+,12,2.08


<div class="alert alert-success">
<h2><b>Part 6</b>: Purchasing Analysis (Age)</h2>
<ul>
    <li>Bin the purchase_data data frame by age</li>
    <li>Run basic calculations to obtain:
        purchase count, 
        avg. purchase price, 
        avg. purchase total per person etc. in the table below</li>
    <li>Create a summary data frame to hold the results</li>
    <li>Optional: give the displayed data cleaner formatting</li>
    <li>Display the summary data frame</li>
</ul>
</div>

In [14]:
age_GroupBy = (purchase_data
               .assign(age_range = lambda df: pd.cut(df["Age"], 
                                                       right = False,
                                                       bins = age_bins, 
                                                       labels=group_names))
               .groupby('age_range'))

In [15]:
price_age = pd.DataFrame(data={
    'purchase_count' : age_GroupBy.count()['Price'],
    'mean_purchase_USD' : age_GroupBy.mean()['Price'],
    'total_purchase_USD' : age_GroupBy.sum()['Price'],
    'mean_total_USD_person' : age_GroupBy.sum()['Price'] / age_demo['player_totals']
}).round(2)

price_age

Unnamed: 0_level_0,purchase_count,mean_purchase_USD,total_purchase_USD,mean_total_USD_person
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<10,23,3.35,77.13,4.54
10-14,28,2.96,82.78,3.76
15-19,136,3.04,412.89,3.86
20-24,365,3.05,1114.06,4.32
25-29,101,2.9,293.0,3.81
30-34,73,2.93,214.0,4.12
35-39,41,3.6,147.67,4.76
40+,13,2.94,38.24,3.19


<div class="alert alert-success">
<h2><b>Part 7</b>: Top Spenders</h2>
<ul>
    <li>Run basic calculations to obtain the results in the table below</li>
    <li>Create a summary data frame to hold the results</li>
    <li>Sort the total purchase value column in descending order</li>
    <li>Optional: give the displayed data cleaner formatting</li>
    <li>Display a preview of the summary data frame</li>
</ul>
</div>

## For this solution I'll walk through two ways
* The first will create a new DataFrame of summary info using pd.DataFrame
* The second will output a copy of our original DataFrame without storing it

Walking through the first answer:
* First I create a new DataFrame
* I then create three columns
 * Built by using purchase_data
 * Grouped by screen name (SN)
 * Then aggregate Price by
  * Sum
  * Mean
  * Count
* I round the data to two places
* Lastly, the data is sorted by purchase_count

In [16]:
top_spenders = (pd.DataFrame(data={
    'mean_purchase_USD' : purchase_data.groupby(["SN"]).mean()["Price"],
    'purchase_count' : purchase_data.groupby(["SN"]).count()["Price"],
    'total_purchase_USD' : purchase_data.groupby('SN').sum()['Price']
})
                .round(2)
                .sort_values('purchase_count', ascending=False)
               )

top_spenders.head(5)

Unnamed: 0_level_0,mean_purchase_USD,purchase_count,total_purchase_USD
SN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lisosia93,3.79,5,18.96
Iral74,3.4,4,13.62
Idastidru52,3.86,4,15.45
Asur53,2.48,3,7.44
Inguron55,3.7,3,11.11


Walking through this second answer:

I take purchase_data:
* groupby SN
* Then aggregate Price by: mean, count, and sum
* I rename the columns
* Sort the values by purchase_count, descending
* Round the values to two decimals
* Take the top 5 answers, head(5)

In [17]:
(purchase_data.groupby('SN')
              ['Price'].agg(['mean', 'count', 'sum'])
              .rename({'sum' : 'total_purchase_USD',
                     'mean' : 'mean_purchase_USD',
                     'count' : 'purchase_count'}, axis=1)
              .sort_values('purchase_count', ascending=False)
              .round(2)
              .head(5)
)

Unnamed: 0_level_0,mean_purchase_USD,purchase_count,total_purchase_USD
SN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lisosia93,3.79,5,18.96
Iral74,3.4,4,13.62
Idastidru52,3.86,4,15.45
Asur53,2.48,3,7.44
Inguron55,3.7,3,11.11


<div class="alert alert-success">
<h2><b>Part 7</b>: Most Popular Items</h2>
<ul>
    <li>Retrieve the Item ID, Item Name, and Item Price columns</li>
    <li>Group by Item ID and Item Name. Perform calculations to obtain purchase count, item price, and total purchase value</li>
    <li>Create a summary data frame to hold the results</li>
    <li>Sort the purchase count column in descending order</li>
    <li>Optional: give the displayed data cleaner formatting</li>
    <li>Display a preview of the summary data frame</li>
</ul>
</div>

In [18]:
top_items = (purchase_data.groupby(["Item ID", "Item Name"])
              ['Price'].agg(['mean', 'count', 'sum'])
              .rename({'sum' : 'total_purchase_USD',
                     'mean' : 'mean_purchase_USD',
                     'count' : 'purchase_count'}, axis=1)
              .sort_values('purchase_count', ascending=False)
              .round(2)
)

top_items.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_purchase_USD,purchase_count,total_purchase_USD
Item ID,Item Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
178,"Oathbreaker, Last Hope of the Breaking Storm",4.23,12,50.76
145,Fiery Glass Crusader,4.58,9,41.22
108,"Extraction, Quickblade Of Trembling Hands",3.53,9,31.77
82,Nirvana,4.9,9,44.1
19,"Pursuit, Cudgel of Necromancy",1.02,8,8.16


<div class="alert alert-success">
<h2><b>Part 8</b>: Most Profitable Items</h2>
<ul>
    <li>Sort the above table by total purchase value in descending order</li>
    <li>Optional: give the displayed data cleaner formatting</li>
    <li>Display a preview of the data frame</li>

In [19]:
(top_items.sort_values('total_purchase_USD', 
                       ascending=False)
          .head(5)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_purchase_USD,purchase_count,total_purchase_USD
Item ID,Item Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
178,"Oathbreaker, Last Hope of the Breaking Storm",4.23,12,50.76
82,Nirvana,4.9,9,44.1
145,Fiery Glass Crusader,4.58,9,41.22
92,Final Critic,4.88,8,39.04
103,Singed Scalpel,4.35,8,34.8


## This solution guide isn't meant to discourage any students based upon what they did/turned in. 

## It's meant to demonstrate the power of and what can be done in pandas, along with the variety of means to do it.