# Grouping and aggregating
Okay so now we're actually going to actually get statistics and information from our data set. 

- Aggregation: 'Aggregate' or 'sum' functions. We're going to get this data in a group, and get some kind of summary statistic from it. For example, the average cost of somethings, the maximum or minimum value from a group, or the idea of getting a certain value from each row, summing it up, and outputting it.

In [None]:
'''
1. Get the median salary by doing 'median' function on the 'ConvertedComp' series/column.
2. This goes through the data frame, and looks at all numerical series and calculates a median value for them.
3. This outputs summary statistics for all the numerical series. Giving us things such as count, mean, standard deviation, etc. You should know that the mean 
isn't as useful here, since our mean is so heavily affected by outliers due to strangely high salaries! Additionally, the 'count' value just counts the number
of rows that don't have 'NaN' or empty values (how many people answered the question).

4. 'value_counts()' is kind of like groupby and sum. Basically here it gets all of the values from the series and puts them in boxes, 'Yes' or 'No' in this case. Then it counts the 
amount of values in those boxes. So for 'Yes' there's 71257, whilst 'No' has 17626.
5. Another example accessing the 'SocialMedia' column to see the social media that a person uses. So it'll put all of the series values in boxes such as 
'Twitter', 'Facebook', 'Instagram', etc. Then it counts the number of values in each box, which gives us back the distribution of social media platform usage.
If you don't want to see the integer counts for each 'box', then Pandas also allows you to see the percentages. 


'''
import pandas as pd
csvPath = "../data/survey_results_public.csv"
df = pd.read_csv(csvPath)

# 1.
df['ConvertedComp'].median()

# 2.
df.median()

# 3.
df.describe() 
df["ConvertedComp"].count() # 55823, returns the number of people who put a salary and answered the question
# 4.
df["Hobbyist"].value_counts()
# 5.
df["SocialMedia"].value_counts(normalize=True)

# 6. Separate values in boxes based on the 'Country', then tallies them up; Remember this is only on a series, so the entire rows aren't boxed up
df["Country"].value_counts()

# 7. This puts the rows into groups/boxes based on their 'Country' values. Exactly how it works in SQL as well.
# So this is a data-frame groupby object
countryGroups = df.groupBy(["Country"])

'''
8. Get all rows (data frame) that have the country "United States"; however you could also achieve the same thing with a filter. However 
the advantage here is that we've done this with all countries.
'''
countryGroups.get_group("United States");

'''
9. Starting small, let's find the most popular social medias 
for the united states. 
'''
countryFilter = df["Country"] == "United States"
df.loc[countryFilter]["SocialMedia"].value_counts()

'''
10. Now let's apply a function to all of those country groups. We want 
to find out the most popular social media per country. Now for every country
group, we're counting the and recording the different values for social media and tallying 
them up! So remember that the 'indexes' here are the 'Country' values e.g.
'India', 'United States', etc.


'''
countryGroups["SocialMedia"].value_counts().loc["United States"]

'''
11. So we get the median value of the 'ConvertedComp' for each country group. We are 
then returned a series with indices of Countries and values in the series being the median values.
As a result we're able to get the median 'ConvertedComp' for the country group of 'Germany'
'''
countryGroups["ConvertedComp"].median().loc["Germany"]

'''
12. Let's move back to the string methods. Get all rows for India. Then 
we should get back a series of booleans for how many people use Python.
Then do a .sum()

'''
indiaFilter = df["Country"] == "India"
indiaDf = df.loc[indiaFilter]
indiaDf["LanguagesWorkedWith"].str.contains("Python").sum() # This returns a 'SeriesGroupBy'

