# 08. Grouping and Aggregating - Analyzing and Exploring The Data

---

In [2]:
import pandas as pd

In the last notebook, we looked at how filtering works in Pandas. In this notebook, we go into Exploring and Analyzing Data in Pandas, which is one of the first steps in examining any dataset.

As the things we cover today are only necessary in the case of large datasets, we start with a real-world example, which is the Stack Overflow dataset we've been examining in our previous notebooks.

First, we load our files, then we change the default options of maximum visible rows and colums to 85 each:

In [3]:
df = pd.read_csv('survey_results_public.csv')
schema_df = pd.read_csv('survey_results_schema.csv')

In [4]:
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

Let's look at the head of our dataset again:

In [5]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,4.0,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Django;Flask,Flask;jQuery,Node.js,Node.js,IntelliJ;Notepad++;PyCharm,Windows,I do not use containers,,,Yes,"Fortunately, someone else has that title",Yes,Twitter,Online,Username,2017,A few times per month or weekly,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,31-60 minutes,No,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are",Neutral,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,,"Developer, desktop or enterprise applications;...",,17,,,,,,,I am actively looking for a job,I've never had a job,,,Financial performance or funding status of the...,"Something else changed (education, award, medi...",,,,,,,,,,,,,,,,,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Django,Django,,,Atom;PyCharm,Windows,I do not use containers,,Useful across many domains and could change ma...,Yes,Yes,Yes,Instagram,Online,Username,2017,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",100 to 499 employees,"Designer;Developer, back-end;Developer, front-...",3.0,22,1,Slightly satisfied,Slightly satisfied,Not at all confident,Not sure,Not sure,"I’m not actively looking, but I am open to new...",1-2 years ago,Interview with people in peer roles,No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,THB,Thai baht,23000.0,Monthly,8820.0,40.0,There's no schedule or spec; I work on what se...,Distracting work environment;Inadequate access...,Less than once per month / Never,Home,Average,No,,"No, but I think we should",Not sure,I have little or no influence,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,,Other(s):,,,Vim;Visual Studio Code,Linux-based,I do not use containers,,,Yes,Yes,Yes,Reddit,In real life (in person),Username,2011,A few times per week,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3.0,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,"10,000 or more employees","Academic researcher;Developer, desktop or ente...",16.0,14,9,Very dissatisfied,Slightly dissatisfied,Somewhat confident,Yes,No,I am not interested in new job opportunities,Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,"Industry that I'd be working in;Languages, fra...",I was preparing for a job search,UAH,Ukrainian hryvnia,,,,55.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Inadequ...,A few days each month,Office,A little above average,"Yes, because I see value in code review",,"Yes, it's part of our process",Not sure,I have little or no influence,C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA,HTML/CSS;Java;JavaScript;SQL;WebAssembly,Couchbase;MongoDB;MySQL;Oracle;PostgreSQL;SQLite,Couchbase;Firebase;MongoDB;MySQL;Oracle;Postgr...,Android;Linux;MacOS;Slack;Windows,Android;Docker;Kubernetes;Linux;Slack,Django;Express;Flask;jQuery;React.js;Spring,Flask;jQuery;React.js;Spring,Cordova;Node.js,Apache Spark;Hadoop;Node.js;React Native,IntelliJ;Notepad++;Vim,Linux-based,"Outside of work, for personal projects",Not at all,,Yes,Also Yes,Yes,Facebook,In real life (in person),Username,I don't remember,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","Yes, definitely",Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


To start off, let's look at some basic aggregations. Aggregate functions are functions which summarize many values into one value. The **mean**, **mode** and **median** mathematical functions are examples of aggregate functions.

One example of an interesting piece of information that we can only extract from our dataset using an aggregate function is the typical salary of a developer who took this survey. A "typical" salary indicates the **median**, as the median is the value which lies perfectly in the middle of the set of ascendingly sorted given values. i.e. there are as many values lower than the median in any given dataset as there are values higher than it. **We should remember that real-world data is anything but evenly distributed or symmetrical, and so we should not expect to be equal to the mean or even close to it.**

Before grabbing the median salary, let's first look at the salaries as they are presented in the dataset, by examining the first 20 values:

In [6]:
df['ConvertedComp'].head(20)

0          NaN
1          NaN
2       8820.0
3      61000.0
4          NaN
5     366420.0
6          NaN
7          NaN
8      95179.0
9      13293.0
10         NaN
11         NaN
12     90000.0
13     57060.0
14         NaN
15    455352.0
16     65277.0
17     21996.0
18     31140.0
19     41244.0
Name: ConvertedComp, dtype: float64

Getting the median is as simple as calling the **median** function:

In [7]:
df['ConvertedComp'].median()

57287.0

This is already a good start for us to understand the typical salary of a developer who took this survey.

However, it would be much more useful if we could list the medians by country, since that would give us a more accurate look at what the distribution of salaries in the real world looks like.

We are going to look at that in a bit, when we get to grouping data, but first, let's look at the simple aggregate functions we can use in Pandas some more:

We saw what happens when we call **median** on a single column (Series) within our DataFrame, but what happens when we call it on our entire DataFrame?

In [8]:
df.median(numeric_only=True)

Respondent       44442.0
CompTotal        62000.0
ConvertedComp    57287.0
WorkWeekHrs         40.0
CodeRevHrs           4.0
Age                 29.0
dtype: float64

As we can see, we get the median of all numeric columns, which gives us a very useful overview of some of the columns within the DataFrame. Note that not all numeric columns can produce a meaningful median, like the "Respondent" column here for example, which simply gives the number of the respondent as a way to identify them.

**Note: Forgetting to use the flag numeric_only=True with aggregate functions will result in an error in a future verion.**

Another useful aggregate function which provides us with a broad statistical overview of our dataset is **describe**. Let's take a look at what it does:

In [9]:
df.describe()

Unnamed: 0,Respondent,CompTotal,ConvertedComp,WorkWeekHrs,CodeRevHrs,Age
count,88883.0,55945.0,55823.0,64503.0,49790.0,79210.0
mean,44442.0,551901400000.0,127110.7,42.127197,5.084308,30.336699
std,25658.456325,73319260000000.0,284152.3,37.28761,5.513931,9.17839
min,1.0,0.0,0.0,1.0,0.0,1.0
25%,22221.5,20000.0,25777.5,40.0,2.0,24.0
50%,44442.0,62000.0,57287.0,40.0,4.0,29.0
75%,66662.5,120000.0,100000.0,44.75,6.0,35.0
max,88883.0,1e+16,2000000.0,4850.0,99.0,99.0


As we can see, we receive a few key statistical metrics that summarize a lot of information about the dataset. We get the **count** (= number of non-NaN rows), **mean**, **std** (standand deviation), **min**, **max** and 3 quantiles, the 2nd of which, **"50%", is the median**.

One reason the median is better to look at than the mean, is that the mean is affected heavily by outliers and values at the extreme ends of the spectrum, which go into the average and affect it. The median is safe from these effects.

It might be useful to note that the **count** method can be used on its own to count non-NaN values within any given column. For example, we can use this to find the number of people who answered the salary question within the survey:

In [10]:
df['ConvertedComp'].count()

55823

As we can imagine, this can come in useful in many situations, by helping us analyze how much missing data we have in any given Series / DataFrame.

Let's look at a similar method, which counts the specific frequencies of all given values within a given column. An example is the "Hobbyist" column within the survey here, which asks people if they code as a hobby. The question is a Yes/No question, and we might want to know how many people picked each answer. For that, we use the **value_counts** method, as follows:

In [11]:
df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

Another case where we could use this method is ti find out which social media platforms people prefer the most. Here's the social media platform column:

In [12]:
df['SocialMedia']

0          Twitter
1        Instagram
2           Reddit
3           Reddit
4         Facebook
           ...    
88878      YouTube
88879          NaN
88880          NaN
88881          NaN
88882     WhatsApp
Name: SocialMedia, Length: 88883, dtype: object

And here it is after applying the **value_counts** method to it:

In [13]:
df['SocialMedia'].value_counts()

Reddit                      14374
YouTube                     13830
WhatsApp                    13347
Facebook                    13178
Twitter                     11398
Instagram                    6261
I don't use social media     5554
LinkedIn                     4501
WeChat 微信                     667
Snapchat                      628
VK ВКонта́кте                 603
Weibo 新浪微博                     56
Youku Tudou 优酷                 21
Hello                          19
Name: SocialMedia, dtype: int64

**Tip: we can see the percentages of the values instead of the actual numbers by passing the normalize=True flag:**

In [14]:
df['SocialMedia'].value_counts(normalize=True)

Reddit                      0.170233
YouTube                     0.163791
WhatsApp                    0.158071
Facebook                    0.156069
Twitter                     0.134988
Instagram                   0.074150
I don't use social media    0.065777
LinkedIn                    0.053306
WeChat 微信                   0.007899
Snapchat                    0.007437
VK ВКонта́кте               0.007141
Weibo 新浪微博                  0.000663
Youku Tudou 优酷              0.000249
Hello                       0.000225
Name: SocialMedia, dtype: float64

We know that some of these social media platforms are only used in specific regions of the world. This teases something that could be very useful to know: **What are the most popular social media platforms by country?**

To answer that question, we resort to **grouping**, a feature within Pandas which lets us:

1. Split our object into smaller parts.
2. Apply a certain function of our choice to each one of these parts.
3. Finally, the results are combined.

To give an explanatory example, we go back to the question: What are the most popular social media platforms by country?

Before we answer that question, we look once more at the number of participants by country:

In [17]:
df['Country'].value_counts()

United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: Country, Length: 179, dtype: int64

We can see the majority of participants come from only a handful of countries. To look at the trending social media platforms for all of these countries at once, we start by grouping our DataFrame rows by country, using:

In [18]:
df.groupby(['Country'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021AB970CB50>

We notice that we receive back an object of a "DataFrameGroupBy" type. To perform changes on this object later on, it will be easier and more readible if we store it in a variable:

In [19]:
country_group = df.groupby(['Country'])

**Now, since we grouped our DataFrame by country, we can look at the specific groups, which are callable by country names in this case.**

Let's take a look at one of these, for example:

In [20]:
country_group.get_group('United States')

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
12,13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,10 to 19 employees,Data or business analyst;Database administrato...,17,11,8,Very satisfied,Very satisfied,,,,I am not interested in new job opportunities,3-4 years ago,Complete a take-home project;Interview with pe...,Yes,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,90000.0,Yearly,90000.0,40.0,There is a schedule and/or spec (made by me or...,"Meetings;Non-work commitments (parenting, scho...",All or almost all the time (I'm full-time remote),Home,A little above average,"Yes, because I see value in code review",5.0,"No, but I think we should",Developers and management have nearly equal in...,I have a great deal of influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Rust...,Couchbase;DynamoDB;Firebase;MySQL,Firebase;MySQL;Redis,Android;AWS;Docker;IBM Cloud or Watson;iOS;Lin...,Android;AWS;Docker;IBM Cloud or Watson;Linux;S...,Angular/Angular.js;ASP.NET;Express;jQuery;Vue.js,Express;Vue.js,Node.js;Xamarin,Node.js;TensorFlow,Vim;Visual Studio;Visual Studio Code;Xcode,Windows,Development;Testing;Production,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,Yes,Yes,Twitter,In real life (in person),Username,2011,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
21,22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,"10,000 or more employees","Data or business analyst;Designer;Developer, b...",35,12,18,Slightly satisfied,Very dissatisfied,Somewhat confident,No,No,"I’m not actively looking, but I am open to new...",More than 4 years ago,Interview with people in senior / management r...,No,Industry that I'd be working in;Financial perf...,I had a negative experience or interaction at ...,USD,United States dollar,103000.0,Yearly,103000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Average,No,,"No, but I think we should","The CTO, CIO, or other management purchase new...",I have little or no influence,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Elasticsearch;MySQL;Oracle;Redis,Elasticsearch;MySQL;Oracle;Redis,Docker;Linux;Raspberry Pi;Windows,Docker;Linux;Raspberry Pi;Windows,Angular/Angular.js;Ruby on Rails,Angular/Angular.js;Ruby on Rails,Node.js,Node.js,Sublime Text;Visual Studio;Visual Studio Code,Windows,"Outside of work, for personal projects",Not at all,,Yes,Yes,Yes,Instagram,Online,Username,I don't remember,Daily or almost daily,Find answers to specific questions,3-5 times per week,Stack Overflow was much faster,0-10 minutes,Yes,A few times per week,Yes,"No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
22,23,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",Taken an online course in programming or softw...,"10,000 or more employees","Developer, full-stack",3,19,1,Slightly satisfied,Slightly satisfied,Very confident,No,Not sure,"I’m not actively looking, but I am open to new...",Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,Opportunities for professional development;How...,I was preparing for a job search,USD,United States dollar,69000.0,Yearly,69000.0,40.0,There is a schedule and/or spec (made by me or...,Distracting work environment;Meetings;Non-work...,A few days each month,Office,Average,"Yes, because I see value in code review",8.0,"Yes, it's part of our process",Developers and management have nearly equal in...,I have little or no influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Pyth...,Bash/Shell/PowerShell;Go;HTML/CSS;Java;JavaScr...,Oracle;SQLite,Couchbase;DynamoDB;Elasticsearch;Firebase;Oracle,Docker;Google Cloud Platform,Docker;iOS;Slack,React.js;Ruby on Rails,Express;React.js;Ruby on Rails;Vue.js,,React Native;TensorFlow,Visual Studio Code,MacOS,Development;Testing;Production,,Useful for immutable record keeping outside of...,Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Multiple times per day,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,I have never participated in Q&A on Stack Over...,Yes,"No, I've heard of them, but I am not part of a...","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,22.0,Man,No,Straight / Heterosexual,Black or of African descent,No,Appropriate in length,Easy
25,26,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...","10,000 or more employees","Designer;Developer, back-end;Developer, deskto...",12,8,8,Very satisfied,Very satisfied,,,,"I’m not actively looking, but I am open to new...",Less than a year ago,Interview with people in peer roles;Interview ...,No,Remote work options;Diversity of the company o...,I was preparing for a job search,USD,United States dollar,114000.0,Yearly,114000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Far above average,"Yes, because I see value in code review",2.0,"Yes, it's not part of our process but the deve...",Developers typically have the most influence o...,I have a great deal of influence,Bash/Shell/PowerShell;C++;C#;HTML/CSS;JavaScri...,C#;HTML/CSS;JavaScript;Objective-C;Ruby;SQL;Sw...,Microsoft SQL Server;MySQL;Redis;SQLite,Microsoft SQL Server;MySQL;Redis;SQLite,AWS;Docker;Linux;MacOS;Microsoft Azure;Windows...,Android;Docker;iOS;Linux;MacOS;Microsoft Azure...,Angular/Angular.js;ASP.NET;Drupal;Express;jQue...,Angular/Angular.js;ASP.NET,.NET;.NET Core;Node.js;Xamarin,.NET;.NET Core;Node.js,Notepad++;Sublime Text;Vim;Visual Studio;Xcode,MacOS,Development;Testing,Not at all,A passing fad,Yes,SIGH,Yes,I don't use social media,In real life (in person),Username,2008,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,,34.0,Man,No,Gay or Lesbian,,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88818,78292,,No,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",United States,No,"Other doctoral degree (Ph.D, Ed.D., etc.)","A health science (ex. nursing, pharmacy, radio...",Completed an industry certification program (e...,"Just me - I am a freelancer, sole proprietor, ...",Academic researcher,42,14,31,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bash/Shell/PowerShell;C;Python,Bash/Shell/PowerShell;C;Python,SQLite,SQLite,Linux;Raspberry Pi;Other(s):,Linux;Raspberry Pi;Other(s):,,,Chef,,Emacs;IPython / Jupyter,Linux-based,I do not use containers,,Useful for immutable record keeping outside of...,No,Yes,Yes,I don't use social media,In real life (in person),,2013,A few times per week,Find answers to specific questions,Less than once per week,The other resource was slightly faster,11-30 minutes,Not sure / can't remember,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are","No, not really",Somewhat less welcome now than last year,,60.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
88840,82717,,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,"Secondary school (e.g. American high school, G...",,,,,Less than 1 year,,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Android;Windows,Android;Microsoft Azure;Windows,,,,,,MacOS,Testing,,,No,SIGH,Yes,Facebook,In real life (in person),Username,2018,Less than once per month or monthly,Find answers to specific questions,Less than once per week,,60+ minutes,No,,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...",Not sure,,Industry news about technologies you're intere...,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Neither easy nor difficult
88844,83397,,Yes,Less than once per year,,"Not employed, but looking for work",United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,,,12,9,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;JavaScript;Python;SQL,C;C++;C#;Go;Java;JavaScript;Python;R;Ruby;SQL;...,,,Android;Arduino;Slack,Android;Arduino;Docker;iOS;Raspberry Pi;Slack,Flask,Django;Drupal;Flask;jQuery;React.js,,Chef;Torch/PyTorch,Eclipse;IPython / Jupyter;Sublime Text,MacOS,I do not use containers,,,,SIGH,Yes,,,Handle,I don't remember,A few times per week,Find answers to specific questions;Learn how t...,3-5 times per week,They were about the same,,Not sure / can't remember,,Yes,"No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,,27.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
88859,85642,,No,Less than once per year,"OSS is, on average, of LOWER quality than prop...","Independent contractor, freelancer, or self-em...",United States,No,Associate degree,"Information systems, information technology, o...",Taken an online course in programming or softw...,"Just me - I am a freelancer, sole proprietor, ...",Designer;Marketing or sales professional,20,7,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,Go;HTML/CSS,,,,,,,,,,Visual Studio Code,Windows,I do not use containers,,Useful for immutable record keeping outside of...,No,SIGH,Yes,,In real life (in person),Handle,2008,Less than once per month or monthly,Find answers to specific questions,Less than once per week,Stack Overflow was slightly faster,60+ minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,34.0,"Non-binary, genderqueer, or gender non-conforming",,Bisexual;Gay or Lesbian,White or of European descent,No,Appropriate in length,Easy


So we can see that grouping our DataFrame by country has broken it into multiple smaller DataFrames, or groups, each of which consists of all the rows from the dataset in which the participants said they were from a specific country.

This opens the door for applying any function of our choice to our **groupby** resulting object, which means all of the groups have our chosen function applied to them at the same time.

To understand the unique utility this provides us, we can make a quick comparison to anoether feature within Pandas: Filtering.

Filtering allows us to do many things, and can be used to classify rows based on various different things. We can use it, for example, to extract all rows which correspond to a given country:

In [30]:
filter_1 = df['Country'] == 'United States'
df.loc[filter_1]

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
12,13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,10 to 19 employees,Data or business analyst;Database administrato...,17,11,8,Very satisfied,Very satisfied,,,,I am not interested in new job opportunities,3-4 years ago,Complete a take-home project;Interview with pe...,Yes,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,90000.0,Yearly,90000.0,40.0,There is a schedule and/or spec (made by me or...,"Meetings;Non-work commitments (parenting, scho...",All or almost all the time (I'm full-time remote),Home,A little above average,"Yes, because I see value in code review",5.0,"No, but I think we should",Developers and management have nearly equal in...,I have a great deal of influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Rust...,Couchbase;DynamoDB;Firebase;MySQL,Firebase;MySQL;Redis,Android;AWS;Docker;IBM Cloud or Watson;iOS;Lin...,Android;AWS;Docker;IBM Cloud or Watson;Linux;S...,Angular/Angular.js;ASP.NET;Express;jQuery;Vue.js,Express;Vue.js,Node.js;Xamarin,Node.js;TensorFlow,Vim;Visual Studio;Visual Studio Code;Xcode,Windows,Development;Testing;Production,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,Yes,Yes,Twitter,In real life (in person),Username,2011,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
21,22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,"10,000 or more employees","Data or business analyst;Designer;Developer, b...",35,12,18,Slightly satisfied,Very dissatisfied,Somewhat confident,No,No,"I’m not actively looking, but I am open to new...",More than 4 years ago,Interview with people in senior / management r...,No,Industry that I'd be working in;Financial perf...,I had a negative experience or interaction at ...,USD,United States dollar,103000.0,Yearly,103000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Average,No,,"No, but I think we should","The CTO, CIO, or other management purchase new...",I have little or no influence,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,Elasticsearch;MySQL;Oracle;Redis,Elasticsearch;MySQL;Oracle;Redis,Docker;Linux;Raspberry Pi;Windows,Docker;Linux;Raspberry Pi;Windows,Angular/Angular.js;Ruby on Rails,Angular/Angular.js;Ruby on Rails,Node.js,Node.js,Sublime Text;Visual Studio;Visual Studio Code,Windows,"Outside of work, for personal projects",Not at all,,Yes,Yes,Yes,Instagram,Online,Username,I don't remember,Daily or almost daily,Find answers to specific questions,3-5 times per week,Stack Overflow was much faster,0-10 minutes,Yes,A few times per week,Yes,"No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
22,23,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",Taken an online course in programming or softw...,"10,000 or more employees","Developer, full-stack",3,19,1,Slightly satisfied,Slightly satisfied,Very confident,No,Not sure,"I’m not actively looking, but I am open to new...",Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,Opportunities for professional development;How...,I was preparing for a job search,USD,United States dollar,69000.0,Yearly,69000.0,40.0,There is a schedule and/or spec (made by me or...,Distracting work environment;Meetings;Non-work...,A few days each month,Office,Average,"Yes, because I see value in code review",8.0,"Yes, it's part of our process",Developers and management have nearly equal in...,I have little or no influence,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Pyth...,Bash/Shell/PowerShell;Go;HTML/CSS;Java;JavaScr...,Oracle;SQLite,Couchbase;DynamoDB;Elasticsearch;Firebase;Oracle,Docker;Google Cloud Platform,Docker;iOS;Slack,React.js;Ruby on Rails,Express;React.js;Ruby on Rails;Vue.js,,React Native;TensorFlow,Visual Studio Code,MacOS,Development;Testing;Production,,Useful for immutable record keeping outside of...,Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Multiple times per day,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,I have never participated in Q&A on Stack Over...,Yes,"No, I've heard of them, but I am not part of a...","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,22.0,Man,No,Straight / Heterosexual,Black or of African descent,No,Appropriate in length,Easy
25,26,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...","10,000 or more employees","Designer;Developer, back-end;Developer, deskto...",12,8,8,Very satisfied,Very satisfied,,,,"I’m not actively looking, but I am open to new...",Less than a year ago,Interview with people in peer roles;Interview ...,No,Remote work options;Diversity of the company o...,I was preparing for a job search,USD,United States dollar,114000.0,Yearly,114000.0,40.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Meeting...,"Less than half the time, but at least one day ...",Home,Far above average,"Yes, because I see value in code review",2.0,"Yes, it's not part of our process but the deve...",Developers typically have the most influence o...,I have a great deal of influence,Bash/Shell/PowerShell;C++;C#;HTML/CSS;JavaScri...,C#;HTML/CSS;JavaScript;Objective-C;Ruby;SQL;Sw...,Microsoft SQL Server;MySQL;Redis;SQLite,Microsoft SQL Server;MySQL;Redis;SQLite,AWS;Docker;Linux;MacOS;Microsoft Azure;Windows...,Android;Docker;iOS;Linux;MacOS;Microsoft Azure...,Angular/Angular.js;ASP.NET;Drupal;Express;jQue...,Angular/Angular.js;ASP.NET,.NET;.NET Core;Node.js;Xamarin,.NET;.NET Core;Node.js,Notepad++;Sublime Text;Vim;Visual Studio;Xcode,MacOS,Development;Testing,Not at all,A passing fad,Yes,SIGH,Yes,I don't use social media,In real life (in person),Username,2008,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,,34.0,Man,No,Gay or Lesbian,,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88818,78292,,No,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",United States,No,"Other doctoral degree (Ph.D, Ed.D., etc.)","A health science (ex. nursing, pharmacy, radio...",Completed an industry certification program (e...,"Just me - I am a freelancer, sole proprietor, ...",Academic researcher,42,14,31,,,,,,,,,,,,,,,,,,,,,,,,,,,,Bash/Shell/PowerShell;C;Python,Bash/Shell/PowerShell;C;Python,SQLite,SQLite,Linux;Raspberry Pi;Other(s):,Linux;Raspberry Pi;Other(s):,,,Chef,,Emacs;IPython / Jupyter,Linux-based,I do not use containers,,Useful for immutable record keeping outside of...,No,Yes,Yes,I don't use social media,In real life (in person),,2013,A few times per week,Find answers to specific questions,Less than once per week,The other resource was slightly faster,11-30 minutes,Not sure / can't remember,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are","No, not really",Somewhat less welcome now than last year,,60.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
88840,82717,,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,"Secondary school (e.g. American high school, G...",,,,,Less than 1 year,,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Android;Windows,Android;Microsoft Azure;Windows,,,,,,MacOS,Testing,,,No,SIGH,Yes,Facebook,In real life (in person),Username,2018,Less than once per month or monthly,Find answers to specific questions,Less than once per week,,60+ minutes,No,,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...",Not sure,,Industry news about technologies you're intere...,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Neither easy nor difficult
88844,83397,,Yes,Less than once per year,,"Not employed, but looking for work",United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,,,12,9,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;JavaScript;Python;SQL,C;C++;C#;Go;Java;JavaScript;Python;R;Ruby;SQL;...,,,Android;Arduino;Slack,Android;Arduino;Docker;iOS;Raspberry Pi;Slack,Flask,Django;Drupal;Flask;jQuery;React.js,,Chef;Torch/PyTorch,Eclipse;IPython / Jupyter;Sublime Text,MacOS,I do not use containers,,,,SIGH,Yes,,,Handle,I don't remember,A few times per week,Find answers to specific questions;Learn how t...,3-5 times per week,They were about the same,,Not sure / can't remember,,Yes,"No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,,27.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
88859,85642,,No,Less than once per year,"OSS is, on average, of LOWER quality than prop...","Independent contractor, freelancer, or self-em...",United States,No,Associate degree,"Information systems, information technology, o...",Taken an online course in programming or softw...,"Just me - I am a freelancer, sole proprietor, ...",Designer;Marketing or sales professional,20,7,Less than 1 year,,,,,,,,,,,,,,,,,,,,,,,,,,,,Go;HTML/CSS,,,,,,,,,,Visual Studio Code,Windows,I do not use containers,,Useful for immutable record keeping outside of...,No,SIGH,Yes,,In real life (in person),Handle,2008,Less than once per month or monthly,Find answers to specific questions,Less than once per week,Stack Overflow was slightly faster,60+ minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","No, not at all",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,34.0,"Non-binary, genderqueer, or gender non-conforming",,Bisexual;Gay or Lesbian,White or of European descent,No,Appropriate in length,Easy


As we can see, this seems to produce the exact same effect as the one obtained by grouping, however, it only produces and isolates 1 group at once, which matches our chosen filter criteria.

We can even apply the **value_counts** method to the "SocialMedia" column within this filtered group, and obtain the same results as the ones we would get from applying to the same group within our DataFrameGroupBy object:

In [37]:
df.loc[filter_1, 'SocialMedia'].value_counts()

Reddit                      5700
Twitter                     3468
Facebook                    2844
YouTube                     2463
I don't use social media    1851
Instagram                   1652
LinkedIn                    1020
WhatsApp                     609
Snapchat                     326
WeChat 微信                     93
VK ВКонта́кте                  9
Weibo 新浪微博                     8
Hello                          2
Youku Tudou 优酷                 1
Name: SocialMedia, dtype: int64

In [34]:
country_group.get_group('United States')['SocialMedia'].value_counts()

Reddit                      5700
Twitter                     3468
Facebook                    2844
YouTube                     2463
I don't use social media    1851
Instagram                   1652
LinkedIn                    1020
WhatsApp                     609
Snapchat                     326
WeChat 微信                     93
VK ВКонта́кте                  9
Weibo 新浪微博                     8
Hello                          2
Youku Tudou 优酷                 1
Name: SocialMedia, dtype: int64

This does indeed produce the same results in both cases. However, the true utility of the **groupby** function shows when we apply a function directly to it:

In [38]:
country_group['SocialMedia'].value_counts()

Country      SocialMedia             
Afghanistan  Facebook                    15
             YouTube                      9
             I don't use social media     6
             WhatsApp                     4
             Instagram                    1
                                         ..
Zimbabwe     Facebook                     3
             YouTube                      3
             Instagram                    2
             LinkedIn                     2
             Reddit                       1
Name: SocialMedia, Length: 1220, dtype: int64

This immediately delivers the results for each and every country - or group - which is something no single filter can produce!

With that, finding out which social media platforms are used the most in any given country is as simple as calling **loc** on our code:

In [41]:
country_group['SocialMedia'].value_counts().loc['Egypt']

SocialMedia
Facebook                    116
YouTube                      83
WhatsApp                     42
Twitter                      34
LinkedIn                     14
Reddit                       12
Instagram                     7
I don't use social media      5
VK ВКонта́кте                 1
Name: SocialMedia, dtype: int64

**Remember: we can switch to percentages instead of numbers at any time by using the normalize=True flag within the value_counts method:**

In [43]:
country_group['SocialMedia'].value_counts(normalize=True).loc['China']

SocialMedia
WeChat 微信                   0.670549
YouTube                     0.088186
Weibo 新浪微博                  0.069884
I don't use social media    0.044925
Twitter                     0.044925
Reddit                      0.019967
LinkedIn                    0.018303
Facebook                    0.013311
Instagram                   0.011647
Youku Tudou 优酷              0.011647
WhatsApp                    0.004992
VK ВКонта́кте               0.001664
Name: SocialMedia, dtype: float64

Using this, we see that approximately 67% of participants from China said that "WeChat" is their most used social media platform!

**Grouping** can also be used with simple aggregate functions. For example, we previously looked at median salaries. Let's now see the median salary in each country:

In [49]:
country_group['ConvertedComp'].median()

Country
Afghanistan                               6222.0
Albania                                  10818.0
Algeria                                   7878.0
Andorra                                 160931.0
Angola                                    7764.0
                                          ...   
Venezuela, Bolivarian Republic of...      6384.0
Viet Nam                                 11892.0
Yemen                                    11940.0
Zambia                                    5040.0
Zimbabwe                                 19200.0
Name: ConvertedComp, Length: 179, dtype: float64

Et voila! We can easily view the median salary for all countries at the same time.

A quick inquiry using **loc** can also show us the result of a specific group:

In [51]:
country_group['ConvertedComp'].median().loc['Germany']

63016.0

For the times we want to use multiple aggregate functions on our DataFrameGroupBy object at the same time, we can us **agg** and pass a list of the desired aggregate functions to be applied to it:

In [52]:
country_group['ConvertedComp'].agg(['median', 'mean'])

Unnamed: 0_level_0,median,mean
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6222.0,101953.333333
Albania,10818.0,21833.700000
Algeria,7878.0,34924.047619
Andorra,160931.0,160931.000000
Angola,7764.0,7764.000000
...,...,...
"Venezuela, Bolivarian Republic of...",6384.0,14581.627907
Viet Nam,11892.0,17233.436782
Yemen,11940.0,16909.166667
Zambia,5040.0,10075.375000


This is also inquirable by using **loc**.

It is natural, while doing **E**xploratory **D**ata **A**nalysis (**EDA**), to run into problems which have no straightforward solution. This is to be expected, and it takes practice to improve at handling that.

One example, which we will tackle here, is getting the percentage of developers who know Python from each country within our dataset.

Before we start tackling this exercise using grouping, let's first see how this could be achieved on a single country using filtering:

In [72]:
## filter_2 = (df['Country'] == 'India')  & (df['LanguageWorkedWith'].str.contains('Python'))

In [73]:
## df[filter_2].count()[0]

Another way of achieving this using filtering is:

In [74]:
filter_3 = (df['Country'] == 'India')
df[filter_3]['LanguageWorkedWith'].str.contains('Python')

7         True
9         True
14       False
49        True
64       False
         ...  
88808    False
88825     True
88852    False
88853     True
88864    False
Name: LanguageWorkedWith, Length: 9061, dtype: object

**Remember: applying .contains() results in boolean values.**

To find the number of True values within this Series, we can use **sum** on our code, which will treat every True as 1, and every False as 0:

In [75]:
df[filter_3]['LanguageWorkedWith'].str.contains('Python').sum()

3105

Now, trying to replicate this code using grouping might sound like an idea that should work, however, it does not work:

In [79]:
## country_group['LanguageWorkedWith'].str.contains('Python').sum()
## Does not work, produces the following error message:
## AttributeError: 'SeriesGroupBy' object has no attribute 'str'

So, alternatively, we have to come up with a different solution to achieve the desired outcome.

One thing we need to understand about DataFrameGroupBy objects, is that Pandas treats them as DataFrames, which includes many Series, where each Series is a group.

In our case here, each Series is a country, since that is what we chose to group by here.

This is why we can use **apply** here: we can apply any function using it to all Series within a DataFrame.

Remember what we discussed in Notebook 5:

"Using apply on a Series applies it to every value within that Series, and using it on a DataFrame applies it to every Series within that DataFrame."

So let's try using **apply** along with **lambda** to achieve the desired effect here:

In [81]:
country_group['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

This does indeed work, but we would like to get the percentages for each country, since that is more useful, generally speaking. For that we try:

In [100]:
knows_python_num = country_group['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
total_num = country_group.apply(lambda x: x.count()[0])
knows_python_percent = (knows_python_num / total_num) * 100

In [101]:
knows_python_percent

Country
Afghanistan                             18.181818
Albania                                 26.744186
Algeria                                 29.850746
Andorra                                  0.000000
Angola                                  40.000000
                                          ...    
Venezuela, Bolivarian Republic of...    31.818182
Viet Nam                                33.766234
Yemen                                   15.789474
Zambia                                  33.333333
Zimbabwe                                35.897436
Length: 179, dtype: float64

Let's examine which countries have had the highest percentages of Python developers within our survey, and for that, we use **sort_values**:

In [102]:
knows_python_percent.sort_values(ascending=False).head(50)

Country
Dominica                            100.000000
Niger                               100.000000
Timor-Leste                         100.000000
Sao Tome and Principe               100.000000
Turkmenistan                         85.714286
Mauritania                           71.428571
Bahamas                              66.666667
Guinea                               66.666667
Guyana                               66.666667
Uganda                               65.277778
Iceland                              61.224490
Namibia                              60.000000
Benin                                60.000000
Haiti                                60.000000
Congo, Republic of the...            57.142857
Oman                                 54.545455
Cuba                                 53.333333
Republic of Korea                    51.282051
Seychelles                           50.000000
San Marino                           50.000000
South Korea                          50.000000
Monac

Finally, let's combine all of these results together for ease of viewing. We can use **concat** to combine multiple Series into a DataFrame, by specifying the **axis** to be **"columns"**:

In [103]:
python_data_df = pd.concat([knows_python_num, total_num, knows_python_percent], axis='columns')

**Note: Not using the flag "axis='columns'" will result in a Series of the length of all the passed Series combined, instead of showing them next to each other in a DataFrame.**

Let's rename our columns so that they're expressive of what they contain, and use **inplace=True** since we want the results to take effect:

In [106]:
python_data_df.head()

Unnamed: 0_level_0,LanguageWorkedWith,0,1
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,8,44,18.181818
Albania,23,86,26.744186
Algeria,40,134,29.850746
Andorra,0,7,0.0
Angola,2,5,40.0


In [107]:
python_data_df.rename(columns={'LanguageWorkedWith': 'KnowsPython', 0: 'TotalNumber', 1: 'KnowsPythonPercent'}, inplace=True)

In [108]:
python_data_df

Unnamed: 0_level_0,KnowsPython,TotalNumber,KnowsPythonPercent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,8,44,18.181818
Albania,23,86,26.744186
Algeria,40,134,29.850746
Andorra,0,7,0.000000
Angola,2,5,40.000000
...,...,...,...
"Venezuela, Bolivarian Republic of...",28,88,31.818182
Viet Nam,78,231,33.766234
Yemen,3,19,15.789474
Zambia,4,12,33.333333


We can now view the information of any country by simply using **loc**, since when we use grouping, we get an index of the name of te available groups:

In [110]:
python_data_df.loc['Japan']

KnowsPython           182.000000
TotalNumber           391.000000
KnowsPythonPercent     46.547315
Name: Japan, dtype: float64