Grouping and Aggregating - Analyzing and Exploring Your Data

In [1]:
import pandas as pd

In [2]:
region = pd.read_csv('Data/administrative-divisions/Region.csv', index_col='region_id')
district = pd.read_csv('Data/administrative-divisions/Districts.csv', index_col='district_id', dtype={'district_id':'Int64', 'old_region' : 'Int64', 'new_region':'Int64'})
local_types = pd.read_csv('Data/administrative-divisions/LocalBodyTypes.csv', index_col='local_body_type_id')
local_bodies = pd.read_csv('Data/administrative-divisions/localBodies.csv', index_col='local_body_id')

In [3]:
filter1 = (region['Region'] == 'Bagmati')    &  (region['old_new'] == 1)
filter1_index = region.loc[filter1].index
filter1_index

Int64Index([1103], dtype='int64', name='region_id')

In [4]:
bag_dis_filter = district['new_region'] == filter1_index[0]
bagmati_districts = district.loc[bag_dis_filter, ['District']]
bagmati_districts.sort_values(by='District', ascending=False)

Unnamed: 0_level_0,District
district_id,Unnamed: 1_level_1
100108,Sindhupalchok
100206,Sindhuli
100107,Rasuwa
100204,Ramechhap
100106,Nuwakot
100303,Makwanpur
100105,Lalitpur
100104,Kavrepalanchok
100103,Kathmandu
100202,Dolakha


In [5]:
# Let's find all the local bodies of Kavrepalanchok
kavrefilter = (bagmati_districts['District'] == 'Kavrepalanchok')
kavre_index = bagmati_districts.loc[kavrefilter].index
kavre_index

Index([100104], dtype='object', name='district_id')

In [6]:
kavre_local_filter = (local_bodies['district_id'] == kavre_index[0])
kavre_local = local_bodies.loc[kavre_local_filter, ['local_body', 'local_body_type_id', 'max_ward']]
kavre_local

Unnamed: 0_level_0,local_body,local_body_type_id,max_ward
local_body_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1001045001,Banepa,105,
1001045002,Dhulikhel,105,
1001045003,Panauti,105,
1001047001,Anekot,107,9.0
1001047002,Balthali,107,9.0
...,...,...,...
1101045161,Namobuddha,105,11.0
1101045162,Panauti,105,12.0
1101045163,Panchkhal,105,13.0
1101049466,Roshi,119,12.0


In [7]:
# Working with example dataframe
people = {
    "first" : ["Anish", "Ramish", "Samish", "Bamish", "Bamish"],
    "last" : ["Khadka", "Mainali", "Shrestha", "Karki", "Mainali"],
    "email" : ["anishramish56@gmail.com", "mainaliramish89@gmail.com", 
               "shresthasamish28@gmail.com", "bamishkarki819@gmail.com",
               "bamishmainali78@gmail.com"]
}
mydf = pd.DataFrame(people)
mydf

Unnamed: 0,first,last,email
0,Anish,Khadka,anishramish56@gmail.com
1,Ramish,Mainali,mainaliramish89@gmail.com
2,Samish,Shrestha,shresthasamish28@gmail.com
3,Bamish,Karki,bamishkarki819@gmail.com
4,Bamish,Mainali,bamishmainali78@gmail.com


In [8]:
# Aggregation : multiple pieces of data into a single result
# df['column_name'].median()
# df.median() # For entire data frame where there is numerical value
# df.describe()


In [9]:
# Reading real data
df = pd.read_csv('Data/stack-overflow-developer-survey-2019/survey_results_public.csv', index_col='Respondent')
schema_df = pd.read_csv('Data/stack-overflow-developer-survey-2019/survey_results_schema.csv',  index_col='Column')

In [10]:
schema_df

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
OpenSourcer,How often do you contribute to open source?
OpenSource,How do you feel about the quality of open sour...
...,...
Sexuality,Which of the following do you currently identi...
Ethnicity,Which of the following do you identify as? Ple...
Dependents,"Do you have any dependents (e.g., children, el..."
SurveyLength,How do you feel about the length of the survey...


In [11]:
df['ConvertedComp'].median() # median salary

57287.0

In [13]:
df['ConvertedComp'].count()

55823

In [14]:
df['ConvertedComp'].value_counts()

2000000.0    709
1000000.0    558
120000.0     502
100000.0     480
150000.0     434
            ... 
24156.0        1
255720.0       1
614.0          1
179261.0       1
57889.0        1
Name: ConvertedComp, Length: 9162, dtype: int64

In [15]:
country_grp = df.groupby(['Country'])

In [16]:
country_grp.get_group('United States')

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
23,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,22.0,Man,No,Straight / Heterosexual,Black or of African descent,No,Appropriate in length,Easy
26,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,,34.0,Man,No,Gay or Lesbian,,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78292,,No,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",United States,No,"Other doctoral degree (Ph.D, Ed.D., etc.)","A health science (ex. nursing, pharmacy, radio...",Completed an industry certification program (e...,...,Somewhat less welcome now than last year,,60.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
82717,,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,"Secondary school (e.g. American high school, G...",,,...,,Industry news about technologies you're intere...,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Neither easy nor difficult
83397,,Yes,Less than once per year,,"Not employed, but looking for work",United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,,27.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
85642,,No,Less than once per year,"OSS is, on average, of LOWER quality than prop...","Independent contractor, freelancer, or self-em...",United States,No,Associate degree,"Information systems, information technology, o...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,34.0,"Non-binary, genderqueer, or gender non-conforming",,Bisexual;Gay or Lesbian,White or of European descent,No,Appropriate in length,Easy


In [18]:
# This is same as using filter
country_filt = (df['Country'] == 'United States')
df.loc[country_filt]

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
23,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,22.0,Man,No,Straight / Heterosexual,Black or of African descent,No,Appropriate in length,Easy
26,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,,34.0,Man,No,Gay or Lesbian,,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78292,,No,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",United States,No,"Other doctoral degree (Ph.D, Ed.D., etc.)","A health science (ex. nursing, pharmacy, radio...",Completed an industry certification program (e...,...,Somewhat less welcome now than last year,,60.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
82717,,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,No,"Secondary school (e.g. American high school, G...",,,...,,Industry news about technologies you're intere...,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Neither easy nor difficult
83397,,Yes,Less than once per year,,"Not employed, but looking for work",United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,,27.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
85642,,No,Less than once per year,"OSS is, on average, of LOWER quality than prop...","Independent contractor, freelancer, or self-em...",United States,No,Associate degree,"Information systems, information technology, o...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,34.0,"Non-binary, genderqueer, or gender non-conforming",,Bisexual;Gay or Lesbian,White or of European descent,No,Appropriate in length,Easy


In [24]:
schema_df.loc['SocialMedia']

QuestionText    What social media site do you use the most?
Name: SocialMedia, dtype: object

In [25]:
df['SocialMedia'].value_counts()

Reddit                      14374
YouTube                     13830
WhatsApp                    13347
Facebook                    13178
Twitter                     11398
Instagram                    6261
I don't use social media     5554
LinkedIn                     4501
WeChat 微信                     667
Snapchat                      628
VK ВКонта́кте                 603
Weibo 新浪微博                     56
Youku Tudou 优酷                 21
Hello                          19
Name: SocialMedia, dtype: int64