# Index
* [StackOverflow Survey Analysis](#StackOverflow_Survey_Analysis)
    * [Part 1: Loading Data](#part-01)
    * [Part 2: DataFrame and Series Objects](#part-02)
    * [Part 3: Indexes](#part-03)
    * [Part 4: Filtering data from Dataframe and series objects](#part-04)
    * [Part 5: Updating rows and columns, modifying data within Dataframe](#part-05)
    * [Part 7: Sorting Data](#part-07)
    * [Part 8: Grouping, Aggregating, Analysing and Exploring Data](#part-08)
    * [Part 9: Cleaning Data - Casting Data Types and Handling Missing Values](#part-09)
    

In [1]:
import pandas as pd
import numpy as np

# StackOverflow Survey Analysis <a name='StackOverflow_Survey_Analysis'></a>
In this section, we bring it all together, everything that we have learnt and use it to analyse the [2019 StackOverflow Survey Data](https://insights.stackoverflow.com/).

## Part 1: Loading and overviewing data <a name='part-01'></a>



In [2]:
import requests
import zipfile
import io


In [3]:
# URL of the dataset
url = "https://survey.stackoverflow.co/datasets/stack-overflow-developer-survey-2019.zip"

# Local directory to save and extract
extract_dir = "stackoverflow_2019_data"

# Download the file
print("Downloading file...")
response = requests.get(url)
response.raise_for_status()  # Raise an error for bad status codes

# Extract the zip file
print("Extracting contents...")
with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
    zip_ref.extractall(extract_dir)

print(f"Files extracted to '{extract_dir}'")


Downloading file...
Extracting contents...
Files extracted to 'stackoverflow_2019_data'


In [4]:
# get path of data files
data_file = f"./{extract_dir}/survey_results_public.csv"
schema_data = f"./{extract_dir}/survey_results_schema.csv"

In [5]:
# load data into DataFrame
schema_df = pd.read_csv(schema_data)
schema_df

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,OpenSourcer,How often do you contribute to open source?
4,OpenSource,How do you feel about the quality of open sour...
...,...,...
80,Sexuality,Which of the following do you currently identi...
81,Ethnicity,Which of the following do you identify as? Ple...
82,Dependents,"Do you have any dependents (e.g., children, el..."
83,SurveyLength,How do you feel about the length of the survey...


Note - In case, we want to change the number of columns and rows that are being displayed, we can do so by using ***set_options*** as follows:
```Python
pd.set_options('display.max_columns', 85)  # to display maximum of 85 columns
pd.set_options('display.max_rows', 85)  # to display maximum of 85 rows
```
Try it out.

In [6]:
df = pd.read_csv(data_file)
display(df.head(3))
display(df.tail(3))

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult


Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
88880,88802,,No,Never,,Employed full-time,,,,,...,,,,,,,,,,
88881,88816,,No,Never,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",,,,,...,,,,,,,,,,
88882,88863,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Spain,"Yes, full-time","Professional degree (JD, MD, etc.)","Computer science, computer engineering, or sof...",...,Somewhat less welcome now than last year,Tech articles written by other developers;Indu...,18.0,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy


In [7]:
# get dimension of DataFrame
df.shape

(88883, 85)

In [8]:
# display name of columns
df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

In [9]:
# get an overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88883 entries, 0 to 88882
Data columns (total 85 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Respondent              88883 non-null  int64  
 1   MainBranch              88331 non-null  object 
 2   Hobbyist                88883 non-null  object 
 3   OpenSourcer             88883 non-null  object 
 4   OpenSource              86842 non-null  object 
 5   Employment              87181 non-null  object 
 6   Country                 88751 non-null  object 
 7   Student                 87014 non-null  object 
 8   EdLevel                 86390 non-null  object 
 9   UndergradMajor          75614 non-null  object 
 10  EduOther                84260 non-null  object 
 11  OrgSize                 71791 non-null  object 
 12  DevType                 81335 non-null  object 
 13  YearsCode               87938 non-null  object 
 14  Age1stCode              87634 non-null

## Part 2: Dataframe and Series Datatypes <a name='part-02'></a>


In [10]:
# get counts of Hobbyist response
df["Hobbyist"].value_counts()

Hobbyist
Yes    71257
No     17626
Name: count, dtype: int64

In [11]:
# Slicing DataFrame
display(df.loc[0:3, ["Hobbyist"]])
display(df.loc[[0, 1, 2], ["Hobbyist", "Ethnicity"]])
display(df.loc[10:20, "Hobbyist":"Employment"])

Unnamed: 0,Hobbyist
0,Yes
1,No
2,Yes
3,No


Unnamed: 0,Hobbyist,Ethnicity
0,Yes,
1,No,
2,Yes,


Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
10,Yes,Once a month or more often,The quality of OSS and closed source software ...,
11,No,Never,"OSS is, on average, of HIGHER quality than pro...",Employed part-time
12,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time
13,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time
14,Yes,Never,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work"
15,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
16,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,Employed full-time
17,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,Employed full-time
18,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
19,No,Never,"OSS is, on average, of HIGHER quality than pro...",Employed full-time


## Part 3: Indexes <a name='part-03'></a>


In [12]:
# read csv as DataFrame while specifying index
df = pd.read_csv(data_file, index_col="Respondent")
df.head(2)

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult


In [13]:
# set column as index
schema_df.set_index("Column", inplace=True)
schema_df.head()

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
OpenSourcer,How often do you contribute to open source?
OpenSource,How do you feel about the quality of open sour...


In [14]:
# printing MainBranch via recent index setting
print(schema_df.loc["MainBranch"])
print(schema_df.loc["MainBranch", "QuestionText"])

QuestionText    Which of the following options best describes ...
Name: MainBranch, dtype: object
Which of the following options best describes you today? Here, by "developer" we mean "someone who writes code."


In [15]:
# sort index of schema_df
schema_df.sort_index(ascending=False)

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
YearsCodePro,How many years have you coded professionally (...
YearsCode,"Including any education, how many years have y..."
WorkWeekHrs,"On average, how many hours per week do you work?"
WorkRemote,How often do you work remotely?
WorkPlan,How structured or planned is your work?
...,...
BlockchainOrg,How is your organization thinking about or imp...
BlockchainIs,Blockchain / cryptocurrency technology is prim...
BetterLife,Do you think people born today will have a bet...
Age1stCode,At what age did you write your first line of c...


## Part 4: Filtering data from Dataframe and series objects <a name='part-04'></a>

In [16]:
# filtering rows having salary > 70k
high_sal = df["ConvertedComp"] > 70000
df.loc[high_sal]

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,28.0,Man,No,Straight / Heterosexual,East Asian,No,Too long,Neither easy nor difficult
9,I am a developer by profession,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time,New Zealand,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,,23.0,Man,No,Bisexual,White or of European descent,No,Appropriate in length,Neither easy nor difficult
13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
16,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,United Kingdom,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,26.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Neither easy nor difficult
22,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,Some college/university study without earning ...,,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,47.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88876,I am a developer by profession,Yes,Never,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Received on-the-job training in software devel...,...,A lot less welcome now than last year,,23.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
88877,I am a developer by profession,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,48.0,Man,No,Straight / Heterosexual,South Asian,Yes,Too long,Neither easy nor difficult
88878,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,26.0,Man,No,Straight / Heterosexual,South Asian,No,Appropriate in length,Easy
88879,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Finland,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...",...,Not applicable - I did not use Stack Overflow ...,,34.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy


In [17]:
# get only some columns, satisfying the condition
df.loc[high_sal, ["Country", "ConvertedComp", "LanguageWorkedWith"]]

Unnamed: 0_level_0,Country,ConvertedComp,LanguageWorkedWith
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Canada,366420.0,Java;R;SQL
9,New Zealand,95179.0,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;P...
13,United States,90000.0,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...
16,United Kingdom,455352.0,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;T...
22,United States,103000.0,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...
...,...,...,...
88876,United States,180000.0,Bash/Shell/PowerShell;C#;HTML/CSS;Java;Python;...
88877,United States,2000000.0,Bash/Shell/PowerShell;C;Clojure;HTML/CSS;Java;...
88878,United States,130000.0,HTML/CSS;JavaScript;Scala;TypeScript
88879,Finland,82488.0,Bash/Shell/PowerShell;C++;Python


In [18]:
# get data of some conuntries
countries = ["India", "United States", "Germany", "Canada", "United Kingdom"]
filtr = df["Country"].isin(countries)
df.loc[filtr, "Country"]

Respondent
1        United Kingdom
4         United States
6                Canada
8                 India
10                India
              ...      
85642     United States
85961    United Kingdom
86012             India
88282     United States
88377            Canada
Name: Country, Length: 45008, dtype: object

In [19]:
# filter using substring
filtr = df["LanguageWorkedWith"].str.contains("Python", na=False)
df.loc[filtr, "LanguageWorkedWith"]

Respondent
1                          HTML/CSS;Java;JavaScript;Python
2                                      C++;HTML/CSS;Python
4                                      C;C++;C#;Python;SQL
5              C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
8        Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;Java...
                               ...                        
84539    Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;Java...
85738      Bash/Shell/PowerShell;C++;Python;Ruby;Other(s):
86566      Bash/Shell/PowerShell;HTML/CSS;Python;Other(s):
87739             C;C++;HTML/CSS;JavaScript;PHP;Python;SQL
88212                           HTML/CSS;JavaScript;Python
Name: LanguageWorkedWith, Length: 36443, dtype: object

## Part 5: Updating rows and columns, modifying data within Dataframe <a name='part-05'></a>


In [20]:
# renaming ConvertedComp to Salary, NOT inplace
df.rename(columns={"ConvertedComp": "SalaryUSD"}).columns

Index(['MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource', 'Employment',
       'Country', 'Student', 'EdLevel', 'UndergradMajor', 'EduOther',
       'OrgSize', 'DevType', 'YearsCode', 'Age1stCode', 'YearsCodePro',
       'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney', 'MgrWant', 'JobSeek',
       'LastHireDate', 'LastInt', 'FizzBuzz', 'JobFactors', 'ResumeUpdate',
       'CurrencySymbol', 'CurrencyDesc', 'CompTotal', 'CompFreq', 'SalaryUSD',
       'WorkWeekHrs', 'WorkPlan', 'WorkChallenge', 'WorkRemote', 'WorkLoc',
       'ImpSyn', 'CodeRev', 'CodeRevHrs', 'UnitTests', 'PurchaseHow',
       'PurchaseWhat', 'LanguageWorkedWith', 'LanguageDesireNextYear',
       'DatabaseWorkedWith', 'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn',
  

In [21]:
df.rename(columns={"ConvertedComp": "SalaryUSD"}, inplace=True)
df["SalaryUSD"]

Respondent
1            NaN
2            NaN
3         8820.0
4        61000.0
5            NaN
          ...   
88377        NaN
88601        NaN
88802        NaN
88816        NaN
88863        NaN
Name: SalaryUSD, Length: 88883, dtype: float64

In [22]:
# convert Hobbyist's yes no values to True False
df["Hobbyist"] = df["Hobbyist"].map({"Yes": True, "No": False})
df.head()

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,True,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,False,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
3,"I am not primarily a developer, but I write co...",True,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
4,I am a developer by profession,False,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
5,I am a developer by profession,True,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


## Part 6: Adding and Removing Rows and Columns from DataFrame <a name='part-06'></a>
Nothing to do here. Move to other one for examples.

## Part 7: Sorting Data <a name='part-07'></a>


In [23]:
# sort data via country name in aesc and slary in desc
df.sort_values(by=["Country", "SalaryUSD"], ascending=[True, False], inplace=True)
df[["Country", "SalaryUSD"]].head(10)

Unnamed: 0_level_0,Country,SalaryUSD
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
63129,Afghanistan,1000000.0
50499,Afghanistan,153216.0
39258,Afghanistan,19152.0
58450,Afghanistan,17556.0
7085,Afghanistan,14364.0
22450,Afghanistan,7980.0
48436,Afghanistan,4464.0
10746,Afghanistan,3996.0
8149,Afghanistan,1596.0
29736,Afghanistan,1116.0


In [24]:
# getting top n salaries
df["SalaryUSD"].nlargest(3)

Respondent
25983    2000000.0
87896    2000000.0
22013    2000000.0
Name: SalaryUSD, dtype: float64

In [25]:
# getting last n salaries
df["SalaryUSD"].nsmallest(2)

Respondent
722      0.0
28638    0.0
Name: SalaryUSD, dtype: float64

In [26]:
# get all details of top n
df.nlargest(3, "SalaryUSD")

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25983,I am a developer by profession,True,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Received on-the-job training in software devel...,...,Just as welcome now as I felt last year,,24.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
87896,I am a developer by profession,True,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,Germany,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,32.0,Man,No,Gay or Lesbian,White or of European descent,No,Appropriate in length,Neither easy nor difficult
22013,I am a developer by profession,True,Never,The quality of OSS and closed source software ...,Employed full-time,India,No,"Professional degree (JD, MD, etc.)","A natural science (ex. biology, chemistry, phy...",Taken an online course in programming or softw...,...,A lot more welcome now than last year,Tech articles written by other developers;Indu...,,Man,No,Straight / Heterosexual,,Yes,Too long,Easy


## Video 8: Grouping, Aggregating, Analysing and Exploring Data <a name='part-08'></a>


In [27]:
df.sort_index(inplace=True)
df.index

Index([    1,     2,     3,     4,     5,     6,     7,     8,     9,    10,
       ...
       88874, 88875, 88876, 88877, 88878, 88879, 88880, 88881, 88882, 88883],
      dtype='int64', name='Respondent', length=88883)

In [28]:
df["SalaryUSD"].head(10)

Respondent
1          NaN
2          NaN
3       8820.0
4      61000.0
5          NaN
6     366420.0
7          NaN
8          NaN
9      95179.0
10     13293.0
Name: SalaryUSD, dtype: float64

In [29]:
# get median of salary
df["SalaryUSD"].median()

57287.0

In [30]:
# get median of every numerical column
df.select_dtypes(np.number).median()

CompTotal      62000.0
SalaryUSD      57287.0
WorkWeekHrs       40.0
CodeRevHrs         4.0
Age               29.0
dtype: float64

In [31]:
# get summary of DataFrame
df.describe()
# count is non NaN rows

Unnamed: 0,CompTotal,SalaryUSD,WorkWeekHrs,CodeRevHrs,Age
count,55945.0,55823.0,64503.0,49790.0,79210.0
mean,551901400000.0,127110.7,42.127197,5.084308,30.336699
std,73319260000000.0,284152.3,37.28761,5.513931,9.17839
min,0.0,0.0,1.0,0.0,1.0
25%,20000.0,25777.5,40.0,2.0,24.0
50%,62000.0,57287.0,40.0,4.0,29.0
75%,120000.0,100000.0,44.75,6.0,35.0
max,1e+16,2000000.0,4850.0,99.0,99.0


In [32]:
# check how many responded to SalayUSD
df["SalaryUSD"].count()

55823

In [33]:
# check how many are hobbyist
df["Hobbyist"].value_counts()

Hobbyist
True     71257
False    17626
Name: count, dtype: int64

In [34]:
# check which SocialMedia is popular among devs
df["SocialMedia"].value_counts()

SocialMedia
Reddit                      14374
YouTube                     13830
WhatsApp                    13347
Facebook                    13178
Twitter                     11398
Instagram                    6261
I don't use social media     5554
LinkedIn                     4501
WeChat 微信                     667
Snapchat                      628
VK ВКонта́кте                 603
Weibo 新浪微博                     56
Youku Tudou 优酷                 21
Hello                          19
Name: count, dtype: int64

In [35]:
# check which SocialMedia is popular among devs in percentages
df["SocialMedia"].value_counts(normalize=True)

SocialMedia
Reddit                      0.170233
YouTube                     0.163791
WhatsApp                    0.158071
Facebook                    0.156069
Twitter                     0.134988
Instagram                   0.074150
I don't use social media    0.065777
LinkedIn                    0.053306
WeChat 微信                   0.007899
Snapchat                    0.007437
VK ВКонта́кте               0.007141
Weibo 新浪微博                  0.000663
Youku Tudou 优酷              0.000249
Hello                       0.000225
Name: proportion, dtype: float64

### Grouping
Split -> apply Fxn -> combine result

In [36]:
# check how many countries are there
print(df["Country"].nunique())
# and their counts
df["Country"].value_counts()

179


Country
United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: count, Length: 179, dtype: int64

In [37]:
# split data by group
country_grp = df.groupby(["Country"])

In [38]:
# get group information
country_grp.get_group("India")

  country_grp.get_group("India")


Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,I code primarily as a hobby,True,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work",India,,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...",...,A lot more welcome now than last year,Tech articles written by other developers;Indu...,24.0,Man,No,Straight / Heterosexual,,,Appropriate in length,Neither easy nor difficult
10,I am a developer by profession,True,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",,,...,Somewhat less welcome now than last year,Tech articles written by other developers;Tech...,,,,,,Yes,Too long,Difficult
15,I am a student who is learning to code,True,Never,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work",India,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,20.0,Man,No,,,Yes,Too long,Neither easy nor difficult
50,I am a developer by profession,True,Once a month or more often,"OSS is, on average, of LOWER quality than prop...",Employed full-time,India,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele...",Received on-the-job training in software devel...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,23.0,Man,No,,South Asian,No,Too long,Easy
65,I am a developer by profession,True,Never,,Employed full-time,India,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Information systems, information technology, o...",,...,A lot more welcome now than last year,,21.0,Man,No,,,Yes,Appropriate in length,Neither easy nor difficult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88829,I am a student who is learning to code,True,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work",India,"Yes, full-time","Secondary school (e.g. American high school, G...",,"Taught yourself a new language, framework, or ...",...,Somewhat more welcome now than last year,Tech articles written by other developers;Indu...,21.0,Man,No,Straight / Heterosexual,South Asian,No,Appropriate in length,Easy
88843,I am a developer by profession,True,Never,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Somewhat more welcome now than last year,Industry news about technologies you're intere...,22.0,Man,No,Straight / Heterosexual,South Asian,Yes,Appropriate in length,Easy
88856,I am a developer by profession,True,Never,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,"Yes, full-time","Master’s degree (MA, MS, M.Eng., MBA, etc.)",Web development or web design,Participated in online coding competitions (e....,...,Somewhat less welcome now than last year,Tech articles written by other developers;Indu...,,,,,,,,
88869,I am a developer by profession,False,Never,"OSS is, on average, of LOWER quality than prop...",Employed full-time,India,,,,,...,A lot more welcome now than last year,Tech articles written by other developers,30.0,Man,No,,South Asian,No,Too long,Neither easy nor difficult


In [39]:
# get social media popular in a specific country
country_grp.get_group("India")["SocialMedia"].value_counts()

  country_grp.get_group("India")["SocialMedia"].value_counts()


SocialMedia
WhatsApp                    2990
YouTube                     1820
LinkedIn                     955
Facebook                     841
Instagram                    822
Twitter                      542
Reddit                       473
I don't use social media     250
Snapchat                      23
Hello                          5
WeChat 微信                      5
VK ВКонта́кте                  4
Youku Tudou 优酷                 2
Weibo 新浪微博                     1
Name: count, dtype: int64

In [40]:
# socialmedia count grouped by country
country_grp["SocialMedia"].value_counts().head(50)
# this returns series object with multiple indexes

Country              SocialMedia             
Afghanistan          Facebook                     15
                     YouTube                       9
                     I don't use social media      6
                     WhatsApp                      4
                     Instagram                     1
                     LinkedIn                      1
                     Twitter                       1
Albania              WhatsApp                     18
                     Facebook                     16
                     Instagram                    13
                     YouTube                      10
                     Twitter                       8
                     LinkedIn                      7
                     Reddit                        6
                     I don't use social media      4
                     Snapchat                      1
                     WeChat 微信                     1
Algeria              YouTube                      42


In [41]:
# get social media popular in a specific country
country_grp["SocialMedia"].value_counts().loc["India"]
# using this method we can now simply change country name to get its results

SocialMedia
WhatsApp                    2990
YouTube                     1820
LinkedIn                     955
Facebook                     841
Instagram                    822
Twitter                      542
Reddit                       473
I don't use social media     250
Snapchat                      23
Hello                          5
WeChat 微信                      5
VK ВКонта́кте                  4
Youku Tudou 优酷                 2
Weibo 新浪微博                     1
Name: count, dtype: int64

In [42]:
# get median salary for every country
country_grp["SalaryUSD"].median()

Country
Afghanistan                               6222.0
Albania                                  10818.0
Algeria                                   7878.0
Andorra                                 160931.0
Angola                                    7764.0
                                          ...   
Venezuela, Bolivarian Republic of...      6384.0
Viet Nam                                 11892.0
Yemen                                    11940.0
Zambia                                    5040.0
Zimbabwe                                 19200.0
Name: SalaryUSD, Length: 179, dtype: float64

### Aggregation

In [43]:
# check mean, median
country_grp["SalaryUSD"].agg(["mean", "median"])

Unnamed: 0_level_0,mean,median
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,101953.333333,6222.0
Albania,21833.700000,10818.0
Algeria,34924.047619,7878.0
Andorra,160931.000000,160931.0
Angola,7764.000000,7764.0
...,...,...
"Venezuela, Bolivarian Republic of...",14581.627907,6384.0
Viet Nam,17233.436782,11892.0
Yemen,16909.166667,11940.0
Zambia,10075.375000,5040.0


In [44]:
# check mean, median for specific country
country_grp["SalaryUSD"].agg(["mean", "median"][::-1]).loc["Canada"]

median     68705.000000
mean      134018.564909
Name: Canada, dtype: float64

In [45]:
# how many people know python in India?
filtr = df["Country"] == "India"
df.loc[filtr]["LanguageWorkedWith"].str.contains("Python").sum()

3105

Note - This :point_down:
```Python
country_grp['LanguageWorkedWith'].str.contains('Python').sum()
```
won't work as the country_grp is not series object anymore, it is SeriesGroupBy object. We need to use apply instead.

In [46]:
# how many people know python in every country, using seriesGroupBy?
knows_python = country_grp["LanguageWorkedWith"].apply(
    lambda x: x.str.contains("Python").sum()
)
knows_python

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

In [47]:
# What % people in each country knows Python?
country_respondents = df["Country"].value_counts()
country_respondents

Country
United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: count, Length: 179, dtype: int64

In [48]:
python_df = pd.concat([knows_python, country_respondents], axis="columns")
python_df

Unnamed: 0_level_0,LanguageWorkedWith,count
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,8,44
Albania,23,86
Algeria,40,134
Andorra,0,7
Angola,2,5
...,...,...
"Venezuela, Bolivarian Republic of...",28,88
Viet Nam,78,231
Yemen,3,19
Zambia,4,12


In [49]:
# renaming columns
# python_df.reset_index(inplace=True)
python_df.rename(
    columns={"count": "NumRespondents", "LanguageWorkedWith": "KnowsPython"},
    inplace=True,
)
python_df

Unnamed: 0_level_0,KnowsPython,NumRespondents
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,8,44
Albania,23,86
Algeria,40,134
Andorra,0,7
Angola,2,5
...,...,...
"Venezuela, Bolivarian Republic of...",28,88
Viet Nam,78,231
Yemen,3,19
Zambia,4,12


In [50]:
# calc % of people knowing python
python_df["PctKnowsPython"] = (
    python_df["KnowsPython"] / python_df["NumRespondents"]
) * 100
python_df[python_df["NumRespondents"] > 100].sort_values(
    ["PctKnowsPython"], ascending=False
).head(20)

Unnamed: 0_level_0,KnowsPython,NumRespondents,PctKnowsPython
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
South Korea,80,160,50.0
Chile,102,206,49.514563
Finland,266,546,48.717949
Kenya,120,249,48.192771
United States,10083,20949,48.131176
Israel,457,952,48.004202
Taiwan,88,187,47.058824
Switzerland,460,978,47.034765
Hong Kong (S.A.R.),88,188,46.808511
Japan,182,391,46.547315


In [51]:
python_df.loc[["India", "United States"]]

Unnamed: 0_level_0,KnowsPython,NumRespondents,PctKnowsPython
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,3105,9061,34.267741
United States,10083,20949,48.131176


## Part 9: Cleaning Data - Casting Data Types and Handling Missing Values <a name='part-09'></a>


In [52]:
# create a list of missing values and load it in dataframe satisfying the missing criterion
na_vals = ["NA", "Missing"]
df = pd.read_csv(data_file, index_col="Respondent", na_values=na_vals)
df.head()

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [53]:
# Calculate average number of years of coding experience
df["YearsCode"].head()

Respondent
1      4
2    NaN
3      3
4      3
5     16
Name: YearsCode, dtype: object

In [54]:
# casting age (object) to number
# df['YearsCode'].astype(np.float32)  # this wont work as we have 'Less than 1 year' in this column
# check unique values of 'YearsCode'
df["YearsCode"].unique()

array(['4', nan, '3', '16', '13', '6', '8', '12', '2', '5', '17', '10',
       '14', '35', '7', 'Less than 1 year', '30', '9', '26', '40', '19',
       '15', '20', '28', '25', '1', '22', '11', '33', '50', '41', '18',
       '34', '24', '23', '42', '27', '21', '36', '32', '39', '38', '31',
       '37', 'More than 50 years', '29', '44', '45', '48', '46', '43',
       '47', '49'], dtype=object)

In [55]:
# replace 'Less than 1 year' and 'More than 50 years'
df["YearsCode"].replace("More than 50 years", 51, inplace=True)
df["YearsCode"].replace("Less than 1 year", 0, inplace=True)
df["YearsCode"] = df["YearsCode"].astype(np.float32)
df["YearsCode"].unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["YearsCode"].replace("More than 50 years", 51, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["YearsCode"].replace("Less than 1 year", 0, inplace=True)


array([ 4., nan,  3., 16., 13.,  6.,  8., 12.,  2.,  5., 17., 10., 14.,
       35.,  7.,  0., 30.,  9., 26., 40., 19., 15., 20., 28., 25.,  1.,
       22., 11., 33., 50., 41., 18., 34., 24., 23., 42., 27., 21., 36.,
       32., 39., 38., 31., 37., 51., 29., 44., 45., 48., 46., 43., 47.,
       49.], dtype=float32)

In [56]:
# check mean and median of age
df["YearsCode"].agg(["mean", "median"])

mean      11.662114
median     9.000000
Name: YearsCode, dtype: float32