**Project Overview: Stack Overflow Developer Survey Analysis**
This project explores responses from the Stack Overflow Developer Survey to uncover insights about developers' experience, remote work, education, compensation, and Python usage. The following key questions are addressed in the analysis:

How many people answered survey? 

Whats the number of people that answered all questions in a survey? 

What are the measures of central tendency (mean, median, mode) for respondents' work experience?


How many respondents work remotely?


What percentage of respondents use Python for programming?


How many respondents learned to program through online courses?


Among Python users, what are the average and median annual compensations by country?


What are the education levels of the top 5 highest-paid respondents?


Within each age group, what percentage of respondents use Python?


Among respondents in the top 25% compensation bracket who work remotely, which industries are most common?

In [24]:
import pandas as pd 
import numpy as np
import math

In [25]:
df = pd.read_csv('Downloads/survey_results.csv')
schema = pd.read_csv('Downloads/survey_results_schema.csv')
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

In [15]:
df

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,I am a developer by profession,18-24 years old,"Employed, full-time",Remote,Apples,Hobby;School or academic work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","On the job training;School (i.e., University, ...",,...,,,,,,,,,,
65433,65434,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects,,,,...,,,,,,,,,,
65434,65435,I am a developer by profession,25-34 years old,"Employed, full-time",In-person,Apples,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Social ...,...,,,,,,,,,,
65435,65436,I am a developer by profession,18-24 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby;Contribute to open-source projects;Profe...,"Secondary school (e.g. American high school, G...",On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,


In [18]:
df.describe()

Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,ConvertedCompYearly,JobSat
count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,23435.0,29126.0
mean,32719.0,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,86155.29,6.935041
std,18890.179119,5.444117e+147,9.168709,25.966221,18.422661,21.833836,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,186757.0,2.088259
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32712.0,6.0
50%,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,0.0,0.0,65000.0,7.0
75%,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,10.0,10.0,107971.5,8.0
max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,16256600.0,10.0


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Columns: 114 entries, ResponseId to JobSat
dtypes: float64(13), int64(1), object(100)
memory usage: 56.9+ MB


In [19]:
count_responds = df['ResponseId'].nunique()

In [20]:
count_responds

65437

How many people answered a survey in total? - 65437
The number we got by counting unique response IDs

In [21]:
df.dropna(axis = 0)

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat


No rows (entries) that have no NA values in the answers for the survey. That is because some questions allow multiple answers. So some columns bound to have NA even if technically responder answered all questions.

In [34]:
questions_from_data = df.columns
questions_from_schema = schema.qname
print(questions_from_data)
print(questions_from_schema)

Index(['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       ...
       'JobSatPoints_6', 'JobSatPoints_7', 'JobSatPoints_8', 'JobSatPoints_9',
       'JobSatPoints_10', 'JobSatPoints_11', 'SurveyLength', 'SurveyEase',
       'ConvertedCompYearly', 'JobSat'],
      dtype='object', length=114)
0          MainBranch
1                 Age
2          Employment
3          RemoteWork
4               Check
           ...       
82     JobSatPoints_7
83     JobSatPoints_8
84     JobSatPoints_9
85    JobSatPoints_10
86    JobSatPoints_11
Name: qname, Length: 87, dtype: object


In [35]:
questions_in_both = set(questions_from_data) & set(questions_from_schema)

In [36]:
print(questions_in_both)

{'Currency', 'Knowledge_8', 'LearnCodeOnline', 'ProfessionalQuestion', 'JobSatPoints_7', 'AIBen', 'NEWSOSites', 'Employment', 'TechEndorse', 'TechDoc', 'EdLevel', 'YearsCodePro', 'SOComm', 'BuildvsBuy', 'CompTotal', 'SOHow', 'Knowledge_7', 'JobSatPoints_4', 'Age', 'ProfessionalCloud', 'WorkExp', 'Check', 'JobSat', 'AISent', 'Frustration', 'TimeAnswering', 'DevType', 'Knowledge_6', 'JobSatPoints_6', 'JobSatPoints_10', 'Frequency_2', 'RemoteWork', 'Knowledge_2', 'AIEthics', 'SOVisitFreq', 'PurchaseInfluence', 'CodingActivities', 'Frequency_3', 'SurveyLength', 'BuyNewTool', 'SurveyEase', 'Knowledge_5', 'Knowledge_1', 'AIThreat', 'JobSatPoints_1', 'Knowledge_4', 'TBranch', 'Knowledge_9', 'ICorPM', 'SOPartFreq', 'MainBranch', 'AIChallenges', 'YearsCode', 'JobSatPoints_5', 'AIAcc', 'TimeSearching', 'AIComplex', 'ProfessionalTech', 'JobSatPoints_11', 'Frequency_1', 'LearnCode', 'Knowledge_3', 'JobSatPoints_8', 'SOAccount', 'AISelect', 'Industry', 'OrgSize', 'JobSatPoints_9', 'Country'}


To count users who answered all the questions, we will only consider the questions that have only 1 possible answer from set "questions_in_both"

In [38]:
df.dropna(subset = questions_in_both).shape[0]

6306

How many people answered all questions? - 6306 responders 

In [39]:
df.WorkExp.mean()

11.46695663901814

Mean Work Experience is 11.5 Years

In [40]:
df.WorkExp.median()

9.0

Median Work Experience is 9 years

In [41]:
df.WorkExp.mode()

0    3.0
Name: WorkExp, dtype: float64

Mode Work Experience is 3 years

In [44]:
df[df.RemoteWork == 'Remote'].shape[0]

20831

Number of people working Remotely - 20831

In [47]:
df[df.RemoteWork == 'In-person'].shape[0]

10960

In [48]:
df[df.RemoteWork == 'Hybrid (some remote, some in-person)'].shape[0]

23015

In [58]:
Works_with_python_mask = df.LanguageHaveWorkedWith.str.lower().str.contains('python', na = False)

In [59]:
df.loc[Works_with_python_mask].shape[0]

30795

Number of people who work with Python = 30795

In [61]:
Perc_work_with_python = (df.loc[Works_with_python_mask].shape[0])/ count_responds
print(Perc_work_with_python*100)

47.0605315035836


47% of all responders work with Python 

In [62]:
df.LearnCode.unique()

array(['Books / Physical media',
       'Books / Physical media;Colleague;On the job training;Other online resources (e.g., videos, blogs, forum, online community)',
       'Books / Physical media;Colleague;On the job training;Other online resources (e.g., videos, blogs, forum, online community);School (i.e., University, College, etc)',
       'Other online resources (e.g., videos, blogs, forum, online community);School (i.e., University, College, etc);Online Courses or Certification',
       'Other online resources (e.g., videos, blogs, forum, online community)',
       'School (i.e., University, College, etc);Online Courses or Certification',
       'Other online resources (e.g., videos, blogs, forum, online community);Online Courses or Certification;Coding Bootcamp',
       'Books / Physical media;Other online resources (e.g., videos, blogs, forum, online community);Online Courses or Certification',
       'On the job training;Other online resources (e.g., videos, blogs, forum, onli

In [63]:
Learned_online_mask = df.LearnCode.str.lower().str.contains('online', na = False)

In [64]:
df.loc[Learned_online_mask].shape[0]

54061

54061 people studied Online

In [71]:
df[Works_with_python_mask].groupby (by = 'Country').agg({'ConvertedCompYearly' : ['mean', 'median']})

Unnamed: 0_level_0,ConvertedCompYearly,ConvertedCompYearly
Unnamed: 0_level_1,mean,median
Country,Unnamed: 1_level_2,Unnamed: 2_level_2
Afghanistan,4543.000000,4768.5
Albania,56295.000000,56295.0
Algeria,9053.285714,6230.0
Andorra,193331.000000,193331.0
Angola,6.000000,6.0
...,...,...
"Venezuela, Bolivarian Republic of...",21500.000000,7100.0
Viet Nam,14014.562500,10180.0
Yemen,10297.333333,5333.0
Zambia,28123.666667,22803.0


Mean and median yearly salary among python users by country

In [72]:
selected_df = df[['ConvertedCompYearly', 'EdLevel']] #selecting 2 columns

In [73]:
sorted_df = selected_df.sort_values(by='ConvertedCompYearly', ascending=False) #sorting the cilumns by 'ConvertedCompYearly'

In [74]:
sorted_df.head(5) #showing top 5 results

Unnamed: 0,ConvertedCompYearly,EdLevel
15837,16256603.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)"
12723,13818022.0,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)"
28379,9000000.0,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)"
17593,6340564.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)"
17672,4936778.0,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)"


The education levels of the top 5 highest-paid responders  

In [75]:
df['worked_with_python'] = Works_with_python_mask #creating new column based on a mask

In [76]:
temp_table = df.groupby('Age').agg({'ResponseId' : 'count', 'worked_with_python' : 'sum'}) 
#making a new table, groupped  by 'Age' and in it we count 'ResponseId' and sum up how many of them worked with Python

In [79]:
responders_who_worked_with_python = (temp_table.worked_with_python/ temp_table.ResponseId)*100 #dividing one value by another to get the percentage 
responders_who_worked_with_python

Age
18-24 years old       55.922826
25-34 years old       45.773912
35-44 years old       41.520546
45-54 years old       41.910706
55-64 years old       40.427184
65 years or older     37.564767
Prefer not to say     45.341615
Under 18 years old    64.875389
dtype: float64

% of responders who use python by age group

In [80]:
df[(df.ConvertedCompYearly > df.ConvertedCompYearly.quantile(0.75)) & (df.RemoteWork == 'Remote')].Industry.value_counts()
#selecting people in .75 quantile of Converted Yearly Compensation and who worked Remotely. Then we count how many people like this are in each Industry

Industry
Software Development                          768
Other:                                        239
Healthcare                                    156
Fintech                                       156
Internet, Telecomm or Information Services    145
Retail and Consumer Services                  106
Media & Advertising Services                  103
Banking/Financial Services                     69
Government                                     69
Computer Systems Design and Services           69
Transportation, or Supply Chain                67
Insurance                                      50
Manufacturing                                  48
Higher Education                               42
Energy                                         36
Name: count, dtype: int64

Top Industries in .75 quantile of Converted Yearly Compensation and who work Remotely