# Case Study

## Analyze the StackOverflow developer survey dataset. The dataset contains responses to an annual survey conducted by StackOverflow.

In [1]:
import os

In [3]:
os.listdir()

['case_study.ipynb',
 'climate.csv',
 'climate_results.txt',
 'italy-covid-daywise.csv',
 'kerala.csv',
 'locations.csv',
 'numpy.ipynb',
 'pandas.ipynb',
 'predictive_model.ipynb',
 'Pyspark-With-Python-main',
 'results.csv',
 'survey_results_public.csv',
 'survey_results_schema.csv']

In [4]:
import pandas as pd

In [5]:
survey_raw_df = pd.read_csv('survey_results_public.csv')

In [6]:
survey_raw_df

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64456,64858,,Yes,,16,,,,United States,,...,,,,"Computer science, computer engineering, or sof...",,,,,10,Less than 1 year
64457,64867,,Yes,,,,,,Morocco,,...,,,,,,,,,,
64458,64898,,Yes,,,,,,Viet Nam,,...,,,,,,,,,,
64459,64925,,Yes,,,,,,Poland,,...,,,,,Angular;Angular.js;React.js,,,,,


##### The dataset contains over 64,000 responses to 60 questions (although many questions are optional). The responses have been anonymized to remove personally identifiable information, and each respondent has been assigned a randomized respondent ID.

In [7]:
survey_raw_df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
  

In [8]:
schema_fname = 'survey_results_schema.csv'
schema_raw = pd.read_csv(schema_fname, index_col='Column').QuestionText

In [9]:
schema_raw

Column
Respondent            Randomized respondent ID number (not in order ...
MainBranch            Which of the following options best describes ...
Hobbyist                                        Do you code as a hobby?
Age                   What is your age (in years)? If you prefer not...
Age1stCode            At what age did you write your first line of c...
                                            ...                        
WebframeWorkedWith    Which web frameworks have you done extensive d...
WelcomeChange         Compared to last year, how welcome do you feel...
WorkWeekHrs           On average, how many hours per week do you wor...
YearsCode             Including any education, how many years have y...
YearsCodePro          NOT including education, how many years have y...
Name: QuestionText, Length: 61, dtype: object

In [11]:
schema_raw['YearsCodePro']

'NOT including education, how many years have you coded professionally (as a part of your work)?'

## Data Preparation & Cleaning

##### Analyze the following areas

- Demographics of the survey respondents and the global programming community.
- Distribution of programming skills, experience, and preferences.
- Employment-related information, preferences, and opinions.

In [12]:
selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

In [13]:
len(selected_columns)

20

In [27]:
survey_df = survey_raw_df[selected_columns].copy()

In [28]:
schema = schema_raw[selected_columns]

In [29]:
survey_df.shape

(64461, 20)

In [30]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 64072 non-null  object 
 1   Age                     45446 non-null  float64
 2   Gender                  50557 non-null  object 
 3   EdLevel                 57431 non-null  object 
 4   UndergradMajor          50995 non-null  object 
 5   Hobbyist                64416 non-null  object 
 6   Age1stCode              57900 non-null  object 
 7   YearsCode               57684 non-null  object 
 8   YearsCodePro            46349 non-null  object 
 9   LanguageWorkedWith      57378 non-null  object 
 10  LanguageDesireNextYear  54113 non-null  object 
 11  NEWLearn                56156 non-null  object 
 12  NEWStuck                54983 non-null  object 
 13  Employment              63854 non-null  object 
 14  DevType                 49370 non-null

In [31]:
survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(
    survey_df.YearsCodePro, errors='coerce')


In [32]:
survey_df.describe()

Unnamed: 0,Age,Age1stCode,YearsCode,YearsCodePro,WorkWeekHrs
count,45446.0,57473.0,56784.0,44133.0,41151.0
mean,30.834111,15.476572,12.782051,8.869667,40.782174
std,9.585392,5.114081,9.490657,7.759961,17.816383
min,1.0,5.0,1.0,1.0,1.0
25%,24.0,12.0,6.0,3.0,40.0
50%,29.0,15.0,10.0,6.0,40.0
75%,35.0,18.0,17.0,12.0,44.0
max,279.0,85.0,50.0,50.0,475.0


In [33]:
survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 100].index, inplace=True)

In [34]:
survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

In [35]:
survey_df['Gender'].value_counts()

Man                                                            45895
Woman                                                           3835
Non-binary, genderqueer, or gender non-conforming                385
Man;Non-binary, genderqueer, or gender non-conforming            121
Woman;Non-binary, genderqueer, or gender non-conforming           92
Woman;Man                                                         73
Woman;Man;Non-binary, genderqueer, or gender non-conforming       25
Name: Gender, dtype: int64

In [36]:
import numpy as np

In [39]:
survey_df.where(~(survey_df.Gender.str.contains(
    ';', na=False)), np.nan, inplace=True)


In [41]:
survey_df.sample(10)

Unnamed: 0,Country,Age,Gender,EdLevel,UndergradMajor,Hobbyist,Age1stCode,YearsCode,YearsCodePro,LanguageWorkedWith,LanguageDesireNextYear,NEWLearn,NEWStuck,Employment,DevType,WorkWeekHrs,JobSat,JobFactors,NEWOvertime,NEWEdImpt
22650,Germany,25.0,Man,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",Yes,15.0,8.0,,HTML/CSS;Python,C++;Python,Once a year,Visit Stack Overflow;Watch help / tutorial videos,Employed part-time,Data scientist or machine learning specialist;...,,,Industry that I’d be working in;Flex time or a...,,
4443,Uruguay,33.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Yes,16.0,16.0,6.0,HTML/CSS;Java;JavaScript;Kotlin;SQL,HTML/CSS;Java;JavaScript;Kotlin;SQL;TypeScript,Once every few years,Call a coworker or friend;Visit Stack Overflow...,Employed full-time,"Developer, back-end;Developer, desktop or ente...",40.0,Slightly satisfied,Flex time or a flexible schedule;Remote work o...,Sometimes: 1-2 days per month but less than we...,Fairly important
8646,India,,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Mathematics or statistics,Yes,18.0,3.0,3.0,Swift,Swift,,Call a coworker or friend;Watch help / tutoria...,Employed full-time,"Developer, mobile",,Slightly satisfied,,Never,Not at all important/not necessary
26286,Russian Federation,35.0,Man,Some college/university study without earning ...,"Information systems, information technology, o...",Yes,12.0,15.0,10.0,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python,Bash/Shell/PowerShell;Dart;Go;Kotlin;Python;Rust,Once a year,Visit Stack Overflow;Go for a walk or other ph...,"Independent contractor, freelancer, or self-em...","Data or business analyst;Developer, back-end;E...",20.0,Very satisfied,Remote work options;Financial performance or f...,Occasionally: 1-2 days per quarter but less th...,Not at all important/not necessary
4002,United States,26.0,Woman,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",Yes,18.0,8.0,3.0,Bash/Shell/PowerShell;C;C++;Java;Python;Swift,C;C#;C++;Python,Once a year,Visit Stack Overflow;Panic;Watch help / tutori...,Employed full-time,Data scientist or machine learning specialist;...,40.0,Neither satisfied nor dissatisfied,Diversity of the company or organization;How w...,Occasionally: 1-2 days per quarter but less th...,Very important
34257,United Kingdom,25.0,Man,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",Yes,11.0,7.0,3.0,HTML/CSS;JavaScript;Ruby;SQL;TypeScript,HTML/CSS;Kotlin;Python;Ruby;SQL;TypeScript,Once a year,Call a coworker or friend;Visit Stack Overflow...,Employed full-time,"Developer, back-end;Developer, front-end;Devel...",37.5,Slightly dissatisfied,"Languages, frameworks, and other technologies ...",Occasionally: 1-2 days per quarter but less th...,Somewhat important
58079,Mexico,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",No,15.0,21.0,14.0,,,,,Employed part-time,"Database administrator;Developer, full-stack;S...",27.5,Very dissatisfied,,Sometimes: 1-2 days per month but less than we...,Critically important
35283,Poland,,,,,Yes,,,,Bash/Shell/PowerShell;C;Go;Java,Bash/Shell/PowerShell;C;Go;Haskell;R,Every few months,Do other work and come back later,Student,,,,,,
56109,Belgium,23.0,Man,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",Yes,18.0,3.0,,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;P...,Bash/Shell/PowerShell;C#;Go;HTML/CSS;Python;Ty...,Every few months,Visit Stack Overflow;Go for a walk or other ph...,Employed full-time,"Database administrator;Developer, back-end;Dev...",42.0,Slightly satisfied,,Often: 1-2 days per week or more,Fairly important
30387,Germany,47.0,Man,"Associate degree (A.A., A.S., etc.)","Computer science, computer engineering, or sof...",No,14.0,19.0,16.0,Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...,Java;JavaScript;Kotlin;Python;TypeScript,Once every few years,Call a coworker or friend;Visit Stack Overflow...,Employed full-time,"Developer, back-end;Developer, desktop or ente...",40.0,Slightly dissatisfied,Flex time or a flexible schedule;Office enviro...,Sometimes: 1-2 days per month but less than we...,Critically important
