<a href="https://colab.research.google.com/github/Sarvesh-Prajapati/data-analysis-pandas/blob/main/StackOverflowSurvey2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Stack Overflow Survey Data 2019: Data Transformation & Analysis Using Pandas**

Link to dataset: https://survey.stackoverflow.co/



---



# Dataset Loading

In [1]:
import pandas as pd
import numpy as np

In [31]:
# Reading the dataset and schema files from Colab's 'Files' tab

data_df = pd.read_csv('/content/survey_results_public_SOF2019.csv')
schema_df = pd.read_csv('/content/survey_results_schema_SOF2019.csv')   # schema contains info about column names of 'data_df'

In [32]:
# Checking the number of rows and columns in the two files

d_rows, d_cols = data_df.shape
s_rows, s_cols = schema_df.shape
print('data_df has', d_rows, 'rows and', d_cols, 'columns.')
print('schema_df has', s_rows, 'rows and', s_cols, 'columns.')

data_df has 88883 rows and 85 columns.
schema_df has 85 rows and 2 columns.


In [5]:
# Glancing at the survey data

data_df.head(5)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [6]:
# Glancing at the schema

schema_df.head(5)

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,OpenSourcer,How often do you contribute to open source?
4,OpenSource,How do you feel about the quality of open sour...


# Dataset Transformation & Cleaning

In [33]:
# Setting index columns in both dataframes

data_df.set_index('Respondent', inplace = True)
schema_df.set_index('Column', inplace = True)

In [5]:
data_df.head(2)

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult


In [6]:
schema_df.head(3)

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?


**It should be evident by now that the column** 'QuestionText' **in frame** 'schema_df' **has questions that describe what the columns of dataset frame** 'data_df' **mean.**

In [34]:
# Making all columns of 'data_df' visible in the output to glance at

pd.set_option('display.max_columns', d_cols)
data_df.head(2)

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,4.0,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Django;Flask,Flask;jQuery,Node.js,Node.js,IntelliJ;Notepad++;PyCharm,Windows,I do not use containers,,,Yes,"Fortunately, someone else has that title",Yes,Twitter,Online,Username,2017,A few times per month or weekly,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,31-60 minutes,No,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are",Neutral,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,,"Developer, desktop or enterprise applications;...",,17,,,,,,,I am actively looking for a job,I've never had a job,,,Financial performance or funding status of the...,"Something else changed (education, award, medi...",,,,,,,,,,,,,,,,,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Django,Django,,,Atom;PyCharm,Windows,I do not use containers,,Useful across many domains and could change ma...,Yes,Yes,Yes,Instagram,Online,Username,2017,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult


**After going through all the columns in above o/p, we drop columns that are irrelevant to our analysis.**

In [None]:
cols_to_drop = ['OpenSourcer', 'OpenSource', 'EduOther', 'OrgSize',	'CareerSat', 'JobSat',
                'MgrIdiot', 'MgrMoney', 'MgrWant', 'JobSeek',	'LastHireDate', 'LastInt', 'FizzBuzz',
                'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc', 'CompTotal', 'CompFreq',
                'WorkWeekHrs', 'WorkPlan', 'WorkChallenge', 'CodeRev', 'CodeRevHrs', 'UnitTests',
                'PurchaseHow', 'PurchaseWhat', 'WebFrameWorkedWith', 'WebFrameDesireNextYear', 'MiscTechWorkedWith',
                'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers', 'BlockchainOrg', 'BlockchainIs',
                'BetterLife', 'ITperson', 'OffOn', 'Extraversion', 'ScreenName', 'SOVisit1st', 'SOVisitFreq', 'SOVisitTo',
                'SOFindAnswer', 'SOTimeSaved', 'SOHowMuchTime', 'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams',
                'SOComm', 'WelcomeChange', 'SONewContent', 'Age', 'Gender', 'Trans', 'Sexuality', 'Ethnicity', 'Dependents']

# len(cols_to_drop) # o/p: 60
data_df.drop(columns = cols_to_drop, inplace = True)

In [38]:
# Checking the 'data_df' frame's columns

pd.set_option('display.max_columns', d_cols)
# data_df.shape  # o/p: (88883, 25)
data_df.head(2)

Unnamed: 0_level_0,MainBranch,Hobbyist,Employment,Country,Student,EdLevel,UndergradMajor,DevType,YearsCode,Age1stCode,YearsCodePro,ConvertedComp,WorkRemote,WorkLoc,ImpSyn,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,SocialMedia,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,I am a student who is learning to code,Yes,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,,4.0,10,,,,,,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Twitter,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,"Developer, desktop or enterprise applications;...",,17,,,,,,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Instagram,Appropriate in length,Neither easy nor difficult


**Now, let us get a bit more clarity on what the column names imply. For that, we need to look up the column names' description in frame** 'schema_df'.

In [39]:
final_cols = ['EdLevel', 'UndergradMajor', 'DevType', 'YearsCode', 'Age1stCode', 'YearsCodePro', 'ImpSyn', 'SocialMedia' ]
for col in final_cols:
  print(col, ' : ', schema_df.loc[col, 'QuestionText'])

EdLevel  :  Which of the following best describes the highest level of formal education that you’ve completed?
UndergradMajor  :  What was your main or most important field of study?
DevType  :  Which of the following describe you? Please select all that apply.
YearsCode  :  Including any education, how many years have you been coding?
Age1stCode  :  At what age did you write your first line of code or program? (E.g., webpage, Hello World, Scratch project)
YearsCodePro  :  How many years have you coded professionally (as a part of your work)?
ImpSyn  :  For the specific work you do, and the years of experience you have, how do you rate your own level of competence?
SocialMedia  :  What social media site do you use the most?


**Altering some column names to make the context more obvious.**

In [40]:
data_df.rename(columns = {'EdLevel' : 'HighestEdLevel', 'DevType' : 'YourDevType', 'YearsCode' : 'CodingExp',
                          'Age1stCode' : 'CodingSinceAge', 'YearsCodePro' : 'ProCodingExp', 'ConvertedComp' : 'SalaryUSD',
                          'WorkRemote' : 'RemoteWorkFreq', 'ImpSyn' : 'SelfCompetenceLevel',
                          'SocialMedia' : 'MainSocialMedia'}, inplace = True)
data_df.head(3)

Unnamed: 0_level_0,MainBranch,Hobbyist,Employment,Country,Student,HighestEdLevel,UndergradMajor,YourDevType,CodingExp,CodingSinceAge,ProCodingExp,SalaryUSD,RemoteWorkFreq,WorkLoc,SelfCompetenceLevel,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,MainSocialMedia,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,I am a student who is learning to code,Yes,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,,4.0,10,,,,,,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Twitter,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,"Developer, desktop or enterprise applications;...",,17,,,,,,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Instagram,Appropriate in length,Neither easy nor difficult
3,"I am not primarily a developer, but I write co...",Yes,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Designer;Developer, back-end;Developer, front-...",3.0,22,1.0,8820.0,Less than once per month / Never,Home,Average,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,Reddit,Appropriate in length,Neither easy nor difficult


In [65]:
# Check the no. of NaN/None vals in each column

[data_df.isna().sum()] # brackets make output more compactly visible

[MainBranch                  552
 Hobbyist                      0
 OpenSourcer                   0
 Employment                 1702
 Country                     132
 Student                    1869
 HighestEdLevel             2493
 UndergradMajor            13269
 YourDevType                7548
 CodingExp                   945
 CodingSinceAge             1249
 ProCodingExp              14552
 SalaryUSD                 33060
 RemoteWorkFreq            18599
 WorkLoc                   18828
 SelfCompetenceLevel       17104
 LanguageWorkedWith         1314
 LanguageDesireNextYear     4795
 DatabaseWorkedWith        12857
 DatabaseDesireNextYear    19736
 PlatformWorkedWith         8169
 PlatformDesireNextYear    11440
 MainSocialMedia            4446
 SurveyLength               1899
 SurveyEase                 1802
 dtype: int64]

**Above output indicates a lot of columns have significant number of** NaN **values. Dropping all those** NaN **values at one go would greatly affect out analysis later on. So, we'll transform the values one column at a time.**

In [69]:
# Identifying unique values in column 'MainBranch'

data_df['MainBranch'].unique()

array(['I am a student who is learning to code',
       'I am not primarily a developer, but I write code sometimes as part of my work',
       'I am a developer by profession', 'I code primarily as a hobby',
       'I used to be a developer by profession, but no longer am', nan],
      dtype=object)

In [41]:
# Altering the values in 'MainBranch'

data_df['MainBranch'] = data_df['MainBranch'].replace({'I am a student who is learning to code' : 'Student',
                               'I am not primarily a developer, but I write code sometimes as part of my work' : 'Amateur',
                               'I am a developer by profession' : 'Pro Developer',
                               'I code primarily as a hobby' : 'Hobbyist',
                               'I used to be a developer by profession, but no longer am' : 'Retired Developer'})
data_df['MainBranch'].unique()

array(['Student', 'Amateur', 'Pro Developer', 'Hobbyist',
       'Retired Developer', nan], dtype=object)

**Repeating above step for other columns henceforth:**

In [79]:
# Column 'HighestEdLevel'

data_df['HighestEdLevel'].unique()

array(['Primary/elementary school',
       'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
       'Bachelor’s degree (BA, BS, B.Eng., etc.)',
       'Some college/university study without earning a degree',
       'Master’s degree (MA, MS, M.Eng., MBA, etc.)',
       'Other doctoral degree (Ph.D, Ed.D., etc.)', nan,
       'Associate degree', 'Professional degree (JD, MD, etc.)',
       'I never completed any formal education'], dtype=object)

In [42]:
data_df['HighestEdLevel'] = data_df['HighestEdLevel'].replace({'Primary/elementary school' : 'Primary',
                               'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)' : 'Secondary',
                               'Bachelor’s degree (BA, BS, B.Eng., etc.)' : 'Bachelors Degree',
                               'Some college/university study without earning a degree' : 'College/Uni & No Degree',
                               'Master’s degree (MA, MS, M.Eng., MBA, etc.)' : 'Masters Degree',
                               'Other doctoral degree (Ph.D, Ed.D., etc.)' : 'Doctoral Degree',
                                  'Associate degree' : 'Associate Degree', 'Professional degree (JD, MD, etc.)' : 'Professional Degree',
                                  'I never completed any formal education' : 'No formal education' })

data_df['HighestEdLevel'].unique()

array(['Primary', 'Secondary', 'Bachelors Degree',
       'College/Uni & No Degree', 'Masters Degree', 'Doctoral Degree',
       nan, 'Associate Degree', 'Professional Degree',
       'No formal education'], dtype=object)

In [100]:
# Column 'CodingExp'

data_df['CodingExp'].unique()

array(['4', nan, '3', '16', '13', '6', '8', '12', '2', '5', '17', '10',
       '14', '35', '7', 'Less than 1 year', '30', '9', '26', '40', '19',
       '15', '20', '28', '25', '1', '22', '11', '33', '50', '41', '18',
       '34', '24', '23', '42', '27', '21', '36', '32', '39', '38', '31',
       '37', 'More than 50 years', '29', '44', '45', '48', '46', '43',
       '47', '49'], dtype=object)

In [43]:
# above o/p shows all vals in col 'CodingExp' are strings which can't be processed by aggregate functions during analysis
# so these values should be replaced suitably and then the column's dtype changed to float.

data_df['CodingExp'] = data_df['CodingExp'].replace({'Less than 1 year' : 0, 'More than 50 years' : 51, np.NaN : 0})
data_df['CodingExp'] = data_df['CodingExp'].astype(float)
data_df['CodingExp'].unique() # checking if the changes have effected

array([ 4.,  0.,  3., 16., 13.,  6.,  8., 12.,  2.,  5., 17., 10., 14.,
       35.,  7., 30.,  9., 26., 40., 19., 15., 20., 28., 25.,  1., 22.,
       11., 33., 50., 41., 18., 34., 24., 23., 42., 27., 21., 36., 32.,
       39., 38., 31., 37., 51., 29., 44., 45., 48., 46., 43., 47., 49.])

In [113]:
# column 'CodingSinceAge'

data_df['CodingSinceAge'].unique()

array(['10', '17', '22', '16', '14', '15', '11', '20', '13', '18', '12',
       '19', '21', '8', '35', '6', '9', '29', '7', '5', '23', '30', nan,
       '27', '24', 'Younger than 5 years', '33', '25', '26', '39', '36',
       '38', '28', '31', 'Older than 85', '32', '37', '50', '65', '42',
       '34', '40', '67', '43', '44', '60', '46', '45', '49', '51', '41',
       '55', '83', '48', '53', '54', '47', '56', '79', '61', '68', '77',
       '66', '52', '80', '62', '84', '57', '58', '63'], dtype=object)

In [44]:
data_df['CodingSinceAge'] = data_df['CodingSinceAge'].replace({'Younger than 5 years' : 4, 'Older than 85' : 86, np.NaN : 0})
data_df['CodingSinceAge'] = data_df['CodingSinceAge'].astype(float)
data_df['CodingSinceAge'].unique() # checking if the changes have effected

array([10., 17., 22., 16., 14., 15., 11., 20., 13., 18., 12., 19., 21.,
        8., 35.,  6.,  9., 29.,  7.,  5., 23., 30.,  0., 27., 24.,  4.,
       33., 25., 26., 39., 36., 38., 28., 31., 86., 32., 37., 50., 65.,
       42., 34., 40., 67., 43., 44., 60., 46., 45., 49., 51., 41., 55.,
       83., 48., 53., 54., 47., 56., 79., 61., 68., 77., 66., 52., 80.,
       62., 84., 57., 58., 63.])

In [118]:
# Column 'ProCodingExp'

data_df['ProCodingExp'].unique()

array([nan, '1', 'Less than 1 year', '9', '3', '4', '10', '8', '2', '13',
       '18', '5', '14', '22', '23', '19', '35', '20', '25', '7', '15',
       '27', '6', '48', '12', '31', '11', '17', '16', '21', '29', '30',
       '26', '33', '28', '37', '40', '34', '24', '39', '38', '36', '32',
       '41', '45', '43', 'More than 50 years', '44', '42', '46', '49',
       '50', '47'], dtype=object)

In [45]:
data_df['ProCodingExp'] = data_df['ProCodingExp'].replace({'Less than 1 year' : 0, 'More than 50 years' : 51, np.NaN : 0})
data_df['ProCodingExp'] = data_df['ProCodingExp'].astype(float)
data_df['ProCodingExp'].unique() # checking if the changes have effected

array([ 0.,  1.,  9.,  3.,  4., 10.,  8.,  2., 13., 18.,  5., 14., 22.,
       23., 19., 35., 20., 25.,  7., 15., 27.,  6., 48., 12., 31., 11.,
       17., 16., 21., 29., 30., 26., 33., 28., 37., 40., 34., 24., 39.,
       38., 36., 32., 41., 45., 43., 51., 44., 42., 46., 49., 50., 47.])

In [143]:
# Quantifying the col 'SelfCompetenceLevel' on scale 0-10, 5 being average.

data_df['SelfCompetenceLevel'].unique()

array([nan, 'Average', 'A little below average', 'A little above average',
       'Far above average', 'Far below average'], dtype=object)

In [47]:
data_df['SelfCompetenceLevel'] = data_df['SelfCompetenceLevel'].replace({'Average' : 5,
                                                                         'A little below average' : 4,
                                                                         'A little above average' : 6,
                                                                         'Far above average' : 9,
                                                                         'Far below average' : 2, np.NaN : 0})
data_df['SelfCompetenceLevel'] = data_df['SelfCompetenceLevel'].astype(float)
data_df['SelfCompetenceLevel'].unique() # checking if the changes have effected

array([0., 5., 4., 6., 9., 2.])

In [131]:
# Identifying and then replacing non-English characters from column 'MainSocialMedia'

data_df['MainSocialMedia'].unique()

array(['Twitter', 'Instagram', 'Reddit', 'Facebook', 'YouTube', nan,
       'VK ВКонта́кте', 'WhatsApp', "I don't use social media",
       'WeChat 微信', 'LinkedIn', 'Snapchat', 'Weibo 新浪微博', 'Hello',
       'Youku Tudou 优酷'], dtype=object)

In [48]:
data_df['MainSocialMedia'] = data_df['MainSocialMedia'].replace({'Weibo 新浪微博' : 'Weibo', 'WeChat 微信' : 'WeChat', 'Youku Tudou 优酷' : 'Youku Tudou'})
data_df['MainSocialMedia'].unique()

array(['Twitter', 'Instagram', 'Reddit', 'Facebook', 'YouTube', nan,
       'VK ВКонта́кте', 'WhatsApp', "I don't use social media", 'WeChat',
       'LinkedIn', 'Snapchat', 'Weibo', 'Hello', 'Youku Tudou'],
      dtype=object)

In [139]:
# column 'SurveyEase'
data_df['SurveyEase'].unique()

array(['Neither easy nor difficult', 'Easy', 'Difficult', nan],
      dtype=object)

In [49]:
data_df['SurveyEase'] = data_df['SurveyEase'].replace({'Neither easy nor difficult' : 'Moderate'})
data_df['SurveyEase'].unique()

array(['Moderate', 'Easy', 'Difficult', nan], dtype=object)

In [151]:
# Altering the 'Employment' column's data

data_df['Employment'].unique()

array(['Not employed, and not looking for work',
       'Not employed, but looking for work', 'Employed full-time',
       'Independent contractor, freelancer, or self-employed', nan,
       'Employed part-time', 'Retired'], dtype=object)

In [50]:
data_df['Employment'] = data_df['Employment'].replace({'Not employed, and not looking for work' : 'Unemployed',
       'Not employed, but looking for work' : 'Job-seeker', 'Employed full-time' : 'Full-time',
       'Independent contractor, freelancer, or self-employed' : 'Self-employed',
       'Employed part-time' : 'Part-time'})
data_df['Employment'].unique()

array(['Unemployed', 'Job-seeker', 'Full-time', 'Self-employed', nan,
       'Part-time', 'Retired'], dtype=object)

**Dropping rows having significant** NaN **values.**
<p>Let us identify rows that have significant count of NaN (say, 10+) across the columns.</p>

In [53]:
# First, checking count of NaN in EACH row

nan_count_per_row = data_df.isna().sum(axis=1) # setting axis = 1 counts NaN in a row (i.e. left to right across cols)
nan_count_per_row

Unnamed: 0_level_0,0
Respondent,Unnamed: 1_level_1
1,5
2,5
3,2
4,0
5,1
...,...
88377,6
88601,19
88802,18
88816,18


In [54]:
# Fetching rows having NaN in 10+ columns out of all cols (i.e. more than 45% of the columns have no data in a row)

data_df[nan_count_per_row > 10]

Unnamed: 0_level_0,MainBranch,Hobbyist,Employment,Country,Student,HighestEdLevel,UndergradMajor,YourDevType,CodingExp,CodingSinceAge,ProCodingExp,SalaryUSD,RemoteWorkFreq,WorkLoc,SelfCompetenceLevel,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,MainSocialMedia,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
11,Hobbyist,Yes,,Antigua and Barbuda,"Yes, full-time",Primary,,,2.0,11.0,0.0,,,,0.0,Other(s):,Other(s):,,,,,,Appropriate in length,Easy
199,Amateur,Yes,Job-seeker,Netherlands,,Doctoral Degree,Mathematics or statistics,Academic researcher;Data or business analyst;D...,35.0,0.0,0.0,,,,9.0,R;SQL,R;SQL,,,,,,,
294,Student,Yes,Part-time,Mexico,"Yes, full-time",,"Computer science, computer engineering, or sof...",,4.0,17.0,0.0,,,,0.0,,,,,,,,Too long,Easy
534,Pro Developer,Yes,Full-time,India,"Yes, full-time",Bachelors Degree,"Computer science, computer engineering, or sof...","Developer, full-stack",2.0,17.0,2.0,,,,0.0,,,,,,,,,
644,Student,Yes,,Netherlands,"Yes, full-time",Secondary,,,2.0,17.0,0.0,,,,0.0,,,,,,,Reddit,Appropriate in length,Moderate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88062,,No,,,,,,,0.0,0.0,0.0,,,,0.0,,,,,,,,,
88076,,No,Full-time,,,,,,0.0,0.0,0.0,,,,0.0,,,,,,,,,
88601,,No,,,,,,,0.0,0.0,0.0,,,,0.0,,,,,,,,,
88802,,No,Full-time,,,,,,0.0,0.0,0.0,,,,0.0,,,,,,,,,


In [73]:
# Dropping all the above rows from 'data_df'

data_df.drop(index = data_df[nan_count_per_row > 10].index, inplace = True)
data_df

Unnamed: 0_level_0,MainBranch,Hobbyist,Employment,Country,Student,HighestEdLevel,UndergradMajor,YourDevType,CodingExp,CodingSinceAge,ProCodingExp,SalaryUSD,RemoteWorkFreq,WorkLoc,SelfCompetenceLevel,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,MainSocialMedia,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,Student,Yes,Unemployed,United Kingdom,No,Primary,,,4.0,10.0,0.0,,,,0.0,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Twitter,Appropriate in length,Moderate
2,Student,No,Job-seeker,Bosnia and Herzegovina,"Yes, full-time",Secondary,,"Developer, desktop or enterprise applications;...",0.0,17.0,0.0,,,,0.0,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Instagram,Appropriate in length,Moderate
3,Amateur,Yes,Full-time,Thailand,No,Bachelors Degree,Web development or web design,"Designer;Developer, back-end;Developer, front-...",3.0,22.0,1.0,8820.0,Less than once per month / Never,Home,5.0,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,Reddit,Appropriate in length,Moderate
4,Pro Developer,No,Full-time,United States,No,Bachelors Degree,"Computer science, computer engineering, or sof...","Developer, full-stack",3.0,16.0,0.0,61000.0,Less than once per month / Never,Home,4.0,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,Reddit,Appropriate in length,Easy
5,Pro Developer,Yes,Full-time,Ukraine,No,Bachelors Degree,"Computer science, computer engineering, or sof...","Academic researcher;Developer, desktop or ente...",16.0,14.0,9.0,,A few days each month,Office,6.0,C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA,HTML/CSS;Java;JavaScript;SQL;WebAssembly,Couchbase;MongoDB;MySQL;Oracle;PostgreSQL;SQLite,Couchbase;Firebase;MongoDB;MySQL;Oracle;Postgr...,Android;Linux;MacOS;Slack;Windows,Android;Docker;Kubernetes;Linux;Slack,Facebook,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88182,,Yes,Part-time,Pakistan,,Secondary,,Academic researcher,1.0,4.0,0.0,,,,0.0,HTML/CSS;Java;JavaScript,,,,Google Cloud Platform,,Twitter,Too short,Moderate
88212,,No,Full-time,Spain,No,Secondary,,"Designer;Developer, front-end;Developer, full-...",18.0,7.0,15.0,,,,0.0,HTML/CSS;JavaScript;Python,JavaScript,MySQL;PostgreSQL,PostgreSQL,,Arduino,WhatsApp,Appropriate in length,Easy
88282,,Yes,Job-seeker,United States,No,College/Uni & No Degree,"Computer science, computer engineering, or sof...","Developer, back-end;Developer, desktop or ente...",38.0,10.0,38.0,,,,0.0,Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...,Bash/Shell/PowerShell;C;Go;HTML/CSS;JavaScript...,,,Linux,Linux;Raspberry Pi,I don't use social media,Too short,Moderate
88377,,Yes,Unemployed,Canada,No,Primary,,,0.0,0.0,0.0,,,,0.0,HTML/CSS;JavaScript;Other(s):,C++;HTML/CSS;JavaScript;SQL;WebAssembly;Other(s):,Firebase;SQLite,Firebase;MySQL;SQLite,Linux,Google Cloud Platform;Linux,YouTube,Appropriate in length,Easy


**Now that the index column** ('Respondent') **is no more ordered (has missing values 'cause of deletion of rows), we'll have to reset index, then drop the** 'Respondent' **column, and finally rename the new index as** 'Respondent'.

In [81]:
data_df.reset_index(inplace = True)
data_df.index += 1  # sets the index column to start from 1 instead of 0
data_df

Unnamed: 0,Respondent,MainBranch,Hobbyist,Employment,Country,Student,HighestEdLevel,UndergradMajor,YourDevType,CodingExp,CodingSinceAge,ProCodingExp,SalaryUSD,RemoteWorkFreq,WorkLoc,SelfCompetenceLevel,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,MainSocialMedia,SurveyLength,SurveyEase
1,1,Student,Yes,Unemployed,United Kingdom,No,Primary,,,4.0,10.0,0.0,,,,0.0,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Twitter,Appropriate in length,Moderate
2,2,Student,No,Job-seeker,Bosnia and Herzegovina,"Yes, full-time",Secondary,,"Developer, desktop or enterprise applications;...",0.0,17.0,0.0,,,,0.0,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Instagram,Appropriate in length,Moderate
3,3,Amateur,Yes,Full-time,Thailand,No,Bachelors Degree,Web development or web design,"Designer;Developer, back-end;Developer, front-...",3.0,22.0,1.0,8820.0,Less than once per month / Never,Home,5.0,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,Reddit,Appropriate in length,Moderate
4,4,Pro Developer,No,Full-time,United States,No,Bachelors Degree,"Computer science, computer engineering, or sof...","Developer, full-stack",3.0,16.0,0.0,61000.0,Less than once per month / Never,Home,4.0,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,Reddit,Appropriate in length,Easy
5,5,Pro Developer,Yes,Full-time,Ukraine,No,Bachelors Degree,"Computer science, computer engineering, or sof...","Academic researcher;Developer, desktop or ente...",16.0,14.0,9.0,,A few days each month,Office,6.0,C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA,HTML/CSS;Java;JavaScript;SQL;WebAssembly,Couchbase;MongoDB;MySQL;Oracle;PostgreSQL;SQLite,Couchbase;Firebase;MongoDB;MySQL;Oracle;Postgr...,Android;Linux;MacOS;Slack;Windows,Android;Docker;Kubernetes;Linux;Slack,Facebook,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88348,88182,,Yes,Part-time,Pakistan,,Secondary,,Academic researcher,1.0,4.0,0.0,,,,0.0,HTML/CSS;Java;JavaScript,,,,Google Cloud Platform,,Twitter,Too short,Moderate
88349,88212,,No,Full-time,Spain,No,Secondary,,"Designer;Developer, front-end;Developer, full-...",18.0,7.0,15.0,,,,0.0,HTML/CSS;JavaScript;Python,JavaScript,MySQL;PostgreSQL,PostgreSQL,,Arduino,WhatsApp,Appropriate in length,Easy
88350,88282,,Yes,Job-seeker,United States,No,College/Uni & No Degree,"Computer science, computer engineering, or sof...","Developer, back-end;Developer, desktop or ente...",38.0,10.0,38.0,,,,0.0,Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...,Bash/Shell/PowerShell;C;Go;HTML/CSS;JavaScript...,,,Linux,Linux;Raspberry Pi,I don't use social media,Too short,Moderate
88351,88377,,Yes,Unemployed,Canada,No,Primary,,,0.0,0.0,0.0,,,,0.0,HTML/CSS;JavaScript;Other(s):,C++;HTML/CSS;JavaScript;SQL;WebAssembly;Other(s):,Firebase;SQLite,Firebase;MySQL;SQLite,Linux,Google Cloud Platform;Linux,YouTube,Appropriate in length,Easy


In [None]:
data_df.drop(columns = ['Respondent'], inplace = True)    # Dropping the 'Respondent' col (as in above o/p)

In [85]:
data_df.rename_axis('Respondent', inplace = True)  # Renaming the newly resetted index to 'Respondent'
data_df.head(3)

Unnamed: 0_level_0,MainBranch,Hobbyist,Employment,Country,Student,HighestEdLevel,UndergradMajor,YourDevType,CodingExp,CodingSinceAge,ProCodingExp,SalaryUSD,RemoteWorkFreq,WorkLoc,SelfCompetenceLevel,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,MainSocialMedia,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,Student,Yes,Unemployed,United Kingdom,No,Primary,,,4.0,10.0,0.0,,,,0.0,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Twitter,Appropriate in length,Moderate
2,Student,No,Job-seeker,Bosnia and Herzegovina,"Yes, full-time",Secondary,,"Developer, desktop or enterprise applications;...",0.0,17.0,0.0,,,,0.0,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Instagram,Appropriate in length,Moderate
3,Amateur,Yes,Full-time,Thailand,No,Bachelors Degree,Web development or web design,"Designer;Developer, back-end;Developer, front-...",3.0,22.0,1.0,8820.0,Less than once per month / Never,Home,5.0,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,Reddit,Appropriate in length,Moderate


**So, we now have the dataset as clean as possible. We'll save a copy of it for data analysis:**

In [87]:
data_df.to_csv('/content/sofsurvey2019_cleaned.csv')



---



# Data Analysis