# Data Cleaning: Employee Survey Results

### The goal of this project is to clean and transform the dataset in a form suitable for further exploratory data analysis

## Objectives:

- To identify errors and inconsistencies and understand how to clean and transform the dataset for further analysis
- Perform an overall and column-wise inspection 
- Document all the findings and analyses
- Implement a function that will perform all the cleaning and transformations steps

## Importing Libraries

In [1]:
from collections import Counter

from IPython.display import HTML, display

import numpy as np

import pandas as pd

import os

In [2]:
pd.set_option("display.max_columns", None)

## Getting the Data

In [3]:
file_dir = r"C:\Python Programs\datasets"
file_name = "survey_results_public.csv"
full_path = os.path.join(file_dir, file_name)
full_path

'C:\\Python Programs\\datasets\\survey_results_public.csv'

In [4]:
df = pd.read_csv(full_path)
display(HTML(f"<h3>Data Shape: {df.shape}<h3>"))
df.head()

Unnamed: 0,ResponseId,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,Country,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSysProfessional use,OpSysPersonal use,VersionControlSystem,VCInteraction,VCHostingPersonal use,VCHostingProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,Blockchain,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,None of these,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects,,,,,,,,,,,Canada,CAD\tCanadian dollar,,,JavaScript;TypeScript,Rust;TypeScript,,,,,,,,,,,,,macOS,Windows Subsystem for Linux (WSL),Git,,,,,,,,Very unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Daily or almost daily,Yes,Daily or almost daily,Not sure,,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Difficult,
2,3,"I am not primarily a developer, but I write co...","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Programming Game...,,14.0,5.0,Data scientist or machine learning specialist;...,20 to 99 employees,I have some influence,,United Kingdom of Great Britain and Northern I...,GBP\tPound sterling,32000.0,Yearly,C#;C++;HTML/CSS;JavaScript;Python,C#;C++;HTML/CSS;JavaScript;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,Angular.js,Angular;Angular.js,Pandas,.NET,,,Notepad++;Visual Studio,Notepad++;Visual Studio,Windows,Windows,Git,Code editor,,,,,Microsoft Teams,Microsoft Teams,Very unfavorable,Collectives on Stack Overflow;Stack Overflow;S...,Multiple times per day,Yes,Multiple times per day,Neutral,25-34 years old,Man,No,Bisexual,White,None of the above,"I have a mood or emotional disorder (e.g., dep...",No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0
3,4,I am a developer by profession,"Employed, full-time",Fully remote,I don’t code outside of work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Books / Physical media;School (i.e., Universit...",,,20.0,17.0,"Developer, full-stack",100 to 499 employees,I have some influence,Other (please specify):,Israel,ILS\tIsraeli new shekel,60000.0,Monthly,C#;JavaScript;SQL;TypeScript,C#;SQL;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,ASP.NET;ASP.NET Core,ASP.NET;ASP.NET Core,.NET,.NET,,,Notepad++;Visual Studio;Visual Studio Code,Notepad++;Visual Studio;Visual Studio Code,Windows,Windows,Git,Code editor;Command-line;Version control hosti...,,,Jira Work Management;Trello,Jira Work Management;Trello,Slack;Zoom,Slack;Zoom,Very unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Daily or almost daily,Yes,A few times per week,"Yes, definitely",35-44 years old,Man,No,Straight / Heterosexual,White,None of the above,None of the above,No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,215232.0
4,5,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Stack Overflow;O...,,8.0,3.0,"Developer, front-end;Developer, full-stack;Dev...",20 to 99 employees,I have some influence,Start a free trial;Visit developer communities...,United States of America,USD\tUnited States dollar,,,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript,C#;Elixir;F#;Go;JavaScript;Rust;TypeScript,Cloud Firestore;Elasticsearch;Microsoft SQL Se...,Cloud Firestore;Elasticsearch;Firebase Realtim...,Firebase;Microsoft Azure,Firebase;Microsoft Azure,Angular;ASP.NET;ASP.NET Core ;jQuery;Node.js,Angular;ASP.NET Core ;Blazor;Node.js,.NET,.NET;Apache Kafka,npm,Docker;Kubernetes,Notepad++;Visual Studio;Visual Studio Code;Xcode,Rider;Visual Studio;Visual Studio Code,Windows,macOS;Windows,Git;Other (please specify):,Code editor,,,,,Microsoft Teams;Zoom,,Unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Multiple times per day,Yes,Daily or almost daily,"Yes, definitely",25-34 years old,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Easy,


## Data Dictionary
- Provides a description of all the columns in the dataset

In [5]:
file_dir = r"C:\Python Programs\datasets"
file_name = "survey_results_schema.csv"
full_path = os.path.join(file_dir, file_name)
full_path

'C:\\Python Programs\\datasets\\survey_results_schema.csv'

In [6]:
dictionary = (pd
             .read_csv(full_path)
             .query("selector != 'TB'")
             .loc[:, ["qname", "question"]]
             .set_axis(["column", "description"], axis=1)
             .iloc[1:]
             .set_index("column"))
dictionary

Unnamed: 0_level_0,description
column,Unnamed: 1_level_1
MainBranch,Which of the following options best describes ...
Employment,Which of the following best describes your cur...
RemoteWork,Which best describes your current work situation?
CodingActivities,Which of the following best describes the code...
EdLevel,Which of the following best describes the high...
...,...
Frequency_2,Interacting with people outside of your immedi...
Frequency_3,Encountering knowledge silos (where one indivi...
TrueFalse_1,Are you involved in supporting new hires durin...
TrueFalse_2,Do you use learning resources provided by your...


## Meta-data

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73268 entries, 0 to 73267
Data columns (total 79 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ResponseId                      73268 non-null  int64  
 1   MainBranch                      73268 non-null  object 
 2   Employment                      71709 non-null  object 
 3   RemoteWork                      58958 non-null  object 
 4   CodingActivities                58899 non-null  object 
 5   EdLevel                         71571 non-null  object 
 6   LearnCode                       71580 non-null  object 
 7   LearnCodeOnline                 50685 non-null  object 
 8   LearnCodeCoursesCert            29389 non-null  object 
 9   YearsCode                       71331 non-null  object 
 10  YearsCodePro                    51833 non-null  object 
 11  DevType                         61302 non-null  object 
 12  OrgSize                         

In [8]:
(df
 .dtypes
 .value_counts())

object     73
float64     5
int64       1
dtype: int64

In [9]:
df.isna().sum()

ResponseId                 0
MainBranch                 0
Employment              1559
RemoteWork             14310
CodingActivities       14369
                       ...  
TrueFalse_2            37553
TrueFalse_3            37519
SurveyLength            2824
SurveyEase              2760
ConvertedCompYearly    35197
Length: 79, dtype: int64

In [10]:
df.memory_usage(deep=True).sum()

375549705

In [11]:
375549705 / 1024 / 1024

358.15210819244385

#### Observations:

- The dataset has 73,268 observations
- There're totally 79 variables:
  - `int`: 1
  - `float`: 5
  - `object`: 73
- The dataset utilizes 375,549,705 bytes of memory
  - Roughly 358 mega-bytes of memory
- There're several columns with missing values
  - `VCHostingPersonal use` and `VCHostingProfessional use` have 0 non-null values
  - Values are missing most likely because the users chose to skip/ignore the questions
- Many columns contain inconsisent characters and potentially incorrect data types

#### Steps:

- The column names need to be renamed for better readability
- Irrelevant columns to be dropped:
  - `ResponseId`
  - `VCHostingPersonal use`
  - `VCHostingProfessional use`
- Check for duplicates
- Inspect each column individually for errors and inconsistencies
- The missing values can be replaced with `unanswered`

## Checking Duplicates

In [12]:
df.duplicated().sum()

0

In [13]:
df.ResponseId.duplicated().sum()

0

In [14]:
df.duplicated(subset=["ResponseId"]).sum()

0

- There seem to be no duplicates in the dataset

## Column-wise Inspection

### Convenience Functions

In [15]:
def col_value_counts(name):
    """
    This function accepts the name of a categorical column
    and returns the distribution of its categories with the
    counts and normalized percentages
    
    Parameters:
    -----------
    
    name: str
          The name of the categorical column
    """
    return (df
            .loc[:, name]
            .value_counts()
            .pipe(lambda ser: pd.concat([ser,
                                         df
                                         .loc[:, name]
                                         .value_counts(normalize=True)],
                                        axis=1))
            .set_axis(["count", "pct"], axis=1)
            .rename_axis(index=name))

In [16]:
def col_memory_usage(name, category=False):
    """
    This function accepts the name of a categorical column
    and returns the amount of memory utilized in bytes
    
    Parameters:
    -----------
    
    name: str
          The name of the categorical column
          
    category: bool
              Whether to convert the variable to type 'category' or not
    """
    if category:
        return (df
                .loc[:, name]
                .astype("category")
                .memory_usage(deep=True))
    else:
        return (df
                .loc[:, name]
                .memory_usage(deep=True))

In [17]:
def col_strip_nested(name):
    """
    This function accepts the name of a nested categorical column
    and returns the unique values
    """
    
    result_set = dict()
    for i, entry in df[name].items():
        if isinstance(entry, str):
            for value in entry.split(";"):
                result_set[value] = result_set.get(value, 0) + 1
    display(HTML("<h4>Unique Values:<h4>"))
    return sorted(result_set.items(),
                  key=lambda val: val[1],
                  reverse=True)

In [18]:
def col_description(name):
    """
    This function accepts the name of a column
    and returns its description as per the data dictionary
    """
    
    display(HTML("<h4>Description:<h4>"))
    print(dictionary
          .loc[name]
          .values[0])

### MainBranch

In [19]:
col_description("MainBranch")

Which of the following options best describes you today? Here, by "developer" we mean "someone who writes code." <b>*</b>


In [20]:
df.MainBranch

0                                            None of these
1                           I am a developer by profession
2        I am not primarily a developer, but I write co...
3                           I am a developer by profession
4                           I am a developer by profession
                               ...                        
73263                       I am a developer by profession
73264                       I am a developer by profession
73265    I am not primarily a developer, but I write co...
73266                       I am a developer by profession
73267    I used to be a developer by profession, but no...
Name: MainBranch, Length: 73268, dtype: object

In [21]:
df.MainBranch.unique()

array(['None of these', 'I am a developer by profession',
       'I am not primarily a developer, but I write code sometimes as part of my work',
       'I code primarily as a hobby', 'I am learning to code',
       'I used to be a developer by profession, but no longer am'],
      dtype=object)

In [22]:
col_value_counts("MainBranch")

Unnamed: 0_level_0,count,pct
MainBranch,Unnamed: 1_level_1,Unnamed: 2_level_1
I am a developer by profession,53507,0.730292
I am learning to code,6309,0.086109
"I am not primarily a developer, but I write code sometimes as part of my work",5794,0.07908
I code primarily as a hobby,4865,0.0664
None of these,1497,0.020432
"I used to be a developer by profession, but no longer am",1296,0.017688


In [23]:
(df
 .MainBranch
 .rename("coding_proficiency")
 .replace(["I am a developer by profession",
           "I am learning to code",
           "I am not primarily a developer, but I write code sometimes as part of my work",
           "I code primarily as a hobby",
           "None of these",
           "I used to be a developer by profession, but no longer am"],
          ["developer",
           "learning",
           "work_partly",
           "hobby",
           "other",
           "former_developer"])
 .astype("category"))

0                   other
1               developer
2             work_partly
3               developer
4               developer
               ...       
73263           developer
73264           developer
73265         work_partly
73266           developer
73267    former_developer
Name: coding_proficiency, Length: 73268, dtype: category
Categories (6, object): ['developer', 'former_developer', 'hobby', 'learning', 'other', 'work_partly']

In [24]:
col_memory_usage("MainBranch")

6583633

In [25]:
col_memory_usage("MainBranch", category=True)

74134

In [26]:
6583633 / 74134

88.80720047481587

#### Observations
- There're 6 unique categories
- Values are valid


#### Steps
- The column can be renamed
- The categories can be shortened
- Since cardinality is less, the type can be converted to `category`
  - Will utilize about 89 times less memory

### Employment

In [27]:
df.Employment

0                                                      NaN
1                                      Employed, full-time
2                                      Employed, full-time
3                                      Employed, full-time
4                                      Employed, full-time
                               ...                        
73263                                  Employed, full-time
73264                                  Employed, full-time
73265                                  Employed, full-time
73266                                  Employed, full-time
73267    Independent contractor, freelancer, or self-em...
Name: Employment, Length: 73268, dtype: object

In [28]:
# df.Employment.unique()

In [29]:
col_value_counts("Employment").head(20)

Unnamed: 0_level_0,count,pct
Employment,Unnamed: 1_level_1,Unnamed: 2_level_1
"Employed, full-time",42962,0.599116
"Student, full-time",6756,0.094214
"Independent contractor, freelancer, or self-employed",4978,0.069419
"Employed, full-time;Independent contractor, freelancer, or self-employed",3486,0.048613
"Not employed, but looking for work",1831,0.025534
"Student, full-time;Employed, part-time",1168,0.016288
"Employed, part-time",1132,0.015786
"Student, part-time",1045,0.014573
"Employed, full-time;Student, full-time",972,0.013555
"Employed, full-time;Student, part-time",946,0.013192


In [30]:
col_strip_nested("Employment")

[('Employed, full-time', 49199),
 ('Student, full-time', 10932),
 ('Independent contractor, freelancer, or self-employed', 10721),
 ('Employed, part-time', 4154),
 ('Student, part-time', 3722),
 ('Not employed, but looking for work', 3381),
 ('Not employed, and not looking for work', 1244),
 ('I prefer not to say', 611),
 ('Retired', 396)]

In [31]:
# (df
#  .Employment
#  .dropna()
#  .loc[lambda ser: ser.str.contains("employed")]
#  .unique())

In [32]:
(df
 .Employment
 .str.contains("Retired")
 .sum())

396

In [33]:
(df
 .Employment
 .fillna("unanswered")
 .str.lower()
 .replace({"i prefer not to say": "unanswered"})
 .str.replace("independent contractor, freelancer, or self-employed", "freelancer")
 .str.replace("not employed, but looking for work", "unemployed")
 .str.replace("not employed, and not looking for work", "unemployed")
 .str.replace(", ", "_")
 .str.replace("-", "_"))

0                unanswered
1        employed_full_time
2        employed_full_time
3        employed_full_time
4        employed_full_time
                ...        
73263    employed_full_time
73264    employed_full_time
73265    employed_full_time
73266    employed_full_time
73267            freelancer
Name: Employment, Length: 73268, dtype: object

#### Observations
- There're 103 unique categories
- About 90 categories occur in less than 1% of the total observations
  - These could be considered as `rare`
- Values are nested, separated by a semicolon


#### Steps
- The categories can be shortened
- The value `I prefer not to say` can be renamed `unanswered`
- The missing values could be imputed with `unanswered` because it makes sense to think that this field has values missing because the user chose to skip this question
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords:
    - `Employed`
    - `Not employed`
    - `Student`
    - `Retired`
  - Multiple binary columns can be created:
    - is_employed
    - is_student
    - is_freelancer
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### RemoteWork

In [34]:
df.RemoteWork

0                                         NaN
1                                Fully remote
2        Hybrid (some remote, some in-person)
3                                Fully remote
4        Hybrid (some remote, some in-person)
                         ...                 
73263                            Fully remote
73264                          Full in-person
73265    Hybrid (some remote, some in-person)
73266    Hybrid (some remote, some in-person)
73267                            Fully remote
Name: RemoteWork, Length: 73268, dtype: object

In [35]:
df.RemoteWork.unique()

array([nan, 'Fully remote', 'Hybrid (some remote, some in-person)',
       'Full in-person'], dtype=object)

In [36]:
col_value_counts("RemoteWork")

Unnamed: 0_level_0,count,pct
RemoteWork,Unnamed: 1_level_1,Unnamed: 2_level_1
Fully remote,25341,0.429814
"Hybrid (some remote, some in-person)",25021,0.424387
Full in-person,8596,0.145799


In [37]:
df.RemoteWork.isna().sum()

14310

In [38]:
col_memory_usage("RemoteWork")

5143846

In [39]:
col_memory_usage("RemoteWork", category=True)

73737

In [40]:
(df
 .RemoteWork
 .rename("work_type")
 .replace(["Fully remote",
           "Hybrid (some remote, some in-person)",
           "Full in-person"],
          ["remote",
           "hybrid",
           "in-person"])
 .astype("category"))

0              NaN
1           remote
2           hybrid
3           remote
4           hybrid
           ...    
73263       remote
73264    in-person
73265       hybrid
73266       hybrid
73267       remote
Name: work_type, Length: 73268, dtype: category
Categories (3, object): ['hybrid', 'in-person', 'remote']

#### Observations
- There're 3 unique categories
- Cardinality is less


#### Steps
- The column ca be renamed to `work_type`
- The data type can be converted to `category` to save memory

### CodingActivities

In [41]:
df.CodingActivities

0                                                      NaN
1                 Hobby;Contribute to open-source projects
2                                                    Hobby
3                             I don’t code outside of work
4                                                    Hobby
                               ...                        
73263                              Freelance/contract work
73264                                                Hobby
73265                        Hobby;School or academic work
73266                                                Hobby
73267    Hobby;Contribute to open-source projects;Boots...
Name: CodingActivities, Length: 73268, dtype: object

In [42]:
df.CodingActivities.unique()

array([nan, 'Hobby;Contribute to open-source projects', 'Hobby',
       'I don’t code outside of work',
       'Hobby;Contribute to open-source projects;Bootstrapping a business',
       'Hobby;Contribute to open-source projects;Freelance/contract work',
       'Hobby;Freelance/contract work', 'Hobby;Bootstrapping a business',
       'Other (please specify):', 'Contribute to open-source projects',
       'Hobby;Other (please specify):',
       'Hobby;Contribute to open-source projects;Bootstrapping a business;Freelance/contract work',
       'Bootstrapping a business', 'Freelance/contract work',
       'Hobby;Bootstrapping a business;Freelance/contract work',
       'Bootstrapping a business;Freelance/contract work',
       'Hobby;Contribute to open-source projects;Other (please specify):',
       'Contribute to open-source projects;Freelance/contract work',
       'Hobby;Freelance/contract work;Other (please specify):',
       'Contribute to open-source projects;Bootstrapping a busine

In [43]:
col_value_counts("CodingActivities").head(20)

Unnamed: 0_level_0,count,pct
CodingActivities,Unnamed: 1_level_1,Unnamed: 2_level_1
Hobby,18118,0.307611
I don’t code outside of work,7311,0.124128
Hobby;Contribute to open-source projects,6549,0.11119
Hobby;Freelance/contract work,3554,0.060341
Hobby;School or academic work,3016,0.051206
Freelance/contract work,2189,0.037165
Hobby;Bootstrapping a business,2136,0.036265
Hobby;Contribute to open-source projects;Freelance/contract work,2028,0.034432
Hobby;Contribute to open-source projects;School or academic work,1254,0.021291
School or academic work,1208,0.02051


In [44]:
col_strip_nested("CodingActivities")

[('Hobby', 42922),
 ('Contribute to open-source projects', 15378),
 ('Freelance/contract work', 13305),
 ('School or academic work', 8561),
 ('Bootstrapping a business', 8401),
 ('I don’t code outside of work', 7311),
 ('Other (please specify):', 2179)]

In [45]:
(df
 .CodingActivities
 .dropna()
 .loc[lambda ser: ser.str.contains("I don’t code outside of work")]
 .unique())

array(['I don’t code outside of work'], dtype=object)

In [46]:
(df
 .CodingActivities
 .rename("coding_activity")
 .fillna("unanswered")
 .str.lower()
 .str.replace("contribute to open-source projects", "open_source_contribution")
 .str.replace("freelance/contract work", "freelancing")
 .str.replace("school or academic work", "academics")
 .str.replace("bootstrapping a business", "startup")
 .str.replace("i don’t code outside of work", "only_work")
 .str.replace("other \(please specify\):", "other", regex=True)
 .astype("category"))

0                                    unanswered
1                hobby;open_source_contribution
2                                         hobby
3                                     only_work
4                                         hobby
                          ...                  
73263                               freelancing
73264                                     hobby
73265                           hobby;academics
73266                                     hobby
73267    hobby;open_source_contribution;startup
Name: coding_activity, Length: 73268, dtype: category
Categories (64, object): ['academics', 'freelancing', 'freelancing;academics', 'freelancing;other', ..., 'startup;freelancing;other', 'startup;other', 'startup;other;academics', 'unanswered']

#### Observations
- There're 63 unique categories
- About 45 categories occur in less than 1% of the total observations
  - These could be considered as `rare`
- Values are nested, separated by a semicolon
- `Bootstrapping a business` means beginning a new business with one's own savings


#### Steps
- The categories can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords
  - Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### EdLevel

In [47]:
df.EdLevel

0                                                    NaN
1                                                    NaN
2        Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
3           Bachelor’s degree (B.A., B.S., B.Eng., etc.)
4           Bachelor’s degree (B.A., B.S., B.Eng., etc.)
                              ...                       
73263       Bachelor’s degree (B.A., B.S., B.Eng., etc.)
73264    Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
73265       Bachelor’s degree (B.A., B.S., B.Eng., etc.)
73266       Bachelor’s degree (B.A., B.S., B.Eng., etc.)
73267       Bachelor’s degree (B.A., B.S., B.Eng., etc.)
Name: EdLevel, Length: 73268, dtype: object

In [48]:
df.EdLevel.unique()

array([nan, 'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)',
       'Bachelor’s degree (B.A., B.S., B.Eng., etc.)',
       'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
       'Some college/university study without earning a degree',
       'Something else', 'Primary/elementary school',
       'Other doctoral degree (Ph.D., Ed.D., etc.)',
       'Associate degree (A.A., A.S., etc.)',
       'Professional degree (JD, MD, etc.)'], dtype=object)

In [49]:
col_value_counts("EdLevel")

Unnamed: 0_level_0,count,pct
EdLevel,Unnamed: 1_level_1,Unnamed: 2_level_1
"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",30276,0.42302
"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",15486,0.216373
Some college/university study without earning a degree,9326,0.130304
"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",7904,0.110436
"Associate degree (A.A., A.S., etc.)",2236,0.031242
"Other doctoral degree (Ph.D., Ed.D., etc.)",2169,0.030306
Primary/elementary school,1806,0.025234
Something else,1247,0.017423
"Professional degree (JD, MD, etc.)",1121,0.015663


In [50]:
(df
 .EdLevel
 .replace(["Bachelor’s degree (B.A., B.S., B.Eng., etc.)",
           "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",
           "Some college/university study without earning a degree",
           "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",
           "Associate degree (A.A., A.S., etc.)",
           "Other doctoral degree (Ph.D., Ed.D., etc.)",
           "Primary/elementary school",
           "Something else",
           "Professional degree (JD, MD, etc.)	"],
          ["bachelors",
           "masters",
           "college",
           "secondary_school",
           "associate",
           "doctorate",
           "primary_school",
           "other",
           "professional"])
 .astype("category"))

0              NaN
1              NaN
2          masters
3        bachelors
4        bachelors
           ...    
73263    bachelors
73264      masters
73265    bachelors
73266    bachelors
73267    bachelors
Name: EdLevel, Length: 73268, dtype: category
Categories (9, object): ['Professional degree (JD, MD, etc.)', 'associate', 'bachelors', 'college', ..., 'masters', 'other', 'primary_school', 'secondary_school']

#### Observations
- There're 9 unique categories
- the categories are unnecessarily lengthy
- Less cardinality


#### Steps
- The categories can be shortened
- The type can be made `category` for less memory usage

### LearnCode

In [51]:
df.LearnCode

0                                                      NaN
1                                                      NaN
2        Books / Physical media;Friend or family member...
3        Books / Physical media;School (i.e., Universit...
4        Other online resources (e.g., videos, blogs, f...
                               ...                        
73263    Books / Physical media;Other online resources ...
73264    Other online resources (e.g., videos, blogs, f...
73265    Books / Physical media;Other online resources ...
73266           Books / Physical media;On the job training
73267    Books / Physical media;Friend or family member...
Name: LearnCode, Length: 73268, dtype: object

In [52]:
df.LearnCode.unique()

array([nan,
       'Books / Physical media;Friend or family member;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc)',
       'Books / Physical media;School (i.e., University, College, etc)',
       'Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);On the job training',
       'Other online resources (e.g., videos, blogs, forum)',
       'Online Courses or Certification',
       'On the job training;Coding Bootcamp',
       'Books / Physical media;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc)',
       'School (i.e., University, College, etc)',
       'Books / Physical media',
       'Books / Physical media;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);Online Courses or Certification;Colleague',
       'Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);On the job training

In [53]:
col_value_counts("LearnCode")

Unnamed: 0_level_0,count,pct
LearnCode,Unnamed: 1_level_1,Unnamed: 2_level_1
"School (i.e., University, College, etc)",3669,0.051257
"Other online resources (e.g., videos, blogs, forum)",3292,0.045991
"Books / Physical media;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc)",2873,0.040137
"Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc)",2697,0.037678
"Books / Physical media;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);On the job training;Online Courses or Certification",2392,0.033417
...,...,...
"Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);On the job training;Online Courses or Certification;Coding Bootcamp;Other (please specify):;Hackathons (virtual or in-person)",1,0.000014
Friend or family member;Online Courses or Certification;Coding Bootcamp;Colleague;Hackathons (virtual or in-person),1,0.000014
"Books / Physical media;Friend or family member;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);Online Courses or Certification;Coding Bootcamp;Other (please specify):",1,0.000014
"Books / Physical media;School (i.e., University, College, etc);On the job training;Coding Bootcamp;Hackathons (virtual or in-person)",1,0.000014


In [54]:
col_strip_nested("LearnCode")

[('Other online resources (e.g., videos, blogs, forum)', 50756),
 ('School (i.e., University, College, etc)', 44506),
 ('Books / Physical media', 38994),
 ('Online Courses or Certification', 33379),
 ('On the job training', 28523),
 ('Colleague', 13188),
 ('Friend or family member', 9987),
 ('Coding Bootcamp', 7731),
 ('Hackathons (virtual or in-person)', 5269),
 ('Other (please specify):', 3558)]

In [55]:
(df
 .LearnCode
 .rename("learnt_coding")
 .fillna("unanswered")
 .str.replace("Other online resources \(e.g., videos, blogs, forum\)", "online_resources", regex=True)
 .str.replace("School \(i.e., University, College, etc\)", "academia", regex=True)
 .str.replace("Books / Physical media", "books")
 .str.replace("Online Courses or Certification", "online_courses")
 .str.replace("On the job training", "job_training")
 .str.replace("Friend or family member", "friends_family")
 .str.replace("Hackathons \(virtual or in-person\)", "hackathons", regex=True)
 .str.replace("Other \(please specify\):", "other", regex=True))

0                                               unanswered
1                                               unanswered
2           books;friends_family;online_resources;academia
3                                           books;academia
4                   online_resources;academia;job_training
                               ...                        
73263    books;online_resources;job_training;online_cou...
73264    online_resources;academia;job_training;online_...
73265       books;online_resources;academia;online_courses
73266                                   books;job_training
73267    books;friends_family;online_resources;academia...
Name: learnt_coding, Length: 73268, dtype: object

#### Observations
- There're 737 unique categories
- Many values occur in less than 1% of the total observations
- Even the most frequent value occurs only in 5% of observations
- Values are nested, separated by a semicolon


#### Steps
- The categories can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords
  - Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### LearnCodeOnline

In [56]:
df.LearnCodeOnline

0                                                      NaN
1                                                      NaN
2        Technical documentation;Blogs;Programming Game...
3                                                      NaN
4        Technical documentation;Blogs;Stack Overflow;O...
                               ...                        
73263    Technical documentation;Blogs;Written Tutorial...
73264    Technical documentation;Blogs;Written Tutorial...
73265    Technical documentation;Programming Games;Stac...
73266                                                  NaN
73267    Technical documentation;Blogs;Programming Game...
Name: LearnCodeOnline, Length: 73268, dtype: object

In [57]:
df.LearnCodeOnline.unique()

array([nan,
       'Technical documentation;Blogs;Programming Games;Written Tutorials;Stack Overflow',
       'Technical documentation;Blogs;Stack Overflow;Online books;Video-based Online Courses;Online challenges (e.g., daily or weekly coding challenges)',
       ...,
       'Written Tutorials;Online books;Video-based Online Courses;How-to videos;Written-based Online Courses;Coding sessions (live or recorded);Certification videos',
       'Programming Games;Stack Overflow;Video-based Online Courses;Online challenges (e.g., daily or weekly coding challenges);How-to videos;Written-based Online Courses;Interactive tutorial;Coding sessions (live or recorded);Certification videos',
       'Technical documentation;Programming Games;Stack Overflow;Online books;Video-based Online Courses;How-to videos;Written-based Online Courses;Coding sessions (live or recorded);Certification videos'],
      dtype=object)

In [58]:
col_value_counts("LearnCodeOnline").head(20)

Unnamed: 0_level_0,count,pct
LearnCodeOnline,Unnamed: 1_level_1,Unnamed: 2_level_1
Technical documentation;Blogs;Written Tutorials;Stack Overflow,715,0.014107
Technical documentation;Blogs;Written Tutorials;Stack Overflow;Online forum,706,0.013929
Technical documentation;Blogs;Stack Overflow,589,0.011621
Technical documentation;Blogs;Written Tutorials;Stack Overflow;Online books;Online forum,530,0.010457
Technical documentation;Blogs;Written Tutorials;Stack Overflow;Online books,447,0.008819
Technical documentation;Blogs;Written Tutorials;Stack Overflow;Online forum;How-to videos,430,0.008484
Technical documentation;Blogs;Stack Overflow;How-to videos,398,0.007852
Technical documentation;Blogs;Stack Overflow;Online forum,374,0.007379
Technical documentation;Blogs;Written Tutorials;Stack Overflow;How-to videos,370,0.0073
Technical documentation;Blogs;Stack Overflow;Video-based Online Courses;How-to videos,365,0.007201


In [59]:
col_strip_nested("LearnCodeOnline")

[('Technical documentation', 44669),
 ('Stack Overflow', 43658),
 ('Blogs', 38192),
 ('How-to videos', 30371),
 ('Written Tutorials', 29436),
 ('Video-based Online Courses', 26064),
 ('Online books', 22238),
 ('Online forum', 20446),
 ('Written-based Online Courses', 17424),
 ('Coding sessions (live or recorded)', 14626),
 ('Interactive tutorial', 13287),
 ('Online challenges (e.g., daily or weekly coding challenges)', 12723),
 ('Certification videos', 7541),
 ('Programming Games', 6752),
 ('Auditory material (e.g., podcasts)', 3652),
 ('Other (Please specify):', 1028)]

In [60]:
(df
 .LearnCodeOnline
 .fillna("unanswered")
 .str.replace("Other \(Please specify\):", "other", regex=True)
 .str.replace("Technical documentation", "documentation")
 .str.replace("How-to videos", "videos")
 .str.replace("Video-based Online Courses", "courses")
 .str.replace("Online books", "books")
 .str.replace("Online forum", "forums")
 .str.replace("Written-based Online Courses", "courses")
 .str.replace("Coding sessions \(live or recorded\)", "coding_sessions", regex=True)
 .str.replace("Written Tutorials", "tutorials")
 .str.replace("Interactive tutorial", "tutorials")
 .str.replace("Certification videos", "videos")
 .str.replace("Auditory material \(e.g., podcasts\)", "podcasts", regex=True)
 .str.replace("Online challenges \(e.g., daily or weekly coding challenges\)", "coding_challenges", regex=True)
 .str.lower()
 .str.replace(" ", "_")
 .value_counts())

unanswered                                                                                                               22583
documentation;blogs;tutorials;stack_overflow                                                                               715
documentation;blogs;tutorials;stack_overflow;forums                                                                        706
documentation;blogs;stack_overflow                                                                                         589
documentation;blogs;tutorials;stack_overflow;books;forums                                                                  530
                                                                                                                         ...  
documentation;blogs;programming_games;tutorials;stack_overflow;books;courses;other                                           1
documentation;blogs;programming_games;tutorials;courses;videos;courses;other                                   

#### Observations
- There're 7192 unique entries
- All the entries occur in around 1% of the total observations
  - These could be considered as `rare`
- Values are nested, separated by a semicolon


#### Steps
- The categories can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords
  - Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### LearnCodeCoursesCert

In [61]:
df.LearnCodeCoursesCert

0                                     NaN
1                                     NaN
2                                     NaN
3                                     NaN
4                                     NaN
                       ...               
73263                               Udemy
73264              Coursera;Udemy;Udacity
73265    Udemy;Codecademy;Pluralsight;edX
73266                                 NaN
73267                   Udemy;Pluralsight
Name: LearnCodeCoursesCert, Length: 73268, dtype: object

In [62]:
df.LearnCodeCoursesCert.unique()

array([nan, 'Coursera;Udemy', 'Udemy;Codecademy', 'Coursera;Pluralsight',
       'Coursera;Udemy;Codecademy;edX;Udacity',
       'Coursera;Udemy;Pluralsight;edX', 'Udemy', 'Other',
       'Coursera;Udemy;Udacity', 'Udemy;Pluralsight',
       'Coursera;Udemy;Pluralsight', 'Codecademy', 'Coursera',
       'Coursera;Udemy;edX', 'Udemy;Other', 'Pluralsight',
       'Coursera;Udemy;Codecademy', 'Codecademy;Pluralsight',
       'Coursera;edX', 'Udemy;Codecademy;Pluralsight',
       'Pluralsight;Udacity', 'Coursera;Udemy;Other',
       'Codecademy;Pluralsight;Other',
       'Udemy;Codecademy;Pluralsight;Other', 'Udemy;Pluralsight;Udacity',
       'Coursera;Udemy;Codecademy;Udacity', 'Udemy;edX',
       'Coursera;Udemy;edX;Udacity',
       'Coursera;Pluralsight;edX;Udacity;Other', 'edX',
       'Coursera;Codecademy', 'Coursera;Other', 'Codecademy;Other',
       'Udemy;Codecademy;Pluralsight;edX;Udacity', 'Coursera;Udacity',
       'Udemy;Pluralsight;Other', 'Coursera;Codecademy;Pluralsight;edX

In [63]:
col_value_counts("LearnCodeCoursesCert")

Unnamed: 0_level_0,count,pct
LearnCodeCoursesCert,Unnamed: 1_level_1,Unnamed: 2_level_1
Udemy,5643,0.192011
Other,2594,0.088264
Coursera;Udemy,1893,0.064412
Udemy;Codecademy,1472,0.050087
Udemy;Pluralsight,1402,0.047705
...,...,...
Udemy;Codecademy;edX;Udacity;Skillsoft,1,0.000034
Coursera;Other;Skillsoft,1,0.000034
Codecademy;edX;Udacity;Skillsoft,1,0.000034
Pluralsight;edX;Udacity;Other,1,0.000034


In [64]:
col_strip_nested("LearnCodeCoursesCert")

[('Udemy', 19540),
 ('Coursera', 10261),
 ('Codecademy', 7712),
 ('Pluralsight', 6594),
 ('Other', 6528),
 ('edX', 4590),
 ('Udacity', 3995),
 ('Skillsoft', 553)]

In [65]:
(df
 .LearnCodeCoursesCert
 .fillna("unanswered")
 .str.lower()
 .value_counts())

unanswered                                43879
udemy                                      5643
other                                      2594
coursera;udemy                             1893
udemy;codecademy                           1472
                                          ...  
edx;other;skillsoft                           1
codecademy;pluralsight;other;skillsoft        1
udemy;codecademy;other;skillsoft              1
coursera;codecademy;udacity;skillsoft         1
coursera;pluralsight;other;skillsoft          1
Name: LearnCodeCoursesCert, Length: 207, dtype: int64

#### Observations
- There're 206 unique entries
- Entries are nested, separated by a semicolon


#### Steps
- The entries can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords

### YearsCode

In [66]:
df.YearsCode

0        NaN
1        NaN
2         14
3         20
4          8
        ... 
73263      8
73264      6
73265     42
73266     50
73267     16
Name: YearsCode, Length: 73268, dtype: object

In [67]:
df.YearsCode.unique()

array([nan, '14', '20', '8', '15', '3', '1', '6', '37', '5', '12', '22',
       '11', '4', '7', '13', '36', '2', '25', '10', '40', '16', '27',
       '24', '19', '9', '17', '18', '26', 'More than 50 years', '29',
       '30', '32', 'Less than 1 year', '48', '45', '38', '39', '28', '23',
       '43', '21', '41', '35', '50', '33', '31', '34', '46', '44', '42',
       '47', '49'], dtype=object)

In [68]:
col_value_counts("YearsCode")

Unnamed: 0_level_0,count,pct
YearsCode,Unnamed: 1_level_1,Unnamed: 2_level_1
10,5217,0.073138
5,5193,0.072801
6,4651,0.065203
4,4480,0.062806
7,4237,0.059399
8,4227,0.059259
3,4122,0.057787
2,3351,0.046978
12,2995,0.041987
15,2962,0.041525


#### Observations
- This is an object type column but should be numeric
- Contains some string values


#### Steps
- Some values need to be replaced:
  - `More than 50 years` ---> 50
  - `Less than 1 year` ---> 1
- The type should be `int`

### YearsCodePro

In [69]:
df.YearsCodePro

0        NaN
1        NaN
2          5
3         17
4          3
        ... 
73263      5
73264      5
73265     33
73266     31
73267      5
Name: YearsCodePro, Length: 73268, dtype: object

In [70]:
df.YearsCodePro.unique()

array([nan, '5', '17', '3', '6', '30', '2', '10', '15', '4', '22', '20',
       '40', '9', '14', '21', '7', '18', '25', '8', '12', '45', '1', '19',
       '28', '24', '11', '23', 'Less than 1 year', '32', '27', '16', '44',
       '26', '37', '46', '13', '31', '39', '34', '38', '35', '29', '42',
       '36', '33', '43', '41', '48', '50', 'More than 50 years', '47',
       '49'], dtype=object)

- Same steps as `YearsCode`

### DevType

In [71]:
df.DevType

0                                                      NaN
1                                                      NaN
2        Data scientist or machine learning specialist;...
3                                    Developer, full-stack
4        Developer, front-end;Developer, full-stack;Dev...
                               ...                        
73263                                  Developer, back-end
73264        Data scientist or machine learning specialist
73265    Developer, full-stack;Developer, desktop or en...
73266    Developer, front-end;Developer, desktop or ent...
73267    Developer, front-end;Engineer, data;Engineer, ...
Name: DevType, Length: 73268, dtype: object

In [72]:
df.DevType.unique()

array([nan,
       'Data scientist or machine learning specialist;Developer, front-end;Engineer, data;Engineer, site reliability',
       'Developer, full-stack', ...,
       'Data scientist or machine learning specialist;Developer, front-end;Developer, full-stack;Developer, back-end;Developer, QA or test;Developer, mobile;Database administrator;Cloud infrastructure engineer;Data or business analyst;Designer;Blockchain',
       'Developer, front-end;Developer, full-stack;Developer, back-end;Developer, desktop or enterprise applications;Developer, mobile;Educator;Developer, embedded applications or devices',
       'Developer, front-end;Engineer, data;Engineer, site reliability;Developer, full-stack;Developer, back-end;Developer, desktop or enterprise applications;Developer, QA or test;Student;Developer, mobile;Academic researcher;DevOps specialist;Developer, embedded applications or devices;Developer, game or graphics;Cloud infrastructure engineer;Data or business analyst;Designer;Scie

In [73]:
col_value_counts("DevType").head(50)

Unnamed: 0_level_0,count,pct
DevType,Unnamed: 1_level_1,Unnamed: 2_level_1
"Developer, full-stack",7142,0.116505
"Developer, back-end",5301,0.086474
"Developer, front-end",2385,0.038906
"Developer, front-end;Developer, full-stack;Developer, back-end",1807,0.029477
"Developer, full-stack;Developer, back-end",1535,0.02504
"Developer, mobile",1492,0.024339
Other (please specify):,1229,0.020048
Student,976,0.015921
"Developer, front-end;Developer, full-stack",941,0.01535
"Developer, desktop or enterprise applications",758,0.012365


In [74]:
col_strip_nested("DevType")

[('Developer, full-stack', 28701),
 ('Developer, back-end', 26595),
 ('Developer, front-end', 15915),
 ('Developer, desktop or enterprise applications', 9546),
 ('Developer, mobile', 7634),
 ('DevOps specialist', 6170),
 ('Student', 5595),
 ('Cloud infrastructure engineer', 5283),
 ('Database administrator', 4934),
 ('System administrator', 4908),
 ('Developer, embedded applications or devices', 3923),
 ('Project manager', 3897),
 ('Designer', 3764),
 ('Engineer, data', 3600),
 ('Engineering manager', 3574),
 ('Data scientist or machine learning specialist', 3424),
 ('Data or business analyst', 3201),
 ('Developer, QA or test', 3096),
 ('Academic researcher', 2709),
 ('Other (please specify):', 2618),
 ('Product manager', 2514),
 ('Educator', 2090),
 ('Engineer, site reliability', 1947),
 ('Security professional', 1928),
 ('Developer, game or graphics', 1837),
 ('Senior Executive (C-Suite, VP, etc.)', 1805),
 ('Scientist', 1762),
 ('Blockchain', 1302),
 ('Marketing or sales professiona

In [75]:
(df
 .DevType
 .fillna("unanswered")
 .str.replace("Other \(Please specify\):", "other", regex=True)
 .str.replace("Developer, embedded applications or devices", "embedded_app_dev")
 .str.replace("Engineer, data", "data_engineer")
 .str.replace("Developer, desktop or enterprise applications", "enterprise_app_dev")
 .str.replace("Developer, full-stack", "full_stack_dev")
 .str.replace("Developer, front-end", "front_end_dev")
 .str.replace("Developer, back-end", "back_end_dev")
 .str.replace("Developer, mobile", "mobile_dev",)
 .str.replace("Data scientist or machine learning specialist", "data_scientist")
 .str.replace("Data or business analyst", "data_analyst")
 .str.replace("Developer, QA or test", "testing_dev")
 .str.replace("Engineer, site reliability", "site_reliability_engineer")
 .str.replace("Developer, game or graphics", "game_dev")
 .str.replace("Senior Executive \(C-Suite, VP, etc.\)", "senior_executive", regex=True)
 .str.replace("Marketing or sales professional", "marketing_professional")
 .str.lower()
 .str.replace(" ", "_")
 .value_counts())

unanswered                                                                                                                                                                                                                                                                                                11966
full_stack_dev                                                                                                                                                                                                                                                                                             7142
back_end_dev                                                                                                                                                                                                                                                                                               5301
front_end_dev                                                                           

#### Observations
- There're 9,984 unique entries
- Entries are nested, separated by a semicolon


#### Steps
- The entries can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords
  - Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### OrgSize

In [76]:
df.OrgSize

0                         NaN
1                         NaN
2          20 to 99 employees
3        100 to 499 employees
4          20 to 99 employees
                 ...         
73263    100 to 499 employees
73264            I don’t know
73265      20 to 99 employees
73266      10 to 19 employees
73267                     NaN
Name: OrgSize, Length: 73268, dtype: object

In [77]:
df.OrgSize.unique()

array([nan, '20 to 99 employees', '100 to 499 employees', 'I don’t know',
       'Just me - I am a freelancer, sole proprietor, etc.',
       '2 to 9 employees', '5,000 to 9,999 employees',
       '1,000 to 4,999 employees', '10,000 or more employees',
       '500 to 999 employees', '10 to 19 employees'], dtype=object)

In [78]:
col_value_counts("OrgSize")

Unnamed: 0_level_0,count,pct
OrgSize,Unnamed: 1_level_1,Unnamed: 2_level_1
20 to 99 employees,10343,0.202649
100 to 499 employees,9289,0.181998
"10,000 or more employees",6922,0.135622
"1,000 to 4,999 employees",5736,0.112385
2 to 9 employees,4887,0.09575
10 to 19 employees,4251,0.083289
500 to 999 employees,3645,0.071416
"Just me - I am a freelancer, sole proprietor, etc.",2771,0.054292
"5,000 to 9,999 employees",2189,0.042889
I don’t know,1006,0.01971


In [79]:
df.OrgSize.isna().sum()

22229

In [80]:
(df
 .OrgSize
 .str.replace(" employees", "")
 .replace("Just me - I am a freelancer, sole proprietor, etc.", "freelancer")
 .replace(["2 to 9",
           "10 to 19",
           "20 to 99",
           "100 to 499",
           "500 to 999",
           "1,000 to 4,999",
           "5,000 to 9,999",
           "10,000 or more",
           "I don’t know"],
          ["small",
           "small",
           "small",
           "small",
           "medium",
           "medium",
           "large",
           "large",
           "unknown"])
 .value_counts(dropna=False))

small         28770
NaN           22229
medium         9381
large          9111
freelancer     2771
unknown        1006
Name: OrgSize, dtype: int64

#### Observations
- There're 10 unique entries
- the entries are unnecessarily lengthy
- Less cardinality


#### Steps
- The entries can be grouped together to make meaningful values
- The type can be made `category` for less memory usage

### PurchaseInfluence

In [81]:
df.PurchaseInfluence

0                                     NaN
1                                     NaN
2                   I have some influence
3                   I have some influence
4                   I have some influence
                       ...               
73263               I have some influence
73264       I have little or no influence
73265    I have a great deal of influence
73266    I have a great deal of influence
73267                                 NaN
Name: PurchaseInfluence, Length: 73268, dtype: object

In [82]:
df.PurchaseInfluence.unique()

array([nan, 'I have some influence', 'I have little or no influence',
       'I have a great deal of influence'], dtype=object)

In [83]:
col_value_counts("PurchaseInfluence")

Unnamed: 0_level_0,count,pct
PurchaseInfluence,Unnamed: 1_level_1,Unnamed: 2_level_1
I have some influence,21991,0.431458
I have little or no influence,17345,0.340305
I have a great deal of influence,11633,0.228237


In [84]:
(df
 .PurchaseInfluence
 .replace(["I have some influence",
           "I have little or no influence",
           "I have a great deal of influence"],
          ["some",
           "little",
           "great"])
 .astype("category"))

0           NaN
1           NaN
2          some
3          some
4          some
          ...  
73263      some
73264    little
73265     great
73266     great
73267       NaN
Name: PurchaseInfluence, Length: 73268, dtype: category
Categories (3, object): ['great', 'little', 'some']

#### Observations
- There're 3 unique entries
- Less cardinality


#### Steps
- The entries can be shortened
- The type can be made `category` for less memory usage

### BuyNewTool

In [85]:
col_description("BuyNewTool")

When buying a new tool or software, how do you discover and research available solutions? Select all that apply.


In [86]:
df.BuyNewTool

0                                                      NaN
1                                                      NaN
2                                                      NaN
3                                  Other (please specify):
4        Start a free trial;Visit developer communities...
                               ...                        
73263    Visit developer communities like Stack Overflo...
73264    Other (please specify):;Ask developers I know/...
73265    Start a free trial;Ask developers I know/work ...
73266    Start a free trial;Visit developer communities...
73267    Start a free trial;Visit developer communities...
Name: BuyNewTool, Length: 73268, dtype: object

In [87]:
# df.BuyNewTool.unique()

In [88]:
col_value_counts("BuyNewTool").head(20)

Unnamed: 0_level_0,count,pct
BuyNewTool,Unnamed: 1_level_1,Unnamed: 2_level_1
Start a free trial;Visit developer communities like Stack Overflow;Ask developers I know/work with,11787,0.173433
Start a free trial;Ask developers I know/work with,7454,0.109677
Start a free trial;Visit developer communities like Stack Overflow;Ask developers I know/work with;Read ratings or reviews on third party sites like G2Crowd,5377,0.079117
Start a free trial,5210,0.076659
Visit developer communities like Stack Overflow;Ask developers I know/work with,4886,0.071892
Start a free trial;Visit developer communities like Stack Overflow,4192,0.061681
Ask developers I know/work with,2555,0.037594
Visit developer communities like Stack Overflow,2441,0.035917
Start a free trial;Visit developer communities like Stack Overflow;Read ratings or reviews on third party sites like G2Crowd,2215,0.032591
Start a free trial;Ask developers I know/work with;Read ratings or reviews on third party sites like G2Crowd,2023,0.029766


In [89]:
col_strip_nested("BuyNewTool")

[('Start a free trial', 48849),
 ('Ask developers I know/work with', 45588),
 ('Visit developer communities like Stack Overflow', 42762),
 ('Read ratings or reviews on third party sites like G2Crowd', 20235),
 ('Research companies that have advertised on sites I visit', 9136),
 ('Other (please specify):', 4438),
 ('Research companies that have emailed me', 3667)]

In [90]:
(df
 .BuyNewTool
 .fillna("unanswered")
 .str.replace("Other \(please specify\):", "other", regex=True)
 .str.replace("Start a free trial", "free_trial")
 .str.replace("Ask developers I know/work with", "ask_known_devs")
 .str.replace("Visit developer communities like Stack Overflow", "stack_overflow")
 .str.replace("Read ratings or reviews on third party sites like G2Crowd", "Read ratings")
 .str.replace("Research companies that have advertised on sites I visit", "Research companies")
 .str.replace("Research companies that have emailed me", "Research companies")
 .str.lower()
 .str.replace(" ", "_")
 .value_counts()
 .index)

Index(['free_trial;stack_overflow;ask_known_devs', 'free_trial;ask_known_devs',
       'free_trial;stack_overflow;ask_known_devs;read_ratings', 'unanswered',
       'free_trial', 'stack_overflow;ask_known_devs',
       'free_trial;stack_overflow', 'ask_known_devs', 'stack_overflow',
       'free_trial;stack_overflow;read_ratings',
       ...
       'read_ratings;research_companies',
       'other;stack_overflow;research_companies;research_companies',
       'other;research_companies;research_companies',
       'other;stack_overflow;read_ratings;research_companies',
       'other;free_trial;read_ratings;research_companies',
       'other;free_trial;stack_overflow;research_companies;read_ratings;research_companies',
       'other;ask_known_devs;read_ratings;research_companies',
       'other;ask_known_devs;research_companies;research_companies',
       'other;free_trial;ask_known_devs;read_ratings;research_companies',
       'other;ask_known_devs;research_companies;read_ratings'],
      

#### Observations
- There're 125 unique entries
- Many entries occur in less than 1% of the total observations
- Entries are nested, separated by a semicolon


#### Steps
- The entries can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords
  - Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### Country

In [91]:
df.Country

0                                                      NaN
1                                                   Canada
2        United Kingdom of Great Britain and Northern I...
3                                                   Israel
4                                 United States of America
                               ...                        
73263                                              Nigeria
73264                             United States of America
73265                             United States of America
73266    United Kingdom of Great Britain and Northern I...
73267                                               Canada
Name: Country, Length: 73268, dtype: object

In [92]:
df.Country.unique()

array([nan, 'Canada',
       'United Kingdom of Great Britain and Northern Ireland', 'Israel',
       'United States of America', 'Germany', 'India', 'Netherlands',
       'Croatia', 'Australia', 'Russian Federation', 'Czech Republic',
       'Austria', 'Serbia', 'Italy', 'Ireland', 'Poland', 'Slovenia',
       'Iraq', 'Sweden', 'Madagascar', 'Norway', 'Taiwan',
       'Hong Kong (S.A.R.)', 'Mexico', 'France', 'Brazil', 'Lithuania',
       'Uruguay', 'Denmark', 'Spain', 'Egypt', 'Turkey', 'South Africa',
       'Ukraine', 'Finland', 'Romania', 'Portugal', 'Singapore', 'Oman',
       'Belgium', 'Chile', 'Bulgaria', 'Latvia', 'Philippines', 'Greece',
       'Belarus', 'Saudi Arabia', 'Kenya', 'Switzerland', 'Iceland',
       'Viet Nam', 'Thailand', 'China', 'Montenegro', 'Slovakia', 'Japan',
       'Luxembourg', 'Turkmenistan', 'Argentina', 'Hungary', 'Tunisia',
       'Bangladesh', 'Maldives', 'Dominican Republic', 'Jordan',
       'Pakistan', 'Nepal', 'Iran, Islamic Republic of...', 'I

In [93]:
with pd.option_context("display.max_rows", 180):
    display(col_value_counts("Country").head(180))

Unnamed: 0_level_0,count,pct
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States of America,13543,0.188697
India,6639,0.092503
Germany,5395,0.07517
United Kingdom of Great Britain and Northern Ireland,4190,0.05838
Canada,2490,0.034694
France,2328,0.032436
Brazil,2109,0.029385
Poland,1732,0.024132
Netherlands,1555,0.021666
Spain,1521,0.021192


In [94]:
(df
 .Country
 .replace(["Lao People's Democratic Republic",
           "Democratic Republic of the Congo",
           "Brunei Darussalam",
           "Swaziland",
           "Congo, Republic of the...",
           "Libyan Arab Jamahiriya",
           "United Republic of Tanzania",
           "Syrian Arab Republic",
           "Republic of Moldova",
           "The former Yugoslav Republic of Macedonia",
           "Republic of Korea",
           "Venezuela, Bolivarian Republic of...",
           "United Arab Emirates",
           "Hong Kong (S.A.R.)",
           "Viet Nam",
           "Iran, Islamic Republic of...",
           "Russian Federation",
           "United Kingdom of Great Britain and Northern Ireland",
           "United States of America"],
          ["Laos",
           "Congo",
           "Brunei",
           "Eswatini",
           "Congo",
           "Libya",
           "Tanzania",
           "Syria",
           "Moldova",
           "North Macedonia",
           "South Korea",
           "Venezuela",
           "UAE",
           "Hong Kong",
           "Vietnam",
           "Iran",
           "Russia",
           "UK",
           "USA"]))

0            NaN
1         Canada
2             UK
3         Israel
4            USA
          ...   
73263    Nigeria
73264        USA
73265        USA
73266         UK
73267     Canada
Name: Country, Length: 73268, dtype: object

In [95]:
df.Country.dropna().loc[lambda ser: ser.str.contains("Congo")].iloc[1]

'Congo, Republic of the...'

- Some of the entries were inconsistent and irregular
- These inconsistencies will be handled accordingly

### Currency

In [323]:
col_description("Currency")

Which currency do you use day-to-day? If your answer is complicated, please pick the one you're most comfortable estimating in. *


In [96]:
df.Currency

0                              NaN
1             CAD\tCanadian dollar
2              GBP\tPound sterling
3          ILS\tIsraeli new shekel
4        USD\tUnited States dollar
                   ...            
73263    USD\tUnited States dollar
73264    USD\tUnited States dollar
73265    USD\tUnited States dollar
73266          GBP\tPound sterling
73267                          NaN
Name: Currency, Length: 73268, dtype: object

In [333]:
(df
 .loc[df.Country.isna()]
 .Currency
 .unique())

array([nan], dtype=object)

In [339]:
(df
 .Country
 .fillna("unanswered")
 .replace(["Lao People's Democratic Republic", "Democratic Republic of the Congo",
           "Brunei Darussalam", "Swaziland", "Congo, Republic of the...", "Libyan Arab Jamahiriya",
           "United Republic of Tanzania", "Syrian Arab Republic", "Republic of Moldova", 
           "The former Yugoslav Republic of Macedonia", "Republic of Korea", 
           "Venezuela, Bolivarian Republic of...", "United Arab Emirates", 
           "Hong Kong (S.A.R.)", "Viet Nam", "Iran, Islamic Republic of...", "Russian Federation",
           "United Kingdom of Great Britain and Northern Ireland", "United States of America"],
          ["Laos", "Congo", "Brunei", "Eswatini", "Congo", "Libya", "Tanzania", "Syria", "Moldova", 
           "North Macedonia", "South Korea", "Venezuela", "UAE", "Hong Kong", "Vietnam", "Iran", 
           "Russia", "UK", "USA"])
 .unique())

array(['unanswered', 'Canada', 'UK', 'Israel', 'USA', 'Germany', 'India',
       'Netherlands', 'Croatia', 'Australia', 'Russia', 'Czech Republic',
       'Austria', 'Serbia', 'Italy', 'Ireland', 'Poland', 'Slovenia',
       'Iraq', 'Sweden', 'Madagascar', 'Norway', 'Taiwan', 'Hong Kong',
       'Mexico', 'France', 'Brazil', 'Lithuania', 'Uruguay', 'Denmark',
       'Spain', 'Egypt', 'Turkey', 'South Africa', 'Ukraine', 'Finland',
       'Romania', 'Portugal', 'Singapore', 'Oman', 'Belgium', 'Chile',
       'Bulgaria', 'Latvia', 'Philippines', 'Greece', 'Belarus',
       'Saudi Arabia', 'Kenya', 'Switzerland', 'Iceland', 'Vietnam',
       'Thailand', 'China', 'Montenegro', 'Slovakia', 'Japan',
       'Luxembourg', 'Turkmenistan', 'Argentina', 'Hungary', 'Tunisia',
       'Bangladesh', 'Maldives', 'Dominican Republic', 'Jordan',
       'Pakistan', 'Nepal', 'Iran', 'Indonesia', 'Ecuador',
       'Bosnia and Herzegovina', 'Armenia', 'Colombia', 'Kazakhstan',
       'South Korea', 'Costa R

In [None]:
currency_map = {"unanswered": "unanswered", "Canada": "CAD", "UK": "GBP", "Israel": "ILS",
                "USA": "USD", "Germany": "EUR", "India": "INR", "Netherlands": "EUR",
                "Croatia": "HRK", "Australia": "AUD", "Russia": "RUB",
                "Czech Republic": "CZK", "Austria": "EUR", "Serbia": "RSD", "Italy": "EUR",
                "Ireland": "EUR", "Poland": "PLN", "Slovenia": "EUR", "Iraq": "IQD",
                "Sweden": "SEK", "Madagascar": "MGA", "Norway": "NOK", "Taiwan": "TWD", 
                "Hong Kong": "HKD", "Mexico": "MXN", "France": "EUR", "Brazil": "BRL",
                "Lithuania": "LTL", "Uruguay": "UYU", "Denmark": "DKK", "Spain": "EUR", 
                "Egypt": "EGP", "Turkey": "TRY", "South Africa": "ZAR", "Ukraine": "UAH", 
                "Finland": "EUR", "Romania": "RON", "Portugal": "EUR", "Singapore": "SGD", 
                "Oman": "OMR", "Belgium": "EUR", "Chile": "CLP", "Bulgaria": "BGN", 
                "Latvia": "LVL", "Philippines": "PHP", "Greece": "EUR", "Belarus": "BYR",
                "Saudi Arabia": "SAR", "Kenya": "KES", "Switzerland": "CHF", "Iceland": "ISK", 
                "Vietnam": "VND", "Thailand": "THB", "China": "CNY", "Montenegro": "EUR", 
                "Slovakia": "EUR", "Japan": "jpy", "Luxembourg": "LUF", "Turkmenistan": "TMT",
                "Argentina": "ARS", "Hungary": "HUF", "Tunisia": "TND", "Bangladesh": "BDT", 
                "Maldives": "MVR", "Dominican Republic": "DOP", "Jordan": "JOD", "Pakistan": "PKR",
                "Nepal": "NPR", "Iran": "IRR", "Indonesia": "IDR", "Ecuador": "USD", 
                "Bosnia and Herzegovina": "BAM", "Armenia": "AMD",
                "Colombia": "COP", "Kazakhstan": "KZT", "South Korea": "KRW",
                "Costa Rica": "CRC", "Honduras": "HNL", "Mauritius": "MUR",
                "Estonia": "EUR", "Algeria": "DZD", "Trinidad and Tobago": "TTD",
                "Mali": "XOF", "Morocco": "MAD", "Eswatini": "SZL",
                "New Zealand": "NZD", "North Macedonia": "MKD", "Afghanistan": "AFN",
                "Cyprus": "CYP", "UAE": "AED", "Peru": "PEN",
                "Uzbekistan": "UZS", "Ethiopia": "ETB", "Bahrain": "BHD",
                "Malta": "MLT", "Nicaragua": "NIO", "Andorra": "ADP",
                "Lebanon": "LBP", "Belize": "BZD", "Zambia": "ZMW",
                "Bolivia": "BOB", "Malaysia": "MYR", "Sri Lanka": "LKR",
                "Laos": "LAK", "Guatemala": "GTQ", "Azerbaijan": "AZN",
                "Suriname": "SRD", "El Salvador": "USD", "Syria": "SYP",
                "Qatar": "QAR", "Nigeria": "NGN", "Kyrgyzstan": "KGS",
                "Zimbabwe": "ZWD", "Rwanda": "RWF", "Georgia": "GEL",
                "Cambodia": "KHR", "Malawi": "MWK", "Yemen": "YER",
                "Fiji": "FJD", "Nomadic": "unknown", "Uganda": "UGX",
                "Albania": "ALL", "Timor-Leste": "USD", "Mongolia": "MNT",
                "Moldova": "MDL", "Tajikistan": "TJS", "Ghana": "GHS",
                "Tanzania": "TZS", "Myanmar": "MMK", "Kuwait": "KWD",
                "Cameroon": "XAF", "Kosovo": "EUR", "Jamaica": "JMD",
                "Benin": "XOF", "Botswana": "BWP", "Niger": "XOF",
                "Palestine": "EGP", "Cape Verde": "CVE", "Libya": "LYD",
                "Venezuela": "VES", "Senegal": "XOF", "Cuba": "CUP",
                "Togo": "XOF", "Angola": "AOA", "Isle of Man": "IMP",
                "Panama": "PAB", "Bahamas": "BSD", "Paraguay": "PYG",
                "Sudan": "SDG", "Liberia": "LRD", "Bhutan": "BTN",
                "Congo": "CDF", "Côte d'Ivoire": "XOF", "Barbados": "BBD",
                "Namibia": "NAD", "Somalia": "SOS", "Sierra Leone": "SLL",
                "Mozambique": "MZN", "Lesotho": "LSL", "Chad": "XAF",
                "North Korea": "KPW", "Antigua and Barbuda": "XCD", "Papua New Guinea": "PGK",
                "Palau": "USD", "Guinea": "GNF", "Haiti": "HTG",
                "Gabon": "XAF", "Mauritania": "MRU", "San Marino": "EUR",
                "Guyana": "GYD", "Saint Lucia": "XCD", "Burkina Faso": "XOF",
                "Brunei": "BND", "Gambia": "GMD", "Monaco": "MCO",
                "Djibouti": "DJF", "Seychelles": "SCR", "Solomon Islands": "SBD",
                "Saint Kitts and Nevis": "KN"}

### Observations:
- Many countries are missing values for currency
- Some of the currencies are incorrectly mentioned
- Whenever the country column is missing, so is the corresponding value of the currency column
- It doesn't make sense to have missing values for this column when the country is specified

### Steps:
- The accurate currencies will be mapped manually based on the values of the `Country` feature

### CompTotal

In [100]:
col_description("CompTotal")

What is your current total compensation (salary, bonuses, and perks, before taxes and deductions)? Please enter a whole number in the box below, without any punctuation.  If you are paid hourly, please estimate an equivalent weekly, monthly, or yearly salary. If you prefer not to answer, please leave the box empty.


In [101]:
df.CompTotal

0             NaN
1             NaN
2         32000.0
3         60000.0
4             NaN
           ...   
73263     60000.0
73264    107000.0
73265         NaN
73266     58500.0
73267         NaN
Name: CompTotal, Length: 73268, dtype: float64

In [102]:
with pd.option_context("display.float_format", "{:,.3f}".format):
    display(df.CompTotal.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]))

count                                           38,422.000
mean    23,424,340,221,747,959,297,504,942,608,856,691,...
std     4,591,478,187,815,219,735,465,859,539,580,935,5...
min                                                  0.000
5%                                               2,200.000
10%                                              4,000.000
25%                                             30,000.000
50%                                             77,500.000
75%                                            154,000.000
90%                                            500,000.000
95%                                          1,800,000.000
99%                                         30,000,000.000
max     900,000,000,000,000,060,934,480,090,350,342,481...
Name: CompTotal, dtype: float64

In [103]:
(df
 .dropna(subset=["CompTotal"])
 .sort_values(by="CompTotal", ascending=False))

Unnamed: 0,ResponseId,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,Country,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSysProfessional use,OpSysPersonal use,VersionControlSystem,VCInteraction,VCHostingPersonal use,VCHostingProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,Blockchain,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
35786,35787,I am a developer by profession,"Employed, full-time;Employed, part-time","Hybrid (some remote, some in-person)",I don’t code outside of work,"Secondary school (e.g. American high school, G...",Colleague,,,4,16,"Developer, full-stack;Academic researcher","Just me - I am a freelancer, sole proprietor, ...",I have a great deal of influence,Visit developer communities like Stack Overflow,Ecuador,DOP\tDominican peso,9.000000e+56,Weekly,APL;COBOL;Scala,C++;Dart;MATLAB;VBA,IBM DB2,Neo4j;Redis,DigitalOcean,Heroku,Angular,Blazor;Svelte,Scikit-learn,Tidyverse,Puppet,Yarn,Android Studio,Android Studio,Windows,BSD,Other (please specify):;Mercurial,Command-line,,,Airtable,Planview Projectplace or Clarizen,Google Chat,RingCentral,Indifferent,Stack Overflow for Teams (private knowledge sh...,A few times per week,No,,"No, not really",65 years or older,"Man;Woman;Non-binary, genderqueer, or gender n...",No,Straight / Heterosexual,White;Indian;European;North American;Middle Ea...,I am deaf / hard of hearing;I am blind / have ...,"I have a mood or emotional disorder (e.g., dep...",Yes,People manager,50.0,Disagree,Neither agree nor disagree,Agree,Strongly agree,Agree,Neither agree nor disagree,Disagree,10+ times a week,10+ times a week,10+ times a week,Over 120 minutes a day,Over 120 minutes a day,Just right,DevOps function;Automated testing,Yes,No,Yes,Too long,Difficult,
3068,3069,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","School (i.e., University, College, etc);Other ...",,,10,2,"Developer, back-end;Scientist","10,000 or more employees",I have some influence,,United States of America,USD\tUnited States dollar,1.000000e+52,Weekly,C;C++;Fortran;Java;Python;Scala,Python;Rust;Scala,Cassandra;PostgreSQL;Redis;SQLite,Cassandra;PostgreSQL;Redis;SQLite,AWS,,Flask,Flask,Apache Kafka;Apache Spark,Apache Kafka;Apache Spark,Docker;Kubernetes,Docker;Kubernetes,IntelliJ;PyCharm;Vim,IntelliJ;PyCharm;Vim,Linux-based,Linux-based;Windows,Git,Command-line;Version control hosting service w...,,,Confluence,Confluence,Microsoft Teams;Rocketchat,Rocketchat,Unsure,Stack Overflow;Stack Exchange,Daily or almost daily,Yes,Less than once per month or monthly,"No, not at all",18-24 years old,Man,"Or, in your own words:",Prefer to self-describe:,"Or, in your own words:","Or, in your own words:","Or, in your own words:",No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Difficult,
70597,70598,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Fully remote,Hobby;Contribute to open-source projects,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,,20,19,"Developer, front-end;Developer, full-stack;Dev...","Just me - I am a freelancer, sole proprietor, ...",I have a great deal of influence,Other (please specify):,United States of America,ZMW Zambian kwacha,1.000000e+22,Yearly,Bash/Shell;HTML/CSS;JavaScript;Perl;PHP;SQL,,MariaDB;MySQL,,,,jQuery,,,,,,Nano,,BSD;Linux-based,Linux-based,Git;SVN,Command-line,,,,,,,Unfavorable,Stack Overflow;Stack Exchange,A few times per week,Yes,Less than once per month or monthly,"No, not really",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
17567,17568,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects;Freel...,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Online ...,,10,6,"Developer, front-end;Developer, full-stack;Dev...",10 to 19 employees,I have some influence,Visit developer communities like Stack Overflo...,Italy,EUR European Euro,1.000000e+15,Monthly,HTML/CSS;JavaScript;PHP;SQL,C#;Go;HTML/CSS;Java;JavaScript;PHP;Python;R;SQ...,MariaDB;MongoDB;MySQL;SQLite,MariaDB;MongoDB;MySQL;SQLite,AWS;DigitalOcean;Managed Hosting,AWS;DigitalOcean;Google Cloud;Managed Hosting;...,Angular.js;jQuery;Laravel;Node.js;Vue.js,Angular;Angular.js;jQuery;Laravel;Node.js;Reac...,,,Docker;Homebrew;npm,Docker;Homebrew;npm,Nano;PhpStorm;Vim,Nano;PhpStorm;Vim,macOS,Linux-based;Windows,Git,Code editor;Command-line,,,Trello,,Google Chat;Microsoft Teams;Slack;Zoom,Google Chat;Microsoft Teams;Slack;Zoom,Indifferent,Collectives on Stack Overflow;Stack Overflow;S...,Multiple times per day,Yes,A few times per week,"Yes, somewhat",25-34 years old,Man,No,Straight / Heterosexual,European,None of the above,None of the above,Yes,Independent contributor,8.0,Strongly agree,Neither agree nor disagree,Strongly agree,Strongly agree,Strongly agree,Neither agree nor disagree,Agree,Never,Never,Never,15-30 minutes a day,Less than 15 minutes a day,Somewhat short,DevOps function;Microservices;Continuous integ...,Yes,Yes,Yes,Too long,Neither easy nor difficult,
19244,19245,"I am not primarily a developer, but I write co...","Independent contractor, freelancer, or self-em...",Fully remote,Hobby;Contribute to open-source projects;Freel...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Programming Games;Writ...,,15,10,"Developer, front-end;Engineer, data;Developer,...",10 to 19 employees,I have little or no influence,Other (please specify):;Visit developer commun...,Kyrgyzstan,KGS\tKyrgyzstani som,5.000000e+12,Yearly,Bash/Shell;C++;Go;HTML/CSS;JavaScript;MATLAB;P...,Bash/Shell;HTML/CSS;JavaScript;PHP;SQL;TypeScript,Elasticsearch;MariaDB;MongoDB;MySQL;Oracle;Pos...,Elasticsearch;MariaDB;MySQL,DigitalOcean;Firebase;Google Cloud;Oracle Clou...,DigitalOcean;Firebase,Angular;Django;Drupal;Gatsby;jQuery;Laravel;No...,Angular;Vue.js,Apache Kafka;Apache Spark;React Native;Torch/P...,,Docker,Ansible;Docker;Flow;Kubernetes;npm;Puppet;Unit...,Android Studio;Atom;IntelliJ;NetBeans;Neovim;N...,Atom;Notepad++;PhpStorm;PyCharm,Linux-based,BSD;Linux-based;Windows;Windows Subsystem for ...,Git,Code editor;Command-line;Version control hosti...,,,Microsoft Planner;Trello,,Google Chat;Rocketchat;Slack;Symphony;Zoom,,Favorable,Stack Overflow;Stack Exchange,Daily or almost daily,Yes,Daily or almost daily,"Yes, definitely",35-44 years old,"Man;Or, in your own words:","Or, in your own words:",Straight / Heterosexual;Prefer to self-describe:,White;Asian;Multiracial;Central Asian,"Or, in your own words:;I am deaf / hard of hea...","Or, in your own words:",No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Neither easy nor difficult,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3891,3892,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Fully remote,Contribute to open-source projects;Freelance/c...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Other online resources ...,Technical documentation;Blogs;Stack Overflow;O...,Coursera;Udemy;Pluralsight;Udacity;Other,10,6,Data or business analyst,"Just me - I am a freelancer, sole proprietor, ...",I have some influence,Start a free trial;Visit developer communities...,Russian Federation,RUB\tRussian ruble,0.000000e+00,Yearly,Bash/Shell;JavaScript;Python;SQL,Dart,Microsoft SQL Server;MongoDB;PostgreSQL;Redis;...,,AWS;DigitalOcean,,FastAPI;Flask,Django,Apache Kafka;Flutter;Keras;NumPy;Pandas;Scikit...,,Docker,Kubernetes,IPython/Jupyter;Sublime Text;Visual Studio Code,,Linux-based,Linux-based,Git,Command-line,,,,,Slack;Zoom,,Unsure,Stack Overflow;Stack Exchange,A few times per month or weekly,Yes,Less than once per month or monthly,"Yes, definitely",25-34 years old,Man,No,Bisexual,White;European,None of the above,None of the above,,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,
8085,8086,I am a developer by profession,"Independent contractor, freelancer, or self-em...","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects;Boots...,Some college/university study without earning ...,"Books / Physical media;School (i.e., Universit...",,,25,20,"Developer, front-end;Developer, full-stack;Dev...",2 to 9 employees,I have a great deal of influence,Other (please specify):;Start a free trial,Russian Federation,USD\tUnited States dollar,0.000000e+00,Monthly,Assembly;Bash/Shell;C;C++;HTML/CSS;JavaScript;...,,MongoDB;PostgreSQL;Redis;SQLite,MongoDB;PostgreSQL;Redis;SQLite,Managed Hosting,,Node.js;React.js,Deno;Node.js;React.js,Electron;Flutter;GTK;React Native,Electron;Flutter;GTK;React Native,Docker;Flow;npm;Unreal Engine;Yarn,Ansible;Docker;Flow;Kubernetes;Unreal Engine;Yarn,Android Studio;Vim;Visual Studio Code,Vim;Visual Studio Code,Linux-based,Linux-based,Git,Command-line,,,Trello,,,,Unsure,Stack Overflow;Stack Exchange,Less than once per month or monthly,Not sure/can't remember,,"Yes, somewhat",35-44 years old,Man,No,Straight / Heterosexual,I don't know,I am blind / have difficulty seeing,,No,,,,,,,,,,,,,,,,,,,,,Neither easy nor difficult,
47360,47361,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Fully remote,Bootstrapping a business,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Blogs;Stack Overflow;Video-based Online Course...,Udemy,20,20,"Developer, front-end;Developer, full-stack;Dev...","Just me - I am a freelancer, sole proprietor, ...",I have a great deal of influence,Start a free trial;Visit developer communities...,United States of America,USD\tUnited States dollar,0.000000e+00,Yearly,HTML/CSS;JavaScript;PHP;SQL;Swift,HTML/CSS;JavaScript;PHP;SQL;Swift,Cloud Firestore;MariaDB;Firebase Realtime Data...,Cloud Firestore;MariaDB;Firebase Realtime Data...,DigitalOcean;Firebase;Linode,DigitalOcean;Firebase;Linode,Laravel;Vue.js,Laravel;Vue.js,,,Homebrew;npm,Homebrew;npm,Visual Studio Code;Xcode,Visual Studio Code;Xcode,macOS,macOS;Windows,Git,Command-line,,,,,,,Unfavorable,Stack Overflow;Stack Exchange,Multiple times per day,Yes,I have never participated in Q&A on Stack Over...,Neutral,35-44 years old,Man,No,Straight / Heterosexual,White,None of the above,None of the above,,,,,,,,,,,,,,,,,,,,,Appropriate in length,Neither easy nor difficult,
51860,51861,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Fully remote,Contribute to open-source projects;Freelance/c...,Some college/university study without earning ...,Books / Physical media;Friend or family member...,Technical documentation;Blogs;Written Tutorial...,,42,37,"Developer, front-end;Developer, back-end;Devel...","Just me - I am a freelancer, sole proprietor, ...",I have a great deal of influence,Other (please specify):,United States of America,USD\tUnited States dollar,0.000000e+00,Yearly,JavaScript;Perl,JavaScript;Perl,SQLite,SQLite,DigitalOcean;Firebase,DigitalOcean;Firebase,Node.js,Node.js,Cordova,Cordova,,,Android Studio;Vim;Xcode,Android Studio;Vim;Xcode,Linux-based;macOS,Linux-based;macOS,Other (please specify):,Command-line,,,,,Mattermost;Zoom,Mattermost;Zoom,Unsure,Stack Overflow;Stack Exchange,A few times per week,Yes,A few times per month or weekly,Not sure,55-64 years old,Man,No,Straight / Heterosexual,White;North American,None of the above,"I have a mood or emotional disorder (e.g., dep...",,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,


#### Checking who are the people having 0 compensation

In [104]:
(df
 .query("CompTotal == 0")
 .Employment
 .pipe(lambda ser: col_strip_nested(ser.name)))

[('Employed, full-time', 49199),
 ('Student, full-time', 10932),
 ('Independent contractor, freelancer, or self-employed', 10721),
 ('Employed, part-time', 4154),
 ('Student, part-time', 3722),
 ('Not employed, but looking for work', 3381),
 ('Not employed, and not looking for work', 1244),
 ('I prefer not to say', 611),
 ('Retired', 396)]

In [105]:
(df
 .query("CompTotal == 0")
 .Employment
 .pipe(lambda ser: col_value_counts(ser.name)))

Unnamed: 0_level_0,count,pct
Employment,Unnamed: 1_level_1,Unnamed: 2_level_1
"Employed, full-time",42962,0.599116
"Student, full-time",6756,0.094214
"Independent contractor, freelancer, or self-employed",4978,0.069419
"Employed, full-time;Independent contractor, freelancer, or self-employed",3486,0.048613
"Not employed, but looking for work",1831,0.025534
...,...,...
"Student, part-time;Independent contractor, freelancer, or self-employed;Retired",1,0.000014
"Employed, full-time;Independent contractor, freelancer, or self-employed;Employed, part-time;Not employed, and not looking for work",1,0.000014
"Employed, full-time;Student, full-time;Retired",1,0.000014
"Employed, part-time;Not employed, and not looking for work",1,0.000014


In [106]:
((df.Employment.str.contains("Employed, full-time")) & (df.CompTotal == 0)).sum()

27

In [107]:
df.CompTotal.isna().sum()

34846

In [108]:
(df
 .CompTotal
 .where(lambda ser: ~(df.Employment.str.contains("Employed, full-time") & (ser == 0)), np.nan)
 .isna().sum())

34873

#### Observations
- The values are of the right type
- Some values are extremely large and way out of range
  - These could be because of the varying compensation frequencies
  - Or due to different currencies
- Some entries are 0; these are people who are:
  - Students
  - Unemployes
  - Retired
  - Self-employed
  - Part-time employed
- There're also some `Full-time Employed` people having 0 compensation:
  - Maybe they didn't want to share their income


#### Steps
- The column `CompFreq` must be analyzed for inaccuracies
- For full-time employees having 0 compensation:
  - These values will be replaced with `np.nan`

### CompFreq

In [109]:
col_description("CompFreq")

Is that compensation weekly, monthly, or yearly?


In [110]:
df.CompFreq

0            NaN
1            NaN
2         Yearly
3        Monthly
4            NaN
          ...   
73263     Yearly
73264     Yearly
73265        NaN
73266     Yearly
73267        NaN
Name: CompFreq, Length: 73268, dtype: object

In [111]:
df.CompFreq.unique()

array([nan, 'Yearly', 'Monthly', 'Weekly'], dtype=object)

In [112]:
col_value_counts("CompFreq")

Unnamed: 0_level_0,count,pct
CompFreq,Unnamed: 1_level_1,Unnamed: 2_level_1
Yearly,23267,0.523737
Monthly,19983,0.449814
Weekly,1175,0.026449


#### Observations
- There're 3 unique entries
- Values are valid
- No cleaning required


#### Steps
- The type can be made `category` for less memory usage

### LanguageHaveWorkedWith

In [113]:
df.LanguageHaveWorkedWith

0                                                      NaN
1                                    JavaScript;TypeScript
2                        C#;C++;HTML/CSS;JavaScript;Python
3                             C#;JavaScript;SQL;TypeScript
4              C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript
                               ...                        
73263    Bash/Shell;Dart;JavaScript;PHP;Python;SQL;Type...
73264            Bash/Shell;HTML/CSS;JavaScript;Python;SQL
73265                   HTML/CSS;JavaScript;PHP;Python;SQL
73266                                        C#;Delphi;VBA
73267          C#;JavaScript;Lua;PowerShell;SQL;TypeScript
Name: LanguageHaveWorkedWith, Length: 73268, dtype: object

In [114]:
col_value_counts("LanguageHaveWorkedWith")

Unnamed: 0_level_0,count,pct
LanguageHaveWorkedWith,Unnamed: 1_level_1,Unnamed: 2_level_1
HTML/CSS;JavaScript;TypeScript,1250,0.017612
Python,962,0.013554
HTML/CSS;JavaScript,914,0.012878
HTML/CSS;JavaScript;PHP;SQL,745,0.010497
C#;HTML/CSS;JavaScript;SQL;TypeScript,570,0.008031
...,...,...
C#;Java;JavaScript;PHP;PowerShell;SQL;TypeScript,1,0.000014
C++;HTML/CSS;JavaScript;Python;Solidity;SQL,1,0.000014
Dart;Haskell;HTML/CSS;Java;R;SQL;TypeScript,1,0.000014
Bash/Shell;C;C#;Go;HTML/CSS;Java;JavaScript;Lua;Perl;PHP;Python;Ruby;Rust;SQL;TypeScript,1,0.000014


In [115]:
col_strip_nested("LanguageHaveWorkedWith")

[('JavaScript', 46443),
 ('HTML/CSS', 39142),
 ('SQL', 35127),
 ('Python', 34155),
 ('TypeScript', 24752),
 ('Java', 23644),
 ('Bash/Shell', 20656),
 ('C#', 19883),
 ('C++', 16024),
 ('PHP', 14827),
 ('C', 13692),
 ('PowerShell', 8575),
 ('Go', 7922),
 ('Rust', 6625),
 ('Kotlin', 6507),
 ('Dart', 4648),
 ('Ruby', 4299),
 ('Assembly', 3887),
 ('Swift', 3489),
 ('R', 3308),
 ('VBA', 3185),
 ('MATLAB', 2913),
 ('Lua', 2867),
 ('Groovy', 2357),
 ('Delphi', 2311),
 ('Scala', 1837),
 ('Objective-C', 1698),
 ('Perl', 1644),
 ('Haskell', 1577),
 ('Elixir', 1528),
 ('Julia', 1084),
 ('Clojure', 1070),
 ('Solidity', 1031),
 ('LISP', 932),
 ('F#', 730),
 ('Fortran', 646),
 ('Erlang', 641),
 ('APL', 504),
 ('COBOL', 464),
 ('SAS', 435),
 ('OCaml', 422),
 ('Crystal', 340)]

In [116]:
(df
 .LanguageHaveWorkedWith
 .apply(lambda val: len(val.split(";")) if isinstance(val, str) else np.nan))

0        NaN
1        2.0
2        5.0
3        4.0
4        6.0
        ... 
73263    7.0
73264    5.0
73265    5.0
73266    3.0
73267    6.0
Name: LanguageHaveWorkedWith, Length: 73268, dtype: float64

#### Observations:
- There're 25,068 unique entries
- Entries are nested, separated by a semicolon
- The entries are clean and valid


#### Steps:
- This variable requires feature engineering
- Can create a column, `Num of languages known`
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### LanguageWantToWorkWith

In [117]:
df.LanguageWantToWorkWith

0                                                   NaN
1                                       Rust;TypeScript
2                 C#;C++;HTML/CSS;JavaScript;TypeScript
3                                     C#;SQL;TypeScript
4            C#;Elixir;F#;Go;JavaScript;Rust;TypeScript
                              ...                      
73263    Bash/Shell;Go;JavaScript;Python;SQL;TypeScript
73264                        HTML/CSS;JavaScript;Python
73265             C#;HTML/CSS;JavaScript;PHP;Python;SQL
73266                                            Delphi
73267                        PowerShell;Rust;TypeScript
Name: LanguageWantToWorkWith, Length: 73268, dtype: object

- Same steps as `LanguageHaveWorkedWith`

### DatabaseHaveWorkedWith

In [118]:
df.DatabaseHaveWorkedWith

0                                                      NaN
1                                                      NaN
2                                     Microsoft SQL Server
3                                     Microsoft SQL Server
4        Cloud Firestore;Elasticsearch;Microsoft SQL Se...
                               ...                        
73263                 Elasticsearch;MySQL;PostgreSQL;Redis
73264                  Elasticsearch;MongoDB;Oracle;SQLite
73265    MariaDB;Microsoft SQL Server;MySQL;PostgreSQL;...
73266                  Microsoft SQL Server;MongoDB;Oracle
73267                     Microsoft SQL Server;Neo4j;Redis
Name: DatabaseHaveWorkedWith, Length: 73268, dtype: object

In [119]:
col_value_counts("DatabaseHaveWorkedWith")

Unnamed: 0_level_0,count,pct
DatabaseHaveWorkedWith,Unnamed: 1_level_1,Unnamed: 2_level_1
MySQL,3563,0.059264
PostgreSQL,3519,0.058532
Microsoft SQL Server,3238,0.053858
SQLite,2022,0.033632
MongoDB,1655,0.027528
...,...,...
Elasticsearch;IBM DB2;MongoDB;MySQL,1,0.000017
Cloud Firestore;Microsoft SQL Server;MongoDB;MySQL;Neo4j,1,0.000017
Cloud Firestore;MongoDB;Oracle;PostgreSQL;Redis;SQLite,1,0.000017
MariaDB;Microsoft SQL Server;MySQL;Neo4j;Oracle;PostgreSQL;SQLite,1,0.000017


In [120]:
col_strip_nested("DatabaseHaveWorkedWith")

[('MySQL', 28520),
 ('PostgreSQL', 26538),
 ('SQLite', 19487),
 ('MongoDB', 17228),
 ('Microsoft SQL Server', 16355),
 ('Redis', 13471),
 ('MariaDB', 10912),
 ('Elasticsearch', 7430),
 ('Oracle', 6994),
 ('Firebase Realtime Database', 5309),
 ('DynamoDB', 5029),
 ('Cloud Firestore', 4535),
 ('Cassandra', 1617),
 ('Neo4j', 1291),
 ('IBM DB2', 1219),
 ('Couchbase', 807),
 ('CouchDB', 783)]

In [121]:
(df
 .DatabaseHaveWorkedWith
 .str.replace(" ", "_"))

0                                                      NaN
1                                                      NaN
2                                     Microsoft_SQL_Server
3                                     Microsoft_SQL_Server
4        Cloud_Firestore;Elasticsearch;Microsoft_SQL_Se...
                               ...                        
73263                 Elasticsearch;MySQL;PostgreSQL;Redis
73264                  Elasticsearch;MongoDB;Oracle;SQLite
73265    MariaDB;Microsoft_SQL_Server;MySQL;PostgreSQL;...
73266                  Microsoft_SQL_Server;MongoDB;Oracle
73267                     Microsoft_SQL_Server;Neo4j;Redis
Name: DatabaseHaveWorkedWith, Length: 73268, dtype: object

### DatabaseWantToWorkWith

In [122]:
col_strip_nested("DatabaseWantToWorkWith")

[('PostgreSQL', 25212),
 ('MongoDB', 17297),
 ('MySQL', 16271),
 ('Redis', 16211),
 ('SQLite', 14085),
 ('Microsoft SQL Server', 9867),
 ('Elasticsearch', 8533),
 ('MariaDB', 7181),
 ('Firebase Realtime Database', 5312),
 ('DynamoDB', 5309),
 ('Cloud Firestore', 3873),
 ('Oracle', 3511),
 ('Cassandra', 3178),
 ('Neo4j', 2354),
 ('CouchDB', 1154),
 ('Couchbase', 955),
 ('IBM DB2', 519)]

### PlatformHaveWorkedWith

In [123]:
col_value_counts("PlatformHaveWorkedWith")

Unnamed: 0_level_0,count,pct
PlatformHaveWorkedWith,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS,8719,0.174645
Microsoft Azure,4989,0.099932
Google Cloud,2306,0.046190
AWS;Microsoft Azure,2039,0.040842
Heroku,2003,0.040121
...,...,...
AWS;Colocation;Firebase;Google Cloud;IBM Cloud or Watson;OpenStack;VMware,1,0.000020
Firebase;Google Cloud;Heroku;Linode;Managed Hosting;Microsoft Azure,1,0.000020
AWS;DigitalOcean;Firebase;Google Cloud;Heroku;IBM Cloud or Watson;Managed Hosting;Microsoft Azure;Oracle Cloud Infrastructure;VMware,1,0.000020
AWS;DigitalOcean;Firebase;Google Cloud;IBM Cloud or Watson;Oracle Cloud Infrastructure;OpenStack;VMware,1,0.000020


In [124]:
col_strip_nested("PlatformHaveWorkedWith")

[('AWS', 25939),
 ('Microsoft Azure', 14604),
 ('Google Cloud', 13634),
 ('Firebase', 10751),
 ('Heroku', 10160),
 ('DigitalOcean', 7953),
 ('VMware', 4429),
 ('Managed Hosting', 2927),
 ('Linode', 1994),
 ('OVH', 1913),
 ('Oracle Cloud Infrastructure', 1110),
 ('OpenStack', 1029),
 ('IBM Cloud or Watson', 853),
 ('Colocation', 642)]

In [125]:
(df
 .PlatformHaveWorkedWith
 .fillna("unanswered")
 .str.replace("IBM Cloud or Watson", "watson")
 .str.replace("OpenStack", "open_stack")
 .str.replace("DigitalOcean", "Digital Ocean")
 .str.lower()
 .str.replace(" ", "_")
 .unique())

array(['unanswered', 'firebase;microsoft_azure',
       'aws;google_cloud;heroku', ...,
       'aws;digital_ocean;google_cloud;heroku;open_stack',
       'watson;linode;vmware',
       'colocation;digital_ocean;heroku;linode;oracle_cloud_infrastructure;ovh'],
      dtype=object)

### PlatformWantToWorkWith

In [126]:
col_strip_nested("PlatformWantToWorkWith")

[('AWS', 23701),
 ('Google Cloud', 13394),
 ('Microsoft Azure', 12785),
 ('Firebase', 8643),
 ('DigitalOcean', 6983),
 ('Heroku', 5951),
 ('VMware', 2495),
 ('Linode', 2259),
 ('Managed Hosting', 1887),
 ('OpenStack', 1385),
 ('OVH', 1233),
 ('Oracle Cloud Infrastructure', 1144),
 ('IBM Cloud or Watson', 888),
 ('Colocation', 537)]

### WebframeHaveWorkedWith

In [127]:
df.WebframeHaveWorkedWith

0                                                 NaN
1                                                 NaN
2                                          Angular.js
3                               ASP.NET;ASP.NET Core 
4        Angular;ASP.NET;ASP.NET Core ;jQuery;Node.js
                             ...                     
73263                         Express;FastAPI;Node.js
73264                          FastAPI;Flask;React.js
73265                                ASP.NET;React.js
73266                                             NaN
73267    ASP.NET Core ;Blazor;Node.js;React.js;Svelte
Name: WebframeHaveWorkedWith, Length: 73268, dtype: object

In [128]:
col_strip_nested("WebframeHaveWorkedWith")

[('Node.js', 25733),
 ('React.js', 23277),
 ('jQuery', 15602),
 ('Express', 12557),
 ('Angular', 11138),
 ('Vue.js', 10278),
 ('ASP.NET Core ', 10155),
 ('ASP.NET', 8139),
 ('Django', 8002),
 ('Flask', 7994),
 ('Next.js', 7386),
 ('Laravel', 5159),
 ('Angular.js', 4912),
 ('FastAPI', 3289),
 ('Ruby on Rails', 3182),
 ('Svelte', 2500),
 ('Blazor', 2438),
 ('Nuxt.js', 2089),
 ('Symfony', 1955),
 ('Gatsby', 1889),
 ('Drupal', 1211),
 ('Phoenix', 1164),
 ('Fastify', 1008),
 ('Deno', 925),
 ('Play Framework', 450)]

In [129]:
(df
 .WebframeHaveWorkedWith
 .fillna("unanswered")
 .str.lower()
 .str.replace(" ", "_")
 .str.replace(".", "_", regex=True))

0                                          unanswered
1                                          unanswered
2                                          angular_js
3                               asp_net;asp_net_core_
4        angular;asp_net;asp_net_core_;jquery;node_js
                             ...                     
73263                         express;fastapi;node_js
73264                          fastapi;flask;react_js
73265                                asp_net;react_js
73266                                      unanswered
73267    asp_net_core_;blazor;node_js;react_js;svelte
Name: WebframeHaveWorkedWith, Length: 73268, dtype: object

### WebframeWantToWorkWith

In [130]:
col_strip_nested("WebframeWantToWorkWith")

[('React.js', 21910),
 ('Node.js', 20948),
 ('Vue.js', 12666),
 ('Next.js', 10348),
 ('Express', 9333),
 ('ASP.NET Core ', 8894),
 ('Angular', 8402),
 ('Django', 7200),
 ('Svelte', 6741),
 ('jQuery', 5937),
 ('Flask', 5437),
 ('FastAPI', 4571),
 ('Blazor', 3880),
 ('ASP.NET', 3848),
 ('Laravel', 3808),
 ('Nuxt.js', 3805),
 ('Deno', 3717),
 ('Ruby on Rails', 3272),
 ('Angular.js', 2773),
 ('Phoenix', 1862),
 ('Gatsby', 1569),
 ('Fastify', 1452),
 ('Symfony', 1433),
 ('Drupal', 600),
 ('Play Framework', 371)]

### MiscTechHaveWorkedWith

In [131]:
df.MiscTechHaveWorkedWith

0                                                      NaN
1                                                      NaN
2                                                   Pandas
3                                                     .NET
4                                                     .NET
                               ...                        
73263                                              Flutter
73264    Keras;NumPy;Pandas;Scikit-learn;TensorFlow;Tor...
73265                             .NET;Pandas;React Native
73266                                                  NaN
73267                            Apache Kafka;Apache Spark
Name: MiscTechHaveWorkedWith, Length: 73268, dtype: object

In [132]:
col_strip_nested("MiscTechHaveWorkedWith")

[('.NET', 15850),
 ('NumPy', 13144),
 ('Pandas', 11506),
 ('Spring', 7399),
 ('TensorFlow', 5942),
 ('Flutter', 5799),
 ('Scikit-learn', 5776),
 ('React Native', 5765),
 ('Apache Kafka', 4748),
 ('Electron', 4390),
 ('Torch/PyTorch', 3952),
 ('Qt', 3906),
 ('Keras', 3333),
 ('Ionic', 2417),
 ('Xamarin', 2388),
 ('Apache Spark', 2298),
 ('Cordova', 1903),
 ('Hadoop', 1581),
 ('GTK', 1380),
 ('Capacitor', 1142),
 ('Tidyverse', 996),
 ('Hugging Face Transformers', 925),
 ('Uno Platform', 334)]

In [133]:
(df
 .MiscTechHaveWorkedWith
 .fillna("unanswered")
 .str.replace("Torch/PyTorch", "pytorch")
 .str.replace(".NET", "dot_net", regex=True)
 .str.lower()
 .str.replace(" ", "_")
 .str.replace("-", "_"))

0                                               unanswered
1                                               unanswered
2                                                   pandas
3                                                  dot_net
4                                                  dot_net
                               ...                        
73263                                              flutter
73264    keras;numpy;pandas;scikit_learn;tensorflow;pyt...
73265                          dot_net;pandas;react_native
73266                                           unanswered
73267                            apache_kafka;apache_spark
Name: MiscTechHaveWorkedWith, Length: 73268, dtype: object

### MiscTechWantToWorkWith

In [134]:
col_strip_nested("MiscTechWantToWorkWith")

[('.NET', 11824),
 ('NumPy', 10248),
 ('Pandas', 9070),
 ('TensorFlow', 8836),
 ('Flutter', 7980),
 ('React Native', 6728),
 ('Apache Kafka', 6153),
 ('Torch/PyTorch', 6073),
 ('Spring', 5472),
 ('Scikit-learn', 5175),
 ('Electron', 4349),
 ('Keras', 3493),
 ('Apache Spark', 3155),
 ('Qt', 2895),
 ('Xamarin', 2259),
 ('Hadoop', 1993),
 ('Ionic', 1679),
 ('Hugging Face Transformers', 1462),
 ('GTK', 1314),
 ('Capacitor', 1073),
 ('Tidyverse', 828),
 ('Cordova', 745),
 ('Uno Platform', 578)]

### ToolsTechHaveWorkedWith

In [135]:
df.ToolsTechHaveWorkedWith

0                                           NaN
1                                           NaN
2                                           NaN
3                                           NaN
4                                           npm
                          ...                  
73263            Docker;Homebrew;Kubernetes;npm
73264                                       NaN
73265                                       npm
73266                                       NaN
73267    Docker;Kubernetes;npm;Pulumi;Terraform
Name: ToolsTechHaveWorkedWith, Length: 73268, dtype: object

In [136]:
col_strip_nested("ToolsTechHaveWorkedWith")

[('npm', 35778),
 ('Docker', 34981),
 ('Yarn', 15175),
 ('Homebrew', 14420),
 ('Kubernetes', 12624),
 ('Terraform', 6160),
 ('Unity 3D', 5840),
 ('Ansible', 5210),
 ('Unreal Engine', 2180),
 ('Puppet', 1025),
 ('Chef', 828),
 ('Pulumi', 461),
 ('Flow', 444)]

In [137]:
(df
 .ToolsTechHaveWorkedWith
 .fillna("unanswered")
 .str.lower()
 .str.replace(" ", "_"))

0                                    unanswered
1                                    unanswered
2                                    unanswered
3                                    unanswered
4                                           npm
                          ...                  
73263            docker;homebrew;kubernetes;npm
73264                                unanswered
73265                                       npm
73266                                unanswered
73267    docker;kubernetes;npm;pulumi;terraform
Name: ToolsTechHaveWorkedWith, Length: 73268, dtype: object

### ToolsTechWantToWorkWith

In [138]:
col_strip_nested("ToolsTechWantToWorkWith")

[('Docker', 32642),
 ('npm', 23147),
 ('Kubernetes', 18647),
 ('Yarn', 11154),
 ('Homebrew', 10424),
 ('Terraform', 8130),
 ('Unity 3D', 6293),
 ('Ansible', 5540),
 ('Unreal Engine', 5061),
 ('Puppet', 1056),
 ('Pulumi', 1050),
 ('Chef', 911),
 ('Flow', 453)]

### NEWCollabToolsHaveWorkedWith

In [139]:
df.NEWCollabToolsHaveWorkedWith

0                                                      NaN
1                                                      NaN
2                                  Notepad++;Visual Studio
3               Notepad++;Visual Studio;Visual Studio Code
4         Notepad++;Visual Studio;Visual Studio Code;Xcode
                               ...                        
73263    IPython/Jupyter;Sublime Text;Vim;Visual Studio...
73264    IPython/Jupyter;Notepad++;Spyder;Vim;Visual St...
73265              Spyder;Visual Studio;Visual Studio Code
73266       RAD Studio (Delphi, C++ Builder);Visual Studio
73267                     Visual Studio;Visual Studio Code
Name: NEWCollabToolsHaveWorkedWith, Length: 73268, dtype: object

In [140]:
col_strip_nested("NEWCollabToolsHaveWorkedWith")

[('Visual Studio Code', 52523),
 ('Visual Studio', 22673),
 ('IntelliJ', 19723),
 ('Notepad++', 19543),
 ('Vim', 16458),
 ('Android Studio', 13963),
 ('PyCharm', 12158),
 ('Sublime Text', 11698),
 ('Eclipse', 8866),
 ('IPython/Jupyter', 8188),
 ('Xcode', 7425),
 ('Atom', 6595),
 ('Nano', 6530),
 ('Webstorm', 5602),
 ('PhpStorm', 4790),
 ('Neovim', 4759),
 ('NetBeans', 3695),
 ('CLion', 3543),
 ('Rider', 3480),
 ('Emacs', 3178),
 ('RStudio', 2387),
 ('GoLand', 2345),
 ('RAD Studio (Delphi, C++ Builder)', 1894),
 ('Qt Creator', 1892),
 ('Spyder', 1637),
 ('RubyMine', 975),
 ('TextMate', 516)]

In [141]:
(df
 .NEWCollabToolsHaveWorkedWith
 .fillna("unanswered")
 .str.replace("RAD Studio \(Delphi, C\++ Builder\)", "RAD Studio", regex=True)
 .str.replace("IPython/Jupyter", "Jupyter", regex=True)
 .str.lower()
 .str.replace(" ", "_"))

0                                              unanswered
1                                              unanswered
2                                 notepad++;visual_studio
3              notepad++;visual_studio;visual_studio_code
4        notepad++;visual_studio;visual_studio_code;xcode
                               ...                       
73263         jupyter;sublime_text;vim;visual_studio_code
73264     jupyter;notepad++;spyder;vim;visual_studio_code
73265             spyder;visual_studio;visual_studio_code
73266                            rad_studio;visual_studio
73267                    visual_studio;visual_studio_code
Name: NEWCollabToolsHaveWorkedWith, Length: 73268, dtype: object

### NEWCollabToolsWantToWorkWith

In [142]:
col_strip_nested("NEWCollabToolsWantToWorkWith")

[('Visual Studio Code', 44282),
 ('IntelliJ', 15040),
 ('Visual Studio', 14958),
 ('Vim', 12862),
 ('Notepad++', 11734),
 ('Android Studio', 9096),
 ('PyCharm', 8882),
 ('Sublime Text', 6370),
 ('IPython/Jupyter', 6224),
 ('Xcode', 5780),
 ('Neovim', 4977),
 ('Webstorm', 4638),
 ('Nano', 3897),
 ('Rider', 3537),
 ('PhpStorm', 3265),
 ('Eclipse', 2942),
 ('Emacs', 2885),
 ('CLion', 2828),
 ('GoLand', 2725),
 ('Atom', 2508),
 ('RAD Studio (Delphi, C++ Builder)', 1566),
 ('RStudio', 1449),
 ('Qt Creator', 1226),
 ('NetBeans', 1080),
 ('RubyMine', 850),
 ('Spyder', 837),
 ('TextMate', 299)]

- The following columns have the exact same workflow as `LanguageHaveWorkedWith`
  - LanguageWantToWorkWith
  - DatabaseHaveWorkedWith
  - DatabaseWantToWorkWith
  - PlatformHaveWorkedWith
  - PlatformWantToWorkWith
  - WebframeHaveWorkedWith
  - WebframeWantToWorkWith
  - MiscTechHaveWorkedWith
  - MiscTechWantToWorkWith
  - ToolsTechHaveWorkedWith
  - ToolsTechWantToWorkWith
  - NEWCollabToolsHaveWorkedWith
  - NEWCollabToolsWantToWorkWith

### OpSysProfessional use 

In [143]:
df["OpSysProfessional use"]

0                        NaN
1                      macOS
2                    Windows
3                    Windows
4                    Windows
                ...         
73263                  macOS
73264    Linux-based;Windows
73265                Windows
73266                Windows
73267    Linux-based;Windows
Name: OpSysProfessional use, Length: 73268, dtype: object

In [144]:
df["OpSysProfessional use"].unique()

array([nan, 'macOS', 'Windows', 'Linux-based;macOS',
       'Windows;Windows Subsystem for Linux (WSL)',
       'Linux-based;macOS;Windows', 'Linux-based;Windows',
       'Windows Subsystem for Linux (WSL)', 'macOS;Windows',
       'Linux-based', 'Linux-based;Windows Subsystem for Linux (WSL)',
       'Linux-based;Windows;Windows Subsystem for Linux (WSL)',
       'Linux-based;macOS;Windows;Windows Subsystem for Linux (WSL)',
       'Other (please specify):',
       'BSD;Linux-based;Windows;Windows Subsystem for Linux (WSL)',
       'macOS;Other (please specify):',
       'macOS;Windows;Windows Subsystem for Linux (WSL)',
       'macOS;Windows Subsystem for Linux (WSL)',
       'Windows;Other (please specify):', 'BSD;Linux-based;macOS;Windows',
       'BSD;Linux-based',
       'BSD;Linux-based;Windows;Other (please specify):',
       'Linux-based;macOS;Windows Subsystem for Linux (WSL)',
       'BSD;Linux-based;Windows Subsystem for Linux (WSL)',
       'BSD;Linux-based;macOS',
       

In [145]:
col_value_counts("OpSysProfessional use")#.shape

Unnamed: 0_level_0,count,pct
OpSysProfessional use,Unnamed: 1_level_1,Unnamed: 2_level_1
Windows,16645,0.25411
macOS,12541,0.191457
Linux-based,10934,0.166924
Linux-based;Windows,5637,0.086057
Linux-based;macOS,4906,0.074897
Windows;Windows Subsystem for Linux (WSL),3800,0.058013
Linux-based;Windows;Windows Subsystem for Linux (WSL),3003,0.045845
macOS;Windows,2327,0.035525
Linux-based;macOS;Windows,1551,0.023678
Linux-based;macOS;Windows;Windows Subsystem for Linux (WSL),897,0.013694


In [146]:
col_strip_nested("OpSysProfessional use")

[('Windows', 34905),
 ('Linux-based', 28523),
 ('macOS', 23578),
 ('Windows Subsystem for Linux (WSL)', 10252),
 ('BSD', 737),
 ('Other (please specify):', 284)]

In [147]:
(df["OpSysProfessional use"]
 .str.replace("Linux-based", "Linux", regex=True)
 .str.replace("Windows Subsystem for Linux \(WSL\)", "WSL", regex=True)
 .str.replace("Other \(please specify\)\:", "Other", regex=True)
 .str.lower()
 .str.replace("macos", "mac_os"))

0                  NaN
1               mac_os
2              windows
3              windows
4              windows
             ...      
73263           mac_os
73264    linux;windows
73265          windows
73266          windows
73267    linux;windows
Name: OpSysProfessional use, Length: 73268, dtype: object

#### Observations:
- There're 55 unique entries
- Entries are nested, separated by a semicolon


#### Steps:
- Multiple binary columns can be created based on the languages known
- Can create a column, `Num of operating systems used`
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### OpSysPersonal use

In [148]:
df["OpSysPersonal use"]

0                                      NaN
1        Windows Subsystem for Linux (WSL)
2                                  Windows
3                                  Windows
4                            macOS;Windows
                       ...                
73263                    Linux-based;macOS
73264                  Linux-based;Windows
73265                              Windows
73266                              Windows
73267                  Linux-based;Windows
Name: OpSysPersonal use, Length: 73268, dtype: object

In [149]:
col_strip_nested("OpSysPersonal use")

[('Windows', 44567),
 ('Linux-based', 28765),
 ('macOS', 22217),
 ('Windows Subsystem for Linux (WSL)', 10724),
 ('BSD', 1054),
 ('Other (please specify):', 349)]

- Same steps as `OpSysProfessional use`

### VersionControlSystem

In [150]:
col_description("VersionControlSystem")

What are the primary <b>version control systems</b> you use? Select all that apply.


In [151]:
df.VersionControlSystem

0                                NaN
1                                Git
2                                Git
3                                Git
4        Git;Other (please specify):
                    ...             
73263                            Git
73264                            Git
73265                            Git
73266                            SVN
73267                            Git
Name: VersionControlSystem, Length: 73268, dtype: object

In [152]:
df.VersionControlSystem.unique()

array([nan, 'Git', 'Git;Other (please specify):', 'Mercurial;SVN',
       "I don't use one", 'Git;SVN', 'SVN', 'Other (please specify):',
       'Git;Mercurial', 'Git;Other (please specify):;SVN', 'Mercurial',
       'Git;Other (please specify):;Mercurial', 'Git;Mercurial;SVN',
       'Other (please specify):;SVN',
       'Git;Other (please specify):;Mercurial;SVN',
       'Other (please specify):;Mercurial',
       'Other (please specify):;Mercurial;SVN'], dtype=object)

In [153]:
col_value_counts("VersionControlSystem")#.shape

Unnamed: 0_level_0,count,pct
VersionControlSystem,Unnamed: 1_level_1,Unnamed: 2_level_1
Git,62055,0.869373
I don't use one,3080,0.04315
Git;SVN,2858,0.04004
Git;Other (please specify):,1356,0.018997
SVN,590,0.008266
Other (please specify):,523,0.007327
Git;Mercurial,498,0.006977
Mercurial,134,0.001877
Git;Mercurial;SVN,104,0.001457
Git;Other (please specify):;SVN,83,0.001163


In [154]:
col_strip_nested("VersionControlSystem")

[('Git', 67006),
 ('SVN', 3700),
 ("I don't use one", 3080),
 ('Other (please specify):', 2047),
 ('Mercurial', 808)]

In [155]:
(df
 .VersionControlSystem
 .fillna("unanswered")
 .str.replace("Other \(please specify\):", "other", regex=True)
 .str.replace("I don't use one", "none")
 .str.lower()
 .unique())

array(['unanswered', 'git', 'git;other', 'mercurial;svn', 'none',
       'git;svn', 'svn', 'other', 'git;mercurial', 'git;other;svn',
       'mercurial', 'git;other;mercurial', 'git;mercurial;svn',
       'other;svn', 'git;other;mercurial;svn', 'other;mercurial',
       'other;mercurial;svn'], dtype=object)

#### Observations
- There're 16 unique entries
- Git is most widely used version control system

#### Steps
- Will create a binary column `use_git`

### VCInteraction

In [156]:
col_description("VCInteraction")

How do you interact with your version control system? Select all that apply.


In [157]:
df.VCInteraction

0                                                      NaN
1                                                      NaN
2                                              Code editor
3        Code editor;Command-line;Version control hosti...
4                                              Code editor
                               ...                        
73263                             Code editor;Command-line
73264                             Code editor;Command-line
73265    Code editor;Command-line;Version control hosti...
73266            Dedicated version control GUI application
73267    Code editor;Command-line;Version control hosti...
Name: VCInteraction, Length: 73268, dtype: object

In [158]:
col_value_counts("VCInteraction")

Unnamed: 0_level_0,count,pct
VCInteraction,Unnamed: 1_level_1,Unnamed: 2_level_1
Command-line,17602,0.25826
Code editor;Command-line,16502,0.242121
Code editor;Command-line;Version control hosting service web GUI,7288,0.106931
Code editor;Command-line;Version control hosting service web GUI;Dedicated version control GUI application,4420,0.064851
Command-line;Dedicated version control GUI application,3841,0.056356
Code editor,3762,0.055197
Dedicated version control GUI application,3373,0.049489
Command-line;Version control hosting service web GUI,3363,0.049343
Code editor;Command-line;Dedicated version control GUI application,2665,0.039101
Command-line;Version control hosting service web GUI;Dedicated version control GUI application,1280,0.01878


In [159]:
col_strip_nested("VCInteraction")

[('Command-line', 56961),
 ('Code editor', 37137),
 ('Version control hosting service web GUI', 19382),
 ('Dedicated version control GUI application', 17976)]

In [160]:
(df
 .VCInteraction
 .fillna("unanswered")
 .str.replace("Version control hosting service web GUI", "vc hosting service web gui")
 .str.replace("Dedicated version control GUI application", "ded vc gui app")
 .str.lower()
 .str.replace(" ", "_")
 .str.replace("-", "_")
 .unique())

array(['unanswered', 'code_editor',
       'code_editor;command_line;vc_hosting_service_web_gui;ded_vc_gui_app',
       'command_line;vc_hosting_service_web_gui;ded_vc_gui_app',
       'code_editor;command_line', 'command_line',
       'command_line;ded_vc_gui_app',
       'vc_hosting_service_web_gui;ded_vc_gui_app',
       'code_editor;ded_vc_gui_app',
       'code_editor;command_line;ded_vc_gui_app',
       'command_line;vc_hosting_service_web_gui',
       'code_editor;command_line;vc_hosting_service_web_gui',
       'code_editor;vc_hosting_service_web_gui',
       'vc_hosting_service_web_gui', 'ded_vc_gui_app',
       'code_editor;vc_hosting_service_web_gui;ded_vc_gui_app'],
      dtype=object)

### VCHostingPersonal use

In [161]:
df["VCHostingPersonal use"].isna().all()

True

### VCHostingProfessional use

In [162]:
df["VCHostingProfessional use"].isna().all()

True

- All entries in `VCHostingPersonal use` and `VCHostingProfessional use` are `nan`
- These columns will be dropped

### OfficeStackAsyncHaveWorkedWith

In [163]:
df.OfficeStackAsyncHaveWorkedWith

0                                                 NaN
1                                                 NaN
2                                                 NaN
3                         Jira Work Management;Trello
4                                                 NaN
                             ...                     
73263                            Jira Work Management
73264                                             NaN
73265                                 Microsoft Lists
73266                                             NaN
73267    Confluence;Jira Work Management;Trello;Wrike
Name: OfficeStackAsyncHaveWorkedWith, Length: 73268, dtype: object

In [164]:
col_value_counts("OfficeStackAsyncHaveWorkedWith")

Unnamed: 0_level_0,count,pct
OfficeStackAsyncHaveWorkedWith,Unnamed: 1_level_1,Unnamed: 2_level_1
Confluence;Jira Work Management,7084,0.153257
Jira Work Management,6059,0.131082
Trello,4953,0.107154
Confluence,3791,0.082015
Notion,2577,0.055751
...,...,...
Adobe Workfront;Airtable;Asana;Confluence;Jira Work Management;Microsoft Planner;Smartsheet;Trello,1,0.000022
Asana;DingTalk (Teambition);Microsoft Planner;monday.com,1,0.000022
ClickUp;monday.com;Notion;Stack Overflow for Teams,1,0.000022
Airtable;Asana;Confluence;Jira Work Management;Microsoft Lists;monday.com;Trello,1,0.000022


In [165]:
col_strip_nested("OfficeStackAsyncHaveWorkedWith")

[('Jira Work Management', 24234),
 ('Confluence', 19496),
 ('Trello', 16324),
 ('Notion', 9711),
 ('Asana', 3874),
 ('ClickUp', 2704),
 ('Microsoft Planner', 2282),
 ('Stack Overflow for Teams', 1804),
 ('monday.com', 1639),
 ('Airtable', 1438),
 ('Microsoft Lists', 973),
 ('Smartsheet', 654),
 ('Wrike', 417),
 ('Adobe Workfront', 368),
 ('DingTalk (Teambition)', 227),
 ('Swit', 131),
 ('Workzone', 112),
 ('Planview Projectplace or Clarizen', 84),
 ('Cerri', 61),
 ('Wimi', 61),
 ('Leankor', 57)]

In [166]:
(df
 .OfficeStackAsyncHaveWorkedWith
 .fillna("unanswered")
 .str.replace("Stack Overflow for Teams", "Stack Overflow Teams")
 .str.replace("monday\.com", "monday_com", regex=True)
 .str.replace("DingTalk \(Teambition\)", "DingTalk", regex=True)
 .str.replace("Planview Projectplace or Clarizen", "Clarizen")
 .str.lower()
 .str.replace(" ", "_")
 .value_counts())

unanswered                                                                                            27045
confluence;jira_work_management                                                                        7084
jira_work_management                                                                                   6059
trello                                                                                                 4953
confluence                                                                                             3791
                                                                                                      ...  
adobe_workfront;airtable;asana;confluence;jira_work_management;microsoft_planner;smartsheet;trello        1
asana;dingtalk;microsoft_planner;monday_com                                                               1
clickup;monday_com;notion;stack_overflow_teams                                                            1
airtable;asana;confluence;ji

### OfficeStackAsyncWantToWorkWith

In [167]:
col_strip_nested("OfficeStackAsyncWantToWorkWith")

[('Jira Work Management', 14695),
 ('Confluence', 10568),
 ('Trello', 8968),
 ('Notion', 7923),
 ('Stack Overflow for Teams', 2440),
 ('Asana', 1973),
 ('ClickUp', 1808),
 ('Microsoft Planner', 1379),
 ('monday.com', 1024),
 ('Airtable', 997),
 ('Microsoft Lists', 690),
 ('Adobe Workfront', 314),
 ('Smartsheet', 307),
 ('Wrike', 211),
 ('DingTalk (Teambition)', 156),
 ('Workzone', 136),
 ('Swit', 134),
 ('Planview Projectplace or Clarizen', 81),
 ('Leankor', 77),
 ('Cerri', 75),
 ('Wimi', 73)]

### OfficeStackSyncHaveWorkedWith

In [168]:
df.OfficeStackSyncHaveWorkedWith

0                               NaN
1                               NaN
2                   Microsoft Teams
3                        Slack;Zoom
4              Microsoft Teams;Zoom
                    ...            
73263                    Slack;Zoom
73264                    Rocketchat
73265          Microsoft Teams;Zoom
73266                          Zoom
73267    Microsoft Teams;Slack;Zoom
Name: OfficeStackSyncHaveWorkedWith, Length: 73268, dtype: object

In [169]:
col_strip_nested("OfficeStackSyncHaveWorkedWith")

[('Zoom', 36153),
 ('Microsoft Teams', 36097),
 ('Slack', 34440),
 ('Google Chat', 13019),
 ('Cisco Webex Teams', 6238),
 ('Mattermost', 2603),
 ('Rocketchat', 1438),
 ('RingCentral', 560),
 ('Symphony', 359),
 ('Wire', 283),
 ('Wickr', 189),
 ('Unify Circuit', 123),
 ('Coolfire Core', 89)]

In [170]:
(df
 .OfficeStackSyncHaveWorkedWith
 .fillna("unanswered")
 .str.replace("Rocketchat", "Rocket chat")
 .str.replace("RingCentral", "Ring Central")
 .str.lower()
 .str.replace(" ", "_")
 .value_counts())

unanswered                                                                                        11140
microsoft_teams                                                                                    9257
slack;zoom                                                                                         7722
microsoft_teams;slack;zoom                                                                         6068
slack                                                                                              5334
                                                                                                  ...  
cisco_webex_teams;coolfire_core;google_chat;ring_central;rocket_chat;slack;symphony;wickr;zoom        1
coolfire_core;google_chat;microsoft_teams;slack;zoom                                                  1
google_chat;microsoft_teams;ring_central;rocket_chat;slack;zoom                                       1
coolfire_core;google_chat;mattermost;microsoft_teams            

### OfficeStackSyncWantToWorkWith

In [171]:
col_strip_nested("OfficeStackSyncWantToWorkWith")

[('Slack', 27844),
 ('Microsoft Teams', 19140),
 ('Zoom', 18042),
 ('Google Chat', 7655),
 ('Mattermost', 1972),
 ('Cisco Webex Teams', 1944),
 ('Rocketchat', 973),
 ('Wire', 301),
 ('Symphony', 268),
 ('RingCentral', 215),
 ('Wickr', 183),
 ('Unify Circuit', 118),
 ('Coolfire Core', 102)]

- The following columns have the exact same workflow as `LanguageHaveWorkedWith`
  - OfficeStackAsyncHaveWorkedWith
  - OfficeStackAsyncWantToWorkWith
  - OfficeStackSyncHaveWorkedWith
  - OfficeStackSyncWantToWorkWith

### Blockchain

In [172]:
col_description("Blockchain")

How favorable are you about blockchain, crypto, and decentralization?


In [173]:
df.Blockchain

0                     NaN
1        Very unfavorable
2        Very unfavorable
3        Very unfavorable
4             Unfavorable
               ...       
73263      Very favorable
73264              Unsure
73265    Very unfavorable
73266         Indifferent
73267           Favorable
Name: Blockchain, Length: 73268, dtype: object

In [174]:
df.Blockchain.unique()

array([nan, 'Very unfavorable', 'Unfavorable', 'Favorable',
       'Very favorable', 'Indifferent', 'Unsure'], dtype=object)

In [175]:
col_value_counts("Blockchain")

Unnamed: 0_level_0,count,pct
Blockchain,Unnamed: 1_level_1,Unnamed: 2_level_1
Indifferent,18331,0.257925
Favorable,14629,0.205836
Very unfavorable,11625,0.163569
Unfavorable,10549,0.148429
Unsure,8128,0.114365
Very favorable,7809,0.109876


#### Observations
- There're 6 unique entries
- The entries are clean and valid


#### Steps
- No cleaning required
- The type can be made `category` for less memory usage

### NEWSOSites

In [176]:
col_description("NEWSOSites")

Which of the following Stack Overflow sites have you visited? Select all that apply.


In [177]:
df.NEWSOSites

0                                                      NaN
1        Collectives on Stack Overflow;Stack Overflow f...
2        Collectives on Stack Overflow;Stack Overflow;S...
3        Collectives on Stack Overflow;Stack Overflow f...
4        Collectives on Stack Overflow;Stack Overflow f...
                               ...                        
73263                        Stack Overflow;Stack Exchange
73264                                       Stack Overflow
73265                        Stack Overflow;Stack Exchange
73266                                       Stack Overflow
73267    Collectives on Stack Overflow;Stack Overflow;S...
Name: NEWSOSites, Length: 73268, dtype: object

In [178]:
col_value_counts("NEWSOSites")#.shape

Unnamed: 0_level_0,count,pct
NEWSOSites,Unnamed: 1_level_1,Unnamed: 2_level_1
Stack Overflow;Stack Exchange,41859,0.586548
Stack Overflow,19229,0.269446
Collectives on Stack Overflow;Stack Overflow;Stack Exchange,3814,0.053444
Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies);Stack Overflow;Stack Exchange,2100,0.029426
Collectives on Stack Overflow;Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies);Stack Overflow;Stack Exchange,1124,0.01575
Collectives on Stack Overflow;Stack Overflow,992,0.0139
Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies);Stack Overflow,589,0.008253
Collectives on Stack Overflow,542,0.007595
I have never visited Stack Overflow or the Stack Exchange network,461,0.00646
Stack Exchange,219,0.003069


In [179]:
col_strip_nested("NEWSOSites")

[('Stack Overflow', 69879),
 ('Stack Exchange', 49216),
 ('Collectives on Stack Overflow', 6783),
 ('Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies)',
  4178),
 ('I have never visited Stack Overflow or the Stack Exchange network', 461)]

In [180]:
(df
 .NEWSOSites
 .str.replace("Collectives on Stack Overflow", "collectives")
 .str.replace("Stack Overflow for Teams \(private knowledge sharing & collaboration platform for companies\)", "stack overflow teams", regex=True)
 .str.replace("I have never visited Stack Overflow or the Stack Exchange network", "none", regex=True)
 .str.lower()
 .str.replace(" ", "_")
 .value_counts())

stack_overflow;stack_exchange                                     41859
stack_overflow                                                    19229
collectives;stack_overflow;stack_exchange                          3814
stack_overflow_teams;stack_overflow;stack_exchange                 2100
collectives;stack_overflow_teams;stack_overflow;stack_exchange     1124
collectives;stack_overflow                                          992
stack_overflow_teams;stack_overflow                                 589
collectives                                                         542
none                                                                461
stack_exchange                                                      219
collectives;stack_overflow_teams;stack_overflow                     172
stack_overflow_teams                                                110
collectives;stack_exchange                                           71
collectives;stack_overflow_teams                                

#### Observations
- There're 16 unique entries
- Most of the entries entries occur in less than 1% of the total observations
- Entries are nested, separated by a semicolon


#### Steps
- The entries can be shortened
- Can create a column based on no. of sites visited
- Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis

### SOVisitFreq

In [181]:
col_description("SOVisitFreq")

How frequently would you say you visit Stack Overflow?


In [182]:
df.SOVisitFreq

0                           NaN
1         Daily or almost daily
2        Multiple times per day
3         Daily or almost daily
4        Multiple times per day
                  ...          
73263     Daily or almost daily
73264     Daily or almost daily
73265    Multiple times per day
73266     Daily or almost daily
73267     Daily or almost daily
Name: SOVisitFreq, Length: 73268, dtype: object

In [183]:
col_value_counts("SOVisitFreq")

Unnamed: 0_level_0,count,pct
SOVisitFreq,Unnamed: 1_level_1,Unnamed: 2_level_1
Daily or almost daily,21712,0.305971
A few times per week,19770,0.278604
Multiple times per day,15965,0.224983
A few times per month or weekly,11185,0.157622
Less than once per month or monthly,2329,0.032821


In [184]:
(df
 .SOVisitFreq
 .dropna()
 .replace(["Daily or almost daily",
           "Multiple times per day",
           "A few times per week",
           "A few times per month or weekly",
           "Less than once per month or monthly"],
          ["daily",
           "daily",
           "weekly",
           "weekly",
           "monthly"])
 .astype("category")
 .value_counts())

daily      37677
weekly     30955
monthly     2329
Name: SOVisitFreq, dtype: int64

#### Observations
- There're 5 unique entries
- the entries are unnecessarily lengthy


#### Steps
- The entries can be shortened
- The type can be made `category` for less memory usage

### SOAccount

In [185]:
col_description("SOAccount")

Do you have a Stack Overflow account?


In [186]:
df.SOAccount

0                            NaN
1                            Yes
2                            Yes
3                            Yes
4                            Yes
                  ...           
73263                        Yes
73264    Not sure/can't remember
73265                        Yes
73266                        Yes
73267                        Yes
Name: SOAccount, Length: 73268, dtype: object

In [187]:
col_value_counts("SOAccount")

Unnamed: 0_level_0,count,pct
SOAccount,Unnamed: 1_level_1,Unnamed: 2_level_1
Yes,58519,0.817624
No,8951,0.125063
Not sure/can't remember,4102,0.057313


In [188]:
(df
 .SOAccount
 .replace("Not sure/can't remember", "unsure"))

0           NaN
1           Yes
2           Yes
3           Yes
4           Yes
          ...  
73263       Yes
73264    unsure
73265       Yes
73266       Yes
73267       Yes
Name: SOAccount, Length: 73268, dtype: object

#### Observations
- There're 3 unique entries
- The entries are clean and valid


#### Steps
- No cleaning required
- Just one category will be renamed for clarity
- The type can be made `category` for less memory usage

### SOPartFreq

In [189]:
col_description("SOPartFreq")

How frequently would you say you participate in Q&amp;A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions.


In [190]:
df.SOPartFreq

0                                                      NaN
1                                    Daily or almost daily
2                                   Multiple times per day
3                                     A few times per week
4                                    Daily or almost daily
                               ...                        
73263                      A few times per month or weekly
73264                                                  NaN
73265                  Less than once per month or monthly
73266    I have never participated in Q&A on Stack Over...
73267                  Less than once per month or monthly
Name: SOPartFreq, Length: 73268, dtype: object

In [191]:
col_value_counts("SOPartFreq")

Unnamed: 0_level_0,count,pct
SOPartFreq,Unnamed: 1_level_1,Unnamed: 2_level_1
Less than once per month or monthly,26846,0.461042
I have never participated in Q&A on Stack Overflow,13498,0.231809
A few times per month or weekly,10559,0.181336
A few times per week,4433,0.07613
Daily or almost daily,1881,0.032303
Multiple times per day,1012,0.01738


In [192]:
(df
 .SOPartFreq
 .replace(["Daily or almost daily",
           "Multiple times per day",
           "A few times per week",
           "A few times per month or weekly",
           "Less than once per month or monthly",
           "I have never participated in Q&A on Stack Overflow"],
          ["daily",
           "daily",
           "weekly",
           "weekly",
           "monthly",
           "never"])
 .astype("category")
 .value_counts())

monthly    26846
weekly     14992
never      13498
daily       2893
Name: SOPartFreq, dtype: int64

#### Observations:
- There're 6 unique entries
- The entries are clean and valid


#### Steps:
- Some categories will be renamed for clarity
- The type can be made `category` for less memory usage

### SOComm

In [193]:
col_description("SOComm")

Do you consider yourself a member of the Stack Overflow community?


In [194]:
df.SOComm

0                    NaN
1               Not sure
2                Neutral
3        Yes, definitely
4        Yes, definitely
              ...       
73263    Yes, definitely
73264            Neutral
73265      Yes, somewhat
73266     No, not at all
73267      Yes, somewhat
Name: SOComm, Length: 73268, dtype: object

In [195]:
col_value_counts("SOComm")

Unnamed: 0_level_0,count,pct
SOComm,Unnamed: 1_level_1,Unnamed: 2_level_1
"Yes, somewhat",19674,0.275515
"No, not really",18728,0.262268
Neutral,14929,0.209066
"Yes, definitely",10381,0.145376
"No, not at all",6456,0.09041
Not sure,1240,0.017365


In [196]:
(df
 .SOComm
 .replace(["Yes, somewhat",
           "Yes, definitely",
           "No, not at all",
           "No, not really",
           "Not sure"],
          ["yes",
           "yes",
           "no",
           "no",
           "unsure"])
 .str.lower()
 .astype("category")
 .value_counts())

yes        30055
no         25184
neutral    14929
unsure      1240
Name: SOComm, dtype: int64

#### Observations:
- There're 6 unique entries
- The entries are clean and valid


#### Steps:
- Some categories will be renamed for clarity
- The type can be made `category` for less memory usage

### Age

In [197]:
df.Age

0                    NaN
1                    NaN
2        25-34 years old
3        35-44 years old
4        25-34 years old
              ...       
73263    25-34 years old
73264    25-34 years old
73265    55-64 years old
73266    55-64 years old
73267    25-34 years old
Name: Age, Length: 73268, dtype: object

In [198]:
df.Age.unique()

array([nan, '25-34 years old', '35-44 years old', 'Under 18 years old',
       '18-24 years old', '45-54 years old', '55-64 years old',
       '65 years or older', 'Prefer not to say'], dtype=object)

In [199]:
col_value_counts("Age")

Unnamed: 0_level_0,count,pct
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
25-34 years old,28112,0.396245
18-24 years old,16646,0.234629
35-44 years old,13988,0.197164
45-54 years old,5281,0.074437
Under 18 years old,3866,0.054492
55-64 years old,1978,0.02788
65 years or older,554,0.007809
Prefer not to say,521,0.007344


In [200]:
(df
 .Age
 .rename("age_group")
 .replace({"Under 18 years old": "minor",
           "18-24 years old": "young_adult",
           "25-34 years old": "young_adult",
           "35-44 years old": "middle_aged",
           "45-54 years old": "middle_aged",
           "55-64 years old": "senior",
           "65 years or older": "senior",
           "Prefer not to say": "unanswered"})
 .astype("category"))

0                NaN
1                NaN
2        young_adult
3        middle_aged
4        young_adult
            ...     
73263    young_adult
73264    young_adult
73265         senior
73266         senior
73267    young_adult
Name: age_group, Length: 73268, dtype: category
Categories (5, object): ['middle_aged', 'minor', 'senior', 'unanswered', 'young_adult']

#### Observations:
- There're 8 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- The type can be made `category` for less memory usage

### Gender

In [201]:
df.Gender

0        NaN
1        NaN
2        Man
3        Man
4        NaN
        ... 
73263    Man
73264    Man
73265    Man
73266    Man
73267    Man
Name: Gender, Length: 73268, dtype: object

In [202]:
df.Gender.unique()

array([nan, 'Man', 'Or, in your own words:', 'Woman',
       'Non-binary, genderqueer, or gender non-conforming',
       'Prefer not to say',
       'Man;Non-binary, genderqueer, or gender non-conforming',
       'Or, in your own words:;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Man;Woman', 'Man;Or, in your own words:',
       'Or, in your own words:;Woman;Non-binary, genderqueer, or gender non-conforming',
       'Man;Woman;Non-binary, genderqueer, or gender non-conforming',
       'Or, in your own words:;Woman',
       'Man;Or, in your own words:;Woman;Non-binary, genderqueer, or gender non-conforming',
       'Man;Or, in your own words:;Non-binary, genderqueer, or gender non-conforming',
       'Man;Or, in your own words:;Woman'], dtype=object)

In [203]:
col_value_counts("Gender")#.shape

Unnamed: 0_level_0,count,pct
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Man,64607,0.911846
Woman,3399,0.047973
Prefer not to say,1172,0.016541
"Non-binary, genderqueer, or gender non-conforming",704,0.009936
"Or, in your own words:",279,0.003938
"Man;Non-binary, genderqueer, or gender non-conforming",235,0.003317
"Man;Or, in your own words:",171,0.002413
"Woman;Non-binary, genderqueer, or gender non-conforming",160,0.002258
"Man;Woman;Non-binary, genderqueer, or gender non-conforming",31,0.000438
Man;Woman,24,0.000339


In [204]:
col_strip_nested("Gender")

[('Man', 65097),
 ('Woman', 3662),
 ('Non-binary, genderqueer, or gender non-conforming', 1186),
 ('Prefer not to say', 1172),
 ('Or, in your own words:', 521)]

In [205]:
(df
 .Gender
 .fillna("unanswered")
 .str.replace("Non-binary, genderqueer, or gender non-conforming", "non_binary")
 .str.replace("Or, in your own words:", "other")
 .str.replace("Prefer not to say", "unanswered")
 .str.replace("Man", "male")
 .str.replace("Woman", "female")
 .value_counts())

male                            64607
unanswered                       3587
female                           3399
non_binary                        704
other                             279
male;non_binary                   235
male;other                        171
female;non_binary                 160
male;female;non_binary             31
male;female                        24
male;other;female;non_binary       18
other;female;non_binary            15
other;non_binary                   14
other;female                       13
male;other;non_binary               9
male;other;female                   2
Name: Gender, dtype: int64

#### Observations:
- There're 16 unique entries
- The entries are nested


#### Steps:
- The categories will be renamed for clarity
- The type can be made `category` for less memory usage

### Trans

In [206]:
col_description("Trans")

Do you identify as transgender?


In [207]:
df.Trans

0        NaN
1        NaN
2         No
3         No
4        NaN
        ... 
73263     No
73264     No
73265     No
73266     No
73267     No
Name: Trans, Length: 73268, dtype: object

In [208]:
df.Trans.unique()

array([nan, 'No', 'Or, in your own words:', 'Yes', 'Prefer not to say'],
      dtype=object)

In [209]:
col_value_counts("Trans")

Unnamed: 0_level_0,count,pct
Trans,Unnamed: 1_level_1,Unnamed: 2_level_1
No,67392,0.95843
Prefer not to say,1379,0.019612
Yes,1064,0.015132
"Or, in your own words:",480,0.006826


In [210]:
(df
 .Trans
 .str.lower()
 .replace({"prefer not to say": "unanswered",
           "or, in your own words:": "yes"})
 .astype("category"))

0        NaN
1        NaN
2         no
3         no
4        NaN
        ... 
73263     no
73264     no
73265     no
73266     no
73267     no
Name: Trans, Length: 73268, dtype: category
Categories (3, object): ['no', 'unanswered', 'yes']

#### Observations:
- There're 4 unique entries
- The entries are clean and valid
- No cleaning required


#### Steps:
- The type can be made `category` for less memory usage

### Sexuality

In [211]:
col_description("Sexuality")

Which of the following describe you, if any? Please check all that apply.


In [212]:
df.Sexuality

0                            NaN
1                            NaN
2                       Bisexual
3        Straight / Heterosexual
4                            NaN
                  ...           
73263    Straight / Heterosexual
73264    Straight / Heterosexual
73265    Straight / Heterosexual
73266    Straight / Heterosexual
73267    Straight / Heterosexual
Name: Sexuality, Length: 73268, dtype: object

In [213]:
col_value_counts("Sexuality")#.shape

Unnamed: 0_level_0,count,pct
Sexuality,Unnamed: 1_level_1,Unnamed: 2_level_1
Straight / Heterosexual,55238,0.829835
Prefer not to say,4350,0.06535
Bisexual,2700,0.040562
Gay or Lesbian,1382,0.020762
Prefer to self-describe:,1079,0.01621
Queer,394,0.005919
Bisexual;Straight / Heterosexual,354,0.005318
Bisexual;Queer,282,0.004236
Straight / Heterosexual;Prefer to self-describe:,169,0.002539
Gay or Lesbian;Queer,150,0.002253


In [214]:
col_strip_nested("Sexuality")

[('Straight / Heterosexual', 55975),
 ('Prefer not to say', 4350),
 ('Bisexual', 3626),
 ('Gay or Lesbian', 1778),
 ('Prefer to self-describe:', 1429),
 ('Queer', 1131)]

In [215]:
(df
 .Sexuality
 .fillna("unanswered")
 .str.replace("Straight / Heterosexual", "Heterosexual")
 .str.replace("Prefer not to say", "unanswered")
 .str.replace("Gay or Lesbian", "Homosexual")
 .str.replace("Prefer to self-describe:", "other")
 .str.lower()
 .value_counts())

heterosexual                                    55238
unanswered                                      11053
bisexual                                         2700
homosexual                                       1382
other                                            1079
queer                                             394
bisexual;heterosexual                             354
bisexual;queer                                    282
heterosexual;other                                169
homosexual;queer                                  150
heterosexual;queer                                 79
bisexual;homosexual                                73
other;queer                                        51
bisexual;homosexual;queer                          44
bisexual;heterosexual;homosexual;queer             36
bisexual;other                                     30
bisexual;other;queer                               26
bisexual;heterosexual;other;homosexual;queer       25
bisexual;heterosexual;homose

#### Observations:
- There're 32 unique entries
- The entries are nested


#### Steps:
- The categories will be renamed for clarity
- The entries having more than 1 value will be renamed `lgbqt`
- The entries having `Prefer to self-describe` will be renamed `lgbqt`
- The type can be made `category` for less memory usage

### Ethnicity

In [216]:
col_description("Ethnicity")

Which of the following describe you, if any? Please check all that apply.


In [217]:
df.Ethnicity

0                           NaN
1                           NaN
2                         White
3                         White
4                           NaN
                  ...          
73263                   African
73264                     White
73265               Multiracial
73266                  European
73267    Or, in your own words:
Name: Ethnicity, Length: 73268, dtype: object

In [218]:
(df
 .Ethnicity
 .unique())

array([nan, 'White', 'Or, in your own words:', ...,
       'White;European;North American;Middle Eastern;Asian;Multiracial',
       'White;Middle Eastern;Central American;Hispanic or Latino/a',
       'White;European;North African;Hispanic or Latino/a'], dtype=object)

In [219]:
col_value_counts("Ethnicity")

Unnamed: 0_level_0,count,pct
Ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
European,14612,0.210323
White,13633,0.196232
White;European,8694,0.125140
Indian,5240,0.075424
Asian,3346,0.048162
...,...,...
Indian;European;African;Multiracial,1,0.000014
Asian;East Asian;Southeast Asian;Pacific Islander,1,0.000014
Middle Eastern;Asian;Multiracial;Biracial,1,0.000014
"Or, in your own words:;European;Asian",1,0.000014


In [220]:
col_strip_nested("Ethnicity")

[('White', 27360),
 ('European', 25877),
 ('Indian', 6739),
 ('Asian', 6586),
 ('Hispanic or Latino/a', 3967),
 ('Middle Eastern', 2850),
 ('South American', 2624),
 ('North American', 2331),
 ('African', 2294),
 ('South Asian', 1797),
 ('Prefer not to say', 1732),
 ('Southeast Asian', 1618),
 ('Or, in your own words:', 1524),
 ('Multiracial', 1222),
 ('East Asian', 1214),
 ('Black', 1028),
 ('Biracial', 798),
 ("I don't know", 701),
 ('North African', 611),
 ('Caribbean', 460),
 ('Central American', 416),
 ('Central Asian', 397),
 ('Ethnoreligious group', 348),
 ('Indigenous (such as Native American or Indigenous Australian)', 330),
 ('Pacific Islander', 147)]

In [221]:
(df
 .Ethnicity
 .fillna("unanswered")
 .str.replace("Hispanic or Latino/a", "Hispanic")
 .str.replace("Prefer not to say", "unanswered")
 .str.replace("Or, in your own words:", "other")
 .str.replace("I don't know", "unsure")
 .str.replace("Indigenous \(such as Native American or Indigenous Australian\)", "Indigenous", regex=True)
 .str.lower()
 .str.replace(" ", "_")
 .unique()
 .tolist())

['unanswered',
 'white',
 'other',
 'indian',
 'european',
 'white;european',
 'white;north_american',
 'white;european;middle_eastern;ethnoreligious_group',
 'middle_eastern',
 'european;ethnoreligious_group',
 'north_american',
 'african',
 'white;european;asian;east_asian',
 'white;middle_eastern',
 'black;caribbean',
 'asian;east_asian;southeast_asian',
 'asian',
 'central_american',
 'white;european;north_african',
 'hispanic',
 'european;african',
 'european;southeast_asian',
 'white;hispanic',
 'european;hispanic',
 'white;european;middle_eastern',
 'east_asian',
 'white;south_american',
 'african;black',
 'white;hispanic;south_american',
 'african;north_african',
 'indian;asian',
 'asian;south_asian',
 'southeast_asian',
 'unsure',
 'middle_eastern;ethnoreligious_group',
 'white;european;hispanic',
 'north_african',
 'european;north_american;hispanic;multiracial',
 'biracial',
 'white;other;european',
 'other;ethnoreligious_group',
 'white;european;north_american;multiracial;in

#### Observations:
- There're 1055 unique entries
- The entries are nested


#### Steps:
- The categories will be renamed for clarity
- I will create a sum column and binary indicators
- The type can be made `category` for less memory usage

### Accessibility

In [222]:
col_description("Accessibility")

Which of the following describe you, if any? Please check all that apply. 


In [223]:
df.Accessibility

0                      NaN
1                      NaN
2        None of the above
3        None of the above
4                      NaN
               ...        
73263    None of the above
73264    None of the above
73265    None of the above
73266    None of the above
73267    None of the above
Name: Accessibility, Length: 73268, dtype: object

In [224]:
df.Accessibility.unique()

array([nan, 'None of the above', 'Or, in your own words:',
       'I am deaf / hard of hearing', 'Prefer not to say',
       'I am blind / have difficulty seeing',
       'I am unable to / find it difficult to type',
       'I am unable to / find it difficult to walk or stand without assistance',
       'Or, in your own words:;I am blind / have difficulty seeing',
       'I am unable to / find it difficult to type;I am unable to / find it difficult to walk or stand without assistance',
       'I am deaf / hard of hearing;I am unable to / find it difficult to walk or stand without assistance',
       'I am deaf / hard of hearing;I am blind / have difficulty seeing',
       'I am deaf / hard of hearing;I am blind / have difficulty seeing;I am unable to / find it difficult to type;I am unable to / find it difficult to walk or stand without assistance',
       'Or, in your own words:;I am deaf / hard of hearing;I am blind / have difficulty seeing',
       'I am blind / have difficulty seei

In [225]:
col_value_counts("Accessibility")

Unnamed: 0_level_0,count,pct
Accessibility,Unnamed: 1_level_1,Unnamed: 2_level_1
None of the above,63064,0.937838
Prefer not to say,1633,0.024285
I am blind / have difficulty seeing,981,0.014589
"Or, in your own words:",579,0.00861
I am deaf / hard of hearing,436,0.006484
I am unable to / find it difficult to walk or stand without assistance,186,0.002766
I am unable to / find it difficult to type,131,0.001948
I am deaf / hard of hearing;I am blind / have difficulty seeing,48,0.000714
I am deaf / hard of hearing;I am blind / have difficulty seeing;I am unable to / find it difficult to type;I am unable to / find it difficult to walk or stand without assistance,31,0.000461
"Or, in your own words:;I am blind / have difficulty seeing",30,0.000446


In [226]:
col_strip_nested("Accessibility")

[('None of the above', 63064),
 ('Prefer not to say', 1633),
 ('I am blind / have difficulty seeing', 1142),
 ('Or, in your own words:', 650),
 ('I am deaf / hard of hearing', 570),
 ('I am unable to / find it difficult to walk or stand without assistance',
  298),
 ('I am unable to / find it difficult to type', 232)]

In [227]:
(df
 .Accessibility
 .fillna("unanswered")
 .replace({"Prefer not to say": "unanswered",
           "None of the above": "none_of_these"})
 .str.replace("none of the above", "none")
 .str.replace("I am blind / have difficulty seeing", "blind")
 .str.replace("Or, in your own words:", "other")
 .str.replace("I am deaf / hard of hearing", "deaf")
 .str.replace("I am unable to / find it difficult to walk or stand without assistance", "cant_walk")
 .str.replace("I am unable to / find it difficult to type", "cant_type")
 .value_counts())

none_of_these                           63064
unanswered                               7657
blind                                     981
other                                     579
deaf                                      436
cant_walk                                 186
cant_type                                 131
deaf;blind                                 48
deaf;blind;cant_type;cant_walk             31
other;blind                                30
cant_type;cant_walk                        25
deaf;cant_walk                             16
blind;cant_walk                            15
blind;cant_type                            11
other;deaf                                  9
other;deaf;blind;cant_type;cant_walk        9
other;cant_type                             8
other;cant_walk                             7
deaf;cant_type                              6
deaf;blind;cant_walk                        5
deaf;blind;cant_type                        5
other;deaf;blind                  

#### Observations:
- There're 27 unique entries
- The entries are nested


#### Steps:
- The categories will be renamed for clarity
- I will create a sum column and binary indicators
- The type can be made `category` for less memory usage

### MentalHealth

In [228]:
col_description("MentalHealth")

Which of the following describe you, if any? Please check all that apply. 


In [229]:
df.MentalHealth

0                                                      NaN
1                                                      NaN
2        I have a mood or emotional disorder (e.g., dep...
3                                        None of the above
4                                                      NaN
                               ...                        
73263                                    None of the above
73264                                    None of the above
73265                                    None of the above
73266                                    None of the above
73267                                    None of the above
Name: MentalHealth, Length: 73268, dtype: object

In [230]:
col_value_counts("MentalHealth")#.shape

Unnamed: 0_level_0,count,pct
MentalHealth,Unnamed: 1_level_1,Unnamed: 2_level_1
None of the above,46849,0.705058
Prefer not to say,3435,0.051695
"I have a concentration and/or memory disorder (e.g., ADHD, etc.)",2936,0.044186
I have an anxiety disorder,2404,0.036179
"I have a mood or emotional disorder (e.g., depression, bipolar disorder, etc.)",1914,0.028805
"I have a mood or emotional disorder (e.g., depression, bipolar disorder, etc.);I have an anxiety disorder",1495,0.022499
"I have autism / an autism spectrum disorder (e.g. Asperger's, etc.)",1025,0.015426
"I have a mood or emotional disorder (e.g., depression, bipolar disorder, etc.);I have an anxiety disorder;I have a concentration and/or memory disorder (e.g., ADHD, etc.)",995,0.014974
"I have learning differences (e.g., Dyslexic, Dyslexia, etc.)",833,0.012536
"I have a mood or emotional disorder (e.g., depression, bipolar disorder, etc.);I have a concentration and/or memory disorder (e.g., ADHD, etc.)",677,0.010189


In [231]:
col_strip_nested("MentalHealth")

[('None of the above', 46849),
 ('I have a concentration and/or memory disorder (e.g., ADHD, etc.)', 7026),
 ('I have an anxiety disorder', 6848),
 ('I have a mood or emotional disorder (e.g., depression, bipolar disorder, etc.)',
  6449),
 ('Prefer not to say', 3435),
 ("I have autism / an autism spectrum disorder (e.g. Asperger's, etc.)", 2834),
 ('I have learning differences (e.g., Dyslexic, Dyslexia, etc.)', 1840),
 ('Or, in your own words:', 815)]

In [232]:
(df
 .MentalHealth
 .fillna("unanswered")
 .replace({"None of the above": "none_of_these",
           "Prefer not to say": "unanswered"})
 .str.replace("Or, in your own words:", "other")
 .str.replace("I have an anxiety disorder", "anxiety_disorder")
 .str.replace("I have a mood or emotional disorder \(e.g., depression, bipolar disorder, etc.\)", "emotional_disorder", regex=True)
 .str.replace("I have a concentration and/or memory disorder \(e.g., ADHD, etc.\)", "memory_disorder", regex=True)
 .str.replace("I have autism / an autism spectrum disorder \(e.g. Asperger's, etc.\)", "autism", regex=True)
 .str.replace("I have learning differences \(e.g., Dyslexic, Dyslexia, etc.\)", "dyslexia", regex=True)
 .unique()
 .tolist())

['unanswered',
 'emotional_disorder;anxiety_disorder',
 'none_of_these',
 'other',
 'emotional_disorder',
 'emotional_disorder;anxiety_disorder;memory_disorder',
 'memory_disorder;dyslexia',
 'anxiety_disorder',
 'autism',
 'dyslexia',
 'memory_disorder',
 'anxiety_disorder;memory_disorder',
 'memory_disorder;dyslexia;autism',
 'emotional_disorder;memory_disorder;autism',
 'anxiety_disorder;other',
 'emotional_disorder;anxiety_disorder;memory_disorder;autism',
 'emotional_disorder;autism',
 'memory_disorder;autism',
 'other;memory_disorder',
 'emotional_disorder;anxiety_disorder;memory_disorder;dyslexia',
 'emotional_disorder;anxiety_disorder;autism',
 'emotional_disorder;memory_disorder',
 'anxiety_disorder;memory_disorder;dyslexia;autism',
 'anxiety_disorder;memory_disorder;autism',
 'anxiety_disorder;autism',
 'anxiety_disorder;dyslexia',
 'emotional_disorder;anxiety_disorder;dyslexia',
 'other;autism',
 'emotional_disorder;dyslexia',
 'dyslexia;autism',
 'emotional_disorder;anxiety

In [233]:
(df
 .MentalHealth
 .str.lower()
 .replace({np.nan: "unanswered"})
 .loc[lambda ser: ser.str.contains("none")]
 .value_counts())

none of the above    46849
Name: MentalHealth, dtype: int64

#### Observations:
- There're 57 unique entries
- The entries are nested


#### Steps:
- The categories will be renamed for clarity

### TBranch

In [234]:
col_description("TBranch")

<span style="font-size:16px;">Would you like to participate in the Professional Developer Series?</span>


In [235]:
df.TBranch

0        NaN
1         No
2         No
3         No
4         No
        ... 
73263    Yes
73264    Yes
73265    Yes
73266     No
73267    NaN
Name: TBranch, Length: 73268, dtype: object

In [236]:
df.TBranch.unique()

array([nan, 'No', 'Yes'], dtype=object)

In [237]:
col_value_counts("TBranch")

Unnamed: 0_level_0,count,pct
TBranch,Unnamed: 1_level_1,Unnamed: 2_level_1
Yes,37200,0.706284
No,15470,0.293716


#### Observations:
- There're 2 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### ICorPM

In [238]:
col_description("ICorPM")

Are you an independent contributor or people manager?


In [239]:
df.ICorPM

0                            NaN
1                            NaN
2                            NaN
3                            NaN
4                            NaN
                  ...           
73263    Independent contributor
73264    Independent contributor
73265    Independent contributor
73266                        NaN
73267                        NaN
Name: ICorPM, Length: 73268, dtype: object

In [240]:
df.ICorPM.unique()

array([nan, 'Independent contributor', 'People manager'], dtype=object)

In [241]:
col_value_counts("ICorPM")

Unnamed: 0_level_0,count,pct
ICorPM,Unnamed: 1_level_1,Unnamed: 2_level_1
Independent contributor,30592,0.84315
People manager,5691,0.15685


#### Observations:
- There're 2 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### WorkExp

In [242]:
col_description("WorkExp")

How many years of working experience do you have?


In [243]:
df.WorkExp

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
73263     5.0
73264     6.0
73265    42.0
73266     NaN
73267     NaN
Name: WorkExp, Length: 73268, dtype: float64

In [244]:
df.WorkExp.describe()

count    36769.000000
mean        10.242378
std          8.706850
min          0.000000
25%          4.000000
50%          8.000000
75%         15.000000
max         50.000000
Name: WorkExp, dtype: float64

In [245]:
(df
 .query("WorkExp == 0")
 .Age
 .unique())

array(['18-24 years old', '25-34 years old', '35-44 years old',
       '65 years or older', nan, 'Under 18 years old',
       'Prefer not to say', '45-54 years old'], dtype=object)

#### Observations:
- The entries are clean and valid


#### Steps:
- No cleaning required

### Knowledge_1

In [246]:
col_description("Knowledge_1")

I have interactions with people outside of my immediate team.


In [247]:
df.Knowledge_1

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
           ...   
73263       Agree
73264       Agree
73265    Disagree
73266         NaN
73267         NaN
Name: Knowledge_1, Length: 73268, dtype: object

In [248]:
df.Knowledge_1.unique()

array([nan, 'Agree', 'Strongly agree', 'Neither agree nor disagree',
       'Strongly disagree', 'Disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Knowledge_2

In [249]:
col_description("Knowledge_2")

Knowledge silos prevent me from getting ideas across the organization (i.e., one individual or team has information that isn't shared with others)


In [250]:
df.Knowledge_2

0                               NaN
1                               NaN
2                               NaN
3                               NaN
4                               NaN
                    ...            
73263                      Disagree
73264                         Agree
73265    Neither agree nor disagree
73266                           NaN
73267                           NaN
Name: Knowledge_2, Length: 73268, dtype: object

In [251]:
df.Knowledge_2.unique()

array([nan, 'Disagree', 'Agree', 'Neither agree nor disagree',
       'Strongly agree', 'Strongly disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Knowledge_3

In [252]:
col_description("Knowledge_3")

I can find up-to-date information within my organization to help me do my job.


In [253]:
df.Knowledge_3.unique()

array([nan, 'Agree', 'Disagree', 'Strongly agree',
       'Neither agree nor disagree', 'Strongly disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Knowledge_4

In [254]:
col_description("Knowledge_4")

I am able to quickly find answers to my questions with existing tools and resources.


In [255]:
df.Knowledge_4.unique()

array([nan, 'Agree', 'Strongly agree', 'Neither agree nor disagree',
       'Disagree', 'Strongly disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Knowledge_5

In [256]:
col_description("Knowledge_5")

I know which system or resource to use to find information and answers to questions I have.


In [257]:
df.Knowledge_5.unique()

array([nan, 'Agree', 'Strongly agree', 'Neither agree nor disagree',
       'Disagree', 'Strongly disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Knowledge_6

In [258]:
col_description("Knowledge_6")

I often find myself answering questions that I’ve already answered before.


In [259]:
df.Knowledge_6.unique()

array([nan, 'Agree', 'Neither agree nor disagree', 'Strongly agree',
       'Disagree', 'Strongly disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Knowledge_7

In [260]:
col_description("Knowledge_7")

Waiting on answers to questions often causes interruptions and disrupts my workflow.


In [261]:
df.Knowledge_7.unique()

array([nan, 'Disagree', 'Agree', 'Neither agree nor disagree',
       'Strongly agree', 'Strongly disagree'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- No cleaning required
- The type can be made `category` for less memory usage

### Frequency_1

In [262]:
col_description("Frequency_1")

Needing help from people outside of your immediate team?


In [263]:
df.Frequency_1.unique()

array([nan, '3-5 times a week', '10+ times a week', 'Never',
       '1-2 times a week', '6-10 times a week'], dtype=object)

In [264]:
col_value_counts("Frequency_1")

Unnamed: 0_level_0,count,pct
Frequency_1,Unnamed: 1_level_1,Unnamed: 2_level_1
1-2 times a week,21689,0.613186
Never,8754,0.247491
3-5 times a week,3489,0.09864
6-10 times a week,796,0.022504
10+ times a week,643,0.018179


In [265]:
(df
 .Frequency_1
 .str.lower()
 .fillna("unanswered")
 .replace({"1-2 times a week": "rarely",
           "3-5 times a week": "mildly",
           "6-10 times a week": "frequently",
           "10+ times a week": "frequently"})
 .astype("category"))

0        unanswered
1        unanswered
2        unanswered
3        unanswered
4        unanswered
            ...    
73263         never
73264        rarely
73265         never
73266    unanswered
73267    unanswered
Name: Frequency_1, Length: 73268, dtype: category
Categories (5, object): ['frequently', 'mildly', 'never', 'rarely', 'unanswered']

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### Frequency_2

In [266]:
col_description("Frequency_2")

Interacting with people outside of your immediate team?


In [267]:
df.Frequency_2.unique()

array([nan, '3-5 times a week', '10+ times a week', 'Never',
       '6-10 times a week', '1-2 times a week'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### Frequency_3

In [268]:
col_description("Frequency_3")

Encountering knowledge silos (where one individual or team has information that's not shared or distributed with other individuals or teams) at work?


In [269]:
df.Frequency_3.unique()

array([nan, 'Never', '3-5 times a week', '1-2 times a week',
       '6-10 times a week', '10+ times a week'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### TimeSearching

In [270]:
col_description("TimeSearching")

On an average day, how much time do you typically spend searching for answers or solutions to problems you encounter at work? (This includes time spent searching on your own, asking a colleague, and waiting for a response).


In [271]:
df.TimeSearching

0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
                ...         
73263    30-60 minutes a day
73264    15-30 minutes a day
73265    30-60 minutes a day
73266                    NaN
73267                    NaN
Name: TimeSearching, Length: 73268, dtype: object

In [272]:
df.TimeSearching.unique()

array([nan, '15-30 minutes a day', '30-60 minutes a day',
       '60-120 minutes a day', 'Less than 15 minutes a day',
       'Over 120 minutes a day'], dtype=object)

In [273]:
col_value_counts("TimeSearching")

Unnamed: 0_level_0,count,pct
TimeSearching,Unnamed: 1_level_1,Unnamed: 2_level_1
30-60 minutes a day,13652,0.377148
15-30 minutes a day,10122,0.279629
60-120 minutes a day,6371,0.176004
Less than 15 minutes a day,3528,0.097464
Over 120 minutes a day,2525,0.069755


In [274]:
(df
 .TimeSearching
 .fillna("unanswered")
 .str.lower()
 .str.replace(" minutes a day", "")
 .replace({"less than 15": "quarter",
           "15-30": "half",
           "30-60": "one",
           "60-120": "two",
           "over_120": "above_two"})
 .astype("category"))

0        unanswered
1        unanswered
2        unanswered
3        unanswered
4        unanswered
            ...    
73263           one
73264          half
73265           one
73266    unanswered
73267    unanswered
Name: TimeSearching, Length: 73268, dtype: category
Categories (6, object): ['half', 'one', 'over 120', 'quarter', 'two', 'unanswered']

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### TimeAnswering

In [275]:
col_description("TimeAnswering")

On an average day, how much time do you typically spend answering questions you get asked at work?


In [276]:
df.TimeAnswering

0                               NaN
1                               NaN
2                               NaN
3                               NaN
4                               NaN
                    ...            
73263    Less than 15 minutes a day
73264          60-120 minutes a day
73265          60-120 minutes a day
73266                           NaN
73267                           NaN
Name: TimeAnswering, Length: 73268, dtype: object

In [277]:
df.TimeAnswering.unique()

array([nan, 'Over 120 minutes a day', '60-120 minutes a day',
       'Less than 15 minutes a day', '30-60 minutes a day',
       '15-30 minutes a day'], dtype=object)

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### Onboarding

In [278]:
col_description("Onboarding")

The time it takes to onboard new hires at my company is:


In [279]:
df.Onboarding

0               NaN
1               NaN
2               NaN
3               NaN
4               NaN
            ...    
73263    Just right
73264     Very long
73265    Just right
73266           NaN
73267           NaN
Name: Onboarding, Length: 73268, dtype: object

In [280]:
df.Onboarding.unique()

array([nan, 'Somewhat long', 'Just right', 'Somewhat short', 'Very short',
       'Very long'], dtype=object)

In [281]:
col_value_counts("Onboarding")

Unnamed: 0_level_0,count,pct
Onboarding,Unnamed: 1_level_1,Unnamed: 2_level_1
Somewhat long,12961,0.363267
Just right,12526,0.351075
Somewhat short,4434,0.124275
Very long,4352,0.121977
Very short,1406,0.039407


In [282]:
(df
 .Onboarding
 .fillna("unanswered")
 .str.lower()
 .str.replace(" ", "_")
 .str.replace("somewhat_", "")
 .astype("category")
 .value_counts())

unanswered    37589
long          12961
just_right    12526
short          4434
very_long      4352
very_short     1406
Name: Onboarding, dtype: int64

#### Observations:
- There're 5 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### ProfessionalTech

In [283]:
col_description("ProfessionalTech")

My company has:


In [284]:
df.ProfessionalTech

0                                                      NaN
1                                                      NaN
2                                                      NaN
3                                                      NaN
4                                                      NaN
                               ...                        
73263    DevOps function;Microservices;Developer portal...
73264                                        None of these
73265                                        None of these
73266                                                  NaN
73267                                                  NaN
Name: ProfessionalTech, Length: 73268, dtype: object

In [285]:
(df
 .ProfessionalTech
 .unique()
 .tolist())

[nan,
 'Innersource initiative;DevOps function;Microservices;Developer portal or other central places to find tools/services;Continuous integration (CI) and (more often) continuous delivery;Automated testing;Observability tools',
 'Innersource initiative;DevOps function;Microservices;Continuous integration (CI) and (more often) continuous delivery;Automated testing;Observability tools',
 'DevOps function;Microservices',
 'Continuous integration (CI) and (more often) continuous delivery;Automated testing',
 'DevOps function;Continuous integration (CI) and (more often) continuous delivery;Automated testing',
 'None of these',
 'DevOps function;Microservices;Continuous integration (CI) and (more often) continuous delivery;Automated testing;Observability tools',
 'Developer portal or other central places to find tools/services;Continuous integration (CI) and (more often) continuous delivery;Automated testing',
 'DevOps function;Continuous integration (CI) and (more often) continuous delive

In [286]:
col_value_counts("ProfessionalTech")

Unnamed: 0_level_0,count,pct
ProfessionalTech,Unnamed: 1_level_1,Unnamed: 2_level_1
None of these,4658,0.133444
DevOps function;Microservices;Developer portal or other central places to find tools/services;Continuous integration (CI) and (more often) continuous delivery;Automated testing;Observability tools,2421,0.069358
Innersource initiative;DevOps function;Microservices;Developer portal or other central places to find tools/services;Continuous integration (CI) and (more often) continuous delivery;Automated testing;Observability tools,2414,0.069157
DevOps function;Microservices;Continuous integration (CI) and (more often) continuous delivery;Automated testing;Observability tools,1915,0.054862
DevOps function;Microservices;Continuous integration (CI) and (more often) continuous delivery;Automated testing,1575,0.045121
...,...,...
Microservices;Continuous integration (CI) and (more often) continuous delivery;None of these,1,0.000029
Innersource initiative;Developer portal or other central places to find tools/services;Automated testing;Observability tools;None of these,1,0.000029
Microservices;Automated testing;None of these,1,0.000029
Developer portal or other central places to find tools/services;Continuous integration (CI) and (more often) continuous delivery;None of these,1,0.000029


In [287]:
col_strip_nested("ProfessionalTech")

[('Continuous integration (CI) and (more often) continuous delivery', 24361),
 ('DevOps function', 20716),
 ('Automated testing', 20278),
 ('Microservices', 17094),
 ('Developer portal or other central places to find tools/services', 13327),
 ('Observability tools', 12941),
 ('Innersource initiative', 5692),
 ('None of these', 4757)]

In [288]:
(df
 .ProfessionalTech
 .fillna("unanswered")
 .str.replace(";None of these", "")
 .str.replace("Continuous integration \(CI\) and \(more often\) continuous delivery", "ci_cd", regex=True)
 .str.replace("DevOps function", "devops")
 .str.replace("Developer portal or other central places to find tools/services", "dev_portal")
 .str.replace(" ", "_")
 .str.lower()
 .value_counts().to_frame())

Unnamed: 0,ProfessionalTech
unanswered,38362
none_of_these,4658
innersource_initiative;devops;microservices;dev_portal;ci_cd;automated_testing;observability_tools,2429
devops;microservices;dev_portal;ci_cd;automated_testing;observability_tools,2421
devops;microservices;ci_cd;automated_testing;observability_tools,1915
...,...
innersource_initiative;devops;observability_tools,8
innersource_initiative;devops;dev_portal;automated_testing;observability_tools,7
innersource_initiative;microservices;dev_portal;observability_tools,5
innersource_initiative;microservices;dev_portal;automated_testing;observability_tools,5


#### Observations:
- There're 155 unique entries
- Entries are nested, separated by a semicolon
- Some of the entries contain `none of these` along with other values. which doesn't make much sense


#### Steps:
- The entries can be shortened
- This variable can be handled in various ways:
  - The rare categories (<1%) can be grouped together
  - The categories can be shortened based on presence of keywords
  - Multiple binary columns can be created
- These steps should be kept in mind for performing feature engineering during exploratory analysis
- The entries in which there's `none of these` along with other values,  `none of these` will be removed

### TrueFalse_1

In [289]:
col_description("TrueFalse_1")

Are you involved in supporting new hires during their onboarding?


In [290]:
df.TrueFalse_1

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
73263    Yes
73264     No
73265     No
73266    NaN
73267    NaN
Name: TrueFalse_1, Length: 73268, dtype: object

In [291]:
df.TrueFalse_1.unique()

array([nan, 'Yes', 'No'], dtype=object)

#### Observations:
- There're 2 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### TrueFalse_2

In [292]:
col_description("TrueFalse_2")

Do you use learning resources provided by your employer?


In [293]:
df.TrueFalse_2

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
73263    Yes
73264    Yes
73265     No
73266    NaN
73267    NaN
Name: TrueFalse_2, Length: 73268, dtype: object

In [294]:
df.TrueFalse_2.unique()

array([nan, 'Yes', 'No'], dtype=object)

- Same as `TrueFalse_1`

### TrueFalse_3

In [295]:
col_description("TrueFalse_3")

Does your employer give you time to learn new skills?


In [296]:
df.TrueFalse_3

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
73263    Yes
73264    Yes
73265     No
73266    NaN
73267    NaN
Name: TrueFalse_3, Length: 73268, dtype: object

In [297]:
df.TrueFalse_3.unique()

array([nan, 'Yes', 'No'], dtype=object)

- Same as `TrueFalse_1`

### SurveyLength

In [298]:
col_description("SurveyLength")

How do you feel about the length of the survey this year?


In [299]:
df.SurveyLength

0                          NaN
1                     Too long
2        Appropriate in length
3        Appropriate in length
4                     Too long
                 ...          
73263                 Too long
73264                 Too long
73265    Appropriate in length
73266    Appropriate in length
73267    Appropriate in length
Name: SurveyLength, Length: 73268, dtype: object

In [300]:
df.SurveyLength.unique()

array([nan, 'Too long', 'Appropriate in length', 'Too short'],
      dtype=object)

In [301]:
(df
 .SurveyLength
 .fillna("unanswered")
 .str.lower()
 .str.replace(" in length", "")
 .str.replace(" ", "_")
 .astype("category")
 .value_counts())

appropriate    53883
too_long       14491
unanswered      2824
too_short       2070
Name: SurveyLength, dtype: int64

#### Observations:
- There're 3 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### SurveyEase

In [302]:
col_description("SurveyEase")

How easy or difficult was this survey to complete?


In [303]:
df.SurveyEase

0                               NaN
1                         Difficult
2        Neither easy nor difficult
3                              Easy
4                              Easy
                    ...            
73263                          Easy
73264                          Easy
73265                          Easy
73266                          Easy
73267                          Easy
Name: SurveyEase, Length: 73268, dtype: object

In [304]:
df.SurveyEase.unique()

array([nan, 'Difficult', 'Neither easy nor difficult', 'Easy'],
      dtype=object)

In [305]:
(df
 .SurveyEase
 .fillna("unanswered")
 .str.lower()
 .replace("neither easy nor difficult", "neutral")
 .astype("category")
 .value_counts())

easy          47886
neutral       21627
unanswered     2760
difficult       995
Name: SurveyEase, dtype: int64

#### Observations:
- There're 3 unique entries
- The entries are clean and valid


#### Steps:
- The categories will be renamed for clarity
- Type will be made `category`

### ConvertedCompYearly

In [306]:
df.ConvertedCompYearly

0             NaN
1             NaN
2         40205.0
3        215232.0
4             NaN
           ...   
73263         NaN
73264         NaN
73265         NaN
73266         NaN
73267         NaN
Name: ConvertedCompYearly, Length: 73268, dtype: float64

In [307]:
df.ConvertedCompYearly.describe()

count    3.807100e+04
mean     1.707613e+05
std      7.814132e+05
min      1.000000e+00
25%      3.583200e+04
50%      6.784500e+04
75%      1.200000e+05
max      5.000000e+07
Name: ConvertedCompYearly, dtype: float64

## Data Cleaning Function

In [340]:
def clean_data(data):
    """
    This function accepts a raw dataset and returns the cleaned version
    
    Parameters:
    -----------
    
    data: pd.DataFrame
          The raw dataset to clean and transform
    """    
    
    return (data
            .drop(columns=["ResponseId",
                           "VCHostingPersonal use",
                           "VCHostingProfessional use"])
            .rename(columns=str.lower)
            .rename(columns={"mainbranch": "coding_proficiency",
                             "remotework": "work_type",
                             "codingactivities": "coding_activity",
                             "edlevel": "education_level",
                             "learncode": "learnt_coding",
                             "learncodeonline": "learnt_coding_online",
                             "learncodecoursescert": "learnt_coding_courses",
                             "yearscode": "coding_years",
                             "yearscodepro": "coding_pro_years",
                             "devtype": "profession",
                             "orgsize": "org_size",
                             "purchaseinfluence": "purchase_influence_level",
                             "buynewtool": "learn_new_tool",
                             "comptotal": "comp_total",
                             "compfreq": "comp_freq",
                             "languagehaveworkedwith": "lang_worked_with",
                             "languagewanttoworkwith": "lang_want_work_with",
                             "databasehaveworkedwith": "db_worked_with",
                             "databasewanttoworkwith": "db_want_work_with",
                             "platformhaveworkedwith": "platform_worked_with",
                             "platformwanttoworkwith": "platform_want_work_with",
                             "webframehaveworkedwith": "web_frame_worked_with",
                             "webframewanttoworkwith": "web_frame_want_work_with",
                             "misctechhaveworkedwith": "misc_tech_worked_with",
                             "misctechwanttoworkwith": "misc_tech_want_work_with",
                             "toolstechhaveworkedwith": "tools_tech_worked_with",
                             "toolstechwanttoworkwith": "tools_tech_want_work_with",
                             "newcollabtoolshaveworkedwith": "new_collab_tools_worked_with",
                             "newcollabtoolswanttoworkwith": "new_collab_tools_want_work_with",
                             "opsysprofessional use": "op_sys_pro_use",
                             "opsyspersonal use": "op_sys_personal_use",
                             "versioncontrolsystem": "ver_control_sys",
                             "vcinteraction": "vc_interaction",
                             "officestackasynchaveworkedwith": "office_stack_async_worked_with",
                             "officestackasyncwanttoworkwith": "office_stack_async_want_work_with",
                             "officestacksynchaveworkedwith": "office_stack_sync_worked_with",
                             "officestacksyncwanttoworkwith": "office_stack_sync_want_work_with",
                             "newsosites": "new_sites_visited",
                             "sovisitfreq": "sites_visit_freq",
                             "soaccount": "have_account",
                             "sopartfreq": "participate",
                             "socomm": "consider_self_member",
                             "age": "age_group",
                             "mentalhealth": "mental_health",
                             "tbranch": "participate_dev_series",
                             "icorpm": "ind_cont_ppl_manager",
                             "workexp": "work_exp",
                             "knowledge_1": "interact_ppl_out_team",
                             "knowledge_2": "info_not_shared_team",
                             "knowledge_3": "can_find_info_org",
                             "knowledge_4": "useful_resources",
                             "knowledge_5": "know_resources",
                             "knowledge_6": "ans_questions_repeated",
                             "knowledge_7": "interrupted_waiting",
                             "frequency_1": "get_help_out_team",
                             "frequency_2": "interact_out_team",
                             "frequency_3": "meet_silos_work",
                             "timesearching": "hours_spent_searching",
                             "timeanswering": "hours_spent_answering",
                             "onboarding": "onboarding_time",
                             "professionaltech": "company_tech",
                             "truefalse_1": "support_new_emp",
                             "truefalse_2": "use_resources",
                             "truefalse_3": "given_time_learning",
                             "surveylength": "survey_length",
                             "surveyease": "survey_difficulty",
                             "convertedcompyearly": "conv_yearly_comp"})
            .apply(lambda ser: ser.str.strip().fillna("unanswered") if ser.dtype == "O" else ser)
            .assign(coding_proficiency=lambda df_: df_
                                                     .coding_proficiency
                                                     .replace(["I am a developer by profession",
                                                               "I am learning to code",
                                                               "I am not primarily a developer, but I write code sometimes as part of my work",
                                                               "I code primarily as a hobby",
                                                               "None of these",
                                                               "I used to be a developer by profession, but no longer am"],
                                                              ["developer",
                                                               "learning",
                                                               "work_partly",
                                                               "hobby",
                                                               "other",
                                                               "former_developer"])
                                                     .astype("category"),
                    employment=lambda df_: df_
                                             .employment
                                             .str.lower()
                                             .replace({"i prefer not to say": "unanswered"})
                                             .str.replace("independent contractor, freelancer, or self-employed", "freelancer")
                                             .str.replace("not employed, but looking for work", "unemployed")
                                             .str.replace("not employed, and not looking for work", "unemployed")
                                             .str.replace(", ", "_")
                                             .str.replace("-", "_"),
                    work_type=lambda df_: df_
                                           .work_type
                                           .replace(["Fully remote",
                                                     "Hybrid (some remote, some in-person)",
                                                     "Full in-person"],
                                                    ["remote",
                                                     "hybrid",
                                                     "in_person"])
                                           .astype("category"),
                    coding_activity=lambda df_: df_
                                                  .coding_activity
                                                  .str.lower()
                                                  .str.replace("contribute to open-source projects", "open_source_contribution")
                                                  .str.replace("freelance/contract work", "freelancing")
                                                  .str.replace("school or academic work", "academics")
                                                  .str.replace("bootstrapping a business", "startup")
                                                  .str.replace("i don’t code outside of work", "only_work")
                                                  .str.replace("other \(please specify\):", "other", regex=True),
                    education_level=lambda df_: df_
                                                 .education_level
                                                 .replace(["Bachelor’s degree (B.A., B.S., B.Eng., etc.)",
                                                           "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",
                                                           "Some college/university study without earning a degree",
                                                           "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",
                                                           "Associate degree (A.A., A.S., etc.)",
                                                           "Other doctoral degree (Ph.D., Ed.D., etc.)",
                                                           "Primary/elementary school",
                                                           "Something else",
                                                           "Professional degree (JD, MD, etc.)	"],
                                                          ["bachelors",
                                                           "masters",
                                                           "college",
                                                           "secondary_school",
                                                           "associate",
                                                           "doctorate",
                                                           "primary_school",
                                                           "other",
                                                           "professional"])
                                                 .astype("category"),
                    learnt_coding=lambda df_: df_
                                                .learnt_coding
                                                .str.replace("Other online resources \(e.g., videos, blogs, forum\)", "online_resources", regex=True)
                                                .str.replace("School \(i.e., University, College, etc\)", "academia", regex=True)
                                                .str.replace("Books / Physical media", "books")
                                                .str.replace("Online Courses or Certification", "online_courses")
                                                .str.replace("On the job training", "job_training")
                                                .str.replace("Friend or family member", "friends_family")
                                                .str.replace("Hackathons \(virtual or in-person\)", "hackathons", regex=True)
                                                .str.replace("Other \(please specify\):", "other", regex=True),
                    learnt_coding_online=lambda df_: df_
                                                       .learnt_coding_online
                                                       .str.replace("Other \(Please specify\):", "other", regex=True)
                                                       .str.replace("Technical documentation", "documentation")
                                                       .str.replace("How-to videos", "videos")
                                                       .str.replace("Video-based Online Courses", "courses")
                                                       .str.replace("Online books", "books")
                                                       .str.replace("Online forum", "forums")
                                                       .str.replace("Written-based Online Courses", "courses")
                                                       .str.replace("Coding sessions \(live or recorded\)", "coding_sessions", regex=True)
                                                       .str.replace("Written Tutorials", "tutorials")
                                                       .str.replace("Interactive tutorial", "tutorials")
                                                       .str.replace("Certification videos", "videos")
                                                       .str.replace("Auditory material \(e.g., podcasts\)", "podcasts", regex=True)
                                                       .str.replace("Online challenges \(e.g., daily or weekly coding challenges\)", "coding_challenges", regex=True)
                                                       .str.lower()
                                                       .str.replace(" ", "_"),
                    learnt_coding_courses=lambda df_: df_
                                                        .learnt_coding_courses
                                                        .str.lower(),
                    coding_years=lambda df_: df_
                                              .coding_years
                                              .replace(["Less than 1 year",
                                                        "More than 50 years",
                                                        "unanswered"],
                                                       ["1",
                                                        "50",
                                                        np.nan])
                                              .pipe(lambda ser: pd.to_numeric(ser)),
                    coding_pro_years=lambda df_: df_
                                                  .coding_pro_years
                                                  .replace(["Less than 1 year",
                                                            "More than 50 years",
                                                            "unanswered"],
                                                           ["1",
                                                            "50",
                                                            np.nan])
                                                  .pipe(lambda ser: pd.to_numeric(ser)),
                    profession=lambda df_: df_
                                             .profession
                                             .str.replace("Other \(Please specify\):", "other", regex=True)
                                             .str.replace("Developer, embedded applications or devices", "embedded_app_dev")
                                             .str.replace("Engineer, data", "data_engineer")
                                             .str.replace("Developer, desktop or enterprise applications", "enterprise_app_dev")
                                             .str.replace("Developer, full-stack", "full_stack_dev")
                                             .str.replace("Developer, front-end", "front_end_dev")
                                             .str.replace("Developer, back-end", "back_end_dev")
                                             .str.replace("Developer, mobile", "mobile_dev",)
                                             .str.replace("Data scientist or machine learning specialist", "data_scientist")
                                             .str.replace("Data or business analyst", "data_analyst")
                                             .str.replace("Developer, QA or test", "testing_dev")
                                             .str.replace("Engineer, site reliability", "site_reliability_engineer")
                                             .str.replace("Developer, game or graphics", "game_dev")
                                             .str.replace("Senior Executive \(C-Suite, VP, etc.\)", "senior_executive", regex=True)
                                             .str.replace("Marketing or sales professional", "marketing_professional")
                                             .str.lower()
                                             .str.replace(" ", "_"),
                    org_size=lambda df_: df_
                                          .org_size
                                          .str.replace(" employees", "")
                                          .replace("Just me - I am a freelancer, sole proprietor, etc.", "freelancer")
                                          .replace(["2 to 9",
                                                    "10 to 19",
                                                    "20 to 99",
                                                    "100 to 499",
                                                    "500 to 999",
                                                    "1,000 to 4,999",
                                                    "5,000 to 9,999",
                                                    "10,000 or more",
                                                    "I don’t know"],
                                                   ["small",
                                                    "small",
                                                    "small",
                                                    "small",
                                                    "medium",
                                                    "medium",
                                                    "large",
                                                    "large",
                                                    "unknown"])
                                          .astype("category"),
                    purchase_influence_level=lambda df_: df_
                                                           .purchase_influence_level
                                                           .replace(["I have some influence",
                                                                     "I have little or no influence",
                                                                     "I have a great deal of influence"],
                                                                    ["some",
                                                                     "little",
                                                                     "great"])
                                                           .astype("category"),
                    learn_new_tool=lambda df_: df_
                                                 .learn_new_tool
                                                 .str.replace("Other \(please specify\):", "other", regex=True)
                                                 .str.replace("Start a free trial", "free_trial")
                                                 .str.replace("Ask developers I know/work with", "ask_known_devs")
                                                 .str.replace("Visit developer communities like Stack Overflow", "stack_overflow")
                                                 .str.replace("Read ratings or reviews on third party sites like G2Crowd", "Read ratings")
                                                 .str.replace("Research companies that have advertised on sites I visit", "Research companies")
                                                 .str.replace("Research companies that have emailed me", "Research companies")
                                                 .str.lower()
                                                 .str.replace(" ", "_"),
                    country=lambda df_: df_
                                          .country
                                          .replace(["Lao People's Democratic Republic",
                                                    "Democratic Republic of the Congo",
                                                    "Brunei Darussalam",
                                                    "Swaziland",
                                                    "Congo, Republic of the...",
                                                    "Libyan Arab Jamahiriya",
                                                    "United Republic of Tanzania",
                                                    "Syrian Arab Republic",
                                                    "Republic of Moldova",
                                                    "The former Yugoslav Republic of Macedonia",
                                                    "Republic of Korea",
                                                    "Venezuela, Bolivarian Republic of...",
                                                    "United Arab Emirates",
                                                    "Hong Kong (S.A.R.)",
                                                    "Viet Nam",
                                                    "Iran, Islamic Republic of...",
                                                    "Russian Federation",
                                                    "United Kingdom of Great Britain and Northern Ireland",
                                                    "United States of America"],
                                                   ["Laos",
                                                    "Congo",
                                                    "Brunei",
                                                    "Eswatini",
                                                    "Congo",
                                                    "Libya",
                                                    "Tanzania",
                                                    "Syria",
                                                    "Moldova",
                                                    "North Macedonia",
                                                    "South Korea",
                                                    "Venezuela",
                                                    "UAE",
                                                    "Hong Kong",
                                                    "Vietnam",
                                                    "Iran",
                                                    "Russia",
                                                    "UK",
                                                    "USA"]),
                    currency=lambda df_: df_
                                           .country
                                           .map({"unanswered": "unanswered", "Canada": "CAD", 
                                                 "UK": "GBP", "Israel": "ILS", "USA": "USD", 
                                                 "Germany": "EUR", "India": "INR", "Netherlands": "EUR",
                                                 "Croatia": "HRK", "Australia": "AUD", "Russia": "RUB",
                                                 "Czech Republic": "CZK", "Austria": "EUR", "Serbia": "RSD", 
                                                 "Italy": "EUR", "Ireland": "EUR", "Poland": "PLN", 
                                                 "Slovenia": "EUR", "Iraq": "IQD", "Sweden": "SEK", 
                                                 "Madagascar": "MGA", "Norway": "NOK", "Taiwan": "TWD", 
                                                 "Hong Kong": "HKD", "Mexico": "MXN", "France": "EUR", 
                                                 "Brazil": "BRL", "Lithuania": "LTL", "Uruguay": "UYU", 
                                                 "Denmark": "DKK", "Spain": "EUR", "Egypt": "EGP", "Turkey": "TRY", 
                                                 "South Africa": "ZAR", "Ukraine": "UAH", "Finland": "EUR", 
                                                 "Romania": "RON", "Portugal": "EUR", "Singapore": "SGD", 
                                                 "Oman": "OMR", "Belgium": "EUR", "Chile": "CLP", "Bulgaria": "BGN", 
                                                 "Latvia": "LVL", "Philippines": "PHP", "Greece": "EUR", "Belarus": "BYR",
                                                 "Saudi Arabia": "SAR", "Kenya": "KES", "Switzerland": "CHF", 
                                                 "Iceland": "ISK", "Vietnam": "VND", "Thailand": "THB", "China": "CNY", 
                                                 "Montenegro": "EUR", "Slovakia": "EUR", "Japan": "jpy", 
                                                 "Luxembourg": "LUF", "Turkmenistan": "TMT", "Argentina": "ARS", 
                                                 "Hungary": "HUF", "Tunisia": "TND", "Bangladesh": "BDT", "Maldives": "MVR", 
                                                 "Dominican Republic": "DOP", "Jordan": "JOD", "Pakistan": "PKR",
                                                 "Nepal": "NPR", "Iran": "IRR", "Indonesia": "IDR", "Ecuador": "USD", 
                                                 "Bosnia and Herzegovina": "BAM", "Armenia": "AMD",
                                                 "Colombia": "COP", "Kazakhstan": "KZT", "South Korea": "KRW",
                                                 "Costa Rica": "CRC", "Honduras": "HNL", "Mauritius": "MUR",
                                                 "Estonia": "EUR", "Algeria": "DZD", "Trinidad and Tobago": "TTD",
                                                 "Mali": "XOF", "Morocco": "MAD", "Eswatini": "SZL",
                                                 "New Zealand": "NZD", "North Macedonia": "MKD", "Afghanistan": "AFN",
                                                 "Cyprus": "CYP", "UAE": "AED", "Peru": "PEN",
                                                 "Uzbekistan": "UZS", "Ethiopia": "ETB", "Bahrain": "BHD",
                                                 "Malta": "MLT", "Nicaragua": "NIO", "Andorra": "ADP",
                                                 "Lebanon": "LBP", "Belize": "BZD", "Zambia": "ZMW",
                                                 "Bolivia": "BOB", "Malaysia": "MYR", "Sri Lanka": "LKR",
                                                 "Laos": "LAK", "Guatemala": "GTQ", "Azerbaijan": "AZN",
                                                 "Suriname": "SRD", "El Salvador": "USD", "Syria": "SYP",
                                                 "Qatar": "QAR", "Nigeria": "NGN", "Kyrgyzstan": "KGS",
                                                 "Zimbabwe": "ZWD", "Rwanda": "RWF", "Georgia": "GEL",
                                                 "Cambodia": "KHR", "Malawi": "MWK", "Yemen": "YER",
                                                 "Fiji": "FJD", "Nomadic": "unknown", "Uganda": "UGX",
                                                 "Albania": "ALL", "Timor-Leste": "USD", "Mongolia": "MNT",
                                                 "Moldova": "MDL", "Tajikistan": "TJS", "Ghana": "GHS",
                                                 "Tanzania": "TZS", "Myanmar": "MMK", "Kuwait": "KWD",
                                                 "Cameroon": "XAF", "Kosovo": "EUR", "Jamaica": "JMD",
                                                 "Benin": "XOF", "Botswana": "BWP", "Niger": "XOF",
                                                 "Palestine": "EGP", "Cape Verde": "CVE", "Libya": "LYD",
                                                 "Venezuela": "VES", "Senegal": "XOF", "Cuba": "CUP",
                                                 "Togo": "XOF", "Angola": "AOA", "Isle of Man": "IMP",
                                                 "Panama": "PAB", "Bahamas": "BSD", "Paraguay": "PYG",
                                                 "Sudan": "SDG", "Liberia": "LRD", "Bhutan": "BTN",
                                                 "Congo": "CDF", "Côte d'Ivoire": "XOF", "Barbados": "BBD",
                                                 "Namibia": "NAD", "Somalia": "SOS", "Sierra Leone": "SLL",
                                                 "Mozambique": "MZN", "Lesotho": "LSL", "Chad": "XAF",
                                                 "North Korea": "KPW", "Antigua and Barbuda": "XCD", "Papua New Guinea": "PGK",
                                                 "Palau": "USD", "Guinea": "GNF", "Haiti": "HTG",
                                                 "Gabon": "XAF", "Mauritania": "MRU", "San Marino": "EUR",
                                                 "Guyana": "GYD", "Saint Lucia": "XCD", "Burkina Faso": "XOF",
                                                 "Brunei": "BND", "Gambia": "GMD", "Monaco": "MCO",
                                                 "Djibouti": "DJF", "Seychelles": "SCR", "Solomon Islands": "SBD",
                                                 "Saint Kitts and Nevis": "KN"}),
                    comp_total=lambda df_: df_
                                            .comp_total
                                            .where(lambda ser: ~(data.Employment.str.contains("Employed, full-time") & (ser == 0)), np.nan)
                                            .pipe(lambda ser: pd.to_numeric(ser)),
                    comp_freq=lambda df_: df_
                                           .comp_freq
                                           .str.lower()
                                           .astype("category"),
                    db_worked_with=lambda df_: df_
                                                 .db_worked_with
                                                 .str.replace(" ", "_"),
                    db_want_work_with=lambda df_: df_
                                                 .db_want_work_with
                                                 .str.replace(" ", "_"),
                    platform_worked_with=lambda df_: df_
                                                       .platform_worked_with
                                                       .str.replace("IBM Cloud or Watson", "watson")
                                                       .str.replace("OpenStack", "open_stack")
                                                       .str.replace("DigitalOcean", "Digital Ocean")
                                                       .str.lower()
                                                       .str.replace(" ", "_"),
                    platform_want_work_with=lambda df_: df_
                                                          .platform_want_work_with
                                                          .str.replace("IBM Cloud or Watson", "watson")
                                                          .str.replace("OpenStack", "open_stack")
                                                          .str.replace("DigitalOcean", "Digital Ocean")
                                                          .str.lower()
                                                       .str.replace(" ", "_"),
                    web_frame_worked_with=lambda df_: df_
                                                        .web_frame_worked_with
                                                        .str.lower()
                                                        .str.replace(" ", "_")
                                                        .str.replace(".", "_", regex=True),
                    web_frame_want_work_with=lambda df_: df_
                                                           .web_frame_want_work_with
                                                           .str.lower()
                                                           .str.replace(" ", "_")
                                                           .str.replace(".", "_", regex=True),
                    misc_tech_worked_with=lambda df_: df_
                                                        .misc_tech_worked_with
                                                        .str.replace("Torch/PyTorch", "pytorch")
                                                        .str.replace(".NET", "dot_net", regex=True)
                                                        .str.lower()
                                                        .str.replace(" ", "_")
                                                        .str.replace("-", "_"),
                    misc_tech_want_work_with=lambda df_: df_
                                                           .misc_tech_want_work_with
                                                           .str.replace("Torch/PyTorch", "pytorch")
                                                           .str.replace(".NET", "dot_net", regex=True)
                                                           .str.lower()
                                                           .str.replace(" ", "_")
                                                           .str.replace("-", "_"),
                    tools_tech_worked_with=lambda df_: df_
                                                         .tools_tech_worked_with
                                                         .str.lower()
                                                         .str.replace(" ", "_"),
                    tools_tech_want_work_with=lambda df_: df_
                                                            .tools_tech_want_work_with
                                                            .str.lower()
                                                            .str.replace(" ", "_"),
                    new_collab_tools_worked_with=lambda df_: df_
                                                               .new_collab_tools_worked_with
                                                               .str.replace("RAD Studio \(Delphi, C\++ Builder\)", "RAD Studio", regex=True)
                                                               .str.replace("IPython/Jupyter", "Jupyter", regex=True)
                                                               .str.lower()
                                                               .str.replace(" ", "_"),
                    new_collab_tools_want_work_with=lambda df_: df_
                                                                  .new_collab_tools_want_work_with
                                                                  .str.replace("RAD Studio \(Delphi, C\++ Builder\)", "RAD Studio", regex=True)
                                                                  .str.replace("IPython/Jupyter", "Jupyter", regex=True)
                                                                  .str.lower()
                                                                  .str.replace(" ", "_"),
                    op_sys_pro_use=lambda df_: df_
                                                 .op_sys_pro_use
                                                 .str.replace("Linux-based", "Linux", regex=True)
                                                 .str.replace("Windows Subsystem for Linux \(WSL\)", "WSL", regex=True)
                                                 .str.replace("Other \(please specify\)\:", "Other", regex=True)
                                                 .str.lower()
                                                 .str.replace("macos", "mac_os"),
                    op_sys_personal_use=lambda df_: df_
                                                      .op_sys_personal_use
                                                      .str.replace("Linux-based", "Linux", regex=True)
                                                      .str.replace("Windows Subsystem for Linux \(WSL\)", "WSL", regex=True)
                                                      .str.replace("Other \(please specify\)\:", "Other", regex=True)
                                                      .str.lower()
                                                      .str.replace("macos", "mac_os"),
                    ver_control_sys=lambda df_: df_
                                                  .ver_control_sys
                                                  .str.replace("Other \(please specify\):", "other", regex=True)
                                                  .str.replace("I don't use one", "none")
                                                  .str.lower(),
                    vc_interaction=lambda df_: df_
                                                 .vc_interaction
                                                 .str.replace("Version control hosting service web GUI", "vc hosting service web gui")
                                                 .str.replace("Dedicated version control GUI application", "ded vc gui app")
                                                 .str.lower()
                                                 .str.replace(" ", "_")
                                                 .str.replace("-", "_"),
                    office_stack_async_worked_with=lambda df_: df_
                                                                 .office_stack_async_worked_with
                                                                 .str.replace("Stack Overflow for Teams", "Stack Overflow Teams")
                                                                 .str.replace("monday\.com", "monday_com", regex=True)
                                                                 .str.replace("DingTalk \(Teambition\)", "DingTalk", regex=True)
                                                                 .str.replace("Planview Projectplace or Clarizen", "Clarizen")
                                                                 .str.lower()
                                                                 .str.replace(" ", "_"),
                    office_stack_async_want_work_with=lambda df_: df_
                                                                    .office_stack_async_want_work_with
                                                                    .str.replace("Stack Overflow for Teams", "Stack Overflow Teams")
                                                                    .str.replace("monday\.com", "monday_com", regex=True)
                                                                    .str.replace("DingTalk \(Teambition\)", "DingTalk", regex=True)
                                                                    .str.replace("Planview Projectplace or Clarizen", "Clarizen")
                                                                    .str.lower()
                                                                    .str.replace(" ", "_"),
                    office_stack_sync_worked_with=lambda df_: df_
                                                                .office_stack_sync_worked_with
                                                                .str.replace("Rocketchat", "Rocket chat")
                                                                .str.replace("RingCentral", "Ring Central")
                                                                .str.lower()
                                                                .str.replace(" ", "_"),
                    office_stack_sync_want_work_with=lambda df_: df_
                                                                   .office_stack_sync_want_work_with
                                                                   .str.replace("Rocketchat", "Rocket chat")
                                                                   .str.replace("RingCentral", "Ring Central")
                                                                   .str.lower()
                                                                   .str.replace(" ", "_"),
                    blockchain=lambda df_: df_
                                             .blockchain
                                             .str.lower()
                                             .astype("category"),
                    new_sites_visited=lambda df_: df_
                                                    .new_sites_visited
                                                    .str.replace("Collectives on Stack Overflow", "collectives")
                                                    .str.replace("Stack Overflow for Teams \(private knowledge sharing & collaboration platform for companies\)", "stack overflow teams", regex=True)
                                                    .str.replace("I have never visited Stack Overflow or the Stack Exchange network", "none", regex=True)
                                                    .str.lower()
                                                    .str.replace(" ", "_"),
                    sites_visit_freq=lambda df_: df_
                                                   .sites_visit_freq
                                                   .replace(["Daily or almost daily",
                                                              "Multiple times per day",
                                                              "A few times per week",
                                                              "A few times per month or weekly",
                                                              "Less than once per month or monthly"],
                                                             ["daily",
                                                              "daily",
                                                              "weekly",
                                                              "weekly",
                                                              "monthly"])
                                                    .astype("category"),
                    have_account=lambda df_: df_
                                               .have_account
                                               .str.lower()
                                               .replace("Not sure/can't remember", "unsure")
                                               .astype("category"),
                    participate=lambda df_: df_
                                              .participate
                                              .replace(["Daily or almost daily",
                                                        "Multiple times per day",
                                                        "A few times per week",
                                                        "A few times per month or weekly",
                                                        "Less than once per month or monthly",
                                                        "I have never participated in Q&A on Stack Overflow"],
                                                       ["daily",
                                                        "daily",
                                                        "weekly",
                                                        "weekly",
                                                        "monthly",
                                                        "never"])
                                              .astype("category"),
                    consider_self_member=lambda df_: df_
                                                       .consider_self_member
                                                       .replace(["Yes, somewhat",
                                                                 "Yes, definitely",
                                                                 "No, not at all",
                                                                 "No, not really",
                                                                 "Not sure"],
                                                                ["yes",
                                                                 "yes",
                                                                 "no",
                                                                 "no",
                                                                 "unsure"])
                                                       .str.lower()
                                                       .astype("category"),
                    age_group=lambda df_: df_
                                            .age_group
                                            .replace({"Under 18 years old": "minor",
                                                      "18-24 years old": "young_adult",
                                                      "25-34 years old": "young_adult",
                                                      "35-44 years old": "middle_aged",
                                                      "45-54 years old": "middle_aged",
                                                      "55-64 years old": "senior",
                                                      "65 years or older": "senior",
                                                      "Prefer not to say": "unanswered"})
                                             .astype("category"),
                    gender=lambda df_: df_
                                         .gender
                                         .str.replace("Non-binary, genderqueer, or gender non-conforming", "non_binary")
                                         .str.replace("Or, in your own words:", "other")
                                         .str.replace("Prefer not to say", "unanswered")
                                         .str.replace("Man", "male")
                                         .str.replace("Woman", "female"),
                    trans=lambda df_: df_
                                       .trans
                                       .str.lower()
                                       .replace({"prefer not to say": "unanswered",
                                                 "or, in your own words:": "yes"})
                                       .astype("category"),
                    sexuality=lambda df_: df_
                                            .sexuality
                                            .str.replace("Straight / Heterosexual", "Heterosexual")
                                            .str.replace("Prefer not to say", "unanswered")
                                            .str.replace("Gay or Lesbian", "Homosexual")
                                            .str.replace("Prefer to self-describe:", "other")
                                            .str.lower(),
                    ethnicity=lambda df_: df_
                                            .ethnicity
                                            .replace({"Prefer not to say": "unanswered",
                                                      "I don't know": "unsure"})
                                            .str.replace("Hispanic or Latino/a", "Hispanic")
                                            .str.replace("Or, in your own words:", "other")
                                            .str.replace("Indigenous \(such as Native American or Indigenous Australian\)", "Indigenous", regex=True)
                                            .str.lower()
                                            .str.replace(" ", "_"),
                    accessibility=lambda df_: df_
                                                .accessibility
                                                .replace({"Prefer not to say": "unanswered",
                                                          "None of the above": "none_of_these"})
                                                .str.replace("none of the above", "none")
                                                .str.replace("I am blind / have difficulty seeing", "blind")
                                                .str.replace("Or, in your own words:", "other")
                                                .str.replace("I am deaf / hard of hearing", "deaf")
                                                .str.replace("I am unable to / find it difficult to walk or stand without assistance", "cant_walk")
                                                .str.replace("I am unable to / find it difficult to type", "cant_type"),
                    mental_health=lambda df_: df_
                                                .mental_health
                                                .replace({"None of the above": "none_of_these",
                                                          "Prefer not to say": "unanswered"})
                                                .str.replace("Or, in your own words:", "other")
                                                .str.replace("I have an anxiety disorder", "anxiety_disorder")
                                                .str.replace("I have a mood or emotional disorder \(e.g., depression, bipolar disorder, etc.\)", "emotional_disorder", regex=True)
                                                .str.replace("I have a concentration and/or memory disorder \(e.g., ADHD, etc.\)", "memory_disorder", regex=True)
                                                .str.replace("I have autism / an autism spectrum disorder \(e.g. Asperger's, etc.\)", "autism", regex=True)
                                                .str.replace("I have learning differences \(e.g., Dyslexic, Dyslexia, etc.\)", "dyslexia", regex=True),
                    participate_dev_series=lambda df_: df_
                                                        .participate_dev_series
                                                        .str.lower()
                                                        .astype("category"),
                    ind_cont_ppl_manager=lambda df_: df_
                                                       .ind_cont_ppl_manager
                                                       .str.lower()
                                                       .str.replace(" ", "_")
                                                       .astype("category"),
                    interact_ppl_out_team=lambda df_: df_
                                                        .interact_ppl_out_team
                                                        .str.lower()
                                                        .str.replace(" ", "_")
                                                        .astype("category"),
                    info_not_shared_team=lambda df_: df_
                                                        .info_not_shared_team
                                                        .str.lower()
                                                        .str.replace(" ", "_")
                                                        .astype("category"),
                    can_find_info_org=lambda df_: df_
                                                    .can_find_info_org
                                                    .str.lower()
                                                    .str.replace(" ", "_")
                                                    .astype("category"),
                    useful_resources=lambda df_: df_
                                                   .useful_resources
                                                   .str.lower()
                                                   .str.replace(" ", "_")
                                                   .astype("category"),
                    know_resources=lambda df_: df_
                                                 .know_resources
                                                 .str.lower()
                                                 .str.replace(" ", "_")
                                                 .astype("category"),
                    ans_questions_repeated=lambda df_: df_
                                                         .ans_questions_repeated
                                                         .str.lower()
                                                         .str.replace(" ", "_")
                                                         .astype("category"),
                    interrupted_waiting=lambda df_: df_
                                                      .interrupted_waiting
                                                      .str.lower()
                                                      .str.replace(" ", "_")
                                                      .astype("category"),
                    get_help_out_team=lambda df_: df_
                                                    .get_help_out_team
                                                    .str.lower()
                                                    .replace({"1-2 times a week": "rarely",
                                                              "3-5 times a week": "mildly",
                                                              "6-10 times a week": "frequently",
                                                              "10+ times a week": "frequently"})
                                                    .astype("category"),
                    interact_out_team=lambda df_: df_
                                                    .interact_out_team
                                                    .str.lower()
                                                    .replace({"1-2 times a week": "rarely",
                                                              "3-5 times a week": "mildly",
                                                              "6-10 times a week": "frequently",
                                                              "10+ times a week": "frequently"})
                                                    .astype("category"),
                    meet_silos_work=lambda df_: df_
                                                  .meet_silos_work
                                                  .str.lower()
                                                  .replace({"1-2 times a week": "rarely",
                                                            "3-5 times a week": "mildly",
                                                            "6-10 times a week": "frequently",
                                                            "10+ times a week": "frequently"})
                                                  .astype("category"),
                    hours_spent_searching=lambda df_: df_
                                                        .hours_spent_searching
                                                        .str.lower()
                                                        .str.replace(" minutes a day", "")
                                                        .replace({"less than 15": "quarter",
                                                                  "15-30": "half",
                                                                  "30-60": "one",
                                                                  "60-120": "two",
                                                                  "over 120": "above_two"})
                                                        .astype("category"),
                    hours_spent_answering=lambda df_: df_
                                                        .hours_spent_answering
                                                        .str.lower()
                                                        .str.replace(" minutes a day", "")
                                                        .replace({"less than 15": "quarter",
                                                                  "15-30": "half",
                                                                  "30-60": "one",
                                                                  "60-120": "two",
                                                                  "over 120": "above_two"})
                                                        .astype("category"),
                    onboarding_time=lambda df_: df_
                                                  .onboarding_time
                                                  .str.lower()
                                                  .str.replace(" ", "_")
                                                  .str.replace("somewhat_", "")
                                                  .astype("category"),
                    company_tech=lambda df_: df_
                                               .company_tech
                                               .str.replace(";None of these", "")
                                               .str.replace("Continuous integration \(CI\) and \(more often\) continuous delivery", "ci_cd", regex=True)
                                               .str.replace("DevOps function", "devops")
                                               .str.replace("Developer portal or other central places to find tools/services", "dev_portal")
                                               .str.lower()
                                               .str.replace(" ", "_"),
                    support_new_emp=lambda df_: df_
                                                  .support_new_emp
                                                  .str.lower()
                                                  .astype("category"),
                    use_resources=lambda df_: df_
                                                .use_resources
                                                .str.lower()
                                                .astype("category"),
                    given_time_learning=lambda df_: df_
                                                      .given_time_learning
                                                      .str.lower()
                                                      .astype("category"),
                    survey_length=lambda df_: df_
                                                .survey_length
                                                .str.lower()
                                                .str.replace(" in length", "")
                                                .str.replace(" ", "_")
                                                .astype("category"),
                    survey_difficulty=lambda df_: df_
                                                    .survey_difficulty
                                                    .str.lower()
                                                    .replace("neither easy nor difficult", "neutral")
                                                    .astype("category")))

In [341]:
df_cleaned = clean_data(df)
df_cleaned.shape

(73268, 76)

## Data Validation

### 1. Checking if the total no. of years coded is less than no. of years coded professionally:

In [311]:
col_description("YearsCode")

Including any education, how many years have you been coding in total?


In [312]:
col_description("YearsCodePro")

NOT including education, how many years have you coded professionally (as a part of your work)?


In [313]:
(clean_data(df)
 .dropna(subset=["coding_years", "coding_pro_years"], how="any")
 .pipe(lambda df_: df_.coding_pro_years.gt(df_.coding_years).sum()))

536

In [314]:
def validate_coding_years(data):
    result = np.full(data.shape[0], True)
    for i, (ix, row) in enumerate(data.iterrows()):
        try:
            years_coded = int(row.coding_years)
        except:
            continue
        
        try:
            years_coded_pro = int(row.coding_pro_years)
        except:
            continue
        
        if years_coded_pro > years_coded:
            result[i] = False
    
    return data[result]

### Observations:
- There're 536 entries where `coding_pro_years` is greater than `coding_years`
  - This is not possible and it doesn't make sense
  - Might be better to delete these observations
  
  
### Steps:
- The function `validate_coding_years` will handle this error and delete the erranous observations

### 2. Checking the currency values when country isn't mentioned

In [345]:
(df_cleaned
 .query("country == 'unanswered'")
 .currency
 .unique())

array(['unanswered'], dtype=object)

### Observations:
- Whenever country is not mentioned, the corresponding currency value is also missing, as expected
  
  
### Steps:
- No cleaning steps required

## Cleaned Data

In [346]:
df_cleaned = (df
              .pipe(clean_data)
              .pipe(validate_coding_years)
              .reset_index(drop=True))

df_cleaned

Unnamed: 0,coding_proficiency,employment,work_type,coding_activity,education_level,learnt_coding,learnt_coding_online,learnt_coding_courses,coding_years,coding_pro_years,profession,org_size,purchase_influence_level,learn_new_tool,country,currency,comp_total,comp_freq,lang_worked_with,lang_want_work_with,db_worked_with,db_want_work_with,platform_worked_with,platform_want_work_with,web_frame_worked_with,web_frame_want_work_with,misc_tech_worked_with,misc_tech_want_work_with,tools_tech_worked_with,tools_tech_want_work_with,new_collab_tools_worked_with,new_collab_tools_want_work_with,op_sys_pro_use,op_sys_personal_use,ver_control_sys,vc_interaction,office_stack_async_worked_with,office_stack_async_want_work_with,office_stack_sync_worked_with,office_stack_sync_want_work_with,blockchain,new_sites_visited,sites_visit_freq,have_account,participate,consider_self_member,age_group,gender,trans,sexuality,ethnicity,accessibility,mental_health,participate_dev_series,ind_cont_ppl_manager,work_exp,interact_ppl_out_team,info_not_shared_team,can_find_info_org,useful_resources,know_resources,ans_questions_repeated,interrupted_waiting,get_help_out_team,interact_out_team,meet_silos_work,hours_spent_searching,hours_spent_answering,onboarding_time,company_tech,support_new_emp,use_resources,given_time_learning,survey_length,survey_difficulty,conv_yearly_comp
0,other,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,
1,developer,employed_full_time,remote,hobby;open_source_contribution,unanswered,unanswered,unanswered,unanswered,,,unanswered,unanswered,unanswered,unanswered,Canada,CAD,,unanswered,JavaScript;TypeScript,Rust;TypeScript,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,mac_os,wsl,git,unanswered,unanswered,unanswered,unanswered,unanswered,very unfavorable,collectives;stack_overflow_teams;stack_overflo...,daily,yes,daily,unsure,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,too_long,difficult,
2,work_partly,employed_full_time,hybrid,hobby,masters,books;friends_family;online_resources;academia,documentation;blogs;programming_games;tutorial...,unanswered,14.0,5.0,data_scientist;front_end_dev;data_engineer;sit...,small,some,unanswered,UK,GBP,32000.0,yearly,C#;C++;HTML/CSS;JavaScript;Python,C#;C++;HTML/CSS;JavaScript;TypeScript,Microsoft_SQL_Server,Microsoft_SQL_Server,unanswered,unanswered,angular_js,angular;angular_js,pandas,dot_net,unanswered,unanswered,notepad++;visual_studio,notepad++;visual_studio,windows,windows,git,code_editor,unanswered,unanswered,microsoft_teams,microsoft_teams,very unfavorable,collectives;stack_overflow;stack_exchange,daily,yes,daily,neutral,young_adult,male,no,bisexual,white,none_of_these,emotional_disorder;anxiety_disorder,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,neutral,40205.0
3,developer,employed_full_time,remote,only_work,bachelors,books;academia,unanswered,unanswered,20.0,17.0,full_stack_dev,small,some,other,Israel,ILS,60000.0,monthly,C#;JavaScript;SQL;TypeScript,C#;SQL;TypeScript,Microsoft_SQL_Server,Microsoft_SQL_Server,unanswered,unanswered,asp_net;asp_net_core,asp_net;asp_net_core,dot_net,dot_net,unanswered,unanswered,notepad++;visual_studio;visual_studio_code,notepad++;visual_studio;visual_studio_code,windows,windows,git,code_editor;command_line;vc_hosting_service_we...,jira_work_management;trello,jira_work_management;trello,slack;zoom,slack;zoom,very unfavorable,collectives;stack_overflow_teams;stack_overflo...,daily,yes,weekly,yes,middle_aged,male,no,heterosexual,white,none_of_these,none_of_these,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,215232.0
4,developer,employed_full_time,hybrid,hobby,bachelors,online_resources;academia;job_training,documentation;blogs;stack_overflow;books;cours...,unanswered,8.0,3.0,front_end_dev;full_stack_dev;back_end_dev;ente...,small,some,free_trial;stack_overflow,USA,USD,,unanswered,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript,C#;Elixir;F#;Go;JavaScript;Rust;TypeScript,Cloud_Firestore;Elasticsearch;Microsoft_SQL_Se...,Cloud_Firestore;Elasticsearch;Firebase_Realtim...,firebase;microsoft_azure,firebase;microsoft_azure,angular;asp_net;asp_net_core_;jquery;node_js,angular;asp_net_core_;blazor;node_js,dot_net,dot_net;apache_kafka,npm,docker;kubernetes,notepad++;visual_studio;visual_studio_code;xcode,rider;visual_studio;visual_studio_code,windows,mac_os;windows,git;other,code_editor,unanswered,unanswered,microsoft_teams;zoom,unanswered,unfavorable,collectives;stack_overflow_teams;stack_overflo...,daily,yes,daily,yes,young_adult,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,too_long,easy,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72727,developer,employed_full_time,remote,freelancing,bachelors,books;online_resources;job_training;online_cou...,documentation;blogs;tutorials;stack_overflow;c...,udemy,8.0,5.0,back_end_dev,small,some,stack_overflow;ask_known_devs,Nigeria,NGN,60000.0,yearly,Bash/Shell;Dart;JavaScript;PHP;Python;SQL;Type...,Bash/Shell;Go;JavaScript;Python;SQL;TypeScript,Elasticsearch;MySQL;PostgreSQL;Redis,MySQL;PostgreSQL;Redis,aws;digital_ocean;google_cloud,aws;digital_ocean;google_cloud,express;fastapi;node_js,express;fastapi;node_js,flutter,unanswered,docker;homebrew;kubernetes;npm,docker;homebrew;kubernetes;npm,jupyter;sublime_text;vim;visual_studio_code,sublime_text;vim;visual_studio_code,mac_os,linux;mac_os,git,code_editor;command_line,jira_work_management,jira_work_management,slack;zoom,slack;zoom,very favorable,stack_overflow;stack_exchange,daily,yes,weekly,yes,young_adult,male,no,heterosexual,african,none_of_these,none_of_these,yes,independent_contributor,5.0,agree,disagree,strongly_agree,strongly_agree,strongly_agree,strongly_agree,neither_agree_nor_disagree,never,never,never,one,quarter,just_right,devops;microservices;dev_portal;ci_cd;automate...,yes,yes,yes,too_long,easy,
72728,developer,employed_full_time,in_person,hobby,masters,online_resources;academia;job_training;online_...,documentation;blogs;tutorials;stack_overflow;c...,coursera;udemy;udacity,6.0,5.0,data_scientist,unknown,little,other;ask_known_devs,USA,USD,107000.0,yearly,Bash/Shell;HTML/CSS;JavaScript;Python;SQL,HTML/CSS;JavaScript;Python,Elasticsearch;MongoDB;Oracle;SQLite,Elasticsearch;Neo4j;SQLite,unanswered,unanswered,fastapi;flask;react_js,fastapi;react_js,keras;numpy;pandas;scikit_learn;tensorflow;pyt...,numpy;pandas;pytorch;hugging_face_transformers,unanswered,unanswered,jupyter;notepad++;spyder;vim;visual_studio_code,notepad++;spyder;vim;visual_studio_code,linux;windows,linux;windows,git,code_editor;command_line,unanswered,unanswered,rocket_chat,unanswered,unsure,stack_overflow,daily,not sure/can't remember,unanswered,neutral,young_adult,male,no,heterosexual,white,none_of_these,none_of_these,yes,independent_contributor,6.0,agree,agree,neither_agree_nor_disagree,disagree,disagree,agree,agree,rarely,frequently,frequently,half,two,very_long,none_of_these,no,yes,yes,too_long,easy,
72729,work_partly,employed_full_time,hybrid,hobby;academics,bachelors,books;online_resources;academia;online_courses,documentation;programming_games;stack_overflow...,udemy;codecademy;pluralsight;edx,42.0,33.0,full_stack_dev;enterprise_app_dev;system_admin...,small,great,free_trial;ask_known_devs,USA,USD,,unanswered,HTML/CSS;JavaScript;PHP;Python;SQL,C#;HTML/CSS;JavaScript;PHP;Python;SQL,MariaDB;Microsoft_SQL_Server;MySQL;PostgreSQL;...,MariaDB;Microsoft_SQL_Server;MySQL;PostgreSQL;...,managed_hosting;microsoft_azure;vmware,firebase;linode;managed_hosting;microsoft_azur...,asp_net;react_js,asp_net;asp_net_core_;blazor;laravel;next_js;r...,dot_net;pandas;react_native,dot_net;cordova;ionic;pandas;react_native;xamarin,npm,npm;unreal_engine,spyder;visual_studio;visual_studio_code,spyder;visual_studio;visual_studio_code,windows,windows,git,code_editor;command_line;vc_hosting_service_we...,microsoft_lists,microsoft_lists,microsoft_teams;zoom,microsoft_teams;zoom,very unfavorable,stack_overflow;stack_exchange,daily,yes,monthly,yes,senior,male,no,heterosexual,multiracial,none_of_these,none_of_these,yes,independent_contributor,42.0,disagree,neither_agree_nor_disagree,disagree,agree,agree,agree,neither_agree_nor_disagree,never,never,never,one,two,just_right,none_of_these,no,no,no,appropriate,easy,
72730,developer,employed_full_time,hybrid,hobby,bachelors,books;job_training,unanswered,unanswered,50.0,31.0,front_end_dev;enterprise_app_dev,small,great,free_trial;stack_overflow;read_ratings,UK,GBP,58500.0,yearly,C#;Delphi;VBA,Delphi,Microsoft_SQL_Server;MongoDB;Oracle,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,rad_studio;visual_studio,rad_studio;visual_studio,windows,windows,svn,ded_vc_gui_app,unanswered,unanswered,zoom,zoom,indifferent,stack_overflow,daily,yes,never,no,senior,male,no,heterosexual,european,none_of_these,none_of_these,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,


### Memory Usage

In [347]:
df.memory_usage(deep=True).sum()

375776556

In [348]:
df_cleaned.memory_usage(deep=True).sum()

216710996

In [349]:
375776346 / 218260617

1.7216864460710288

### Meta-data

In [319]:
df_cleaned.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72732 entries, 0 to 72731
Data columns (total 76 columns):
 #   Column                             Non-Null Count  Dtype   
---  ------                             --------------  -----   
 0   coding_proficiency                 72732 non-null  category
 1   employment                         72732 non-null  object  
 2   work_type                          72732 non-null  category
 3   coding_activity                    72732 non-null  object  
 4   education_level                    72732 non-null  category
 5   learnt_coding                      72732 non-null  object  
 6   learnt_coding_online               72732 non-null  object  
 7   learnt_coding_courses              72732 non-null  object  
 8   coding_years                       70795 non-null  float64 
 9   coding_pro_years                   51297 non-null  float64 
 10  profession                         72732 non-null  object  
 11  org_size                           72732 

## Final Results:

- The raw dataset had 73,268 rows and 79 columns
- The cleaned dataset has 72,732 rows and 76 columns
- The 76 columns in the cleaned dataset are of the types:
  - `float` - 5
  - `category` - 33
  - `object` - 38
- The cleaned dataset utilizes nearly 2 times less memory compared to the raw dataset

## Feature Engineering (for the nested columns)

### Convenience Functions

In [320]:
def get_top_categories(data, name, n=5):
    """
    This function will return the top categories of the specified column
    from the dataset

    Parameters:
    -----------

    data: pd.DataFrame
          The dataframe to work with

    name: str
          The column name whose top categories to return

    n: int
       No. of top categories to return
    """
    
    result_set = dict()
    for i, entry in data[name].items():
        if isinstance(entry, str):
            for value in entry.split(";"):
                result_set[value] = result_set.get(value, 0) + 1
    
    return sorted(result_set,
                  key=lambda val: result_set[val],
                  reverse=True)[:n]

In [321]:
def clean_nested_column(data,
                        name,
                        replace_values=False,
                        to_replace=None,
                        replace_with=None,
                        default=None,
                        sum_values=False,
                        suffix=None,
                        terms=[],
                        add_indicators=False,
                        n=None):
    """
    Description:
    ------------
    
    This function will perform cleaning operations based on the user input
    on the  nested columns and return the newly transformed variables concatenated 
    with the provided dataset
    
    
    Operations performed:
    ---------------------
    
    1. Replace nested entries with values provided, according to a mapping
    2. Add the no. of values present in the nested entries
    3. Create binary indicator features for values in the nested entries
    
    
    Parameters:
    -----------
    
    data: pd.DataFrame
          The dataframe to work with

    name: str
          The column name whose top categories to return
          
    replace_values: bool
                    Whether the nested entries should be replaced with valid values
    
    to_replace: list
                A list of values to replace in the nested entries
                Used only if 'replace_values' is True
    
    replace_with: list
                  A list of valid values to replace the nested entries
                  Used only if 'replace_values' is True
    
    default: str
             Default value to replace the nested entries
             Used only if 'replace_values' is True
             
    sum_values: bool
                Whether to add the no. of values in the nested entries
    
    suffix: str
            To append to column new column name
            Used only if 'sum_values' is True
            
    terms: list
           Terms to avoid while calculating the number of values in the nested entries
    
    add_indicators: bool
                    Whether to create binary indicators for values in the nested entries
    
    n: int
       The no. of values in the nested entries to create the binary indicators for
       Used only if 'add_indicators' is True             
    """
    
    temp = pd.DataFrame()
    
    # replacing nested entries with valid values; flattening the nested feature
    if replace_values:
        temp[f"{name}_flattened"] = np.select([data[name].str.contains(value)
                                               for value in to_replace], 
                                              replace_with,
                                              default)
        temp[f"{name}_flattened"] = temp[f"{name}_flattened"].astype("category")
    
    # for accurate implementation of the following 2 operations
    data = data.assign(**{f"{name}": data[name].replace("unanswered", np.nan)})
    
    # summing the no. of values in the nested entries
    if sum_values:
        avoid_terms = ["none", "unanswered"]
        avoid_terms.extend(terms)
        result = np.empty(data.shape[0])
        for i, entry in data[name].items():
            if isinstance(entry, str):
                total = 0
                values = set(entry.split(";"))
                for value in values:               
                    if value in avoid_terms:
                        total += 0
                    else:
                        total += 1
                result[i] = total
            else:
                result[i] = np.nan
        temp[f"{name}_num_{suffix}"] = result
    
    # adding indicators for values in the nested entries
    if add_indicators:
        data = (data)
        top_categories = get_top_categories(data, name, n)
        for category in top_categories:
            temp[f"{name}_{category}"] = (data
                                          .loc[:, name]
                                          .str.contains(category)
                                          .astype("float"))
            
    return pd.concat([data, temp], axis=1)

- The above functions can be used for extracting various new features from the nested columns present in the dataset
- The function `clean_nested_column` performs 3 feature engineering operations:
  1. Flattening the nested column
  2. Creating a new column by calculating the total no. of values in the nested entries
  3. Creating new binary features for the most frequent values in the nested entries
  - All these operations can be controlled by specifying the corresponding parameters
- This step should actually be done during EDA
- By varying the specific parameters of this function, the usefulness of different features will be analyzed
  - This is an iterative step

### Example

In [322]:
clean_nested_column(data=df_cleaned,
                    name="lang_worked_with",
                    replace_values=True,
                    to_replace=["JavaScript",
                                "HTML/CSS",
                                "Python",
                                "SQL",
                                "C++"],
                    replace_with=["java_script",
                                  "html_css",
                                  "python",
                                  "sql",
                                  "c_plus_plus"],
                    default="other",
                    sum_values=True,
                    suffix="used",
                    terms=[],
                    add_indicators=True,
                    n=4) \
.head(50)

Unnamed: 0,coding_proficiency,employment,work_type,coding_activity,education_level,learnt_coding,learnt_coding_online,learnt_coding_courses,coding_years,coding_pro_years,profession,org_size,purchase_influence_level,learn_new_tool,country,currency,comp_total,comp_freq,lang_worked_with,lang_want_work_with,db_worked_with,db_want_work_with,platform_worked_with,platform_want_work_with,web_frame_worked_with,web_frame_want_work_with,misc_tech_worked_with,misc_tech_want_work_with,tools_tech_worked_with,tools_tech_want_work_with,new_collab_tools_worked_with,new_collab_tools_want_work_with,op_sys_pro_use,op_sys_personal_use,ver_control_sys,vc_interaction,office_stack_async_worked_with,office_stack_async_want_work_with,office_stack_sync_worked_with,office_stack_sync_want_work_with,blockchain,new_sites_visited,sites_visit_freq,have_account,participate,consider_self_member,age_group,gender,trans,sexuality,ethnicity,accessibility,mental_health,participate_dev_series,ind_cont_ppl_manager,work_exp,interact_ppl_out_team,info_not_shared_team,can_find_info_org,useful_resources,know_resources,ans_questions_repeated,interrupted_waiting,get_help_out_team,interact_out_team,meet_silos_work,hours_spent_searching,hours_spent_answering,onboarding_time,company_tech,support_new_emp,use_resources,given_time_learning,survey_length,survey_difficulty,conv_yearly_comp,lang_worked_with_flattened,lang_worked_with_num_used,lang_worked_with_JavaScript,lang_worked_with_HTML/CSS,lang_worked_with_SQL,lang_worked_with_Python
0,other,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,,other,,,,,
1,developer,employed_full_time,remote,hobby;open_source_contribution,unanswered,unanswered,unanswered,unanswered,,,unanswered,unanswered,unanswered,unanswered,Canada,CAD\tCanadian dollar,,unanswered,JavaScript;TypeScript,Rust;TypeScript,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,mac_os,wsl,git,unanswered,unanswered,unanswered,unanswered,unanswered,very unfavorable,collectives;stack_overflow_teams;stack_overflo...,daily,yes,daily,unsure,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,too_long,difficult,,java_script,2.0,1.0,0.0,0.0,0.0
2,work_partly,employed_full_time,hybrid,hobby,masters,books;friends_family;online_resources;academia,documentation;blogs;programming_games;tutorial...,unanswered,14.0,5.0,data_scientist;front_end_dev;data_engineer;sit...,small,some,unanswered,UK,GBP\tPound sterling,32000.0,yearly,C#;C++;HTML/CSS;JavaScript;Python,C#;C++;HTML/CSS;JavaScript;TypeScript,Microsoft_SQL_Server,Microsoft_SQL_Server,unanswered,unanswered,angular_js,angular;angular_js,pandas,dot_net,unanswered,unanswered,notepad++;visual_studio,notepad++;visual_studio,windows,windows,git,code_editor,unanswered,unanswered,microsoft_teams,microsoft_teams,very unfavorable,collectives;stack_overflow;stack_exchange,daily,yes,daily,neutral,young_adult,male,no,bisexual,white,none_of_these,emotional_disorder;anxiety_disorder,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,neutral,40205.0,java_script,5.0,1.0,1.0,0.0,1.0
3,developer,employed_full_time,remote,only_work,bachelors,books;academia,unanswered,unanswered,20.0,17.0,full_stack_dev,small,some,other,Israel,ILS\tIsraeli new shekel,60000.0,monthly,C#;JavaScript;SQL;TypeScript,C#;SQL;TypeScript,Microsoft_SQL_Server,Microsoft_SQL_Server,unanswered,unanswered,asp_net;asp_net_core,asp_net;asp_net_core,dot_net,dot_net,unanswered,unanswered,notepad++;visual_studio;visual_studio_code,notepad++;visual_studio;visual_studio_code,windows,windows,git,code_editor;command_line;vc_hosting_service_we...,jira_work_management;trello,jira_work_management;trello,slack;zoom,slack;zoom,very unfavorable,collectives;stack_overflow_teams;stack_overflo...,daily,yes,weekly,yes,middle_aged,male,no,heterosexual,white,none_of_these,none_of_these,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,215232.0,java_script,4.0,1.0,0.0,1.0,0.0
4,developer,employed_full_time,hybrid,hobby,bachelors,online_resources;academia;job_training,documentation;blogs;stack_overflow;books;cours...,unanswered,8.0,3.0,front_end_dev;full_stack_dev;back_end_dev;ente...,small,some,free_trial;stack_overflow,USA,USD\tUnited States dollar,,unanswered,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript,C#;Elixir;F#;Go;JavaScript;Rust;TypeScript,Cloud_Firestore;Elasticsearch;Microsoft_SQL_Se...,Cloud_Firestore;Elasticsearch;Firebase_Realtim...,firebase;microsoft_azure,firebase;microsoft_azure,angular;asp_net;asp_net_core_;jquery;node_js,angular;asp_net_core_;blazor;node_js,dot_net,dot_net;apache_kafka,npm,docker;kubernetes,notepad++;visual_studio;visual_studio_code;xcode,rider;visual_studio;visual_studio_code,windows,mac_os;windows,git;other,code_editor,unanswered,unanswered,microsoft_teams;zoom,unanswered,unfavorable,collectives;stack_overflow_teams;stack_overflo...,daily,yes,daily,yes,young_adult,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,no,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,too_long,easy,,java_script,6.0,1.0,1.0,1.0,0.0
5,work_partly,student_full_time,unanswered,unanswered,masters,books;academia,unanswered,unanswered,15.0,,unanswered,unanswered,unanswered,other,Germany,unanswered,,unanswered,C++;Lua,Lua,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,homebrew,homebrew,visual_studio_code;xcode,visual_studio_code,linux;mac_os,mac_os,git,command_line;vc_hosting_service_web_gui;ded_vc...,confluence,unanswered,rocket_chat;slack;zoom,rocket_chat;slack;zoom,very unfavorable,stack_overflow;stack_exchange,daily,yes,daily,yes,young_adult,other,yes,other,other,other,other,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,,c_plus_plus,2.0,0.0,0.0,0.0,0.0
6,hobby,student_part_time,unanswered,unanswered,secondary_school,online_resources,stack_overflow;courses,unanswered,3.0,,unanswered,unanswered,unanswered,free_trial;stack_overflow,India,unanswered,,unanswered,C++;HTML/CSS;JavaScript;PHP;Python;TypeScript,C;C#;C++;Elixir;Go;HTML/CSS;Java;JavaScript;Ko...,Cloud_Firestore;MongoDB;Firebase_Realtime_Data...,MySQL;Oracle;PostgreSQL,unanswered,unanswered,angular;next_js;node_js;react_js;svelte;vue_js,django;flask;gatsby;jquery;next_js;node_js;rea...,unanswered,unanswered,homebrew;npm,npm,atom;intellij;notepad++;pycharm;sublime_text;v...,visual_studio_code;webstorm,mac_os,mac_os,git,code_editor;command_line,unanswered,unanswered,google_chat;microsoft_teams;slack;zoom,google_chat;slack;zoom,favorable,stack_overflow,daily,yes,daily,yes,minor,male,no,unanswered,indian,none_of_these,none_of_these,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,,java_script,6.0,1.0,1.0,0.0,1.0
7,developer,unemployed,unanswered,unanswered,college,online_courses,unanswered,coursera;udemy,1.0,,full_stack_dev;student,unanswered,unanswered,free_trial,India,unanswered,,unanswered,C;C++;HTML/CSS;Java;JavaScript;SQL,APL;Bash/Shell;Go;Python;TypeScript,MongoDB;MySQL,Neo4j;PostgreSQL,aws;google_cloud;heroku,digital_ocean;firebase;microsoft_azure;vmware,jquery;node_js,angular;angular_js;next_js;vue_js,unanswered,unanswered,npm,unity_3d;yarn,atom;clion;eclipse;intellij;notepad++;visual_s...,android_studio;jupyter;sublime_text;vim;visual...,linux;mac_os,windows,git,command_line,unanswered,unanswered,google_chat;microsoft_teams;zoom,unanswered,very favorable,collectives;stack_overflow;stack_exchange,weekly,yes,never,yes,young_adult,male,no,heterosexual,indian,none_of_these,none_of_these,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,,java_script,6.0,1.0,1.0,1.0,0.0
8,developer,employed_full_time,hybrid,only_work,masters,job_training;Coding Bootcamp,unanswered,unanswered,6.0,6.0,back_end_dev,unknown,little,unanswered,Netherlands,EUR European Euro,46000.0,yearly,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,emacs;notepad++,emacs;notepad++,windows,windows,git,command_line;ded_vc_gui_app,confluence;jira_work_management,confluence;jira_work_management,microsoft_teams,microsoft_teams,very unfavorable,stack_overflow_teams;stack_overflow;stack_exch...,weekly,yes,monthly,no,young_adult,female,no,other,european,other,other,yes,independent_contributor,6.0,agree,disagree,agree,agree,agree,agree,disagree,mildly,mildly,never,half,above_two,long,innersource_initiative;devops;microservices;de...,yes,yes,yes,appropriate,easy,49056.0,other,,,,,
9,developer,freelancer,remote,hobby;open_source_contribution;startup,college,books;online_resources;academia,documentation;blogs;tutorials;stack_overflow;b...,unanswered,37.0,30.0,enterprise_app_dev;mobile_dev;educator,freelancer,great,free_trial;ask_known_devs;research_companies,Croatia,HRK\tCroatian kuna,,unanswered,Delphi;Java;Swift,Delphi;Java;Swift,unanswered,unanswered,digital_ocean;firebase,digital_ocean;firebase,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,android_studio;rad_studio;visual_studio_code;x...,android_studio;rad_studio;visual_studio_code;x...,windows,windows,git,vc_hosting_service_web_gui;ded_vc_gui_app,unanswered,unanswered,google_chat;slack,google_chat;slack,very unfavorable,collectives;stack_overflow;stack_exchange,daily,yes,daily,yes,middle_aged,female,no,heterosexual,white;european,none_of_these,none_of_these,unanswered,unanswered,,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,unanswered,appropriate,easy,,other,3.0,0.0,0.0,0.0,0.0


- This example demonstrates engineering new features for the column `lang_worked_with`
- In the resulting dataset, the last 6 rows depict the newly engineered features
- This is a very convenient function and provides various options to create new features from any of the nested columns present in the dataset