## Job Satisfaction

This Notebook answeres the question:

Do developers who frequently visit Stack Overflow report higher job satisfaction?



- Import Libraries: Start by importing libraries you'll need for data manipulation, analysis, and visualization.
- Load Data (Optional): If the data is stored in the "data" folder, use appropriate functions to load it into the notebook environment.
- Data Understanding: Briefly describe the data, including column names and data types.
- Data Preparation: This will involve handling missing values, converting data types if needed, and cleaning any inconsistencies.
- Exploratory Data Analysis (EDA): Perform relevant Exploratory Data Analysis (EDA) techniques like descriptive statistics, visualizations (histograms, scatter plots) to understand the distribution of variables and relationships between them.
- Modeling (Optional): Depending on the question, you might choose to use statistical tests, correlations, or even simple machine learning models (e.g., linear regression). Clearly explain the chosen approach and why it's suitable.
- Visualization: Create clear and impactful visualizations to communicate your findings visually (bar charts, line graphs, heatmaps).
- Results and Conclusion: Summarize your key findings related to the specific business question being addressed in the notebook. Tie back these findings to the initial question and the data you analyzed.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime


df_schema= pd.read_csv('../data/raw/stack-overflow-developer-survey-2024/survey_results_schema.csv')
df_public = pd.read_csv('../data/raw/stack-overflow-developer-survey-2024/survey_results_public.csv')

df_schema.head()

Unnamed: 0,qid,qname,question,force_resp,type,selector
0,QID2,MainBranch,Which of the following options best describes ...,True,MC,SAVR
1,QID127,Age,What is your age?*,True,MC,SAVR
2,QID296,Employment,Which of the following best describes your cur...,True,MC,MAVR
3,QID308,RemoteWork,Which best describes your current work situation?,False,MC,SAVR
4,QID341,Check,Just checking to make sure you are paying atte...,True,MC,SAVR


In [2]:
df_public.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


In [3]:
df_public.columns.values.tolist()

['ResponseId',
 'MainBranch',
 'Age',
 'Employment',
 'RemoteWork',
 'Check',
 'CodingActivities',
 'EdLevel',
 'LearnCode',
 'LearnCodeOnline',
 'TechDoc',
 'YearsCode',
 'YearsCodePro',
 'DevType',
 'OrgSize',
 'PurchaseInfluence',
 'BuyNewTool',
 'BuildvsBuy',
 'TechEndorse',
 'Country',
 'Currency',
 'CompTotal',
 'LanguageHaveWorkedWith',
 'LanguageWantToWorkWith',
 'LanguageAdmired',
 'DatabaseHaveWorkedWith',
 'DatabaseWantToWorkWith',
 'DatabaseAdmired',
 'PlatformHaveWorkedWith',
 'PlatformWantToWorkWith',
 'PlatformAdmired',
 'WebframeHaveWorkedWith',
 'WebframeWantToWorkWith',
 'WebframeAdmired',
 'EmbeddedHaveWorkedWith',
 'EmbeddedWantToWorkWith',
 'EmbeddedAdmired',
 'MiscTechHaveWorkedWith',
 'MiscTechWantToWorkWith',
 'MiscTechAdmired',
 'ToolsTechHaveWorkedWith',
 'ToolsTechWantToWorkWith',
 'ToolsTechAdmired',
 'NEWCollabToolsHaveWorkedWith',
 'NEWCollabToolsWantToWorkWith',
 'NEWCollabToolsAdmired',
 'OpSysPersonal use',
 'OpSysProfessional use',
 'OfficeStackAsyncHa

## Data Understanding

First of all we should declare which data we should use to answer the Question. We need to know how high their JobSatisfaction is. Therefore we need the data from JobSat. Furthermore we should check how often the visit Stack overflow, if they take Part in answering Questions and if they have an Account on that Platform. So We need SOVisitFreq, SOAccount, SOPartFreq. Then we can Start the Data Preparation.

In [4]:
# let's create the dataframe we need from df_public
df_q1 = df_public[[ 'JobSat', 'SOVisitFreq', 'SOAccount', 'SOPartFreq']].copy()

df_q1.head(10)

Unnamed: 0,MainBranch,CodingActivities,JobSat,SOVisitFreq,SOAccount,SOPartFreq
0,I am a developer by profession,Hobby,,,,
1,I am a developer by profession,Hobby;Contribute to open-source projects;Other...,,Multiple times per day,Yes,Multiple times per day
2,I am a developer by profession,Hobby;Contribute to open-source projects;Other...,,Multiple times per day,Yes,Multiple times per day
3,I am learning to code,,,Daily or almost daily,No,
4,I am a developer by profession,,,Multiple times per day,Yes,Multiple times per day
5,I code primarily as a hobby,,,Multiple times per day,Yes,Multiple times per day
6,"I am not primarily a developer, but I write co...",I don’t code outside of work,,Daily or almost daily,Yes,Daily or almost daily
7,I am learning to code,,,Less than once per month or monthly,No,
8,I code primarily as a hobby,Hobby,,Multiple times per day,Yes,Multiple times per day
9,I am a developer by profession,Bootstrapping a business,,A few times per week,Yes,Less than once per month or monthly


## Data Preparation

Since JobSat is the target variable we should drop all rows where it has a NaN value. For those instances it would be impossible to measure job satisfaction for those instancees. </br>
For the frequence of visits and the frequence we sshould consider to replace the NaN values with "never" since every other answer seems to cover a specific timeframe of usage. </br>
Furthermore if there is a NaN value for the SOAccount we kcan assume that this information is "unknown" and we should replace the NaN values.  </br>
Last but not least there are a few NaN values for "CodingActivities" we can drop without a huge impact on the dataframe.

In [5]:
df_q1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MainBranch        65437 non-null  object 
 1   CodingActivities  54466 non-null  object 
 2   JobSat            29126 non-null  float64
 3   SOVisitFreq       59536 non-null  object 
 4   SOAccount         59560 non-null  object 
 5   SOPartFreq        45237 non-null  object 
dtypes: float64(1), object(5)
memory usage: 3.0+ MB


In [6]:
# Dropping rows where JobSat  is missing (target variable)
df_q1= df_q1.dropna(subset=["JobSat"])

# Filling missing SOVisitFreq, SOPartFreq with 'Never'
df_q1["SOVisitFreq"] = df_q1["SOVisitFreq"].fillna("never")
df_q1["SOPartFreq"] = df_q1["SOPartFreq"].fillna("never")
# Filling missing SOAccount with 'unknown'
df_q1["SOAccount"] = df_q1["SOAccount"].fillna("unknown")

df_q1.head()

Unnamed: 0,MainBranch,CodingActivities,JobSat,SOVisitFreq,SOAccount,SOPartFreq
10,"I used to be a developer by profession, but no...",Hobby;Contribute to open-source projects,8.0,A few times per week,Yes,Less than once per month or monthly
12,I am a developer by profession,Hobby;Contribute to open-source projects;Profe...,8.0,Multiple times per day,Yes,A few times per week
15,I am a developer by profession,Hobby,5.0,A few times per month or weekly,Yes,A few times per month or weekly
18,I am a developer by profession,Hobby;Professional development or self-paced l...,10.0,Daily or almost daily,Yes,A few times per week
20,"I am not primarily a developer, but I write co...",Hobby;Professional development or self-paced l...,6.0,A few times per month or weekly,Yes,Less than once per month or monthly


In [7]:
df_q1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29118 entries, 10 to 65412
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MainBranch        29118 non-null  object 
 1   CodingActivities  29118 non-null  object 
 2   JobSat            29118 non-null  float64
 3   SOVisitFreq       29118 non-null  object 
 4   SOAccount         29118 non-null  object 
 5   SOPartFreq        29118 non-null  object 
dtypes: float64(1), object(5)
memory usage: 1.6+ MB


As we can see there are no NaN values anymore. The next step is to encode the categorical variables into numerical so it is possible to train machine learning models. So we should use:

- `Ordinal Encoding` for SOVisitFreq and SOPartFreq: Since these columns implie a frequency order, we should assign numerical values based on the frequency of visits.
- 