# stackoverflow_survey_analysis

In this jupyter notebook the analysis of the Stack Overflow survey from 2020 and 2024 is presented.   
For the analysis, the [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) process is used.

The [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) process consists of the following steps:    
1. Business Understanding
2. Data Understandung
3. Data Preparation
4. Data Modeling
5. Result Evaluation
6. Deployment

## 1. Business Understanding
The main objective of this analysis is to get used to the data science process according to [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) taught during the [Udacity - Data Scientist](https://www.udacity.com/enrollment/nd025) Course and to apply the acquired knowledge to a real-world problem.

To achieve this, the analysis will use the [Stack Overflow surveys from 2020 and 2024](https://survey.stackoverflow.co/) to address the following three questions:

**When comparing the results from 2020 to 2024 ...**  

- **Question 1: ... did the difficulty of the survey increase over the past four years?**
  - *Relevance*: Understanding changes in survey difficulty over time can help to ensure that any observed trends or patterns in the results are not due to changes in survey complexity.

- **Question 2: ... did the past four years change the general job statisfaction?**   
  - *Relevance*: Analyzing shifts in job satisfaction over time can provide insights into the impact of external factors such as economic conditions or general changes in working conditions.

- **Question 3: ... did the past four years change job compensation?**   
  - *Relevance*: Examining changes in job compensation over the past four years helps to see trends in salary growth, inflation adjustments, and how economic factors or industry shifts may have influenced compensation practices.




## 2. Data understanding

This chapter describes the steps taken to understand the provided data by the Stack Overflow surveys from 2020 and 2024

### 2.1 Import packages and load dataframes

First, we import the relevant packages for the analysis

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Then we import the relevant survays into our workspace

In [3]:
df_20 = pd.read_csv('./data/stack-overflow-developer-survey-2020/survey_results_public.csv')
df_20_schema = pd.read_csv('./data/stack-overflow-developer-survey-2020/survey_results_schema.csv')
df_24 = pd.read_csv('./data/stack-overflow-developer-survey-2024/survey_results_public.csv')
df_24_schema = pd.read_csv('./data/stack-overflow-developer-survey-2024/survey_results_schema.csv')

Now, lets take a look at the head of the 2020 and 2024 survey

In [4]:
df_20.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27.0
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4.0
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4.0
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8.0


In [5]:
df_24.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


### 2.2 Find common colums relevant for solving the questions

Lets find the coloums which are available in both datafields and which can be used to solve the three questions:    
**When comparing the results from 2020 to 2024 ...**  
- **Question 1: ... did the difficulty of the survey increase over the past four years?**
- **Question 2: ... did the past four years change the general job statisfaction?**   
- **Question 3: ... did the past four years change job compensation?**

In [6]:
common_columns = df_24.columns.intersection(df_20.columns) # find the common columns in bot datafields
common_columns_list = common_columns.tolist()
common_columns_list

['MainBranch',
 'Age',
 'Employment',
 'EdLevel',
 'YearsCode',
 'YearsCodePro',
 'DevType',
 'OrgSize',
 'Country',
 'CompTotal',
 'NEWSOSites',
 'SOVisitFreq',
 'SOAccount',
 'SOPartFreq',
 'SOComm',
 'SurveyLength',
 'SurveyEase',
 'JobSat']

The description of each column can be found in the schema file:

In [15]:
pd.set_option('display.max_colwidth', None) # do not limit output
loc_column = common_columns_list
for loc_column in common_columns_list:
    # Print the question text from df_20_schema
    print(f"df_20_schema: Question text for '{loc_column}'")
    print(df_20_schema[df_20_schema['Column'] == loc_column]['QuestionText'].to_string(index=False))

    # Print the question text from df_24_schema
    print(f"df_24_schema: Question text for '{loc_column}'")
    print(df_24_schema[df_24_schema['qname'] == loc_column]['question'].to_string(index=False))
    
    print("\n" + "-"*50 + "\n")  # Separator between different columns for better readability
pd.reset_option('display.max_colwidth')

df_20_schema: Question text for 'MainBranch'
Which of the following options best describes you today? Here, by "developer" we mean "someone who writes code."
df_24_schema: Question text for 'MainBranch'
Which of the following options best describes you today? For the purpose of this survey, a developer is "someone who writes code".*

--------------------------------------------------

df_20_schema: Question text for 'Age'
What is your age (in years)? If you prefer not to answer, you may leave this question blank.
df_24_schema: Question text for 'Age'
What is your age?*

--------------------------------------------------

df_20_schema: Question text for 'Employment'
Which of the following best describes your current employment status?
df_24_schema: Question text for 'Employment'
Which of the following best describes your current employment status? Select all that apply.*

--------------------------------------------------

df_20_schema: Question text for 'EdLevel'
Which of the following

Regarding the questions the following columns of both datafields seem fitting for the analysis:

In [19]:
common_columns_analysis = ['SurveyEase', 'JobSat','CompTotal']

Lets have a short look at the content of the common columns

In [23]:
df_20[common_columns_analysis] # survey from 2020

Unnamed: 0,SurveyEase,JobSat,CompTotal
0,Neither easy nor difficult,Slightly satisfied,
1,,Very dissatisfied,
2,Neither easy nor difficult,,
3,,Slightly dissatisfied,
4,Easy,,
...,...,...,...
64456,,,
64457,,,
64458,,,
64459,,,


In [22]:
df_24[common_columns_analysis]  # survey from 2024

Unnamed: 0,SurveyEase,JobSat,CompTotal
0,,,
1,,,
2,Easy,,
3,Easy,,
4,Easy,,
...,...,...,...
65432,,,
65433,,,
65434,,,
65435,,,


Since there seem to be a lot of NaN's, let's count them down for the selected colums.

In [39]:
df_20_nan_count_per_row_perc = (df_20[common_columns_analysis].isna().sum() / df_20[common_columns_analysis].shape[0])*100
df_24_nan_count_per_row_perc = (df_24[common_columns_analysis].isna().sum() / df_24[common_columns_analysis].shape[0])*100
print(f"2020 survey: NaN values in %:\n{df_20_nan_count_per_row_perc.round(1)}\n") # print 2020's Percentage of NaN's
print(f"2024 survey: NaN values in %:\n{df_24_nan_count_per_row_perc.round(1)}\n") # print 2024's Percentage of NaN's

2020 survey: NaN values in %:
SurveyEase    19.6
JobSat        29.9
CompTotal     46.0
dtype: float64

2024 survey: NaN values in %:
SurveyEase    14.1
JobSat        55.5
CompTotal     48.4
dtype: float64



Furthermore, the possibilites to answer were different in 2020 and 2024.
- This can be seen in the addtach *.pdf files which desribe each survey (so_survey_2020.pdf and 2024 Developer Survey.pdf). 0
    - e.g. for the column 'CompTotal'. In 2020 it was possible to enter the values weekly, monthly OR yearly (defined in column ["CompFreq"]).In 2024 this was only yearly.

In [52]:
pd.set_option('display.max_colwidth', None) # do not limit output
# Get unique values for each column
df_20_unique_values = df_20[common_columns_analysis].apply(lambda x: set(x.unique()))
df_24_unique_values = df_24[common_columns_analysis].apply(lambda x: set(x.unique()))
# Display the result
print(f"unique values of 2020 survey: '{df_20_unique_values}'")
print("-------------------------------")
print(f"unique values of 2024 survey: '{df_20_unique_values}'")
pd.reset_option('display.max_colwidth')

unique values of 2020 survey: 'SurveyEase                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     {Difficult, nan, Neither easy nor difficult, Easy}
JobSat                                                                                                                                                                                 

### 2.3 Summary of Findings during Data understanding

During the Data Understanding step, the following points stood out:

- There are over 60.000 respondents within each dataset (2020: 64.461, 2021: 65.437)
- The columns `['SurveyEase', 'JobSat','CompTotal']` seem promising to solve the questions listed in 1. Business Understanding
- There are a lot of NaN values within the survey
- In the 2020 survey the column 'CompTotal' could be answered either weekly, monthly or yearly. In 2024 it was only yearly.

This means: before working with the data, we need to prepare the data for the analysis

## 3. Data preparation

adding yearly compensation to df20

In [50]:
df_20["YearlyCompensation"] = np.where(df_20["CompFreq"] == "weekly", df_20["CompTotal"] * 52,
                                       np.where(df_20["CompFreq"] == "monthly", df_20["CompTotal"] * 12,
                                                df_20["CompTotal"]))
df_20["YearlyCompensation"]

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
         ..
64456   NaN
64457   NaN
64458   NaN
64459   NaN
64460   NaN
Name: YearlyCompensation, Length: 64461, dtype: float64