# Stack Overflow Survey Analysis

## Introduction

**Aside:** Within this notebook, we will be analyzing the **2024 Stack Overflow Survey** results. As part of an initial investigation of contributing variables towards the main parameter of interest: **CompTotal** representing total compensation of developers, there was a lack of meaningful correlation from numerical parameters. As such, we will be performing a descriptive analysis and data story rather than deployment of a ML model for predication on total compensation.


### Purpose

To demonstrate CRISP-DM compliance where applicable of an end-to-end Data Science project while answering the following questions regarding the survey dataset:
1. How happy are developers in 2024?
2. Do you enjoy coding more if you do it as a hobby?
3. Do younger developers enjoy coding more than older developers?
4. Does happiness as a developer have any impact (positive or negative) on your compensation?
5. What are some of the biggest predictors of satisfaction as a developer?

## 1: Business Understanding
A few of our intial questions are answered by the provided PDF document on the [dataset main page](https://survey.stackoverflow.co/) residing on Stack Overflow. Within this PDF, we can see the the dataset contains questions for developers leverage the tech-based platform that is designed for programming-inclined SME's - Stack Overflow. 

These questions range from various background inquiries to further recommendations and more. As part of the survey, users are questioned about their job satisfaction. We will explore some of the relationships with job satisfaction based on the questions asked above.

## 2: Data Understanding
In order to develop a more in-depth understanding of the data, we will begin by loading the CSV-file dataset (obtainable from the link referenced above) into a Pandas dataframe and begin our data exploration.

In [0]:
# Importing required libraries for our end-to-end ML project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline

# Loading data from 2024 survey into DF, running df.head() to verify records have been loaded successfuly
df = pd.read_csv('./Data/survey_results_public.csv')
df.head()

Now that we have loaded our data successfully, we will start by doing some exploratory analysis through the usage of visualizations.

In [0]:
# Viewing complete number of rows and columns in the dataset
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

In [0]:
# Display a series of histograms, leveraging tight_layout to display readable outputs
df.hist(figsize=(10,9), ec="k")
plt.tight_layout()
plt.show()

The implication of the above is that there are zero records where survey responders have marked "2" or "3" for job satisfaction. Let us verify this below:

In [0]:
# Verifying the lack of Job SatPoints 2 and 3 within the dataset
list(df.loc[:, df.columns.str.contains("JobSatPoints")].columns)

## 3: Data Preparation

In [0]:
# Checking on the null values within the dataset using a simple one-liner
nulls = df.isnull().mean().sort_values(ascending=False)

Based on the above, we can see that several questions relating to the integration of AI have > 75% null values, as such it will be more valuable to us to drop these columns from our dataset for consideration.

In [0]:
# Let us filter out the responses with a high threshold of null values, we are using .loc combined with : to scan the entire column
df = df.loc[:, df.isnull().mean() < 0.75]

Now that we have removed the columns that contain an unusable amount of null values, we will now examine the null values for Job Satisfaction parameters.

In [0]:
# Build a dataframe to store the null proportions of our columns, and then filter for thos columns pertinent to Job Satisfaction
sat_nulls = pd.DataFrame({'Column': nulls.index, 'Null Proportion': nulls.values})
sat_nulls[sat_nulls["Column"].str.contains("JobSat")]

Based on the above result, we can see that only ~45% of survey respondees have provided us with information about their job satisfaction. In order to address this, we will be building a separate dataframe for all respondees that have with a Null JobSat for further analyis. This operation is performed below.

In [0]:
# Handling Nulls, we will be separating out our analysis into two segments - those with job satisfaction information and those without
sat_nulls_df = df[df.JobSat.isnull()]
sat_df = df.dropna(subset=['JobSat'])

In [0]:
# Let us create a new dataframe representing only the job satisfaction information of the dataset for exploration purposes
job_satisfaction_df = df.loc[:, df.columns.str.contains("JobSat")]
job_satisfaction_df.sum().plot(kind = 'bar', figsize=(10,9), ec="k")

## Questions
### 1. How happy are developers in 2024?

In [0]:
list(df.columns)

In [0]:
df2=df[["JobSat","RemoteWork"]].sort_values(['JobSat','RemoteWork'], ascending=False).groupby('RemoteWork').mean()
df2.plot(kind = "bar", figsize=(5,5), ec="k")
plt.ylim(0, 10)
plt.set_ylabel(df2.mean, label_type='edge')
plt.title("Job Satisfaction by Remote Work")