# Tech Job Market and Salaries Analysis 

For our final project, we have selected the Stack Overflow Developer Survey dataset, 
which contains detailed responses from developers regarding their job roles, skills, 
technologies used, and salary information. This dataset is particularly relevant to the 
tech industry, which is a major focus of our group, and will provide insights into the tech 
job market by collecting responses from developers worldwide. It covers various topics 
such as job roles, salary, coding activities, education, technology usage, and job 
satisfaction.<br>

Team Eyy<br>
Members:  
- Julianne Kristine D. Aban 
- Derich Andre G. Arcilla 
- Jennifer Bendoy 
- Richelle Ann C. Candidato 
- Marc Francis B. Gomolon 
- Phoebe Kae A. Plasus

##### Data Preparation

LOADING DATA SET & LIBRARIES

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
# df = pd.read_csv('survey_results_filtered.csv')
df = pd.read_csv('survey_results_public.csv')
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


In [2]:
# Expand display settings to show all columns
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 200)     # Adjust rows if needed


In [3]:
# Display column information: name, number of missing values, and dtype
column_info = pd.DataFrame({
    'Column Name': df.columns,
    'Missing Values': df.isnull().sum(),
    'Data Type': df.dtypes
}).reset_index(drop=True)

# Print the column information
print(column_info)

                        Column Name  Missing Values Data Type
0                        ResponseId               0     int64
1                        MainBranch               0    object
2                               Age               0    object
3                        Employment               0    object
4                        RemoteWork           10631    object
5                             Check               0    object
6                  CodingActivities           10971    object
7                           EdLevel            4653    object
8                         LearnCode            4949    object
9                   LearnCodeOnline           16200    object
10                          TechDoc           24540    object
11                        YearsCode            5568    object
12                     YearsCodePro           13827    object
13                          DevType            5992    object
14                          OrgSize           17957    object
15      

In [4]:
# Calculate the percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100

# Filter columns with more than 50% missing values
high_missing_cols = missing_percentage[missing_percentage > 50]
print("Columns with more than 50% missing values:")
print(high_missing_cols)


Columns with more than 50% missing values:
PlatformAdmired                  52.050063
EmbeddedHaveWorkedWith           66.052845
EmbeddedWantToWorkWith           73.103901
EmbeddedAdmired                  74.428840
MiscTechAdmired                  54.771765
AIToolInterested in Using        53.098400
AIToolNot interested in Using    62.690832
AINextMuch more integrated       79.464217
AINextNo change                  80.900714
AINextMore integrated            62.669438
AINextLess integrated            96.401119
AINextMuch less integrated       98.245641
ICorPM                           54.458487
WorkExp                          54.677018
Knowledge_1                      56.196036
Knowledge_2                      57.178660
Knowledge_3                      57.065575
Knowledge_4                      57.164907
Knowledge_5                      57.394135
Knowledge_6                      57.418586
Knowledge_7                      57.550010
Knowledge_8                      57.580574
Knowledge_9

In [5]:
# Drop columns with more than 50% missing values
df_cleaned = df.drop(columns=high_missing_cols.index)
print(f"Dataset shape after dropping columns: {df_cleaned.shape}")

# Show the names of the remaining columns
remaining_columns = df_cleaned.columns
print(f"Remaining columns ({len(remaining_columns)}):")
print(remaining_columns)


Dataset shape after dropping columns: (65437, 70)
Remaining columns (70):
Index(['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'TechDoc', 'YearsCode', 'YearsCodePro', 'DevType', 'OrgSize',
       'PurchaseInfluence', 'BuyNewTool', 'BuildvsBuy', 'TechEndorse',
       'Country', 'Currency', 'CompTotal', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'LanguageAdmired', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'DatabaseAdmired', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'WebframeAdmired', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'ToolsTechAdmired',
       'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith',
       'NEWCollabToolsAdmired', 'OpSysPersonal use', 'OpSysProfessional use',
       'Offi

In [6]:
# Fill missing numerical values with median
numerical_cols = df_cleaned.select_dtypes(include=['float64', 'int64']).columns
df_cleaned[numerical_cols] = df_cleaned[numerical_cols].fillna(df_cleaned[numerical_cols].median())

# Fill missing categorical values with mode
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns
df_cleaned[categorical_cols] = df_cleaned[categorical_cols].fillna(df_cleaned[categorical_cols].mode().iloc[0])

# Check for missing values in numerical columns
print("Missing values in numerical columns:")
print(df_cleaned[numerical_cols].isnull().sum())

# Check for missing values in categorical columns
print("Missing values in categorical columns:")
print(df_cleaned[categorical_cols].isnull().sum())



Missing values in numerical columns:
ResponseId    0
CompTotal     0
dtype: int64
Missing values in categorical columns:
MainBranch                        0
Age                               0
Employment                        0
RemoteWork                        0
Check                             0
CodingActivities                  0
EdLevel                           0
LearnCode                         0
LearnCodeOnline                   0
TechDoc                           0
YearsCode                         0
YearsCodePro                      0
DevType                           0
OrgSize                           0
PurchaseInfluence                 0
BuyNewTool                        0
BuildvsBuy                        0
TechEndorse                       0
Country                           0
Currency                          0
LanguageHaveWorkedWith            0
LanguageWantToWorkWith            0
LanguageAdmired                   0
DatabaseHaveWorkedWith            0
DatabaseWantToW

In [7]:
#Save Cleaned File
df_cleaned.to_csv('cleaned_survey_results.csv', index=False)


Apriori Algorithm


In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Load the cleaned dataset
df_cleaned = pd.read_csv('cleaned_survey_results.csv')

# Define the columns to process (update based on your data)
columns_to_encode = [
    'LanguageHaveWorkedWith',
    'DatabaseHaveWorkedWith',
    'WebframeHaveWorkedWith',
    'ToolsTechHaveWorkedWith',
    'DevType'
]

# Create a binary matrix
binary_df = pd.DataFrame()

for col in columns_to_encode:
    if col in df_cleaned.columns:
        # Split semi-colon-separated values and create a binary matrix
        split_data = df_cleaned[col].str.get_dummies(sep=';')
        binary_df = pd.concat([binary_df, split_data], axis=1)

# Convert the binary matrix to bool type
binary_df_bool = binary_df.astype(bool)

# Apply the Apriori algorithm using the bool DataFrame
frequent_itemsets = apriori(binary_df_bool, min_support=0.05, use_colnames=True)

# Calculate the total number of itemsets
num_itemsets = len(frequent_itemsets)

# Generate association rules, including the 'num_itemsets' parameter
rules = association_rules(frequent_itemsets, num_itemsets=num_itemsets, metric="lift", min_threshold=1.0)

# Sort and display the top rules
rules = rules.sort_values(by='lift', ascending=False)
print("Top 10 association rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))


# This code identifies patterns in how developers use technologies like programming languages, databases, and frameworks. 
# By applying the Apriori algorithm, it reveals frequent combinations (e.g., Python with SQL) and strong associations between tools, 
# helping understand how technologies are commonly grouped in real-world usage.



Top 10 association rules:
                                  antecedents  \
93346         (SQL, ASP.NET CORE, JavaScript)   
93343    (C#, Microsoft SQL Server, HTML/CSS)   
39066                     (SQL, ASP.NET CORE)   
39071         (ASP.NET, Microsoft SQL Server)   
93339  (C#, Microsoft SQL Server, JavaScript)   
93350           (SQL, ASP.NET CORE, HTML/CSS)   
39069    (ASP.NET CORE, Microsoft SQL Server)   
39068                          (SQL, ASP.NET)   
49926                (ASP.NET CORE, HTML/CSS)   
49915                 (NuGet, C#, JavaScript)   

                                  consequents   support  confidence      lift  
93346    (C#, Microsoft SQL Server, HTML/CSS)  0.050384    0.670940  7.406260  
93343         (SQL, ASP.NET CORE, JavaScript)  0.050384    0.556174  7.406260  
39066         (ASP.NET, Microsoft SQL Server)  0.050659    0.545051  7.322213  
39071                     (SQL, ASP.NET CORE)  0.050659    0.680558  7.322213  
93339           (SQL, ASP.NET CORE

In [None]:
# Columns to analyze
employment_columns = ['Employment', 'RemoteWork', 'OrgSize']
tech_columns = [
    'LanguageHaveWorkedWith', 'DatabaseHaveWorkedWith',
    'WebframeHaveWorkedWith', 'ToolsTechHaveWorkedWith'
]

# Convert binary-encoded dataframes to boolean
binary_employment = pd.get_dummies(df_cleaned[employment_columns], prefix=employment_columns).astype(bool)
binary_tech = pd.DataFrame()

for col in tech_columns:
    if col in df_cleaned.columns:
        split_data = df_cleaned[col].str.get_dummies(sep=';').astype(bool)
        binary_tech = pd.concat([binary_tech, split_data], axis=1)


# Combine employment and tech binary data
binary_data = pd.concat([binary_employment, binary_tech], axis=1)

# Apply Apriori algorithm
frequent_itemsets = apriori(binary_data, min_support=0.05, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, num_itemsets=num_itemsets, metric="lift", min_threshold=1.0)

# Filter and sort the rules
rules = rules.sort_values(by='lift', ascending=False)
print("Top 10 association rules for Employment and Technology:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))


# This code explores relationships between employment factors (e.g., job type, remote work) and technology preferences. 
# It highlights how professional roles influence technology choices, such as remote workers preferring tools like Docker, 
# offering insights into technology trends based on work environments.

Top 10 association rules for Employment and Technology:
                                   antecedents  \
125594         (SQL, ASP.NET CORE, JavaScript)   
125591    (C#, Microsoft SQL Server, HTML/CSS)   
50852                      (SQL, ASP.NET CORE)   
50857          (ASP.NET, Microsoft SQL Server)   
125587  (C#, Microsoft SQL Server, JavaScript)   
125598           (SQL, ASP.NET CORE, HTML/CSS)   
50855     (ASP.NET CORE, Microsoft SQL Server)   
50854                           (SQL, ASP.NET)   
77462                 (ASP.NET CORE, HTML/CSS)   
77451                  (NuGet, C#, JavaScript)   

                                   consequents   support  confidence      lift  
125594    (C#, Microsoft SQL Server, HTML/CSS)  0.050384    0.670940  7.406260  
125591         (SQL, ASP.NET CORE, JavaScript)  0.050384    0.556174  7.406260  
50852          (ASP.NET, Microsoft SQL Server)  0.050659    0.545051  7.322213  
50857                      (SQL, ASP.NET CORE)  0.050659    0.680558 