<a href="https://colab.research.google.com/github/Harish-lvrk/Data-Analysis-project/blob/main/EDA_StackOverflow_2025_surveydata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 EDA Case Study: Stack Overflow Annual Developer Survey 2020

## 👨‍💻 Author
**L Hareesh**  


---

## 📌 Project Overview

This project performs **Exploratory Data Analysis (EDA)** on the **Stack Overflow Annual Developer Survey 2025 dataset**.  
The dataset contains responses from **65,000+ developers worldwide**, where they shared details about their age, countries, education, jobs, salaries, programming languages, and preferences.

The main goal is to analyze and visualize the data to understand **global developer trends in 2020**.

---

## 📝 Understanding the Title

### 🔹 What is Stack Overflow?
A popular website where programmers ask and answer coding questions.

### 🔹 What is the Annual Developer Survey?
Every year, Stack Overflow collects responses from developers worldwide about their demographics, skills, and work.

### 🔹 What are Responses?
Each respondent (developer) provides answers. Example:

* Question: *What programming languages do you use?*  
* Response: *Python, JavaScript, SQL*

### 🔹 What does Analyzing Responses mean?
Studying those answers to find insights like:

* Most popular programming languages  
* Average salaries by country  
* Work preferences during COVID-19 (2020)

### 🔹 Where does the Data come from?
Published by **Stack Overflow**, freely available on their research page or Kaggle.

---

## 💡 What You Can Do with the Data

1. **Demographics** – Age, gender, countries of developers.  
2. **Programming Languages & Tools** – Most popular languages, databases, frameworks.  
3. **Job & Salary Analysis** – Salary by country, experience, education.  
4. **Learning & Education** – University vs self-taught vs bootcamp.  
5. **Work Preferences** – Remote work, job satisfaction, working hours.  
6. **Trends & Patterns** – Younger vs older developers, technology shifts.  
7. **Advanced Analysis** – Correlations, clustering, deeper insights.  

---

## 🚀 Roadmap for EDA

### **Step 1: Load the Data**
* Import libraries (`pandas`, `matplotlib`, `seaborn`)  
* Load dataset with `pd.read_csv()`  
* Check shape (rows, columns) and first few rows  

---

### **Step 2: Understand the Data**
* `df.info()` → Column names & datatypes  
* `df.describe()` → Summary stats  
* `df.isnull().sum()` → Missing values  

---

### **Step 3: Clean the Data**
* Remove irrelevant columns (IDs, metadata)  
* Handle missing values (drop/fill)  
* Rename columns (e.g., `YearsCodePro → Years_Professional`)  
* Convert text into numeric values where needed  

---

### **Step 4: Demographics**
* Age distribution (histogram)  
* Top countries by respondents  
* Gender distribution  

---

### **Step 5: Programming Languages & Tools**
* Most popular programming languages (bar chart)  
* Databases, frameworks, cloud platforms analysis  

---

### **Step 6: Job & Salary**
* Salary distribution (boxplot)  
* Salary by country  
* Salary vs years of experience  
* Salary vs education level  

---

### **Step 7: Learning & Education**
* Coding learning methods (bootcamp, university, self-taught)  
* Education level vs salary  
* Education level vs languages used  

---

### **Step 8: Work Preferences**
* Remote work preference (important in 2020 – pandemic)  
* Job satisfaction levels  
* Weekly working hours  

---

### **Step 9: Correlation & Multivariate Analysis**
* Correlation heatmap (experience, age, salary)  
* Salary vs Experience scatterplot  
* Grouping developers by skills (optional clustering)  

---

### **Step 10: Summarize Insights**
* Top 3 languages: **JavaScript, Python, SQL**  
* **US developers** earn highest salaries; **India** has more developers but lower median salary  
* Younger developers → Python; Older developers → Java, C#  
* Rise of self-taught developers  

---

## 🔑 Final Summary

This project analyzes the **Stack Overflow 2020 Developer Survey** to uncover:

* Who the developers are  
* What tools they use  
* How much they earn  
* How they learn coding  
* How their preferences changed in 2020 (COVID-19 year)  

It helps us **understand global trends in software development**.  

---

## 🙌 Acknowledgements
* Dataset Source: [Stack Overflow Developer Survey 2025](https://insights.stackoverflow.com/survey)  
* Analysis & Documentation: **L Hareesh**  
* AI Assistance: **ChatGPT**  **Gemini Pro**

---

## 📚 References & Resources

* [Google Colab](https://colab.research.google.com/) – Cloud-based Python environment used for running the analysis.  
* [Pandas Documentation](https://pandas.pydata.org/) – Python data analysis library.  
* [NumPy Documentation](https://numpy.org/) – Numerical computing library.  
* [Matplotlib Documentation](https://matplotlib.org/) – Visualization library.  
* [Seaborn Documentation](https://seaborn.pydata.org/) – Statistical data visualization.  
* [YouTube Playlist: Pandas & NumPy Tutorials](https://www.youtube.com/watch?v=GPVsHOlRBBI&list=PLyMom0n-MBrpr1Q3OQC5Od1o1zczHEO0u) – Helpful for learning data manipulation and analysis.  

---


In [10]:
# Install the necessary library for downloading files from GitHub

In [11]:
import requests
import os

# Replace with the direct URL to the raw zip file content on GitHub
# You can usually get this by going to the zip file on GitHub, clicking "Raw", and copying the URL
zip_file_url = 'https://github.com/Harish-lvrk/Data-Analysis-project/raw/main/stack-overflow-developer-survey-2020%20(1).zip'
local_zip_path = '/content/downloaded_file.zip'

try:
    # Download the zip file
    response = requests.get(zip_file_url, stream=True)
    response.raise_for_status()  # Raise an exception for bad status codes

    with open(local_zip_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

    print(f"Zip file downloaded to {local_zip_path}")

except requests.exceptions.RequestException as e:
    print(f"Error downloading zip file: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Zip file downloaded to /content/downloaded_file.zip


### Explanation of the Code

1. **Importing Libraries**

   * `requests` is used to make HTTP requests and download files from the internet.
   * `os` is imported for handling file paths (though not used much here).

2. **File URL and Destination Path**

   * `zip_file_url` contains the direct download link of the zip file stored in GitHub.
   * `local_zip_path` specifies the location where the file will be saved locally (here, in the Colab environment under `/content/`).

3. **Making the Request**

   * `requests.get(zip_file_url, stream=True)` sends a GET request to the URL and enables streaming, which means the file will be downloaded in chunks rather than loading it entirely into memory. This helps when downloading large files.

4. **Checking the Response**

   * `response.raise_for_status()` checks if the request was successful (status code 200). If not, it raises an error.

5. **Saving the File**

   * `with open(local_zip_path, 'wb') as f:` opens a new file in **write-binary mode (`wb`)**. Binary mode is necessary since we are writing raw bytes of a zip file, not text.
   * The `with` statement ensures that the file is properly closed after writing, even if an error occurs.

6. **Writing in Chunks**

   * The `for chunk in response.iter_content(chunk_size=8192):` loop reads the file data in small pieces (8 KB each) instead of loading the whole file at once.
   * `f.write(chunk)` writes each chunk to the file until the entire file is downloaded.

7. **Success Message**

   * If everything goes well, it prints the location of the saved zip file.

8. **Error Handling**

   * `except requests.exceptions.RequestException as e:` catches errors related to the HTTP request (e.g., wrong URL, network error).
   * `except Exception as e:` catches any other unexpected errors and prints them.


In [12]:
import zipfile

# Use the path where the file was downloaded by the previous cell
local_zip_path = '/content/downloaded_file.zip'
extract_path = '/content/extracted_data'

try:
    # Create the extraction directory if it doesn't exist
    os.makedirs(extract_path, exist_ok=True)

    # Open and extract the zip file
    with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)
    print(f"Zip file extracted to {extract_path}")

    # List the files in the extracted directory
    print("\nFiles in the zip file:")
    for root, dirs, files in os.walk(extract_path):
        for name in files:
            print(os.path.join(root, name))
        for name in dirs:
            print(os.path.join(root, name))

except FileNotFoundError:
    print(f"Error: Zip file not found at {local_zip_path}. Please ensure the file path is correct and the previous cell ran successfully.")
except Exception as e:
    print(f"Error extracting zip file: {e}")

Zip file extracted to /content/extracted_data

Files in the zip file:
/content/extracted_data/so_survey_2020.pdf
/content/extracted_data/README_2020.txt
/content/extracted_data/survey_results_public.csv
/content/extracted_data/survey_results_schema.csv


### Explanation of Zip Extraction Process

1. **Importing Libraries**

   * `zipfile` is used to work with ZIP archive files.
   * `os` is used to manage file paths and directories.

2. **Setting File Paths**

   * `local_zip_path` stores the location where the ZIP file was downloaded in the previous step.
   * `extract_path` specifies the folder where the contents of the ZIP file will be extracted.

3. **Creating Extraction Directory**

   * `os.makedirs(extract_path, exist_ok=True)` ensures the target folder exists. If it doesn’t, it creates it. If it already exists, no error is raised (because of `exist_ok=True`).

4. **Opening and Extracting the Zip File**

   * `with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:` opens the ZIP file in read mode.
   * `zip_ref.extractall(extract_path)` extracts all contents of the ZIP file into the given directory.

5. **Listing Extracted Files**

   * `os.walk(extract_path)` is used to go through all files and subfolders inside the extracted directory.
   * It prints the paths of both files and folders inside the extracted directory so you can verify the contents.

6. **Error Handling**

   * `FileNotFoundError`: Triggered if the ZIP file does not exist at the given path.
   * `Exception as e`: Catches any other unexpected errors during extraction.

---

✅ **Summary:** This code safely extracts the downloaded ZIP file into a specified folder, ensures the folder exists before extraction, and lists all files and folders inside it. It also includes error handling for missing files or unexpected issues.
The `for` loop with `os.walk(extract_path)` goes through the folder where the ZIP file was extracted and lists out everything inside it. Let’s break it step by step:

1. **`os.walk(extract_path)`**

   * This function goes through the folder (`extract_path`) and gives three things for each directory it visits:

     * `root` → The current folder path.
     * `dirs` → A list of all sub-folders inside the current folder.
     * `files` → A list of all files inside the current folder.

2. **First inner loop (`for name in files:`)**

   * It goes through every file found in that folder.
   * `os.path.join(root, name)` combines the folder path (`root`) with the file name (`name`) to get the full file path.
   * `print(...)` then shows the full path of each file.

3. **Second inner loop (`for name in dirs:`)**

   * It goes through every sub-folder found inside that folder.
   * Again, `os.path.join(root, name)` creates the full path of the folder.
   * `print(...)` then shows the full path of each folder.

👉 In your case, there were **no sub-folders**, only files, so only the file paths were printed.


In [13]:
import pandas as pd
import numpy as np


In [18]:

survey_raw_data = pd.read_csv('/content/extracted_data/survey_results_public.csv')

In [30]:
schema_data = pd.read_csv('/content/extracted_data/survey_results_schema.csv')

In [31]:
schema_data.shape

(61, 2)

This code cell uses the `.shape` attribute of the `schema_data` pandas DataFrame.

- `.shape` returns a tuple representing the dimensions of the DataFrame.
- The output `(61, 2)` indicates that the `schema_data` DataFrame has 61 rows and 2 columns.

In [24]:
display(survey_raw_data)

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64456,64858,,Yes,,16,,,,United States,,...,,,,"Computer science, computer engineering, or sof...",,,,,10,Less than 1 year
64457,64867,,Yes,,,,,,Morocco,,...,,,,,,,,,,
64458,64898,,Yes,,,,,,Viet Nam,,...,,,,,,,,,,
64459,64925,,Yes,,,,,,Poland,,...,,,,,Angular;Angular.js;React.js,,,,,


The dataset contains over 64,000 responses to 60 questions (although many questions are optional). The responses have been anonymized and there's no personally identifiable information about respondents - just a Randomized respondent ID.

In [32]:
survey_raw_data.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
  

Short codes are used as column names. We can refer to the schema file to see the full text of each question.

In [33]:
schema_data

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,Age,What is your age (in years)? If you prefer not...
4,Age1stCode,At what age did you write your first line of c...
...,...,...
56,WebframeWorkedWith,Which web frameworks have you done extensive d...
57,WelcomeChange,"Compared to last year, how welcome do you feel..."
58,WorkWeekHrs,"On average, how many hours per week do you wor..."
59,YearsCode,"Including any education, how many years have y..."


check each question how it looks like
- here in 'column' column each row indicates the what is the short form of the question in the main data.
- now change this schema data to convinet form by making the column shortcuts as the index

In [34]:
schema_data = schema_data.set_index('Column')
display(schema_data)


Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
...,...
WebframeWorkedWith,Which web frameworks have you done extensive d...
WelcomeChange,"Compared to last year, how welcome do you feel..."
WorkWeekHrs,"On average, how many hours per week do you wor..."
YearsCode,"Including any education, how many years have y..."


In [48]:
type(schema_data)

In [42]:
schemq = schema_data.loc['Respondent']
schemq

Unnamed: 0,Respondent
QuestionText,Randomized respondent ID number (not in order ...


- as this is simple form like dictinary we can converto inot the simple foramt to access the question

In [45]:
schema_raw = schema_data.QuestionText
type(schema_raw)

In [49]:
schema_raw['YearsCodePro']

'NOT including education, how many years have you coded professionally (as a part of your work)?'

- this is the series data we can access simply like dictionary before the schmea data is in the form of data frame

## Data Preparation & Cleaning

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

- Demographics of the survey respondents & the global programming community
- Distribution of programming skills, experience and preferences
- Employment-related information & preferences

Let's select a subset of columns with the relevant data.

In [50]:
selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

In [51]:
len(selected_columns)

20

Let's extract a copy of the data from these columns into a new data frame `survey_df`, which we can continue to modify further without affecting the original data frame.

In [52]:
survey_df = survey_raw_data[selected_columns].copy()

- do same for shema to

In [54]:
schema = schema_raw[selected_columns]

In [56]:
len(schema)

20

- overview of the basic information from the survey data

In [57]:
survey_df.shape

(64461, 20)

In [58]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 64072 non-null  object 
 1   Age                     45446 non-null  float64
 2   Gender                  50557 non-null  object 
 3   EdLevel                 57431 non-null  object 
 4   UndergradMajor          50995 non-null  object 
 5   Hobbyist                64416 non-null  object 
 6   Age1stCode              57900 non-null  object 
 7   YearsCode               57684 non-null  object 
 8   YearsCodePro            46349 non-null  object 
 9   LanguageWorkedWith      57378 non-null  object 
 10  LanguageDesireNextYear  54113 non-null  object 
 11  NEWLearn                56156 non-null  object 
 12  NEWStuck                54983 non-null  object 
 13  Employment              63854 non-null  object 
 14  DevType                 49370 non-null

Most columns have the data type object, either because they contain values of different types, or they contain empty values, which are represented `np.NaN`. It appears that every column contains some empty values, since the Non-Null count for every column is lower than the total number of rows (64461). We'll need to deal with empty values and manually adjust the data type for each column on a case-by-case basis.

Only two of the columns were detected as numeric columns (`Age` and `WorkWeekHrs`), even though there are a few other columns which have mostly numeric values. To make our analysis easier, let's convert some other columns into numeric data types, while ignoring any non-numeric value (they will get converted to NaNs)

In [61]:
age_1st_code_unique = survey_df['Age1stCode'].unique()
age_1st_code_unique

array(['13', '19', '15', '18', '16', '14', '12', '20', '42', '8', '25',
       '22', '30', '17', '21', '10', '46', '9', '7', '11', '6', nan, '31',
       '29', '5', 'Younger than 5 years', '28', '38', '23', '27', '41',
       '24', '53', '26', '35', '32', '40', '33', '36', '54', '48', '56',
       '45', '44', '34', 'Older than 85', '39', '51', '68', '50', '37',
       '47', '43', '52', '85', '64', '55', '58', '49', '76', '72', '73',
       '83', '63'], dtype=object)

In [62]:
yearscodepro_unique = survey_df['YearsCodePro'].unique()
yearscodepro_unique

array(['27', '4', nan, '8', '13', '2', '7', '20', '1', '23', '3', '12',
       '17', '18', '10', '14', '29', '6', '28', '9', '15', '11', '16',
       '25', 'Less than 1 year', '5', '21', '19', '35', '24', '32', '22',
       '30', '38', '26', '40', '33', '31', 'More than 50 years', '34',
       '36', '39', '37', '41', '45', '47', '42', '46', '50', '43', '44',
       '48', '49'], dtype=object)

- some values need to droped

In [63]:
yearscode_unique = survey_df['YearsCode'].unique()
yearscode_unique

array(['36', '7', '4', '15', '6', '17', '8', '10', '35', '5', '37', '19',
       '9', '22', '30', '23', '20', '2', 'Less than 1 year', '3', '13',
       '25', '16', '43', '11', '38', '33', nan, '24', '21', '12', '40',
       '27', '50', '46', '14', '18', '28', '32', '44', '26', '42', '31',
       '34', '29', '1', '39', '41', '45', 'More than 50 years', '47',
       '49', '48'], dtype=object)

In [64]:
survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors = 'coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors = 'coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors = 'coerce')

In [66]:
survey_df.describe()

Unnamed: 0,Age,Age1stCode,YearsCode,YearsCodePro,WorkWeekHrs
count,45446.0,57473.0,56784.0,44133.0,41151.0
mean,30.834111,15.476572,12.782051,8.869667,40.782174
std,9.585392,5.114081,9.490657,7.759961,17.816383
min,1.0,5.0,1.0,1.0,1.0
25%,24.0,12.0,6.0,3.0,40.0
50%,29.0,15.0,10.0,6.0,40.0
75%,35.0,18.0,17.0,12.0,44.0
max,279.0,85.0,50.0,50.0,475.0


### Observations from `survey_df.describe()`

Based on the summary statistics from `survey_df.describe()`, we can observe some potential issues in the data:

*   **Age:** The minimum age is listed as 1 and the maximum age as 279. These values are highly improbable for survey respondents and are likely data entry errors or outliers.
*   **WorkWeekHrs:** The maximum work week hours is reported as 475. Given that there are only 168 hours in a week (7 days * 24 hours), working 475 hours is impossible. This also indicates a data error or an extreme outlier.

These observations suggest that some data cleaning or handling of outliers might be necessary for the 'Age' and 'WorkWeekHrs' columns before performing further analysis.

There seems to be a problem with the age column, as the minimum value is 1 and max value is 279. This is a common issues with surveys: responses may contain invalid values due to accidental or intentional errors while responding. A simple fix would be ignore the rows where the value in the age column is higher than 100 years or lower than 10 years as invalid survey responses.

In [67]:
survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 100].index, inplace=True)

The same hold true for `WorkWeekHrs`. Let's ignore entries where the value for the column is higher than 140 hours.

In [68]:
survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

The gender column also allows picking multiple options, but to simplify our analysis, we'll remove values containing options.