<a href="https://colab.research.google.com/github/Harish-lvrk/Data-Analysis-project/blob/main/EDA_StackOverflow_2025_surveydata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 EDA Case Study: Stack Overflow Annual Developer Survey 2020

## 👨‍💻 Author
**L Hareesh**  


---

## 📌 Project Overview

This project performs **Exploratory Data Analysis (EDA)** on the **Stack Overflow Annual Developer Survey 2025 dataset**.  
The dataset contains responses from **65,000+ developers worldwide**, where they shared details about their age, countries, education, jobs, salaries, programming languages, and preferences.

The main goal is to analyze and visualize the data to understand **global developer trends in 2020**.

---

## 📝 Understanding the Title

### 🔹 What is Stack Overflow?
A popular website where programmers ask and answer coding questions.

### 🔹 What is the Annual Developer Survey?
Every year, Stack Overflow collects responses from developers worldwide about their demographics, skills, and work.

### 🔹 What are Responses?
Each respondent (developer) provides answers. Example:

* Question: *What programming languages do you use?*  
* Response: *Python, JavaScript, SQL*

### 🔹 What does Analyzing Responses mean?
Studying those answers to find insights like:

* Most popular programming languages  
* Average salaries by country  
* Work preferences during COVID-19 (2020)

### 🔹 Where does the Data come from?
Published by **Stack Overflow**, freely available on their research page or Kaggle.

---

## 💡 What You Can Do with the Data

1. **Demographics** – Age, gender, countries of developers.  
2. **Programming Languages & Tools** – Most popular languages, databases, frameworks.  
3. **Job & Salary Analysis** – Salary by country, experience, education.  
4. **Learning & Education** – University vs self-taught vs bootcamp.  
5. **Work Preferences** – Remote work, job satisfaction, working hours.  
6. **Trends & Patterns** – Younger vs older developers, technology shifts.  
7. **Advanced Analysis** – Correlations, clustering, deeper insights.  

---

## 🚀 Roadmap for EDA

### **Step 1: Load the Data**
* Import libraries (`pandas`, `matplotlib`, `seaborn`)  
* Load dataset with `pd.read_csv()`  
* Check shape (rows, columns) and first few rows  

---

### **Step 2: Understand the Data**
* `df.info()` → Column names & datatypes  
* `df.describe()` → Summary stats  
* `df.isnull().sum()` → Missing values  

---

### **Step 3: Clean the Data**
* Remove irrelevant columns (IDs, metadata)  
* Handle missing values (drop/fill)  
* Rename columns (e.g., `YearsCodePro → Years_Professional`)  
* Convert text into numeric values where needed  

---

### **Step 4: Demographics**
* Age distribution (histogram)  
* Top countries by respondents  
* Gender distribution  

---

### **Step 5: Programming Languages & Tools**
* Most popular programming languages (bar chart)  
* Databases, frameworks, cloud platforms analysis  

---

### **Step 6: Job & Salary**
* Salary distribution (boxplot)  
* Salary by country  
* Salary vs years of experience  
* Salary vs education level  

---

### **Step 7: Learning & Education**
* Coding learning methods (bootcamp, university, self-taught)  
* Education level vs salary  
* Education level vs languages used  

---

### **Step 8: Work Preferences**
* Remote work preference (important in 2020 – pandemic)  
* Job satisfaction levels  
* Weekly working hours  

---

### **Step 9: Correlation & Multivariate Analysis**
* Correlation heatmap (experience, age, salary)  
* Salary vs Experience scatterplot  
* Grouping developers by skills (optional clustering)  

---

### **Step 10: Summarize Insights**
* Top 3 languages: **JavaScript, Python, SQL**  
* **US developers** earn highest salaries; **India** has more developers but lower median salary  
* Younger developers → Python; Older developers → Java, C#  
* Rise of self-taught developers  

---

## 🔑 Final Summary

This project analyzes the **Stack Overflow 2020 Developer Survey** to uncover:

* Who the developers are  
* What tools they use  
* How much they earn  
* How they learn coding  
* How their preferences changed in 2020 (COVID-19 year)  

It helps us **understand global trends in software development**.  

---

## 🙌 Acknowledgements
* Dataset Source: [Stack Overflow Developer Survey 2025](https://insights.stackoverflow.com/survey)  
* Analysis & Documentation: **L Hareesh**  
* AI Assistance: **ChatGPT**  **Gemini Pro**

---

## 📚 References & Resources

* [Google Colab](https://colab.research.google.com/) – Cloud-based Python environment used for running the analysis.  
* [Pandas Documentation](https://pandas.pydata.org/) – Python data analysis library.  
* [NumPy Documentation](https://numpy.org/) – Numerical computing library.  
* [Matplotlib Documentation](https://matplotlib.org/) – Visualization library.  
* [Seaborn Documentation](https://seaborn.pydata.org/) – Statistical data visualization.  
* [YouTube Playlist: Pandas & NumPy Tutorials](https://www.youtube.com/watch?v=GPVsHOlRBBI&list=PLyMom0n-MBrpr1Q3OQC5Od1o1zczHEO0u) – Helpful for learning data manipulation and analysis.  

---


In [None]:
# Install the necessary library for downloading files from GitHub

In [4]:
import requests
import os

# Replace with the direct URL to the raw zip file content on GitHub
# You can usually get this by going to the zip file on GitHub, clicking "Raw", and copying the URL
zip_file_url = 'https://github.com/Harish-lvrk/Data-Analysis-project/raw/main/stack-overflow-developer-survey-2020%20(1).zip'
local_zip_path = '/content/downloaded_file.zip'

try:
    # Download the zip file
    response = requests.get(zip_file_url, stream=True)
    response.raise_for_status()  # Raise an exception for bad status codes

    with open(local_zip_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

    print(f"Zip file downloaded to {local_zip_path}")

except requests.exceptions.RequestException as e:
    print(f"Error downloading zip file: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Zip file downloaded to /content/downloaded_file.zip


### Explanation of the Code

1. **Importing Libraries**

   * `requests` is used to make HTTP requests and download files from the internet.
   * `os` is imported for handling file paths (though not used much here).

2. **File URL and Destination Path**

   * `zip_file_url` contains the direct download link of the zip file stored in GitHub.
   * `local_zip_path` specifies the location where the file will be saved locally (here, in the Colab environment under `/content/`).

3. **Making the Request**

   * `requests.get(zip_file_url, stream=True)` sends a GET request to the URL and enables streaming, which means the file will be downloaded in chunks rather than loading it entirely into memory. This helps when downloading large files.

4. **Checking the Response**

   * `response.raise_for_status()` checks if the request was successful (status code 200). If not, it raises an error.

5. **Saving the File**

   * `with open(local_zip_path, 'wb') as f:` opens a new file in **write-binary mode (`wb`)**. Binary mode is necessary since we are writing raw bytes of a zip file, not text.
   * The `with` statement ensures that the file is properly closed after writing, even if an error occurs.

6. **Writing in Chunks**

   * The `for chunk in response.iter_content(chunk_size=8192):` loop reads the file data in small pieces (8 KB each) instead of loading the whole file at once.
   * `f.write(chunk)` writes each chunk to the file until the entire file is downloaded.

7. **Success Message**

   * If everything goes well, it prints the location of the saved zip file.

8. **Error Handling**

   * `except requests.exceptions.RequestException as e:` catches errors related to the HTTP request (e.g., wrong URL, network error).
   * `except Exception as e:` catches any other unexpected errors and prints them.


In [5]:
import zipfile

# Use the path where the file was downloaded by the previous cell
local_zip_path = '/content/downloaded_file.zip'
extract_path = '/content/extracted_data'

try:
    # Create the extraction directory if it doesn't exist
    os.makedirs(extract_path, exist_ok=True)

    # Open and extract the zip file
    with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)
    print(f"Zip file extracted to {extract_path}")

    # List the files in the extracted directory
    print("\nFiles in the zip file:")
    for root, dirs, files in os.walk(extract_path):
        for name in files:
            print(os.path.join(root, name))
        for name in dirs:
            print(os.path.join(root, name))

except FileNotFoundError:
    print(f"Error: Zip file not found at {local_zip_path}. Please ensure the file path is correct and the previous cell ran successfully.")
except Exception as e:
    print(f"Error extracting zip file: {e}")

Zip file extracted to /content/extracted_data

Files in the zip file:
/content/extracted_data/so_survey_2020.pdf
/content/extracted_data/README_2020.txt
/content/extracted_data/survey_results_public.csv
/content/extracted_data/survey_results_schema.csv


### Explanation of Zip Extraction Process

1. **Importing Libraries**

   * `zipfile` is used to work with ZIP archive files.
   * `os` is used to manage file paths and directories.

2. **Setting File Paths**

   * `local_zip_path` stores the location where the ZIP file was downloaded in the previous step.
   * `extract_path` specifies the folder where the contents of the ZIP file will be extracted.

3. **Creating Extraction Directory**

   * `os.makedirs(extract_path, exist_ok=True)` ensures the target folder exists. If it doesn’t, it creates it. If it already exists, no error is raised (because of `exist_ok=True`).

4. **Opening and Extracting the Zip File**

   * `with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:` opens the ZIP file in read mode.
   * `zip_ref.extractall(extract_path)` extracts all contents of the ZIP file into the given directory.

5. **Listing Extracted Files**

   * `os.walk(extract_path)` is used to go through all files and subfolders inside the extracted directory.
   * It prints the paths of both files and folders inside the extracted directory so you can verify the contents.

6. **Error Handling**

   * `FileNotFoundError`: Triggered if the ZIP file does not exist at the given path.
   * `Exception as e`: Catches any other unexpected errors during extraction.

---

✅ **Summary:** This code safely extracts the downloaded ZIP file into a specified folder, ensures the folder exists before extraction, and lists all files and folders inside it. It also includes error handling for missing files or unexpected issues.
The `for` loop with `os.walk(extract_path)` goes through the folder where the ZIP file was extracted and lists out everything inside it. Let’s break it step by step:

1. **`os.walk(extract_path)`**

   * This function goes through the folder (`extract_path`) and gives three things for each directory it visits:

     * `root` → The current folder path.
     * `dirs` → A list of all sub-folders inside the current folder.
     * `files` → A list of all files inside the current folder.

2. **First inner loop (`for name in files:`)**

   * It goes through every file found in that folder.
   * `os.path.join(root, name)` combines the folder path (`root`) with the file name (`name`) to get the full file path.
   * `print(...)` then shows the full path of each file.

3. **Second inner loop (`for name in dirs:`)**

   * It goes through every sub-folder found inside that folder.
   * Again, `os.path.join(root, name)` creates the full path of the folder.
   * `print(...)` then shows the full path of each folder.

👉 In your case, there were **no sub-folders**, only files, so only the file paths were printed.


In [6]:
import pandas as pd
import numpy as np


In [7]:
schema_data = pd.read_csv('/content/extracted_data/survey_results_schema.csv')
df = pd.read_csv('/content/extracted_data/survey_results_public.csv')

In [9]:
schema_data.shape

(61, 2)

This code cell uses the `.shape` attribute of the `schema_data` pandas DataFrame.

- `.shape` returns a tuple representing the dimensions of the DataFrame.
- The output `(61, 2)` indicates that the `schema_data` DataFrame has 61 rows and 2 columns.

In [None]:
schema.info()

This code cell loads two CSV files into pandas DataFrames:

- `schema_data = pd.read_csv('/content/extracted_data/survey_results_schema.csv')`: This line reads the `survey_results_schema.csv` file, which likely contains information about each column in the main survey data, into a DataFrame named `schema_data`.
- `df = pd.read_csv('/content/extracted_data/survey_results_public.csv')`: This line reads the main survey data from `survey_results_public.csv` into a DataFrame named `df`. This is the primary DataFrame you will likely use for your analysis.

Both files are read from the `/content/extracted_data/` directory where the zip file was extracted.