# Task B: Stack Overflow Developer Survey 2024 Analytics

You are provided with the latest developer survey results from Stack Overflow. Your task is to perform analytics on the survey to extract insights on the programming industry.

## Setup
If you are in google colab, you should just be able to run the cell below. Otherwise find the conda `environment.yml` file provided with all the dependencies (e.g. `conda env create -f environment.yml`).

In [None]:
%pip install pandas
import pandas as pd

Find a utility class below to download and read the data for you.

In [None]:
import csv
import requests
import os
from io import BytesIO
import zipfile
from typing import List, Dict, Any, Optional
from pathlib import Path

RESPONSE_ID_FIELD_NAME = "ResponseId"
QUESTION_ID_FIELD_NAME = "qid"

DATASET_SUBDIR = "."
SO_DEVELOPER_SURVEY_URL = "https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip"
class SurveyDataReader:
    """
    A class to read and process Stack Overflow Developer Survey data.
    """

    def __init__(self, schema_file: str, data_file: str):
        if not (os.path.exists(DATASET_SUBDIR) and len([f for f in os.listdir(DATASET_SUBDIR) if f.endswith(".csv")]) == 2):
            self._download_datasets()
        self.schema = self._parse_schema(schema_file)
        self.data = self._parse_data(data_file)

    def _download_datasets(self):
        response = requests.get(SO_DEVELOPER_SURVEY_URL)

        if response.status_code == 200:
            zip_file = BytesIO(response.content)

            with zipfile.ZipFile(zip_file, "r") as zip_ref:
                os.makedirs(DATASET_SUBDIR, exist_ok=True)

                zip_ref.extractall(DATASET_SUBDIR)
        else:
            print(f"Failed to download datasets: Response {response.text}")

    def _parse_schema(self, schema_file: str) -> List[Dict[str, str]]:
        schema = []
        schema_path = Path(schema_file).resolve()
        with open(schema_path, mode="r") as file:
            reader = csv.DictReader(file)
            schema = [row for row in reader]
        return schema

    def _parse_data(self, data_file: str) -> List[Dict[str, Any]]:
        data = []
        data_path = Path(data_file).resolve()
        with open(data_path, mode="r") as file:
            reader = csv.DictReader(file)
            data = [row for row in reader]
        return data

    def get_schema(self) -> List[Dict[str, str]]:
        return self.schema

    def get_data(self) -> List[Dict[str, Any]]:
        return self.data

    def get_question_by_id(self, qid: str) -> Optional[Dict[str, str]]:
        for question in self.schema:
            if question[QUESTION_ID_FIELD_NAME] == qid:
                return question
        return None

    def get_responses_for_question(self, qname: str) -> List[Any]:
        return [response[qname] for response in self.data if qname in response]

    def get_response_by_id(self, response_id: str | int) -> Optional[Dict[str, Any]]:
        response_id_str = str(response_id)
        for response in self.data:
            if response[RESPONSE_ID_FIELD_NAME] == response_id_str:
                return response
        return None

## Getting to know the data reader

In [None]:
SURVEY_SUBDIR = "."
SCHEMA_RELATIVE_PATH = f"{SURVEY_SUBDIR}/survey_results_schema.csv"
DATA_RELATIVE_PATH = f"{SURVEY_SUBDIR}/survey_results_public.csv"

reader = SurveyDataReader(SCHEMA_RELATIVE_PATH, DATA_RELATIVE_PATH)

In [None]:
print(reader.get_schema())

print(len(reader.get_data()))

print(reader.get_data()[0:10]) # Be careful when trying to output the data, there's lots of it!

## Questions

1. Print all the questions asked in the developer survey

2. Which age range has the most responses in the survey?

3. How many survey respondents do we know definitely work for a company larger than Marshall Wace? (Feel free to ask one of us if you don't remember how large Marshall Wace is!)

4. What number of people had less than 1 year of coding experience before (or outside of) coding for their profession?

5. Of the people who had 1 or more years of coding experience outside of coding professionally, what is the average number of years they spent coding outside of work? For simplicity, you can consider only the people who have given an exact number of years they have spent coding in both columns (i.e. excluding those with over 50 or less than 1 year)

6. What is the cumulative compensation among those that disclosed their total compensation? (What assumption are we making here?)

7. What is the most used language for software development?

8. How do developers perceive the benefits of AI in their respective fields (as specified by the first question with id `MainBranch`)?

9. Recreate the graph displaying the most used IDEs, found [here](https://survey.stackoverflow.co/2024/technology/#1-integrated-development-environment). As an extension, recreate it per type of individual as shown on the site too (you do not need to make it interactive).

## Bonus Task: SurveyDataReader

`SurveyDataReader` is a basic class that allows you to access the underlying survey data in a programmatic manner. The class is implemented with basic data structures and no external dependencies hence there is plenty of room for optimisation. If you're feeling adventurous, try to improve the speed of basic operations and add some of your own by potentially leveraging a package such as [NumPy](https://numpy.org/).