In [1]:
# TODO: whats up with the 'When developing in Python, how often do you …?' question
# seems to be a series of responses after performing a collapse operation. do we have the
# series of questions in the column name before splitting on the colon?

# Follow-up to share: this same issue was found in another group of columns, 
# similar processing was done for both groups. see rename_reverse_colon function
# for the processing used

In [2]:
# TODO: "Other – Write In:" was seen in a dataframe print out. how important is this?
# do we get the write in answer? if we do, do we ignore it and just use 'other'
# as their cat?

In [3]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [4]:
import pandas as pd
import numpy as np

from tqdm.notebook import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

<IPython.core.display.Javascript object>

In [5]:
def comma_collapse(x):
    """Utility function to remove non-strings before apply str.join()
    
    :param x: list of items to apply str.join() to
    :return: len 1 str of joined items in x
             Returns empty string if no strings in x
    """
    x_str = [k for k in x if isinstance(k, str)]

    if x is None:
        comma_sep_list = ""
    else:
        comma_sep_list = ", ".join(x_str)
    return comma_sep_list


def rename_reverse_colon(df):
    """Some series of questions take opposite form of most questions that use colons
    This is a helper to rename them.
    
    Examples: 
        * "use a debugger:When developing in Python, how often do you …?"
        * "Computer graphics:To what extent are you involved in the following activities?"
    
    Most questions with a `:` in them take a different form where the identifying
    info is after the colon. This set of columns requires different processing.
    
    This function should be applied before other processing that relies on `:`.
    
    :param df: jetbrains python survery dataframe
    :return: df with columns of interest renamed
    """
    phrases = [
        {
            "key_phrase": "When developing in Python, ",
            "new_form": "how often do you {x}?",
        },
        {
            "key_phrase": "To what extent are you involved",
            "new_form": "how involved are you with {x}?",
        },
    ]

    name_map = {}
    for p in phrases:
        phrase_df = df.filter(like=p["key_phrase"])
        phrase_cols = phrase_df.columns

        for col in phrase_cols:
            x = col.split(":")[0]
            q = p["new_form"].format(x=x)
            name_map[col] = q

    return df.rename(columns=name_map)

<IPython.core.display.Javascript object>

## Data comments:

* All questions were multiple choice. The only columns with more than 24 unique responses are the id field and the country field.
* All fields except the id are strings
* There is a response category for "Other – Write In:" in many cols. We don't have the write in responses in our current data.

----

## Data fields/groups of interest
(after processing) If not listed here, column can go through default categorical encoding
    
### Do you consider yourself as a Data-Scientist?'

* Responses: `['Other – Write In:', 'No', 'Yes']`
* Binary and treat write-ins as nan? One-hot?
        
        
### "Which version of Python" question group

* Related questions:
    * Which version of Python do you use the most?'
    * Which version of Python 2 do you use the most?'
    * Which version of Python 3 do you use the most?'
* Originally start with just first column for major version? Maybe create a category specifically for oldest and newest version of python in dataset?
* Potentially convert to single float column and take most granular version available?


### "How involved are you" question group:

* Ordinal: `{'primary activity': 3, 'secondary activity': 2, 'hobby': 1}`
* Columns:
```
['how involved are you with Computer graphics?',
 'how involved are you with Data analysis?',
 'how involved are you with Desktop development?',
 'how involved are you with DevOps / System administration / Writing automation scripts?',
 'how involved are you with Educational purposes?',
 'how involved are you with Embedded development?',
 'how involved are you with Game development?',
 'how involved are you with Machine learning?',
 'how involved are you with Mobile development?',
 'how involved are you with Multimedia applications development?',
 'how involved are you with Network programming?',
 'how involved are you with Other?',
 'how involved are you with Programming of web parsers / scrapers / crawlers?',
 'how involved are you with Software prototyping?',
 'how involved are you with Software testing / Writing automated tests?',
 'how involved are you with Web development?']
```
        
### "How often do you" question group:

* Ordinal: `{'Often': 3, 'From time<br />to time': 2, 'Never or<br />Almost never': 1}`
* Columns:
```
['how often do you use autocompletion  in your editor?',
 'how often do you use a debugger?',
 'how often do you refactor your code?',
 'how often do you use Version Control Systems?',
 'how often do you use code linting (programs that analyze code for potential errors)?',
 'how often do you use Python virtual environments for your projects?',
 'how often do you use SQL databases ?',
 'how often do you use NoSQL databases?',
 'how often do you run / debug or edit code on remote machines (remote hosts, VMs, etc.)?',
 'how often do you use a Python profiler?',
 'how often do you write tests for your code?',
 'how often do you use code coverage?',
 'how often do you use optional type hinting?',
 'how often do you use Continuous Integration tools?',
 'how often do you use Issue Trackers?']
```

### Ordinal columns:

```
{
    "Is Python the main language you use for your current projects?": {
        "Yes": 3,
        "No, I use Python as a secondary language": 2,
        "No, I don’t use Python for my current projects": 1,
    },
    "How long have you been programming in Python?": {
        "11+ years": 5,
        "6–10 years": 4,
        "3–5 years": 3,
        "1–2 years": 2,
        "Less than 1 year": 1,
    },
    "How many years of professional coding experience do you have?": {
        "11+ years": 5,
        "6–10 years": 4,
        "3–5 years": 3,
        "1–2 years": 2,
        "Less than 1 year": 1,
    },
    "How often do you use your primary IDE?": {
        "Daily": 4,
        "Weekly": 3,
        "Monthly": 2,
        "Less frequently": 1,
    },
    "How many people are in your project team?": {
        "More than 40 people": 5,
        "21-40 people": 4,
        "13-20 people": 3,
        "8-12 people": 2,
        "2-7 people": 1,
    },
    "How many people work for your company / organization?": {
        "More than 5,000": 7,
        "1,001–5,000": 6,
        "501–1,000": 5,
        "51–500": 4,
        "11–50": 3,
        "2–10": 2,
        "Just me": 1,
        "Not sure": 0,
    },
    "Could you tell us your age range?": {
        "60 or older": 6,
        "50–59": 5,
        "40–49": 4,
        "30–39": 3,
        "21–29": 2,
        "18–20": 1,
    },
}
``` 
    
### Col contents is a comma sep list, use `pd.Series.str.get_dummies()`:

```
[
    "What other language(s) do you use?",
    "What do you use Python for?",
    "What do you typically use to upgrade your Python version?",
    "Do you use any of the following tools to isolate Python environments, if any?",
    "What web frameworks / libraries do you use in addition to Python?",
    "What data science framework(s) do you use in addition to Python?",
    "Which of the following frameworks / libraries do you use in addition to Python?",
    "Which of the following cloud platforms do you use?",
    "How do you run code in the cloud (in the production environment)?",
    "How do you develop for the cloud?",
    "What operating system(s) are your development environment?",
    "Which Python unit-testing framework(s) do you use, if any?",
    "What ORM(s) do you use together with Python, if any?",
    "Which database(s) do you regularly use, if any?",
    "Which of the following Big Data tool(s) do you use, if any?",
    "Which Continuous Integration (CI) system(s) do you regularly use?",
    "Which configuration management tools do you use, if any?",
    "What editors/IDEs do you use for Python development in addition to your primary IDE?",
    "Which of the following best describes your job role(s)?",
]
```

In [6]:
data_path = "data/python_psf_external_19.csv"

# Was getting warning about dtypes, took lazy way out with low_memory
# instead of setting dtype arg
raw_df = pd.read_csv(data_path, low_memory=False)
raw_df = rename_reverse_colon(raw_df)
raw_df.shape

(47308, 290)

<IPython.core.display.Javascript object>

In [7]:
colon_cols = [c for c in raw_df if ":" in c]
non_colon_cols = [c for c in raw_df if c not in colon_cols]

colon_col_names = pd.Series(colon_cols)
colon_unique_qs = colon_col_names.str.split(":").str[-1].unique()

collapsed_series = []
for qoi in tqdm(colon_unique_qs):
    df_oi = raw_df.filter(like=qoi)
    collapsed = df_oi.apply(comma_collapse, axis=1)
    collapsed.name = qoi.strip()
    collapsed_series.append(collapsed)


colon_df = pd.concat(collapsed_series, axis=1)
df = pd.concat((raw_df[non_colon_cols], colon_df), axis=1)
df.head()

HBox(children=(FloatProgress(value=0.0, max=19.0), HTML(value='')))




Unnamed: 0,response_id,Is Python the main language you use for your current projects?,How long have you been programming in Python?,How many years of professional coding experience do you have?,For what purposes do you mainly use Python?,how involved are you with Computer graphics?,how involved are you with Data analysis?,how involved are you with Desktop development?,how involved are you with DevOps / System administration / Writing automation scripts?,how involved are you with Educational purposes?,...,How do you develop for the cloud?,What operating system(s) are your development environment?,"Which Python unit-testing framework(s) do you use, if any?","What ORM(s) do you use together with Python, if any?","Which database(s) do you regularly use, if any?","Which of the following Big Data tool(s) do you use, if any?",Which Continuous Integration (CI) system(s) do you regularly use?,"Which configuration management tools do you use, if any?",What editors/IDEs do you use for Python development in addition to your primary IDE?,Which of the following best describes your job role(s)?
0,10,Yes,6–10 years,11+ years,Both for work and personal,,,,secondary activity,,...,Locally with virtualenv (or similar),macOS,"mock, pytest, unittest",Django ORM,"MySQL, Oracle Database",Apache Kafka,"AppVeyor, Gitlab CI, Jenkins / Hudson, Travis CI","Chef, Puppet, Salt",Vim,Developer / Programmer
1,100,Yes,3–5 years,1–2 years,For work,,,,,,...,"In Docker containers, In virtual machines",macOS,"mock, pytest, unittest",Django ORM,PostgreSQL,,Gitlab CI,,PyCharm Professional Edition,Developer / Programmer
2,1000,Yes,3–5 years,3–5 years,Both for work and personal,,secondary activity,,,,...,With local system interpreter,"Linux, macOS",pytest,,,Apache Kafka,,,"PyCharm Community Edition, Sublime Text",Other – Write In:
3,10000,Yes,1–2 years,Less than 1 year,"For personal, educational or side projects",,,,,,...,,,,,,,,,,
4,10001,,,,,,,,,,...,,,,,,,,,,


<IPython.core.display.Javascript object>