# What does your IDE say about you?
When you go to write some code, what application do you open? Jupyter notebook? ATOM? IDEs are central to data science and analytical workflows. I believe that by taking a dive into the types of IDEs that Kaggle 2018 survey respondents say they use, we can gain insights to what the IDEs we use say about us.

<table align="center"><tr>
<td>    <img src="http://jupyter.org/assets/main-logo.svg" alt="Jupyter" style="width: 100px;"> </td>
<td>    <img src="https://www.rstudio.com/wp-content/uploads/2014/06/RStudio-Ball.png" alt="RStudio" style="width: 100px;"> </td>
<td>   <img src="https://caktus-website-production-2015.s3.amazonaws.com/media/blog-images/logo.png" alt="PyCharm" style="width: 100px;"> </td>
<td>   <img src="https://upload.wikimedia.org/wikipedia/commons/f/f5/Notepad_plus_plus.png" alt="Notepad" style="width: 100px;"> </td>
<td>   <img src="https://upload.wikimedia.org/wikipedia/en/thumb/d/d2/Sublime_Text_3_logo.png/150px-Sublime_Text_3_logo.png" alt="Sublime" style="width: 100px;"> </td>
<td>   <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/Visual_Studio_Code_1.18_icon.svg/1028px-Visual_Studio_Code_1.18_icon.svg.png" alt="VSC" style="width: 100px;"> </td>
<td>   <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/Matlab_Logo.png/220px-Matlab_Logo.png" alt="Matlab" style="width: 100px;"> </td>
 <td>   <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Atom_editor_logo.svg/2000px-Atom_editor_logo.svg.png" alt="Atom" style="width: 100px;"> </td>
 </tr></table>
    
## In this analysis I will explore the 2018 Kaggle Survey Results, specifically Q13 related to IDE useage:
`Q13: Which of the following integrated development environments (IDE's) have you used at work or school in the last 5 years?`


## A Tale of the 4 types of Kagglers.
Later we will use clustering techniques to identify the 4 types of kagglers based on IDE use. Suprisingly these groups are quite unique and divide our population roughly in 4 equal parts.
- **The Jupyter Lover** - Respondents that primarily uses Jupyter as their development environment.
- **Jack of All IDEs** -  These respondents use a wide range of IDEs
- **RStudio + Jupyer** - This group consists of users that use RStudio and Jupyter.
- **Never Jupyters (and non-coders)** - This group consists of respondents who either don't use development software, or use a specialized IDE exclusively.

## First, lets look at overall IDE use for Kaggle 2018 Survey respondents.
**Some general questions to explore:**
1. Generally what IDEs do respondents say they've used? Is there a specific type of IDEs that are used more/less commonly? Are there IDEs that are used commonly together?
2. Can we define some distinct groups of "coders" by the IDEs they use? If someone uses VIM are they more commonly an experienced programmer? Are students more likely to use Jupyter notebooks?
3. Are there other details about a respondant that correlate with the IDE they use? Do respondents who use specific IDEs typically make more or less money? Do specific job titles trend towards specific IDES?
4. What were some of the freeform responses. What IDEs should be included next year as a selection?


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
import seaborn as sns
from matplotlib_venn import venn3, venn2
warnings.filterwarnings("ignore")
import matplotlib.pylab as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Read Multiple Choice
mc = pd.read_csv('../input/multipleChoiceResponses.csv')

In [None]:
# Data Prep
ide_qs = mc[['Q13_Part_1','Q13_Part_2','Q13_Part_3','Q13_Part_4','Q13_Part_5',
             'Q13_Part_6','Q13_Part_7','Q13_Part_8','Q13_Part_9','Q13_Part_10',
             'Q13_Part_11','Q13_Part_12','Q13_Part_13','Q13_Part_14','Q13_Part_15']].drop(0)

# ide_qs = mc.drop(0).copy()

column_rename = {'Q13_Part_1': 'Jupyter/IPython',
                 'Q13_Part_2': 'RStudio',
                'Q13_Part_3': 'PyCharm',
                'Q13_Part_4': 'Visual Studio Code',
                'Q13_Part_5': 'nteract',
                'Q13_Part_6': 'Atom',
                'Q13_Part_7': 'MATLAB',
                'Q13_Part_8': 'Visual Studio',
                'Q13_Part_9': 'Notepad++',
                'Q13_Part_10': 'Sublime Text',
                'Q13_Part_11': 'Vim',
                'Q13_Part_12': 'IntelliJ',
                'Q13_Part_13': 'Spyder',
                'Q13_Part_14': 'None',
                'Q13_Part_15': 'Other',
                }
ide_qs_binary = ide_qs.rename(columns=column_rename).fillna(0).replace('[^\\d]',1, regex=True)
mc_and_ide = pd.concat([mc.drop(0), ide_qs_binary], axis=1)

# Overall IDE Popularity
Takeaways:
* Jupyter is a big favorite with over 50% of repondants using it.
* RStudio leads up the rest of the pack. It's safe to say that respondents who code esclusively in R  would prefer RStudio as their main IDE.
* Notepad++ is the top pure text editor above Sublime.
* Some reponded with *None*.  Is it safe to assume that these people don't code?

In [None]:
color_pal = sns.color_palette("husl", 16)

(ide_qs_binary.sum() / ide_qs_binary.count()) \
    .sort_values() \
    .plot(kind='barh', figsize=(10, 10),
          title='Percentage of respondents who have used this IDE in past 5 Years.',
         color=color_pal)

plt.show()

# Overlap Between IDE Useage
We can start by making venn-diagrams with the most popular IDES. These are interesting because we can not only see how many people have used the IDE, but be can also see overlap between IDEs.

In [None]:
plt.figure(figsize=(15, 8))

venn3(subsets=(len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 0)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 0)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 0)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 1)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 1)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 1)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 1)])),
      set_labels=('Jupyter', 'RStudio', 'Notepad++'))
plt.title('Jupyter vs RStudio vs Notepad++ (All users)')
plt.show()

# Students Love Jupyter!

A lot of kaggleres (nearly half of all respondents) answered that they were "students". Since jupyter has a large use within teaching, it's interesting to see how the venn diagram differs between students and non-student respondents. While Jupyter is still very popular amonst both groups, the overlap between jupyter and the next top two popular IDEs (RStudio and Notepad++) is more varies amonst non-students. For non-students there are some exclusive users of RStudio and Notepad++, but most every student that uses these IDEs has also had some interaction with Jupyter.

In [None]:
# Venn Diagram of 
plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
venn3(subsets=(len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 0 & (mc_and_ide['Q6'] == 'Student'))]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 0)& (mc_and_ide['Q6'] == 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 0)& (mc_and_ide['Q6'] == 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 1)& (mc_and_ide['Q6'] == 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 1)& (mc_and_ide['Q6'] == 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 1)& (mc_and_ide['Q6'] == 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 1& (mc_and_ide['Q6'] == 'Student'))])),
      set_labels=('Jupyter', 'RStudio', 'Notepad++'))
plt.title('Students IDE Use')
plt.subplot(1, 2, 2)
venn3(subsets=(len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 0) & (mc_and_ide['Q6'] != 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 0)& (mc_and_ide['Q6'] != 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 0)& (mc_and_ide['Q6'] != 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 1)& (mc_and_ide['Q6'] != 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 0) & (ide_qs_binary['Notepad++'] == 1)& (mc_and_ide['Q6'] != 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 0) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 1)& (mc_and_ide['Q6'] != 'Student')]),
               len(ide_qs_binary.loc[(ide_qs_binary['Jupyter/IPython'] == 1) & (ide_qs_binary['RStudio'] == 1) & (ide_qs_binary['Notepad++'] == 1& (mc_and_ide['Q6'] != 'Student'))])),
      set_labels=('Jupyter', 'RStudio', 'Notepad++'))
plt.title('Non-Students IDE Use')
plt.show()

# Notepad++ is the most popular 'text editor' but others are close behind.
I wanted to look at a venn diagram showing some of the IDEs that serve the same purpose. Since RStudio is exclusively R, and Jupyter is unique in it's interactive cells - how did respondents overlap in which text editors they say they use.

This is a pretty clean venn diagram. You can see that there are more exclusive Notepad++ users than PyCharm and Sublime Text, but the areas where they overlap are strikingly similar. The 1556 people in the center are possibly just peopel who enjoy trying everything out, since the question said "[which IDE] have you used at work or school in the last 5 years?" So the respondents aren't necessarily stating which IDE they like the most, just the ones they've used.

In [None]:
# Mabye do some more venn diagrams like this, but it might be too much.
plt.figure(figsize=(15, 8))

venn3(subsets=(len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 1) & (ide_qs_binary['PyCharm'] == 0) & (ide_qs_binary['Notepad++'] == 0)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 0) & (ide_qs_binary['PyCharm'] == 1) & (ide_qs_binary['Notepad++'] == 0)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 1) & (ide_qs_binary['PyCharm'] == 1) & (ide_qs_binary['Notepad++'] == 0)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 0) & (ide_qs_binary['PyCharm'] == 0) & (ide_qs_binary['Notepad++'] == 1)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 1) & (ide_qs_binary['PyCharm'] == 0) & (ide_qs_binary['Notepad++'] == 1)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 0) & (ide_qs_binary['PyCharm'] == 1) & (ide_qs_binary['Notepad++'] == 1)]),
               len(ide_qs_binary.loc[(ide_qs_binary['Sublime Text'] == 1) & (ide_qs_binary['PyCharm'] == 1) & (ide_qs_binary['Notepad++'] == 1)])),
      set_labels=('Sublime Text', 'PyCharm', 'Notepad++'))
plt.title('Sublime Text vs PyCharm vs Notepad++ (All)')
plt.show()

# Job Title and IDEs - the results might suprise you! 
## *Hover over a cell to see the percentage

Let's work off of the hypothesis that the type of work you do will impact the type of tools you use. We compare the IDE use of different job types below. This chart allows us to see what percentage of professional groups use each IDE. 

Some interesting insights we can see are:
1. Jupyter/IPython is popular for pretty much every job title. This is obviously biased because the respondents also use kaggle! Surveying non-kaggler 
2. Statisticians use Jupyter the least (51.9%). Is this because they also tend to be R users (74.68% use RStudio)?
3. DBA/Database Engineers and Software Engineer are heavy Notepad++ users, but I find it personally suprising that Data Journalists and Developer Advocates also use this IDE.  What drives this similarity among disparate occupations?
4. Visual Studio and Visual Studio Code are popular amonst Developer Advocates and Software Engineers
5. Matlab looks to be quite popular within academia. Students, Research Assistants, Research Scientists, and Principal Investigators stand out as Matlab users.

In [None]:
ide_by_q6 = mc_and_ide \
    .rename(columns={'Q6':'Job Title'}) \
    .groupby('Job Title')['Jupyter/IPython','RStudio','PyCharm','Visual Studio Code',
                   'nteract','Atom','MATLAB','Visual Studio','Notepad++','Sublime Text',
                   'Vim','IntelliJ','Spyder','None','Other'] \
    .mean()

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "8pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "9pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '9pt')])
]
np.random.seed(25)
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
#bigdf = pd.DataFrame(np.random.randn(20, 25)).cumsum()
ide_by_q6.T.sort_values('Data Analyst', ascending=False).T \
    .sort_values('RStudio', ascending=False) \
    .rename(columns={'Jupyter/IPython': 'Jupyter',
                     'Visual Studio':'VStudio',
                     'Visual Studio Code': "VSCode",
                     'Sublime Text': 'Sublime'}) \
    [['Jupyter','RStudio','Notepad++','PyCharm','Spyder','Sublime','MATLAB','VStudio','VSCode']] \
    .sort_index() \
    .style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '100px', 'font-size': '1pt'})\
    .set_caption("Hover to magnify")\
    .set_precision(2)\
    .set_table_styles(magnify()) \
    .format("{:.2%}")

# Correlations between IDEs and Salary

It's always fun to plot the data by salary.   Of course, we shoud be weary of drawing any causal conclusions from this analysis. Correlation does note equal causation! Don't change the IDE you use thinking it will make you more money.

*Note: I ignored NA values and those that answered "I do not wish to disclose my approximate yearly compensation" to the salary question. This may bias our results and should be considered.

Some takeaways:
- Each IDE appears to have it's own unique trend when looking at popularity vs. income which I find unexpected and interesting.
- RStudio use appears to have a somewhat normal distribution around the mid-salary range.
- PyCharm and VSCode appear to be fairly flat, with little changes in popularity around income.
- Notepad++ starts to trend downward as salary goes above \$100k. VIM trends upwards.

In [None]:
# Make Salary into a categorical so it can be sorted
salary_ordered = ['0-10,000' ,
                    '10-20,000',
                    '20-30,000',
                    '30-40,000',
                    '40-50,000',
                    '50-60,000',
                    '60-70,000',
                    '70-80,000',
                    '80-90,000',
                    '90-100,000',
                    '100-125,000',
                    '125-150,000',
                    '150-200,000',
                    '200-250,000',
                    '250-300,000',
                    '300-400,000',
                    '400-500,000',
                    '500,000+',
                  #  'I do not wish to disclose my approximate yearly compensation'
                 ]
mc_and_ide['Salary'] = pd.Categorical(mc_and_ide['Q9'], salary_ordered)


ide_salary_breakdown = mc_and_ide.groupby('Salary')['Jupyter/IPython','RStudio','PyCharm','Visual Studio Code',
                   'nteract','Atom','MATLAB','Visual Studio','Notepad++','Sublime Text',
                   'Vim','IntelliJ','Spyder','None','Other'].mean().sort_index()

ide_salary_breakdown['Mean Salary'] = [5, 15, 25, 35, 45, 55, 65, 75,
                                       85, 95, 112.500, 137.500, 175.000, 225.000, 275.000,
                                       350.000, 450.000, 550.000]

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=3, sharex=True, sharey=True, figsize=(15, 15))
color_pal = sns.color_palette("husl", 16)
n = 1
for col in ide_salary_breakdown.set_index('Mean Salary').columns:
    #print(col)
    plt.subplot(5, 3, n)
    ide_salary_breakdown.set_index('Mean Salary')[col].plot(title=col, xlim=(20,300), color=color_pal[n])
    plt.ylabel('% Use')
    plt.xlabel('Mean Salary ($1,000)')
    n += 1
plt.subplots_adjust(hspace = 0.5)
plt.suptitle('IDE Use (% of respondents) by Salary (\$20k-$300k)', size=20)
fig.tight_layout()
fig.subplots_adjust(top=0.93)
plt.show()

## Lets look closer at the relationship between salary and IDE use
In the below chart we can see the excact number of respondents from each salary grouping. 
With a chart like this insights are usually in the abnormalities. For instance:
- We can see that PyCharm is comparably popular in the \$120-\$150k to other IDEs. Could this be because this is the salary range of a serious python developer?
- We can see the proportinal relationship between Jupyter and unix based text editors varies in salary range. We see a Jupyter to Vim ratio of almost `4:1` in the \$0-\$10k salary, but that drops down to a `2:1` ratio in the \$150-\$200k range. Could it be that those that have more job experience also spend more time in the unix shell? 

In [None]:
salary_ide_counts = mc_and_ide.groupby('Salary')['Jupyter/IPython', 'RStudio', 'PyCharm',
                             'Visual Studio Code', 'nteract', 'Atom',
                             'MATLAB', 'Visual Studio', 'Notepad++',
                             'Sublime Text', 'Vim', 'IntelliJ', 'Spyder',
                             'None', 'Other'] \
    .sum() \
    .T \
    .sort_values('90-100,000', ascending=False) \
    .T \
    .sort_index() \
    .T

salary_ide_counts.columns = [str(col) for col in salary_ide_counts.columns]
salary_ide_counts['\$250k+'] = salary_ide_counts[['250-300,000', '300-400,000', '400-500,000', '500,000+']].sum(axis=1)
salary_ide_counts['\$150k-$250k'] = salary_ide_counts[['150-200,000', '200-250,000']].sum(axis=1)
salary_ide_counts['\$100k-$150k'] = salary_ide_counts[['100-125,000', '125-150,000']].sum(axis=1)
salary_ide_counts['\$80k-$100k'] = salary_ide_counts[['80-90,000', '90-100,000']].sum(axis=1)
salary_ide_counts['\$60k-$80k'] = salary_ide_counts[['60-70,000', '70-80,000']].sum(axis=1)
salary_ide_counts['\$40k-$60k'] = salary_ide_counts[['40-50,000', '50-60,000']].sum(axis=1)
salary_ide_counts['\$20k-$40k'] = salary_ide_counts[['20-30,000', '30-40,000']].sum(axis=1)
salary_ide_counts['\$0-$20k'] = salary_ide_counts[['0-10,000', '10-20,000']].sum(axis=1)

salary_ide_counts[['\$0-$20k','\$20k-$40k','\$40k-$60k','\$60k-$80k','\$80k-$100k','\$100k-$150k','\$150k-$250k','\$250k+']] \
    .style.background_gradient(cmap)

# The country you reside and your IDE
(Results shown are % of respondents who use the IDE within the county)

Some really interesting insights can be drawn from looking at IDE use by country.  To keep things simple I'm only looking at the top 10 countries (and later top 5) by respondents. This is only to keep the visualizations easy to read and gain insights.

Some insights that stand out are:
- China and Russia have less love for RStudio compared to other countries. They also appear to be heavier PyCharm users.
- Jupyter is not as popular in China as in other countries, with only 46% of chinese respondents saying they use Jupyer. China also has more than average MATLAB users.
- Visual Studio Code is popular in Brazil with 30% of brazilians saying they've used it.
- Spyder apears to have a slightly stronger following in India than in other countries.

In [None]:
# Make a sorted list of country by number of responses.
country_sorted = mc_and_ide.groupby('Q3').count().sort_values('Q1', ascending=False).index

country_ide_stats = mc_and_ide.groupby('Q3')['Jupyter/IPython','RStudio','PyCharm',
                                             'Visual Studio Code',
                   'nteract','Atom','MATLAB','Visual Studio','Notepad++','Sublime Text',
                   'Vim','IntelliJ','Spyder','None','Other'] \
    .mean()
country_ide_stats.index = pd.Categorical(country_ide_stats.index, country_sorted)
country_ide_stats.sort_index()[:10] \
    .rename({'United Kingdom of Great Britain and Northern Ireland':'GB/N.Ireland',
             'United States of America':'USA'}) \
    .T \
    .sort_values('USA', ascending=False) \
    .style.background_gradient(cmap, axis=0).format("{:.0%}")

In [None]:
country_ide_stats.sort_index()[:6].drop('Other',axis=1) \
    .rename({'United Kingdom of Great Britain and Northern Ireland':'GB/N.Ireland',
             'United States of America':'USA'}) \
    .T \
    .sort_values('USA', ascending=False) \
    .plot(kind='bar', figsize=(15, 5), title='Percentage of Respondents who use IDE for Top 5 Countries')
plt.show()

# IDE Use By Age
Suprisingly, visually we don't see any major differences in age groups distribution of IDE use. It does appear that RStudio is less popular proportionally amonst the younger respondents (Ages 18-24).  When looking at the distribution of each IDE we can see some have higher use in the younger respondents. 

In [None]:
age_ide_counts = mc_and_ide.groupby('Q2')['Jupyter/IPython','RStudio','PyCharm','Visual Studio Code',
                   'nteract','Atom','MATLAB','Visual Studio','Notepad++','Sublime Text',
                   'Vim','Spyder'] \
    .sum()

age_ide_counts.apply(lambda x: x / x.sum() * 100, axis=1) \
    .plot(kind='bar', stacked=True, figsize=(15, 5), title="IDE Use by Age", colormap=plt.get_cmap('tab20'))
plt.style.use('ggplot')
plt.ylabel('Percent total Use')
plt.xlabel('Age')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=3, sharex=True, sharey=True, figsize=(15, 10))
n = 1
for col in age_ide_counts.columns:
    plt.subplot(4, 3, n)
    age_ide_counts[col].plot.bar(title=col, color=color_pal[n])
    plt.xlabel('Age')
    plt.ylabel('Count')
    n += 1
plt.subplots_adjust(hspace = 0.9)
#fig.tight_layout()
#fig.subplots_adjust(top=0.93)
plt.show()

# Freeform Responses
We should also look at the freeform responses. The wordcloud shows the commonly listed freeform responses. Eclipse, emacs, netbeans, xcode, octave. These should be added as options to select from in the 2019 survey.

In [None]:
ff = pd.read_csv('../input/freeFormResponses.csv')
ff['count'] = 1
ff['IDE_lower'] = ff['Q13_OTHER_TEXT'].str.lower()
ff.drop(0)[['IDE_lower','count']].groupby('IDE_lower').sum()[['count']].sort_values('count', ascending=False)

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
plt.figure(figsize=[15,8])

# Create and generate a word cloud image:
ide_words = ' '.join(ff['IDE_lower'].drop(0).dropna().values)
wordcloud = WordCloud(colormap="Blues",
                      width=1200,
                      height=480,
                      normalize_plurals=False).generate(ide_words)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

# The 4 Types of Kagglers (by IDE use)
## *This section is still in progress, please let me know if you have suggestions for better describing these clusters**

Next we will use the KMeans clustering algorithm to divide the kaggle survey respondents into 4 groups based solely on their IDE use. We will then look to see if we can gain any insights from these four group.

In [None]:
ide_qs_binary = ide_qs.rename(columns=column_rename).fillna(0).replace('[^\\d]',1, regex=True)
ide_qs_binary['no reponse'] = ide_qs_binary.sum(axis=1).apply(lambda x: 1 if x == 0 else 0)
ide_qs_binary = ide_qs_binary.loc[ide_qs_binary['no reponse'] == 0].drop('no reponse', axis=1).copy()

# Make the clusters
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=4, random_state=1).fit_predict(ide_qs_binary)
ide_qs_binary['cluster'] = y_pred

# Name the clusters
y_pred_named = ['Cluster1' if x == 0 else \
                'Cluster2' if x == 1 else \
                'Cluster3' if x == 2 else \
                'Cluster4' for x in y_pred]

ide_qs_binary['cluster_name'] = y_pred_named

cluster1 = ide_qs_binary.loc[ide_qs_binary['cluster'] == 0]
cluster2 = ide_qs_binary.loc[ide_qs_binary['cluster'] == 1]
cluster3 = ide_qs_binary.loc[ide_qs_binary['cluster'] == 2]
cluster4 = ide_qs_binary.loc[ide_qs_binary['cluster'] == 3]

ide_qs_binary = ide_qs_binary.replace({ide_qs_binary.groupby('cluster_name').sum().sort_values('Jupyter/IPython', ascending=False).iloc[0].name: 'Jupyter Lovers',
                     ide_qs_binary.groupby('cluster_name').sum().sort_values('Jupyter/IPython', ascending=True).iloc[0].name: 'Anti-Jupyters',
                     ide_qs_binary.groupby('cluster_name').sum().sort_values('RStudio', ascending=False).iloc[0].name: 'RStudio and Jupyter',
                     ide_qs_binary.groupby('cluster_name').sum().sort_values('PyCharm', ascending=False).iloc[0].name: 'Jack of All IDEs'}).copy()

mc_and_ide['cluster_name'] = ide_qs_binary['cluster_name']
mc_and_ide['cluster_name'] = mc_and_ide['cluster_name'].fillna('No Response')
mc_and_ide['count'] = 1

Lets return to the venn-diagrams from before, so that we can get a sense of our four clusters. Note that the Anti-Jupyters don't have any 

In [None]:
def ven3_jrn(df):
    if df['cluster_name'].iloc[0] == 'Anti-Jupyters':
        return venn2(subsets=(len(df.loc[(df['RStudio'] == 1) & (df['Notepad++'] == 0)]),
                              len(df.loc[(df['RStudio'] == 0) & (df['Notepad++'] == 1)]),
                              len(df.loc[(df['RStudio'] == 1) & (df['Notepad++'] == 1)])),
                    set_labels=('RStudio', 'Notepad++'))
    if df['cluster_name'].iloc[0] == 'Jupyter Lovers':
        return venn2(subsets=(len(df.loc[(df['Jupyter/IPython'] == 1) & (df['Notepad++'] == 0)]),
                              len(df.loc[(df['Jupyter/IPython'] == 0) & (df['Notepad++'] == 1)]),
                              len(df.loc[(df['Jupyter/IPython'] == 1) & (df['Notepad++'] == 1)])),
                    set_labels=('Jupyter/IPython', 'Notepad++'))
    return venn3(subsets=(len(df.loc[(df['Jupyter/IPython'] == 1) & (df['RStudio'] == 0) & (df['Notepad++'] == 0)]),
               len(df.loc[(df['Jupyter/IPython'] == 0) & (df['RStudio'] == 1) & (df['Notepad++'] == 0)]),
               len(df.loc[(df['Jupyter/IPython'] == 1) & (df['RStudio'] == 1) & (df['Notepad++'] == 0)]),
               len(df.loc[(df['Jupyter/IPython'] == 0) & (df['RStudio'] == 0) & (df['Notepad++'] == 1)]),
               len(df.loc[(df['Jupyter/IPython'] == 1) & (df['RStudio'] == 0) & (df['Notepad++'] == 1)]),
               len(df.loc[(df['Jupyter/IPython'] == 0) & (df['RStudio'] == 1) & (df['Notepad++'] == 1)]),
               len(df.loc[(df['Jupyter/IPython'] == 1) & (df['RStudio'] == 1) & (df['Notepad++'] == 1)])),
      set_labels=('Jupyter', 'RStudio', 'Notepad++'))

plt.figure(figsize=(15, 8))
n = 1

for i, d in ide_qs_binary.groupby('cluster_name'):
    plt.subplot(1, 4, n)
    ven3_jrn(d)
    plt.title(i)
    n += 1
plt.show()

In [None]:
ide_qs_binary.groupby('cluster_name') \
    .sum() \
    .T \
    .sort_values('Jupyter Lovers', ascending=False) \
    .T \
    .drop('cluster', axis=1) \
    .plot(kind='bar', figsize=(15, 5), title='IDE by Cluster Group', rot=0, colormap=plt.get_cmap('tab20'))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

With this information we will group our population into three distincy groups:

Group 1 - **The Jupyter Lover** (5549 respondents in this group)

 <img src="https://ih1.redbubble.net/image.457277318.8824/ra,unisex_tshirt,x2200,heather_grey,front-c,392,146,750,1000-bg,f8f8f8.jpg" alt="Jupyter Lover" style="width: 200px;">
- These people tend to use Jupyter more than any other IDE, they also tend to use Jupyer exclusively.

Group 2 - **Jack of All IDEs** (4222 respondents in this group)
<img src="https://cdn-images-1.medium.com/max/1200/1*rV2Jl_qaqzCzCp9Vf0_kUw.png" alt="Jupyter Lover" style="width: 300px;">
- A slightly smaller group of kagglers than the other two. These kagglers use Jupyter like group 1, but also use Notepad++, RStudio and other IDEs.

Group 3 - **RStudio + Jupyer** (5872 respondents in this group)
 <img src="https://ih1.redbubble.net/image.319703067.4496/ra,womens_tshirt,x1900,fafafa:ca443f4786,front-c,265,125,750,1000-bg,f8f8f8.u1.jpg" alt="Jupyter Lover" style="width: 200px;">
 - These respondents use RStudio but also use Jupyter.
 
Group 4 - **Never Jupyters (and non-coders)** (4588 respondents in this group)
 <img src="https://i.imgflip.com/2m8ycy.jpg" alt="Jupyter Lover" style="width: 400px;">
- These kagglers don't use Jupyter at all. While some use IDEs, many of them selected no IDEs at all. This group contains many non-coders but also very specialized IDE users that use one IDE exclusively.

**NO RESPONSE GROUP** (4743 respondents) - We droped the respondents that didn't answer anything for this question- the assumption being that they didn't take the time to select any answer including 'none' so we will no use them for clustering.



## What other insights can we pull from these three groups?
We can start by looking at the top 3 answers for other questions these groups replied with for some of the other survey questions.




|           | Jupyter Lovers                                    | Jack of All IDEs                                  | RStudio + Jupyter                            |                    No Jupyter                   | No Response                                       |
|-----------|---------------------------------------------------|---------------------------------------------------|----------------------------------------------|:-----------------------------------------------:|---------------------------------------------------|
| Age       | 1.25-29 2.22-24 3.30-34                           | 1.25-29 2.22-24 3.30-34                           | 1.25-29 2.30-34 3.22-24                      |             1.25-29 2.22-24 3.30-34             | 1. 22-24 2. 25-29 3. 18-21                        |
| Country   | 1. India 2. USA 3. China                          | 1. USA 2. India 3. China                          | 1. USA 2. India 3. UK                        |             1. USA 2. India 3. China            | 1. India 2. USA 3. China                          |
| Job Title | 1. Student 2. Data Scientist 3. Software Engineer | 1. Student 2. Software Engineer 3. Data Scientist | 1. Data Scientist 2. Student 3. Data Analyst | 1. Student 2. Software Engineer 3. Data Analyst | 1. Student 2. Software Engineer 3. Data Scientist |
|           |                                                   |                                                   |                                              |                                                 |                                                   |
|           |                                                   |                                                   |                                              |                                                 |                                                   |
|           |                                                   |                                                   |                                              |                                                 |                                                   |
|           |                                                   |                                                   |                                              |                                                 |                                                   |
|           |                                                   |                                                   |                                              |                                                 |                                                   |

In [None]:
# groups = [('count','Jupyter Lovers'),
#          ('count','Jack of All IDEs'),
#          ('count','RStudio and Jupyter'),
#          ('count','Anti-Jupyters'),
#          ('count','No Response'),]

# for col in mc_and_ide.columns:
#     print('-----' +col)
#     print('-----------------' + mc[col][0])
#     for group in groups:
#         print(group)
#         print('-----{}------'.format(group[1]))
#         print(mc_and_ide.groupby(['cluster_name',col])[['count']].sum().unstack('cluster_name').sort_values(group, ascending=False)[[group]][:3])

# Work in progress - Plot the groups as % of other questions....

In [None]:
for col in df.columns:
    df[[col]].sort_values(col, ascending=True).plot.barh(title = col, figsize=(10, 5))
    plt.show()

## TODO - think of the best way to display this data without overloading the reader and showing how each *group* is unique.

In [None]:
def plot_question_groups(question):
    mc_and_ide.groupby(['cluster_name', question]) \
        .count()['count'] \
        .unstack('cluster_name') \
        .apply(lambda x: x / x.sum(), axis=1) \
        .sort_values('Jupyter Lovers', ascending=False) \
        .sort_values('Anti-Jupyters') \
        .plot.barh(stacked=True, figsize=(15, 5), title=mc[question][0])
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.show()

In [None]:
plot_question_groups('Q1')
plot_question_groups('Q2')
plot_question_groups('Q3')
plot_question_groups('Q5')
plot_question_groups('Q6')
plot_question_groups('Q7')
plot_question_groups('Q8')
plot_question_groups('Q9')
plot_question_groups('Q10')

In [None]:
plot_question_groups('Q17')
plot_question_groups('Q18')
plot_question_groups('Q20')
plot_question_groups('Q22')

In [None]:
plot_question_groups('Q23')
plot_question_groups('Q24')
plot_question_groups('Q25')

In [None]:
plot_question_groups('Q26')

In [None]:
plot_question_groups('Q32')

In [None]:
plot_question_groups('Q37')

In [None]:
plot_question_groups('Q40')

In [None]:
plot_question_groups('Q43')

In [None]:
plot_question_groups('Q46')

In [None]:
plot_question_groups('Q48')

# Conclusion
Here I will discuss some main takeaway points from this analysis.....

# MORE TO COME - Check back soon. - Please upvote if you enjoyed it!