## Python and R. Let's compare the users

Kaggle survey has a lot of interesting question, and one of them is a hot topic: what language do you use? There are a lot of arguments on internet about "the best" language, this is relevant not only for Data Science, but for programming in general. And I was interested about Kagglers using Python and R. I have split the users roughly in 4 groups:
* use both Python and R
* use only Python, not R
* use only R, not Python
* use neither Python nor R

And then I compare users in these four groups - try to find how similar or different they are. It was very fun to do it, I hope you'll also enjoy reading my notebook :)

![](https://miro.medium.com/max/640/1*SoBbCn6tUkhiUlgDl80PzQ.jpeg)

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly.offline import init_notebook_mode, iplot
import warnings
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', None)
import colorlover as cl
import re
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from plotly.subplots import make_subplots

In [None]:
def plot_var_one(var_name: str = '', normalize_by_var: bool = False, title_name: str = ''):
    """
    Create traces for plotting one variable.
    
    Args:
        var_name: name of the variable to plot
        normalize_by_var: normalize values within each category
        title_name: title name to show
    
    """
    
    data_loc = []
    for i, e in enumerate(data['Python_R'].unique()):
        grouped = data.loc[data['Python_R'] == e, var_name].value_counts().sort_index().reset_index()
        grouped = grouped.rename(columns={'index': var_name, var_name: 'Count'})
        # in two columns variables should be sorted, so I add a dict with mapping for sorting
        if var_name == 'Q6':
            map_dict = {'I have never written code': 6,
                        '5-10 years': 3,
                        '3-5 years': 2,
                        '< 1 years': 0,
                        '1-2 years': 1,
                        '10-20 years': 4,
                        '20+ years': 5}
            grouped['sorting'] = grouped[var_name].apply(lambda x: map_dict[x])
            grouped = grouped.sort_values('sorting', ascending=True)
            
        
        elif var_name == 'Q15':
            map_dict = {'1-2 years': 1,
                         '10-20 years': 6,
                         '2-3 years': 2,
                         '20 or more years': 7,
                         '3-4 years': 3,
                         '4-5 years': 4,
                         '5-10 years': 5,
                         'I do not use machine learning methods': 8,
                         'Under 1 year': 0}
            grouped['sorting'] = grouped[var_name].apply(lambda x: map_dict[x])
            grouped = grouped.sort_values('sorting', ascending=True)
            
        if normalize_by_var:
            d = data[var_name].value_counts().to_dict()
            grouped['Count'] = grouped.apply(lambda row: row.Count / d[row[var_name]], axis=1)
            
        trace = go.Bar(
            x=grouped[var_name],
            y=grouped['Count'],
            name=e,
            legendgroup=i
        )
        data_loc.append(trace)
    return data_loc


def plot_var(var_name: str = '', title_name: str = ''):
    """
    Make plots with a defined variable.
    The first plot is in absolute values, the second one in relative.
    
    Args:
        var_name: name of the variable to plot
        normalize_by_var: normalize values within each category
        title_name: title name to show
    
    """
    fig = make_subplots(rows=2, cols=1, subplot_titles=[f'Language groups by {title_name} in absolute values',
                                                        f'Language groups by {title_name} in relative values'])
    for tr in plot_var_one(var_name=var_name, title_name=title_name):
        fig.add_trace(tr, row=1, col=1)
    for tr in plot_var_one(var_name=var_name, normalize_by_var=True, title_name=title_name):
        fig.add_trace(tr, row=2, col=1)
        
    fig['layout'].update(height=800, width=1000, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)')
    fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='orange')
    return fig


def make_choice_var(var: str = '', df: pd.DataFrame = pd.DataFrame(), title_name: str = ''):
    """
    Prepare traces for a variable, in which responders could select several answers.
    Args:
        var: name of the variable to plot
        df: data to use for plots
        title_name: title name to show
    
    """
    col_names = [col for col in df.columns if f'{var}_Part' in col]
    data_loc = []
    small_df = df[col_names]
    text_values = [col.split('_')[2] for col in col_names]
    counts = []
    nms = []
    for m, n in zip(col_names, text_values):
        if small_df[m].nunique() == 0:
            counts.append(0)
        else:
            counts.append((small_df[m].isnull() == False).sum())
            nms.append(small_df[m].value_counts().index[0].strip())
            
    trace = go.Bar(
        x=nms,
        y=counts,
        name='fdsfdfdf',
        marker=dict(color='silver'),
        showlegend=False
    )
    data_loc.append(trace)
    return trace


def plot_choice_var(var: str, title_name: str, height: int = 800):
    """
    
    Make plots with a defined variable.
    Separate plot for each group of users.
    
    Args:
        var: name of the variable to plot
        title_name: title name to show
        height: height of the plot
    """
    fig = make_subplots(rows=2, cols=2,subplot_titles=group_names)
    for i, v in enumerate(group_names):
        f = make_choice_var(var=var, df=data.loc[data['Python_R'] == v], title_name=title_name)
        fig.add_trace(f, row=(i // 2) + 1, col=(i % 2) + 1)

    fig['layout'].update(height=height, width=1000, paper_bgcolor='rgba(0,0,0,0)',
                         plot_bgcolor='rgba(0,0,0,0)', title=f'Popular {title_name} by language groups');
    
    fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')
    return fig

In [None]:
data = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
# remove row with technical info
data = data[1:]

# create column with groups
data['Python_R'] = ''
data.loc[(data['Q7_Part_1'] == 'Python') & (data['Q7_Part_2'] == 'R'), 'Python_R'] = 'Python & R'
data.loc[(data['Q7_Part_1'] != 'Python') & (data['Q7_Part_2'] == 'R'), 'Python_R'] = 'R, Not Python'
data.loc[(data['Q7_Part_1'] == 'Python') & (data['Q7_Part_2'] != 'R'), 'Python_R'] = 'Python, Not R'
data.loc[(data['Q7_Part_1'] != 'Python') & (data['Q7_Part_2'] != 'R'), 'Python_R'] = 'Not Python, Not R'

# make some texts shorter, so that plots look better
data['Q3'] = data['Q3'].str.replace('United States of America', 'USA').replace('United Kingdom of Great Britain and Northern Ireland', 'UK')
data['Q4'] = data['Q4'].str.replace('Some college/university study without earning a bachelor’s degree', 'Some study')

# use regex to remove text in brackets, so that plots look better
regex_cols = ['Q9', 'Q17', 'Q18', 'Q23', 'Q37', 'Q39']
for col_name in regex_cols:
    for col in [c for c in data.columns if col_name in c]:
        data[col] = data[col].apply(lambda x: re.sub("[\(\[].*?[\)\]]", "", str(x)).strip() if x is not np.nan else x)

## General information about groups of users

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(data['Python_R'])
plt.title('Number of users in groups');

As we can see, most people prefer to use Python and not R. What is more interesting - the number of people using only R is very low. I suppose they aren't that interested in Kaggle? R is often used for statistics and beautiful visualizations, but Kaggle isn't a very suitable place for improving statistical skills.

Also there is a big group of people, who don't use these two languages. Let's see what languages do they use.

In [None]:
# TODO: add comment
other_langs = (data.loc[data['Python_R'] == 'Not Python, Not R', [col for col in data.columns if 'Q7' in col]][1:].isnull() == False).sum()
other_langs.index = ['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'Javascript', 'Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other']
other_langs = other_langs.sort_values(ascending=False)

trace = go.Bar(
    x=other_langs.index,
    y=other_langs.values,
    name='Top languanges',
)
layout = dict(height=400, width=1000, title="Top languages used by people, who don't use Python and R");  
fig = dict(data=[trace], layout=layout)

iplot(fig);

Oh! I suppose most of these people are either Data Analysts or Software Developers.

## Are there differences in countries?

In [None]:
group_names = ['Python & R', 'Not Python, Not R', 'Python, Not R', 'R, Not Python']

fig = make_subplots(rows=2, cols=2, subplot_titles=group_names)
for i, v in enumerate(group_names):
    aggregated = data.loc[data['Python_R'] == v, 'Q3'].value_counts().head(10).sort_values().reset_index()
    fig.add_trace(go.Bar(x=aggregated['Q3'],
                         y=aggregated['index'], orientation='h', name=v), row=(i // 2) + 1, col=(i % 2) + 1)    
    
fig['layout'].update(height=800, width=1000,paper_bgcolor='rgba(0,0,0,0)',
                     plot_bgcolor='rgba(0,0,0,0)', title='Top countries by language groups');
iplot(fig);

That's quite interesing. While top countries are similar, because a lot of people live in them, it can be interesting to have a look at other countries.
* As I live in Russia, I can confirm that R isn't very popular here.
* Looks like R is also not really popular in Japan.
* It is noticable that a lot of people in Nigeria use only Python

## More information to compare

In [None]:
fig = plot_var(var_name='Q2', title_name='gender')
iplot(fig);

It seems that the difference between genders is insignificant.

In [None]:
fig = plot_var(var_name='Q4', title_name='education')
iplot(fig);

One of interesting observations is that the higher the degree level is, the more likely are people to use both Python & R. And a lot of PhD use R. I suppose they need it because of statistical packages as an alternative to Matlab and statistical software?

In [None]:
fig = plot_var(var_name='Q5', title_name='title')
iplot(fig);

Looks like a lot of analysts and managers don't use Python and R. The former need more SQL I suppose, the latter don't need programming at all.

And statisticians, of course, use R a lot :)

In [None]:
fig = plot_var(var_name='Q6', title_name='years of programming')
iplot(fig);

In [None]:
plot_choice_var('Q9', 'IDE')

It isn't surprising that people using R prefer RStudio and those who use Python prefer Jupyter/VSCode/PyCharm. Notepad++ is more popular that other light-weight text/code editors.

In [None]:
plot_choice_var('Q10', 'hosted notebooks', height=1200)

It isn't really suprising that R users prefer Kaggle Notebooks, because they have R ready to use.

In [None]:
plot_choice_var('Q14', 'data visualization libraries')

R users, of course use Shiny and ggplot2. Also it is worth noticing that those who use both Python and R heavily use both matplotlib and ggplot2.

In [None]:
fig = plot_var(var_name='Q15', title_name='years of using ML')
iplot(fig);

This is an interesting plot. On the one hand we could say that the more experienced people are, the more likely are they to use both Python and R. But I think it is the other way round - experienced people may have started using R as an alternative to software like SAS and then later adopted Python as an additional tool.

In [None]:
plot_choice_var('Q16', 'ML frameworks')

Here we can see a lot of language-specific libraries like Caret and Tidymodels on one hand and Scikit-learn on the other hand. Also it seem that some R users use `reticulate` to get access to some Python packages.

In [None]:
plot_choice_var('Q17', 'ML algorithms')

Here we can see one of the biggest differences - almost no R users use deep learning algorithms.

In [None]:
plot_choice_var('Q36', 'sites to share work')

I find it fascinating, that R users prefer usign personal blogs for sharing their work. I suppose it is easier to share python projects on github than R.

In [None]:
plot_choice_var('Q37', 'platforms for courses')

* Everyone uses Coursera :) Let's praise it :pray:
* R users prefer edX and DataCamp;
* new Python users like Kaggle Learn Courses;
* those who don't use Python and R use Udemy a lot - it provides a lot of cheap courses;

In [None]:
plot_choice_var('Q39', 'media sources')

As a result I think that while there are some differences between Python and R users, they aren't very serious. So we can embrace one or both of the languages and do what we like :)