# **Introduction**
**The 2021 Kaggle DS & ML Survey received 25,973 usable responses from participants in 171 different countries and territories, and the survey covers topics including programming tools, machine learning usage, cloud platform, big data products and so on. Based on the results, We can have a general understanding of the current data science and machine learning environment.**

**First, install the visulization package "pyecharts", then load necessary packages**

In [None]:
pip install pyecharts

In [None]:
import pandas as pd
import numpy as np

from pyecharts import options as opts
from pyecharts.charts import Pie,Grid,Bar,Map,Sankey
from pyecharts.commons.utils import JsCode
from pyecharts.globals import ThemeType
from collections import defaultdict

from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_NOTEBOOK
pd.options.mode.chained_assignment = None

In [None]:
df=pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
print(df.shape)
df.head()

**The results are saved as csv file, the rows are answers from a interviewee and columns are different questions, some of which are single choice questions and the other are multiple choice questions.**

In [None]:
df=df.iloc[1:,:]
df.head()

**I just skip the first row which describes the detailed questions.**

# **About personal information**

In [None]:
def single_pie(q,title):
    cols=[x for x in df.columns if x.startswith('%s_'%q) or x==q]
    choice={col:[i for i in set(df[col].dropna())][0] for col in cols}
    if len(cols)==1:
        df_draw=df[q].value_counts()
    else:
        idx=[]
        df_draw=[]
        for col in cols:
            tmp=df[col].value_counts()
            if len(tmp)==0:
                df_draw.append(0)
            else:
                df_draw.append(tmp.iloc[0])
            idx.append(choice[col])
        df_draw=pd.Series(df_draw,index=idx).sort_values(ascending=False)

    pie=(
        Pie(init_opts=opts.InitOpts(width="1200px", height="800px"))
        .add(
            series_name="count/ratio",
            radius=["40%", "55%"],
            data_pair=[list(z) for z in zip(df_draw.index,df_draw.iloc[:])],
            rosetype="radius",
            label_opts=opts.LabelOpts(
                position="outside",
                formatter="{a|{a}}{abg|}\n{hr|}\n {b|{b}: }{c}  {per|{d}%}  ",
                background_color="#eee",
                border_color="#aaa",
                border_width=1,
                border_radius=4,
                rich={
                    "a": {"color": "#999", "lineHeight": 22, "align": "center"},
                    "abg": {
                        "backgroundColor": "#e3e3e3",
                        "width": "100%",
                        "align": "right",
                        "height": 22,
                        "borderRadius": [4, 4, 0, 0],
                    },
                    "hr": {
                        "borderColor": "#aaa",
                        "width": "100%",
                        "borderWidth": 0.5,
                        "height": 0,
                    },
                    "b": {"fontSize": 16, "lineHeight": 33},
                    "per": {
                        "color": "#eee",
                        "backgroundColor": "#334455",
                        "padding": [2, 4],
                        "borderRadius": 2,
                    },
                },
            ),
        )
        .set_global_opts(legend_opts=opts.LegendOpts(pos_left="right", orient="vertical",type_='scroll'),title_opts=opts.TitleOpts(title=title))
        .set_series_opts(
            tooltip_opts=opts.TooltipOpts(
                trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"
            )
        )
    )
    return pie
    

pie=single_pie('Q1','Age')
pie.render_notebook()

**More than half of the interviewees are 18-29 years old.**

In [None]:
def nested_pie(inner_feature,outer_feature):
    df_new=df.groupby([inner_feature])[outer_feature].apply(lambda x:x.value_counts()).unstack()
    df_new['total']=df_new.sum(axis=1)

    inner_x_data=list(df_new.index)
    inner_y_data=df_new['total']
    inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]

    df_new=df_new.drop('total',axis=1)
    outer_x_data=list(df_new.columns)*len(df_new.index)
    data_tmp=np.array(df_new).flatten()
    data_tmp[np.isnan(data_tmp)]=0.
    outer_y_data=data_tmp.tolist()
    outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]

    pie=(
        Pie(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        .add(
            series_name="count/ratio",
            data_pair=inner_data_pair,
            radius=[0, "30%"],
            label_opts=opts.LabelOpts(position="inner"),
        )
        .add(
            series_name="count/ratio",
            radius=["40%", "55%"],
            data_pair=outer_data_pair,
            label_opts=opts.LabelOpts(
                position="outside",
                formatter="{a|{a}}{abg|}\n{hr|}\n {b|{b}: }{c}  {per|{d}%}  ",
                background_color="#eee",
                border_color="#aaa",
                border_width=1,
                border_radius=4,
                rich={
                    "a": {"color": "#999", "lineHeight": 22, "align": "center"},
                    "abg": {
                        "backgroundColor": "#e3e3e3",
                        "width": "100%",
                        "align": "right",
                        "height": 22,
                        "borderRadius": [4, 4, 0, 0],
                    },
                    "hr": {
                        "borderColor": "#aaa",
                        "width": "100%",
                        "borderWidth": 0.5,
                        "height": 0,
                    },
                    "b": {"fontSize": 16, "lineHeight": 33},
                    "per": {
                        "color": "#eee",
                        "backgroundColor": "#334455",
                        "padding": [2, 4],
                        "borderRadius": 2,
                    },
                },
            ),
        )
        .set_global_opts(legend_opts=opts.LegendOpts(pos_left="left", orient="vertical"))
        .set_series_opts(
            tooltip_opts=opts.TooltipOpts(
                trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"
            )
        )
    )
    return pie

In [None]:
pie=nested_pie('Q2','Q1')
pie.render_notebook()

**Most interviewees are men, accounting for about 80%, while women accout for about 20%. Among male respondents, the most frequent age group is 25-29, while it is 18-21 for female respondents.**

In [None]:
def draw_country():
    df_new=df['Q3'].value_counts().sort_values(ascending=False)
    countries=pd.Series(df_new.index)
    countries=countries.replace('United States of America','United States')
    
    map = (
        Map()
        .add("Number of interviewee", [list(z) for z in zip(countries, df_new.iloc[:])], "world")
        .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        .set_global_opts(
            title_opts=opts.TitleOpts(title="Interviewee over the World"),
            visualmap_opts=opts.VisualMapOpts(max_=2000),
        )
    )
    return map

In [None]:
pie=single_pie('Q3','Country')
pie.render_notebook()

In [None]:
map=draw_country()
map.render_notebook()

**Most interviewees come from India, accounting for more than 25%, followed by more than 2600 respondents from the United States.**

In [None]:
def get_SankeyData(questions):
    cols_all=[]
    multi_choice=[]
    for q in questions:
        col_tmp=[x for x in df.columns if x.startswith('%s_'%q) or x==q]
        if len(col_tmp)>1:
            multi_choice.extend(col_tmp)
        cols_all.extend(col_tmp)
    df_new=df[cols_all]
    cols=list(df_new.columns)

    replace_choices=['NoAnwser','None','Other']
    nodes=[]
    name_set=set()
    surfix=defaultdict(int)
    for i in range(len(cols)):
        values=df_new[cols[i]].unique()
        for value in values:
            if not pd.isnull(value):
                if value in name_set:
                    nodes.append({'name':'%s_%d'%(value,surfix[value])})
                    df_new[cols[i]]=df_new[cols[i]].replace(value,'%s_%d'%(value,surfix[value]))
                    surfix[value]+=1
                else:
                    nodes.append({'name':value})
                    name_set.add(value)

    df_new['count']=1
    new_cols=['source','target','value']
    groups=[]
    for i in range(1,len(cols)):
        if cols[i-1] not in multi_choice:
            pre_choice=cols[i-1]
        if cols[i] not in multi_choice:
            j=i-1
            if cols[j] in multi_choice:
                while j>=0 and cols[j] in multi_choice:
                    df_tmp=df_new.groupby([cols[j],cols[i]])['count'].sum().reset_index()
                    df_tmp.columns=new_cols
                    groups.append(df_tmp)
                    j-=1
            else:
                df_tmp=df_new.groupby([cols[j],cols[i]])['count'].sum().reset_index()
                df_tmp.columns=new_cols
                groups.append(df_tmp)
        else:	
            df_tmp=df_new.groupby([pre_choice,cols[i]])['count'].sum().reset_index()
            df_tmp.columns=new_cols
            groups.append(df_tmp)
        df_tmp.columns=new_cols
        groups.append(df_tmp)
    df_concat=pd.concat(groups)

    links=[]
    for item in df_concat.values:
        links.append({k:v for k,v in zip(new_cols,item) if not(pd.isnull(item[0]) or pd.isnull(item[1]))})
    return nodes,links

def draw_Sankey(cols,title=''):
    nodes,links=get_SankeyData(cols)
    sankey=(
        Sankey(init_opts=opts.InitOpts(width="1200px", height="800px"))
        .add('',
            nodes,
            links,
            linestyle_opt=opts.LineStyleOpts(opacity=0.3,curve=0.5,color='source'),
            label_opts=opts.LabelOpts(position='right'),
            #node_gap=30
        )
        .set_global_opts(title_opts=opts.TitleOpts(title=title))
    )
    return sankey

In [None]:
pie=single_pie('Q5','Identity')
pie.render_notebook()

**More than 1/4 of respondents are students, followed by data scientists and software engineer.**

In [None]:
q=['Q4','Q5','Q6']
s=draw_Sankey(q)
s.render_notebook()

**More than half of the interviewees have a Master's degree or Bachelor's degree, which are mainly contributed by students and data scientists. Doctoral degree is mainly contributed by research scientist. Comparing to other interviewees, most students may have less programming experience (<3 years). Data scientists contribute most to respondents with 5-10 years coding experience, while software engineers contriute most to these with more than 20 years coding experience.**

# **About programming environments**

In [None]:
def draw_groupby(by,feature):
    cols=[x for x in df.columns if x.startswith('%s_'%feature) or x==feature]
    choice={col:[i for i in set(df[col].dropna())][0] for col in cols}

    if len(cols)>1:
        df_group=list(df.groupby(by))
        n_group=len(df_group)
        idx=[df_group[i][0] for i in range(n_group)]
        dic_all=defaultdict(list)
        for i in range(n_group):
            df_tmp=df_group[i][1]
            for col in cols:
                tmp=df_tmp[col].value_counts()
                if len(tmp)==0:
                    dic_all[choice[col]].append(0)
                else:
                    dic_all[tmp.index[0]].append(tmp.iloc[0])
        df_new=pd.DataFrame(dic_all,index=idx)
    else:
        df_new=df.groupby(by)[cols[0]].apply(lambda x:x.value_counts().sort_values(ascending=False)).unstack()

    df_new['total']=df_new.sum(axis=1)
    df_new=df_new.sort_values(by=['total'],ascending=True)
    #print(df_new)

    df_ratio=df_new.apply(lambda x:x/x['total'],axis=1)

    cols=df_new.columns[:-1]
    rows=df_new.index

    draw_lists=[]
    for col in cols:
        percents=df_new[col]/df_new['total']
        tmp=[{'value':v,'percent':p} for v,p in zip(df_new[col],percents)]
        #print(tmp)
        draw_lists.append(tmp)

    bar1=(
        Bar()
        .add_xaxis(list(rows))
        .add_yaxis("",list(df_new['total']),category_gap="50%")
        .set_series_opts(
            label_opts=opts.LabelOpts(
            position="right",
            # formatter=JsCode(
            #     "function(x){return Number(x.data* 100).toFixed() + '%';}"
            # ),
        )
    )
        .reversal_axis()
    )

    bar2=Bar().add_xaxis(list(rows))
    for col in cols:
        bar2.add_yaxis(col,[x*100 for x in df_ratio[col]],stack='stack1',category_gap="50%")
    bar2.set_series_opts(
        label_opts=opts.LabelOpts(
            position="bottom",
            formatter=JsCode(
                "function(x){return Number(x.data).toFixed() + '%';}"
            ),
        )
    )
    bar2.set_global_opts(
        xaxis_opts=opts.AxisOpts(
            type_="value",
            min_=0,
            max_=100,
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
    )
    bar2.reversal_axis()

    grid = (
        Grid(init_opts=opts.InitOpts(width="1600px", height="800px"))
        .add(bar1, grid_opts=opts.GridOpts(pos_right="55%"))
        .add(bar2, grid_opts=opts.GridOpts(pos_left="55%"))
    )
    return grid

In [None]:
g=draw_groupby('Q8','Q7')
g.render_notebook()

**The left bar chart shows the programming language recommended by respondents, most of them surggest to learn python first. The left bar chart shows the languages respondents used, it's interesting to see python is most frequently used by people recommending different languages, while the second frequently used language is what they recommended. Obviously, python, R, SQL are most popular.**

In [None]:
p=single_pie('Q9','IDE')
p.render_notebook()

**About a 1/4 of respondents use Jupyter Notebook as their IDE, followed by VSCode and PyCharm.**

In [None]:
g=draw_groupby('Q5','Q10')
g.render_notebook()

**For notebook, Kaggle Notebooks and Colab Notebooks are mostly used, while there are about 1/5 of respondents do not use notebooks.**

In [None]:
q=['Q11','Q12','Q13']
s=draw_Sankey(q)
s.render_notebook()

**Most respondents use a laptop or PC/desktop as their computing platform, and many of them do not use GPUs or TPUs. There are also many people who use NVIDIA GPUs or Google Cloud TPUs, many people alse have experience of using TPUs.**

In [None]:
g=draw_groupby('Q5','Q14')
g.render_notebook()

**For visulization libraries, most respondents use Matplotlib, Seaborn and Plotly, while statisticians may use ggplot more often.**

# **About machine learning methods**

In [None]:
q=['Q15','Q5','Q16']
s=draw_Sankey(q)
s.render_notebook()

**Most respodnets use machine learning method for 1-3 years, students and data scientists are likely to frameworks of Sklearn, Tensorflow, Keras, PyTorch and Xgboost, while other people may use diversified machine learning frameworks.**

In [None]:
p=single_pie('Q17','Machine learning methods')
p.render_notebook()

**Current popular machine learning methods are all frequently used by interviewees, including LR, decision tree, random forests, GBM, CNN, DNN...**

In [None]:
q=['Q5','Q18']
s=draw_Sankey(q)
s.render_notebook()

In [None]:
q=['Q5','Q19']
s=draw_Sankey(q)
s.render_notebook()

**Popular CV tools and NLP methods are also frequently used by people with different identities.**

# **About industry and company**

In [None]:
q=['Q20','Q21','Q22','Q23','Q26']
s=draw_Sankey(q)
s.render_notebook()

**Repondents from industries of Computers/Technology and Academics/Education are more than other industries. Companies with more than 1000 employees are more likely to have more poeple responsible for data science, and they may have well established ML methods, they also pay more on machine learning or cloud computing services.**

In [None]:
p=single_pie('Q24','Important work')
p.render_notebook()

**For most respondents, works related to data and meachine learning methods play important roles in the their daily works.**

In [None]:
g=draw_groupby('Q25','Q20')
g.render_notebook()

**For most interviewees are sutdents, the yearly compensation is concentrated on 0-999. People who works on Computer/Technology accout most in different compensation ranges, and contributes more in higher compensation ranges.**

# **About cloud platforms**

In [None]:
g=draw_groupby('Q5','Q27_A')
g.render_notebook()

In [None]:
p=single_pie('Q28','Enjoyable cloud platforms ')
p.render_notebook()

In [None]:
p=single_pie('Q27_B','Cloud platform want to learn')
p.render_notebook()

**For cloud computing platforms, data scientists use them more often, and Amazon Web Services, Google Cloud Platform and Microsoft Azure are the most popular platforms, they are also platforms most people want to learn. Also, there are many people think all the platforms have similarly enjoyable expericence.**

In [None]:
q=['Q5','Q29_A']
s=draw_Sankey(q)
s.render_notebook()

In [None]:
p=single_pie('Q29_B','Cloud computing products want to learn')
p.render_notebook()

**Data scientist is also the main force of cloud computing products, including  Amazon Elastic Compute Cloud, Google Cloud Compute Engine and Microsoft Azure Virtual Machines. Many respondents also want to learn these 3 products.**

# **About data and ML products**

In [None]:
g=draw_groupby('Q20','Q30_A')
g.render_notebook()

**Industries of Computers/Technology use data storage products more, among which Amazon Simple Storage Service and Google Cloud Storage are more popular.**

In [None]:
g=draw_groupby('Q15','Q31_A')
g.render_notebook()

In [None]:
p=single_pie('Q31_B','Managerd machine learning products want to learn')
p.render_notebook()

**Many people do not use managed machine learning products, and people with more machine learning expericence may use more. Amazon SageMaker, Azure Machine Learning Stuidio, Google Cloud Vertex AI and Databricks are more popular, they also attract many people to learn.**

In [None]:
g=draw_groupby('Q5','Q33')
g.render_notebook()

In [None]:
p=single_pie('Q32_B','Big data products want to learn')
p.render_notebook()

**For big data products, people with different identities may have different tendency. For example, statistician may use MySQL more and Database Engineer may use Microsoft SQL Server more. MySQL becomes products that most people want to learn with no doubt, and MongoDB is the second popular although it may not be used as much as other products such as Microsoft SQL Server.**

In [None]:
g=draw_groupby('Q5','Q35')
g.render_notebook()

In [None]:
p=single_pie('Q34_B','Business intelligence tools want to learn')
p.render_notebook()

**Fewer people use business intelligence tools, Microsoft Power BI and Tableau are most popular, many interviewees also want to be more familiar with them.**

# **About Auto Machine Learning**

In [None]:
p=single_pie('Q36_A','Usage of AutoML tools')
p.render_notebook()

In [None]:
p=single_pie('Q36_B','Auto machine learning tools want to learn')
p.render_notebook()

**More than half of the interviewees do not use auto machine learning tools, some people may use popular AutoML tools such as auto-sklearn, hyperopt, tpot, etc. But many people are interesting in various autoML tools and willing to learn them.**

In [None]:
p=single_pie('Q37_A','Usage of AutoML tools on a regular basis')
p.render_notebook()

In [None]:
p=single_pie('Q37_B','Auto machine learning products want to learn')
p.render_notebook()

**Many people also do not use AutoML tools on a regular basis, while some may use Google Cloud AutoML, Azure Automated Machine Learning, Amazon Safemaker Auropilot, etc, they also attract many respondents to learn.**

In [None]:
p=single_pie('Q38_A','Tools to help manage machine learning experiments')
p.render_notebook()

In [None]:
p=single_pie('Q38_B','Tools for managing ML experiments want to learn')
p.render_notebook()

**Most people do not use tools to help manage machine learning experiments, while some people may use Tensorboard, MLflow, etc. And many respondents may be not interesting in tools for ML experiment, indicating many people do not pay much attention to the learning process in deep learning. So it remains great potential for these tools.**

# **About media**

In [None]:
p=single_pie('Q39','Platform to share data analysis or machine learning applications')
p.render_notebook()

**Many respondents share their data analysis or ML apllications on Github, Kaggle, colab, bolg, etc.**

In [None]:
p=single_pie('Q40','Data science courses platforms ')
p.render_notebook()

**Most people learn data science courses on Coursera, Kaggle, Udemy, etc. People learn courses in university only accounts for less than 10%.**

In [None]:
g=draw_groupby('Q5','Q41')
g.render_notebook()

**For tools to analyze data, Business Analyst, Project Manager and Product Manager may use basic statistical software more, while Data Scientist, Research Scientist and ML Engineer may use local development environments more.**

In [None]:
p=single_pie('Q42','Favorite media sources')
p.render_notebook()

**For media sources that report on data science topics, Kaggle become most popular, followed by YouTube, Blogs, Twitter, etc.**

# Conclusion

**From above results, we can conclude that:**
* Most interviewees are 18-29 years old, the ratio of male to female is about 1:4, many respondents are from India and the United States, students contributes most to total respondents, most respondents have a high degree.
* Python is the most popular programming language and Jupyter Notebook is the most frequently used notebook. There are many respondents who do not use GPU or TPU.
* Popular methods and tools for visulization, machine learning , CV or NLP are widely used by Kagglers.
* Respondents come from different industries, companies with more employes are likely to spend more on machine learning.
* Many respondents are not familiar with cloud computing platforms or big data products, for many of them are students.
* Auto machine learning methods and tools are not widely used currently among respondents，so the autoML tools still have great potential and prospects.

**Thanks for your read, this is my first notebook on Kaggle, there may be many problems in it. Any comments and suggestions are welcome.**