# Kaggle Data Science Survey Analysis EDA: Changes from 2020 to 2021 and Difference from Professionals

- **Goal** : to check if there is any difference in Visualization Libray usage between 2020 and 2021 and all and (experienced) pros
- **Rationale** : including beginers into an analysis may contaminate the questions we are supposed to be answered. by separating beginers we'll focus more on the professional usage of tools for **data visualization and ML modeling**

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial.polynomial import polyfit


import warnings
warnings.filterwarnings('ignore')


matplotlib.rcParams['axes.unicode_minus'] = False 


# load data
ks2020 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
ks2020.head()

ks2021 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
ks2021.head()

## to get the list of age and education level values to filter out wannabes

In [None]:
# dfk2101 = ks2021[]
# ks2021.Q1.value_counts().index

ageList = ['25-29', '22-24', '30-34', '35-39', '40-44', '45-49', '50-54','55-59']

# ks2021.Q4.value_counts().index
educationList = ['Master’s degree', 'Doctoral degree', 'Professional doctorate']


# print(ks2021.head(1).values)
# ks2021.head(2).T

## 2021 stats on visualization library and ML algorithm

In [None]:
tmpa = ks2021.head(4).T

# get the list of questions regarding the two target subjects to assess
Q_14s = tmpa[tmpa[0].str.contains('What data visualization libraries')].index
Q_17s = tmpa[tmpa[0].str.contains('algorithm')].index

# to get top 10s - visualization libraries
dfq14 = pd.Series(np.array(ks2021[Q_14s].tail(len(ks2021)-1)).flatten()).value_counts().head(10)

colors1 = ['red'] + list(np.repeat('grey',3))
plt.barh(dfq14.head(4).index, dfq14.head(4)/dfq14.head(4).max()*100,
        color=colors1)           
plt.title('Data Vaisulaization Library 2021')
plt.xlabel('% compared to the most popular')
plt.show()

# to get top 10s - ML algorithms
dfq17 = pd.Series(np.array(ks2021[Q_17s].tail(len(ks2021)-1)).flatten()).value_counts().head(10)

colors1 = ['red'] + list(np.repeat('grey',9))
plt.barh(dfq17.index, dfq17/dfq17.max()*100,
        color=colors1)           
plt.title('ML Algorithm 2021')
plt.xlabel('% compared to the most popular')
plt.show()



[Recap]
- Matplotlib is the most popular choice
- **Seaborn** which is a sort of branches of matplotlib also is quite popular
<br><br>
- Still classic **regression** stuff is the baseline
- Trees and simple tree ensembles get attentions

In [None]:
# record summaries

dfall = pd.DataFrame()
dfall['vis_itm21'] = dfq14.index
dfall['vis_val21'] = dfq14.values
dfall['alg_itm21'] = dfq17.index
dfall['alg_val21'] = dfq17.values
dfall

## After filtering WANNABEs out - Diffence bewteen Pros and Wannabes

In [None]:
ks2021c = ks2021[(ks2021.Q1.isin(ageList)) & (ks2021.Q4.isin(educationList))]

# filltering out beginers
dfq14 = pd.Series(np.array(ks2021c[Q_14s].tail(len(ks2021c))).flatten()).value_counts().head(10)
dfq17 = pd.Series(np.array(ks2021c[Q_17s].tail(len(ks2021c))).flatten()).value_counts().head(10)

colors1 = ['red'] + list(np.repeat('grey',3))
plt.barh(dfq14.head(4).index, dfq14.head(4)/dfq14.head(4).max()*100,
        color=colors1)           
plt.title('Data Vaisulaization Library 2021')
plt.xlabel('% compared to the most popular')
plt.show()


colors1 = ['red'] + list(np.repeat('grey',9))
plt.barh(dfq17.index, dfq17/dfq17.max()*100,
        color=colors1)           
plt.title('ML Algorithm 2021')
plt.xlabel('% compared to the most popular')
plt.show()


# add pros' stats

dfall['vis_itm21p'] = dfq14.index
dfall['vis_val21p'] = dfq14.values
dfall['alg_itm21p'] = dfq17.index
dfall['alg_val21p'] = dfq17.values
dfall



## comparing all and professionals

In [None]:
nAll = len(ks2021)/100
nPro = len(ks2021c)/100

plt.figure(figsize=(10,6))
colors1 = np.where(dfall.head(4).vis_itm21.str.contains('Networks'), 'navy', 'dodgerblue')
plt.scatter(dfall.head(4).vis_val21/nAll, 
            dfall.head(4).vis_val21p/nPro, 
            alpha=0.3, s=70, color=colors1)
plt.title('Visulaization library popularity - all vs. professional (2021) - Top 4',
         weight='bold')
plt.xlabel('all users %')
plt.ylabel('professionals %')
for i in dfall.head(4).index:
    plt.text(dfall.head(4).vis_val21[i]/nAll, 
             dfall.head(4).vis_val21p[i]/nPro,
             dfall.head(4).vis_itm21[i],)
plt.xlim(0,80)    
plt.ylim(0,80)    
plt.plot([5,75],[5,75], color='grey',alpha=0.2, linewidth=10)
plt.show()


plt.figure(figsize=(10,6))
colors1 = np.where(dfall.head(8).alg_itm21.str.contains('Networks'), 'navy', 'dodgerblue')
plt.scatter(dfall.head(8).alg_val21/nAll, 
            dfall.head(8).alg_val21p/nPro, 
            alpha=0.3, s=70, color=colors1)
plt.title('ML algorithm popularity - all vs. professional (2021)',
         weight='bold')
plt.xlabel('all users %')
plt.ylabel('professionals %')
for i in dfall.head(8).index:
    plt.text(dfall.head(8).alg_val21[i]/nAll, 
             dfall.head(8).alg_val21p[i]/nPro,
             dfall.head(8).alg_itm21[i],)
plt.xlim(0,63)    
plt.ylim(0,63)    
plt.plot([5,60],[5,60], color='grey', alpha=0.2, linewidth=10)
plt.show()


## How was it going in the previous year i.e. 2020

In [None]:
tmpa = ks2020.head(4).T
Q_14s = tmpa[tmpa[0].str.contains('What data visualization libraries')].index
Q_17s = tmpa[tmpa[0].str.contains('algorithm')].index


dfq14 = pd.Series(np.array(ks2020[Q_14s].tail(len(ks2020)-1)).flatten()).value_counts().head(10)

colors1 = ['red'] + list(np.repeat('grey',3))
plt.barh(dfq14.head(4).index, dfq14.head(4)/dfq14.head(4).max()*100,
        color=colors1)           
plt.title('Data Vaisulaization Library 2020')
plt.show()

dfq17 = pd.Series(np.array(ks2020[Q_17s].tail(len(ks2020)-1)).flatten()).value_counts().head(10)

colors1 = ['red'] + list(np.repeat('grey',9))
plt.barh(dfq17.index, dfq17/dfq17.max()*100,
        color=colors1)           
plt.title('ML Algorithm 2020')
plt.show()

dfall['vis_itm20'] = dfq14.index
dfall['vis_val20'] = dfq14.values
dfall['alg_itm20'] = dfq17.index
dfall['alg_val20'] = dfq17.values
# dfall


In [None]:
# Professionals only
ks2020c = ks2020[(ks2020.Q1.isin(ageList)) & (ks2020.Q4.isin(educationList))]

dfq14 = pd.Series(np.array(ks2020c[Q_14s].tail(len(ks2020c))).flatten()).value_counts().head(10)
dfq17 = pd.Series(np.array(ks2020c[Q_17s].tail(len(ks2020c))).flatten()).value_counts().head(10)

colors1 = ['red'] + list(np.repeat('grey',3))
plt.barh(dfq14.head(4).index, dfq14.head(4)/dfq14.head(4).max()*100,
        color=colors1)           
plt.title('Data Vaisulaization Library 2020')
plt.show()

colors1 = ['red'] + list(np.repeat('grey',9))
plt.barh(dfq17.index, dfq17/dfq17.max()*100,
        color=colors1)           
plt.title('ML Algorithm 2020')
plt.show()


# adding additional features

dfall['vis_itm20p'] = dfq14.index
dfall['vis_val20p'] = dfq14.values
dfall['alg_itm20p'] = dfq17.index
dfall['alg_val20p'] = dfq17.values
# dfall

## the Changes from 2020 to 2021?

In [None]:
nAll20 = len(ks2020)/100
nPro20 = len(ks2020c)/100


# all responders
plt.figure(figsize=(10,6))
plt.scatter(dfall.head(4).vis_val20/nAll20, 
            dfall.head(4).vis_val21/nAll, 
            alpha=0.3, s=70)
plt.title('Visualizationlibrary popularity - 2020 vs. 2021',
         weight='bold')
plt.xlabel('2020 %')
plt.ylabel('2021 %')
for i in dfall.head(4).index:
    plt.text(dfall.head(4).vis_val20[i]/nAll, 
             dfall.head(4).vis_val21[i]/nAll,
             dfall.head(4).vis_itm20[i],)
plt.xlim(0,80)    
plt.ylim(0,80)    
plt.plot([5,75],[5,75], color='grey',alpha=0.2, linewidth=10)
plt.plot([5,75],[5,75*1.1], color='lightgrey',alpha=0.2, linewidth=5)
plt.show()


[interpretation]
- The great Matplotlib is the good old learder in the section but Seaborn which is one of the derivatives from Matplotlib is moving a little fast




In [None]:
nAll20 = len(ks2020)/100
nPro20 = len(ks2020c)/100

plt.figure(figsize=(10,6))

plt.scatter(dfall.head(4).vis_val21/nAll, 
            (dfall.head(4).vis_val21/nAll)/(dfall.head(4).vis_val20/nAll20), 
            alpha=0.3, s=90)
plt.title('Visualization library popularity - 2020 vs. 2021',
         weight='bold')
plt.xlabel('2021 %')
plt.ylabel('change from 2020')
for i in dfall.head(4).index:
    plt.text(dfall.head(4).vis_val21[i]/nAll, 
             (dfall.head(4).vis_val21[i]/nAll)/(dfall.head(4).vis_val20[i]/nAll20),
             dfall.head(4).vis_itm20[i])
plt.xlim(0,80)    
plt.ylim(0.9,1.15)    
plt.axhline(1, linestyle=':', color='grey')
plt.axhline(((dfall.head(4).vis_val21/nAll)/(dfall.head(4).vis_val20/nAll20)).max(), linestyle=':')
plt.show()

- Matplotlib is keep expanding its user base.
- Seaborn is super slightly more fast.
- Ggplot is not that popular and shirinking in terms of total user base.

### Changes in the Community of Professionals

In [None]:
nAll20 = len(ks2020)/100
nPro20 = len(ks2020c)/100

plt.figure(figsize=(10,6))
plt.scatter(dfall.head(4).vis_val21p/nPro, 
            (dfall.head(4).vis_val21p/nPro)/(dfall.head(4).vis_val20p/nPro20), 
            alpha=0.5, s=110,
           color='dodgerblue')
plt.title('Visualization library popularity [Professionals] - 2020 vs. 2021',
         weight='bold')
plt.xlabel('2021 %')
plt.ylabel('change from 2020')
for i in dfall.head(4).index:
    plt.text(dfall.head(4).vis_val21p[i]/nPro, 
             (dfall.head(4).vis_val21p[i]/nPro)/(dfall.head(4).vis_val20p[i]/nPro20),
             dfall.head(4).vis_itm20[i],)
plt.xlim(0,80)    
plt.ylim(0.9,1.15)    
plt.axhline(1, linestyle=':')
plt.show()

- Matplotlib is keep expanding its user base in **professional communities**.
- Seaborn is expanding a little more fast.
- Ggplot is not that popular but still growing esp. in **professional communities**.

In [None]:
plt.figure(figsize=(10,6))
colors1 = np.where(dfall.head(8).alg_itm21.str.contains('Networks'), 'navy', 'dodgerblue')
plt.scatter(dfall.head(8).alg_val21p/nPro, 
            (dfall.head(8).alg_val21p/nPro)/(dfall.head(8).alg_val20p/nPro20), 
            alpha=0.5, s=110,
           color='dodgerblue')
plt.title('ML algorithm popularity [Professionals] - 2020 vs. 2021',
         weight='bold')
plt.xlabel('2021 %')
plt.ylabel('change from 2020')
for i in dfall.head(8).index:
    plt.text(dfall.head(8).alg_val21p[i]/nPro, 
             (dfall.head(8).alg_val21p[i]/nPro)/(dfall.head(8).alg_val20p[i]/nPro20),
             dfall.head(8).alg_itm20[i],)
plt.xlim(0,80)    
plt.ylim(0.9,1.4)    
plt.axhline(1, linestyle=':')
plt.show()

- Note that many professionals use one or more NN family of algorithms these days. The popularity grows quite rapidly.

[Caution] Seperating entry level users from professionals is a critical prerequisite. However analysis results may vary depending on the beginer filtering condition