## Aim of the Notebook: 
### Investigate the pattern in the numbers of skills/tools people know or use, and how the patterns differ across people with different levels of experience, job titles, education levels, and whether the pattern links to salary level. 

Since I started doing data analysis, I found myself often asking, how many languages/tools do I need to learn? And as I get older, the question gradually becomes, "do I need to learn this as well?", whenever there is a new trend going on.

Therefore, I wanted to find out from the survey that how many skills or tools (e.g., programming languages) do people know and/or use, and whether and how it relates to job titles, experience, salaray levels, and so on.

## Specific methods:
### The most important survey question I focused here is
- *Q7: What programming languages do you use on a regular basis?.*

In addition, I also looked at two other related questions:
- *Q9: Which of the following integrated development environments (IDE's) do you use on a regular basis?*
- *Q14: What data visualization libraries or tools do you use on a regular basis? (Select all that apply)*

** There are other more specific questions regarding cloud computing, database management, etc., but I will stick to the general questions regarding langauges, IDEs, and also visulaization libraries.

### Regarding background information of the responders, I focused on the following 4 questions:
1. *Q6: For how many years have you been writing code and/or programming?*
* I found this question particularly interesting becuase, "do more experience mean more learned hard skills?", and therefore, "If one keeps working in the field, does it mean that the person constantly learns more tools?".

1. *Q5: Select the title most similar to your current role (or most recent title if retired)*
1. *Q4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years? *
* However I do not expect this question would be very informative because the variability is probably small, as the majority of the people had high-level education or plan to have. Especially the question included "plan to attain in 2 years", which I think made answers to this question noisier and even less informative.
    
### At last I also looked at salary level with *Q24: What is your current yearly compensation?*

### Note: In this Notebook I only examined the data from the **US**, as I found that the different countries have quite different patterns and responder compositions, plus other work culture differences among countries, including job titles and salary. Also I believe that the US data alone is also of great interest, since many people on Kaggle outside the US probably consider working in the US as well.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

In [None]:
## Read data and selected responders reside in the US
survey = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
questions = survey.iloc[0, :].T #questions are in first row
survey = survey[survey['Q3'] == 'United States of America']
n_resp = survey.shape[0]#print(survey.shape)

## For each responder, count how many languages(Q7), IDEs(Q9), and visualization libraries(Q14) they use,
# And construct a new dataset df_mul 
selected_multiple_choice = ['Q7','Q9','Q14']
# assign each question a short name for visualization purpose
selected_multiple_choice_name = ['language','IDE','Vis']
n_selected_multiple_choice = len(selected_multiple_choice)
ds = np.zeros((n_resp,n_selected_multiple_choice))
for iqx, qx in enumerate(selected_multiple_choice):
    # selected the columns for each multiple-choice
    targLbl = np.nonzero(np.asarray([qx in q for q in questions.index]))[0]
    options = questions.values[targLbl]
    # exclude 'None' answer column:
    targLbl = np.setdiff1d(targLbl,targLbl[np.nonzero(['None' in option for option in options])])
    list_of_cols = questions.index[targLbl]#list_of_columns_for_a_single_question
    #print('#%d %s %s n_opt=%d' % (iqx,qx,selected_multiple_choice_name[iqx],list_of_cols.shape[0]))
    #if want to examine indvd numbers of each option: 
    #for icol,col in enumerate(list_of_cols): 
    #    print('#%d %s %s nSel=%d' % (icol,col,options[icol].split(' - ')[2],np.sum(survey.iloc[:,targLbl].iloc[:,icol].notna())))
    ds[:,iqx] = survey.iloc[:,targLbl].notna().sum(axis=1)
#create a dataframe from the numpy array
df_mul = pd.DataFrame(ds,columns=selected_multiple_choice_name,dtype=int, index=survey.index) 
df_mul

## 1. How many skills do people know overall (in the US)
### First, we just take a look at the overall distributions of N of languages, IDEs, and visualization tools, and the relationships among them
1. N of languages (Q7): on average one uses 2.46 languages, median=2. 
1. N of IDEs (Q9): on average one uses 2.2 IDEs, median=2
1. N of visualization tools/libraries (Q14): on average one uses 1.91 visualization tools, median=2.

** all three have the distribution skewed to the right, i.e. fewer people know/use more tools. However, for visulization tools in particular, a large number of people know/use none or only one tool/library.

In [None]:
plt.figure(figsize=(15,4))
for iqx,qx in enumerate(selected_multiple_choice_name):
    plt.subplot(1,3,iqx+1)
    plt.hist(df_mul[qx])
    plt.title(qx)
    plt.ylim([0,700])
    plt.annotate('mean='+str(round(df_mul[qx].mean(),2))+'+/-'+str(round(df_mul[qx].std(),2))\
                 +'\nmedian='+str(round(df_mul[qx].median(),2)), (4,600))
plt.show()

### Correlations between N of skills:
* Not surprisingly, the three categories (language, IDE, visualization tool) have some postive correlations.
* Nonetheless it is worth noting that they **are only moderately correlated**, that is, one knowing/using more languages do not always mean knowing/using more IDEs and visualization tools, and so on.

In [None]:
corr_mat = np.corrcoef(df_mul.T)
sns.heatmap(corr_mat,square=True,annot=True,xticklabels=selected_multiple_choice_name,
               yticklabels=selected_multiple_choice_name,cmap='Reds')
plt.show()

## 2. Examine responders of different levels of coding experience, job roles, and education lelves

In [None]:
# Select columns of single-choice questions of interest, 
# and construct the dataset (df_all) consists of both info of N of skills (df_mul) and the single-choice Qs.
selected_single_choice = ['Q4','Q5','Q6']
# assign each question a short name for visualization purpose
selected_single_choice_name = ['edu','work role','coding experience']
n_selected_single_choice = len(selected_single_choice)
df_single = pd.DataFrame(survey[selected_single_choice])#,index=np.arange(n_resp))# df_single
#combine the data frames of multiple-choice and single-choice into df_all
df_all = pd.concat([df_mul, df_single],axis=1)
df_all

### First we will just have a look at the frequency distribution of the three selected single-choice questions (categories are plotted descendingly by frequency, i..e, from most frequent to least).
1. Years of coding experience (Q6): 
    * The responders are very experienced, as the top three categories are from 3-20 years of experience.
1. Current job role (Q5): 
    * The top category is data scientist, followed by student.
    * Interestingly there are a lot of people identified themselves as "others" (3rd place). Do wonder what the composition is and maybe Kaggle can improve options for this question to capture those demographics in the future.
1. Education (Q4): 
    * As many notebooks already notes, the top category is Master. However it is important to note that the question included the phrase **"plan to attain in the next 2 years"**. Thus it is really not surprising that most people in the field figure that they need something more than a bachelor, and also many non-STEM majors often try to join the workforce by obtaining a relevant Master degree.
    * I personally found the fact that fewer people have or want to have a doctoral degree (3rd place). It reminds me of the question you hear a lot that, "should I do a PhD?".

In [None]:
single_choice_question = 'Q6'
print('%s %s' %(single_choice_question, questions[single_choice_question]))
options = survey[single_choice_question].value_counts().index
sns.barplot(y=options, x=survey[single_choice_question].value_counts())
plt.show()
single_choice_question = 'Q5'
print('%s %s' %(single_choice_question, questions[single_choice_question]))
options = survey[single_choice_question].value_counts().index
sns.barplot(x=survey[single_choice_question].value_counts(),y=options)
plt.show()
single_choice_question = 'Q4'
print('%s %s' %(single_choice_question, questions[single_choice_question]))
options = survey[single_choice_question].value_counts().index
sns.barplot(y=options, x=survey[single_choice_question].value_counts())
plt.show()

### Second, we want to examine how the numbers of skills differ (or not) across levels of coding experience and jobs.

## 2.1. Do the numbers of languages differ by experience and/or job roles?
### Like above, the categories are ***plotted descendingly***, i.e., from the job that knows/uses the most langauges to the one that knows/uses the least. Same for analyses for IDEs and visualization tools later as well.
a) Years of coding experience: A consistent relationship is found, i.e., **N of languages increase as n of coding years increases** 


In [None]:
multi_choice_question = 'language'
single_choice_question = 'Q6' 
print('Mutli-choice Q: %s\nSingle-choice Q: %s %s' \
    % (multi_choice_question, single_choice_question,questions[single_choice_question]))
# calculate the mean valuess of each category and sort descendingly:
sortedMeans = ((df_all.groupby([single_choice_question]).mean())[multi_choice_question]).sort_values(ascending=False)
sns.catplot(x=multi_choice_question,y=single_choice_question,kind='bar',data=df_all, ci=99, order=(sortedMeans.index))
for iopt, opt in enumerate(sortedMeans.index):
    plt.text(0.1, iopt+.2,str(round(sortedMeans[opt],2)),color='w',fontsize=12)
plt.show()

b) Job role: 
* Software Engineers know/use most languages, followed by other data jobs.
* Management jobs and bussines analysts know/use the least (excluding non-employed and others) 
* Academics are in the middle    

In [None]:
multi_choice_question = 'language'
single_choice_question = 'Q5' 
print('Mutli-choice Q: %s\nSingle-choice Q: %s %s' \
    % (multi_choice_question, single_choice_question,questions[single_choice_question]))
# calculate the mean valuess of each category and sort descendingly:
sortedMeans = ((df_all.groupby([single_choice_question]).mean())[multi_choice_question]).sort_values(ascending=False)
sns.catplot(x=multi_choice_question,y=single_choice_question,kind='bar',data=df_all, ci=99, order=(sortedMeans.index))
for iopt, opt in enumerate(sortedMeans.index):
    plt.text(0.1, iopt+.2,str(round(sortedMeans[opt],2)),color='w',fontsize=12)
plt.show()

## 2.2. Number of IDEs
a) Years of coding experience: Pretty much the same among people with > 2 years experience.
* Perhaps people stick to certain IDEs after certain years in the workforce.
* The current field of IDEs may be relatively stable.    

In [None]:
multi_choice_question = 'IDE'
single_choice_question = 'Q6' 
print('Mutli-choice Q: %s\nSingle-choice Q: %s %s' \
    % (multi_choice_question, single_choice_question,questions[single_choice_question]))
# calculate the mean valuess of each category and sort descendingly:
sortedMeans = ((df_all.groupby([single_choice_question]).mean())[multi_choice_question]).sort_values(ascending=False)
sns.catplot(x=multi_choice_question,y=single_choice_question,kind='bar',data=df_all, ci=99, order=(sortedMeans.index))
for iopt, opt in enumerate(sortedMeans.index):
    plt.text(0.1, iopt+.2,str(round(sortedMeans[opt],2)),color='w',fontsize=12)
plt.show()

b) Job role: 
* Like N of langauges, people on techinical positions know/use most IDEs, management jobs and bussines analysts know/use the least, and academics are in the middle.

In [None]:
multi_choice_question = 'IDE'
single_choice_question = 'Q5' 
print('Mutli-choice Q: %s\nSingle-choice Q: %s %s' \
    % (multi_choice_question, single_choice_question,questions[single_choice_question]))
# calculate the mean valuess of each category and sort descendingly:
sortedMeans = ((df_all.groupby([single_choice_question]).mean())[multi_choice_question]).sort_values(ascending=False)
sns.catplot(x=multi_choice_question,y=single_choice_question,kind='bar',data=df_all, ci=99, order=(sortedMeans.index))
for iopt, opt in enumerate(sortedMeans.index):
    plt.text(0.1, iopt+.2,str(round(sortedMeans[opt],2)),color='w',fontsize=12)
plt.show()

### 2.3. N of visualization tools/libraries
a) Years of coding experience: People with 3-10 years of experience know/use the most. Interestingly, the very experienced people (10 to 20 years) are at a similar level as the recent learners. **Maybe visualization is the playground for the youngsters** 

In [None]:
multi_choice_question = 'Vis'
single_choice_question = 'Q6' 
print('Mutli-choice Q: %s\nSingle-choice Q: %s %s' \
    % (multi_choice_question, single_choice_question,questions[single_choice_question]))
# calculate the mean valuess of each category and sort descendingly:
sortedMeans = ((df_all.groupby([single_choice_question]).mean())[multi_choice_question]).sort_values(ascending=False)
sns.catplot(x=multi_choice_question,y=single_choice_question,kind='bar',data=df_all, ci=99, order=(sortedMeans.index))
for iopt, opt in enumerate(sortedMeans.index):
    plt.text(0.1, iopt+.2,str(round(sortedMeans[opt],2)),color='w',fontsize=12)
plt.show()

b) Job role: 
* Similar pattern as N of langauges and IDEs that people on techinical positions know/use most IDEs, management jobs and bussines analysts know/use the least, and academics are in the middle.
* With the exception of Database management and Software Engineer who know/use few visualization tools.

In [None]:
multi_choice_question = 'Vis'
single_choice_question = 'Q5' 
print('Mutli-choice Q: %s\nSingle-choice Q: %s %s' \
    % (multi_choice_question, single_choice_question,questions[single_choice_question]))
# calculate the mean valuess of each category and sort descendingly:
sortedMeans = ((df_all.groupby([single_choice_question]).mean())[multi_choice_question]).sort_values(ascending=False)
sns.catplot(x=multi_choice_question,y=single_choice_question,kind='bar',data=df_all, ci=99, order=(sortedMeans.index))
for iopt, opt in enumerate(sortedMeans.index):
    plt.text(0.1, iopt+.2,str(round(sortedMeans[opt],2)),color='w',fontsize=12)
plt.show()

## 3. Do the numbers of skills relate to one's salary level?
* Finally, I want to look into whether the N of skills (languages, IDEs, visualization tools) can be linked to one's salary level.
* The salary categories in Q24 are first converted to continuous values (Scale from 1 to 7)
* Spearman's rank correlation are calculated between N of skills and salary for
a) Each job role
b) Each education level
** As salary is multifactorial and likely vary greatly depending on the kind of job, experience etc, we should look at people with different jobs and degrees separetely.
* Correlations with p<0.05 given by scipy.stats are considered significant (look for the ones that say **"Significant!"**)

In [None]:
# Convert salary levels to a continuous scale from 0 (<1K) to 7 (>200K)
salary_bin = np.zeros(n_resp)
for iopt , opt in enumerate(survey['Q24'].values):
    #print(iopt, opt)
    try:
        min_end = opt.split('-')[0]
        max_end = opt.split('-')[1]
        if ',' in max_end:# values > 1K
            min_end = float(min_end.split(',')[0])
            max_end = float(max_end.split(',')[0])
            if max_end < 10: #<10K
                #print('<10K:',min_end, max_end)
                salary_bin[iopt] = 1
            elif max_end < 50: #<50K
                salary_bin[iopt] = 2
            elif max_end < 100:#<100K
                salary_bin[iopt] = 3
            elif max_end < 125:
                salary_bin[iopt] = 4
            elif max_end < 150:
                salary_bin[iopt] = 5
            elif max_end < 200:
                salary_bin[iopt] = 6
            else: #max_end >= 200:#>200K
                salary_bin[iopt] = 7  
        else: # value < 1K
            salary_bin[iopt] = 0
    except IndexError as ie: #"$500,000"
        #print('Error:',ie, opt)
        salary_bin[iopt] = 7
    except AttributeError as ae:
        #print('Error:', ae, opt)
        salary_bin[iopt] = np.nan
salary_bin = pd.DataFrame(salary_bin,columns=['salary'], index=survey.index)
df_all = pd.concat([df_all, salary_bin],axis=1)

In [None]:
# Use scipy to calculate Spearman's rank
from scipy import stats

## 3.1. Relationship between N of skills and salary per each job role:
* No significant correlation except the one with "Other" jobs (again wonder what the "others" consist of)
* **There is no consistent relationship between the N of skills and your salary across the job market.**

In [None]:
single_choice_question = 'Q5'
options = survey[single_choice_question].value_counts().index
for iqx,qx in enumerate(selected_multiple_choice_name):
    print(qx,':')
    for iopt, opt in enumerate(options):
        try:
            rho,p_value = stats.spearmanr(df_all[df_all[single_choice_question]==opt][qx], \
                          df_all[df_all[single_choice_question]==opt]['salary'],nan_policy='omit')
            if p_value<0.05:
                print('%s: Corr(rho)=%.4f,p=%.4f (Significant!)' % (opt,rho,p_value))
            else:
                print('%s: Corr(rho)=%.4f,p=%.4f' % (opt,rho,p_value))
        except ValueError as e:
            print(opt,': No data, skip')
    sns.lmplot(x=qx,y='salary', hue=single_choice_question, data=df_all)
    plt.title(qx)
    plt.show()

## 3.2. Relationship between N of skills and salary per education level
* Significant positive correlation only for Bacholar's and "Some college/university study without earning a bachelor’s degree"
*** The N of skills only matter to those with a relatively low level of educatoin.**

In [None]:
single_choice_question = 'Q4'
options = survey[single_choice_question].value_counts().index
for iqx,qx in enumerate(selected_multiple_choice_name):
    print(qx,':')
    for iopt, opt in enumerate(options):
        try:
            rho,p_value = stats.spearmanr(df_all[df_all[single_choice_question]==opt][qx], \
                          df_all[df_all[single_choice_question]==opt]['salary'],nan_policy='omit')
            if p_value<0.05:
                print('%s: Corr(rho)=%.4f,p=%.4f (Significant!)' % (opt,rho,p_value))
            else:
                print('%s: Corr(rho)=%.4f,p=%.4f' % (opt,rho,p_value))
        except ValueError as e:
            print(opt,': No data, skip')
    sns.lmplot(x=qx,y='salary', hue=single_choice_question, data=df_all)
    plt.title(qx)
    plt.show()

## Conclusions
1. We examined three general skills, programming languages, IDEs, and visualization tools. They are only moderately correlated, i.e., it is not necessarily the case that if one knows or uses more languages, one also uses more IDEs or visualzation.
1. People with longer coding experience do know/use more languages and IDEs, but not more visualization tools. Instead, more sophisticated visualization is a recent trend, and people with median experience (~5 years) are the ones that use the most.
1. The job roles are generally grouped into three categories based on the N of languages/IDEs people know/use:
    * a) Most skills: technical positions including the various engineers and data scientists.
    * b) Least skills: project management and bussiness analyst. 
    * c) Median skills: Academics, including statisticians.
1. The N of skills are only related to salary for people with (or plan to attain in the next years) Bachelor's degree or below .