# Introduction

The data we will use comes from the Stack Overflow 2021 Yearly Survey ((See Survey Results Page)[https://insights.stackoverflow.com/survey/]). This data contains a compiled collection of anonymous responses collected through a survey, focused on programming and development in general. Part of this data, however, describes languages that the user currently works with, and languages they would like to work with.

This notebook is focused on compiling and presenting the correlation between languages developers currently work with and languages they want to work with. 

# Importing Libraries


In [1]:
from IPython.display import display
import pandas as pd
import requests
import shutil
import os
import zipfile
import json

# Downloading the Survey Data

Because the survey data is so large to deal with in the repository, we will simply download it and ignore it on the .gitignore

In [2]:
if os.path.exists(os.path.join('.','tmp')):
    shutil.rmtree(os.path.join('.','tmp'))
os.mkdir(os.path.join('.','tmp'))

request = requests.get('https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip')
with open(os.path.join('.','tmp','survey.zip'),'wb') as file:
    file.write(request.content)

# Extract Zip File
with zipfile.ZipFile(os.path.join('.','tmp','survey.zip')) as zip_ref:
    zip_ref.extractall(os.path.join('.','tmp'))

# Importing the Survey Data
The results are compiled into a `.csv` file. We will use the [pandas](https://pandas.pydata.org/) library to load and iterate over the data.

In [3]:
data_full = pd.read_csv(os.path.join('.','tmp','survey_results_public.csv'))
display(data_full)

Unnamed: 0,ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,...,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Slovakia,,,"Secondary school (e.g. American high school, G...",18 - 24 years,Coding Bootcamp;Other online resources (ex: vi...,,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268.0
1,2,I am a student who is learning to code,"Student, full-time",Netherlands,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",7,...,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,
2,3,"I am not primarily a developer, but I write co...","Student, full-time",Russian Federation,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",,...,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above,None of the above,Appropriate in length,Easy,
3,4,I am a developer by profession,Employed full-time,Austria,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",11 - 17 years,,,...,35-44 years old,Man,No,Straight / Heterosexual,White or of European descent,I am deaf / hard of hearing,,Appropriate in length,Neither easy nor difficult,
4,5,I am a developer by profession,"Independent contractor, freelancer, or self-em...",United Kingdom of Great Britain and Northern I...,,England,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",5 - 10 years,Friend or family member,17,...,25-34 years old,Man,No,,White or of European descent,None of the above,,Appropriate in length,Easy,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83434,83435,I am a developer by profession,Employed full-time,United States of America,Texas,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",6,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,I have a concentration and/or memory disorder ...,Appropriate in length,Easy,160500.0
83435,83436,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Benin,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",4,...,18-24 years old,Man,No,Straight / Heterosexual,Black or of African descent,None of the above,None of the above,Appropriate in length,Easy,3960.0
83436,83437,I am a developer by profession,Employed full-time,United States of America,New Jersey,,"Secondary school (e.g. American high school, G...",11 - 17 years,School,10,...,25-34 years old,Man,No,,White or of European descent,None of the above,None of the above,Appropriate in length,Neither easy nor difficult,90000.0
83437,83438,I am a developer by profession,Employed full-time,Canada,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,Online Courses or Certification;Books / Physic...,5,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,I have a mood or emotional disorder (e.g. depr...,Appropriate in length,Neither easy nor difficult,816816.0


# Trimming the Data
Most of these rows we don't need, so we can trim it down to just the columns we are looking for

In [4]:
data_trimmed = data_full[['LanguageHaveWorkedWith','LanguageWantToWorkWith']]
display(data_trimmed)

Unnamed: 0,LanguageHaveWorkedWith,LanguageWantToWorkWith
0,C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift,Swift
1,JavaScript;Python,
2,Assembly;C;Python;R;Rust,Julia;Python;Rust
3,JavaScript;TypeScript,JavaScript;TypeScript
4,Bash/Shell;HTML/CSS;Python;SQL,Bash/Shell;HTML/CSS;Python;SQL
...,...,...
83434,Clojure;Kotlin;SQL,Clojure
83435,,
83436,Groovy;Java;Python,Java;Python
83437,Bash/Shell;JavaScript;Node.js;Python,Go;Rust


# Filtering the Data

As you might notice, some of the elements above contain a "NaN" entry. This indicates that the user did not select anything for that question. Since we need both column datas, we need to filter out any elements that have a "NaN" in either entry

In [5]:
data_filtered = data_trimmed[((data_trimmed['LanguageHaveWorkedWith'].isnull() == False) & (data_trimmed['LanguageWantToWorkWith'].isnull() == False))]
display(data_filtered)

Unnamed: 0,LanguageHaveWorkedWith,LanguageWantToWorkWith
0,C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift,Swift
2,Assembly;C;Python;R;Rust,Julia;Python;Rust
3,JavaScript;TypeScript,JavaScript;TypeScript
4,Bash/Shell;HTML/CSS;Python;SQL,Bash/Shell;HTML/CSS;Python;SQL
5,C;C#;C++;HTML/CSS;Java;JavaScript;Node.js;Powe...,C#;C++;Go;HTML/CSS;Java;JavaScript;Node.js;Obj...
...,...,...
83433,Java;JavaScript;Kotlin;Objective-C;TypeScript,Kotlin
83434,Clojure;Kotlin;SQL,Clojure
83436,Groovy;Java;Python,Java;Python
83437,Bash/Shell;JavaScript;Node.js;Python,Go;Rust


# Iterating on the Data

Our result database will be stored as a dictionary. An example database is as follows:
```python
result = {
    'javascript': {
        'Total': 23,
        'python': 12
    }
}
```
The language in the `'results'` dictionary is the language that users work with. Each sub-language in the inner-dictionary is the language that users want to work with, along with the number of times that pair has shown up.
Additionally, we have a `'total'` parameter, which will be useful in further steps

After initializing the reuslt, we will populate it based on the dataframe values

In [6]:
result_raw = {}
for index,row in data_filtered.iterrows():
    known_languages = row['LanguageHaveWorkedWith'].split(';')
    wanted_languages = row['LanguageWantToWorkWith'].split(';')
    for lang_1 in known_languages:
        if lang_1 not in result_raw:
            result_raw[lang_1] = {'Total': 1}
        else:
            result_raw[lang_1]['Total'] = result_raw[lang_1]['Total'] + 1
        for lang_2 in wanted_languages:
            if lang_2 not in result_raw[lang_1]:
                result_raw[lang_1][lang_2] = 1
            else:
                result_raw[lang_1][lang_2] = result_raw[lang_1][lang_2] + 1

result_raw_df = pd.DataFrame.from_dict(result_raw,orient='index')
display(result_raw_df)

Unnamed: 0,Total,Swift,C#,C++,Go,HTML/CSS,Java,JavaScript,Node.js,Objective-C,...,Elixir,Erlang,LISP,Groovy,Crystal,Delphi,Julia,VBA,APL,COBOL
C++,18656,1805,4733,9876,4201,6615,4871,8158,5695,637,...,699,433,581,300,202,271,902,332,206,146
HTML/CSS,43197,3897,11518,8341,9434,27842,10009,27658,18731,927,...,2070,806,833,668,512,418,1089,662,303,215
JavaScript,49904,4312,12630,8908,11412,25391,11347,32964,21367,985,...,2586,986,911,787,598,426,1175,645,321,219
Objective-C,2100,1139,469,547,455,742,539,942,722,622,...,131,97,78,90,59,70,91,66,51,47
PHP,16795,1651,3801,3250,3928,9271,3977,10857,7572,500,...,701,349,305,287,197,292,342,360,141,129
Swift,3879,2672,757,761,818,1338,907,1695,1247,511,...,213,119,96,109,91,74,129,79,69,58
Assembly,4309,484,1151,2071,1056,1646,1255,1926,1365,220,...,249,173,261,112,102,147,259,140,126,93
C,16086,1672,3621,6804,3956,5793,4456,7081,4892,607,...,664,440,639,313,228,246,800,301,214,158
Python,37371,3031,7443,10351,9500,13320,8725,16745,11156,714,...,1698,806,978,630,374,265,1959,539,279,192
R,3903,367,627,1029,830,1231,864,1521,992,139,...,190,126,182,104,81,71,663,148,79,65


# Converting to Percentages

In order to allow the large languages to be comparable in visualization to the smaller languages, we are going to convert all of the values to percentages based on the total of people who said they currently work with that language.

In [7]:
result_normal = {}
for lang_1 in result_raw:
    result_normal[lang_1] = {}
    for lang_2 in result_raw[lang_1]:
        if lang_2 != 'Total':
            result_normal[lang_1][lang_2] = result_raw[lang_1][lang_2] / result_raw[lang_1]['Total']

result_normal_df = pd.DataFrame.from_dict(result_normal,orient='index')
display(result_normal_df)

Unnamed: 0,Swift,C#,C++,Go,HTML/CSS,Java,JavaScript,Node.js,Objective-C,Perl,...,Elixir,Erlang,LISP,Groovy,Crystal,Delphi,Julia,VBA,APL,COBOL
C++,0.096752,0.253699,0.529374,0.225182,0.354578,0.261096,0.437286,0.305264,0.034145,0.026479,...,0.037468,0.02321,0.031143,0.016081,0.010828,0.014526,0.048349,0.017796,0.011042,0.007826
HTML/CSS,0.090215,0.266639,0.193092,0.218395,0.644536,0.231706,0.640276,0.433618,0.02146,0.017223,...,0.04792,0.018659,0.019284,0.015464,0.011853,0.009677,0.02521,0.015325,0.007014,0.004977
JavaScript,0.086406,0.253086,0.178503,0.228679,0.508797,0.227377,0.660548,0.428162,0.019738,0.01559,...,0.051819,0.019758,0.018255,0.01577,0.011983,0.008536,0.023545,0.012925,0.006432,0.004388
Objective-C,0.542381,0.223333,0.260476,0.216667,0.353333,0.256667,0.448571,0.34381,0.29619,0.043333,...,0.062381,0.04619,0.037143,0.042857,0.028095,0.033333,0.043333,0.031429,0.024286,0.022381
PHP,0.098303,0.226317,0.19351,0.233879,0.55201,0.236797,0.646442,0.450848,0.029771,0.02602,...,0.041739,0.02078,0.01816,0.017088,0.01173,0.017386,0.020363,0.021435,0.008395,0.007681
Swift,0.688837,0.195153,0.196185,0.210879,0.344934,0.233823,0.436968,0.321475,0.131735,0.025522,...,0.054911,0.030678,0.024749,0.0281,0.02346,0.019077,0.033256,0.020366,0.017788,0.014952
Assembly,0.112323,0.267115,0.480622,0.245068,0.381991,0.291251,0.446971,0.316779,0.051056,0.049896,...,0.057786,0.040149,0.060571,0.025992,0.023671,0.034115,0.060107,0.03249,0.029241,0.021583
C,0.103941,0.225103,0.422977,0.245928,0.360127,0.277011,0.440196,0.304115,0.037735,0.033694,...,0.041278,0.027353,0.039724,0.019458,0.014174,0.015293,0.049733,0.018712,0.013303,0.009822
Python,0.081106,0.199165,0.276979,0.254208,0.356426,0.23347,0.448075,0.29852,0.019106,0.019213,...,0.045436,0.021568,0.02617,0.016858,0.010008,0.007091,0.05242,0.014423,0.007466,0.005138
R,0.09403,0.160646,0.263643,0.212657,0.315398,0.221368,0.3897,0.254163,0.035614,0.043556,...,0.048681,0.032283,0.046631,0.026646,0.020753,0.018191,0.169869,0.03792,0.020241,0.016654


# Saving as a JSON

Now we will save the results as a json file so that we can read it in javascript

In [8]:
with open(os.path.join('.','tmp','results.json'),'w') as file:
    file.write(json.dumps(result_normal))