# Introduction

The data we will use comes from the Stack Overflow 2021 Yearly Survey ((See Survey Results Page)[https://insights.stackoverflow.com/survey/]). This data contains a compiled collection of anonymous responses collected through a survey, focused on programming and development in general. Part of this data, however, describes languages that the user currently works with, and languages they would like to work with.

This notebook is focused on compiling and presenting the correlation between languages developers currently work with and languages they want to work with. 

# Importing Libraries


In [1]:
from IPython.display import display
import pandas as pd
import numpy as np
import urllib
import requests
import shutil
import os
import zipfile

# Downloading the Survey Data

Because the survey data is so large to deal with in the repository, we will simply download it and ignore it on the .gitignore

In [2]:
if os.path.exists(os.path.join('.','tmp')):
    shutil.rmtree(os.path.join('.','tmp'))
os.mkdir(os.path.join('.','tmp'))

request = requests.get('https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip')
with open(os.path.join('.','tmp','survey.zip'),'wb') as file:
    file.write(request.content)

# Extract Zip File
with zipfile.ZipFile(os.path.join('.','tmp','survey.zip')) as zip_ref:
    zip_ref.extractall(os.path.join('.','tmp'))

# Importing the Survey Data
The results are compiled into a `.csv` file. We will use the [pandas](https://pandas.pydata.org/) library to load and iterate over the data.

In [3]:
df0 = pd.read_csv(os.path.join('.','tmp','survey_results_public.csv'))
display(df0)

Unnamed: 0,ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,...,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Slovakia,,,"Secondary school (e.g. American high school, G...",18 - 24 years,Coding Bootcamp;Other online resources (ex: vi...,,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268.0
1,2,I am a student who is learning to code,"Student, full-time",Netherlands,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",7,...,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,
2,3,"I am not primarily a developer, but I write co...","Student, full-time",Russian Federation,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",,...,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above,None of the above,Appropriate in length,Easy,
3,4,I am a developer by profession,Employed full-time,Austria,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",11 - 17 years,,,...,35-44 years old,Man,No,Straight / Heterosexual,White or of European descent,I am deaf / hard of hearing,,Appropriate in length,Neither easy nor difficult,
4,5,I am a developer by profession,"Independent contractor, freelancer, or self-em...",United Kingdom of Great Britain and Northern I...,,England,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",5 - 10 years,Friend or family member,17,...,25-34 years old,Man,No,,White or of European descent,None of the above,,Appropriate in length,Easy,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83434,83435,I am a developer by profession,Employed full-time,United States of America,Texas,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",6,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,I have a concentration and/or memory disorder ...,Appropriate in length,Easy,160500.0
83435,83436,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Benin,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",4,...,18-24 years old,Man,No,Straight / Heterosexual,Black or of African descent,None of the above,None of the above,Appropriate in length,Easy,3960.0
83436,83437,I am a developer by profession,Employed full-time,United States of America,New Jersey,,"Secondary school (e.g. American high school, G...",11 - 17 years,School,10,...,25-34 years old,Man,No,,White or of European descent,None of the above,None of the above,Appropriate in length,Neither easy nor difficult,90000.0
83437,83438,I am a developer by profession,Employed full-time,Canada,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,Online Courses or Certification;Books / Physic...,5,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,I have a mood or emotional disorder (e.g. depr...,Appropriate in length,Neither easy nor difficult,816816.0


# Filtering the Data

As you see, the data above is massive. Not only are there over 80,000 entries, but there are 48 columns of data for each entry. Most of these columns we are not concerned about, so we will only pick out the columns we need.

The columns that we want to work with is the 'LangaugeHaveWorkedWith' and the 'LanguageWantToWorkWith' columns, which represent the data we are looking for.

In [4]:
col_work = 'LanguageHaveWorkedWith'
col_want = 'LanguageWantToWorkWith'

df1 = df0[[col_work,col_want]]
display(df1)

Unnamed: 0,LanguageHaveWorkedWith,LanguageWantToWorkWith
0,C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift,Swift
1,JavaScript;Python,
2,Assembly;C;Python;R;Rust,Julia;Python;Rust
3,JavaScript;TypeScript,JavaScript;TypeScript
4,Bash/Shell;HTML/CSS;Python;SQL,Bash/Shell;HTML/CSS;Python;SQL
...,...,...
83434,Clojure;Kotlin;SQL,Clojure
83435,,
83436,Groovy;Java;Python,Java;Python
83437,Bash/Shell;JavaScript;Node.js;Python,Go;Rust


# Filtering the Data

As you might notice, some of the elements above contain a "NaN" entry. This indicates that the user did not select anything for that question. Since we need both column datas, we need to filter out any elements that have a "NaN" in either entry

In [5]:
df2 = df1[((df1['LanguageHaveWorkedWith'].isnull() == False) & (df1['LanguageWantToWorkWith'].isnull() == False))]

display(df2)

Unnamed: 0,LanguageHaveWorkedWith,LanguageWantToWorkWith
0,C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift,Swift
2,Assembly;C;Python;R;Rust,Julia;Python;Rust
3,JavaScript;TypeScript,JavaScript;TypeScript
4,Bash/Shell;HTML/CSS;Python;SQL,Bash/Shell;HTML/CSS;Python;SQL
5,C;C#;C++;HTML/CSS;Java;JavaScript;Node.js;Powe...,C#;C++;Go;HTML/CSS;Java;JavaScript;Node.js;Obj...
...,...,...
83433,Java;JavaScript;Kotlin;Objective-C;TypeScript,Kotlin
83434,Clojure;Kotlin;SQL,Clojure
83436,Groovy;Java;Python,Java;Python
83437,Bash/Shell;JavaScript;Node.js;Python,Go;Rust


# Initializing the Result Database

Our result database will be created using another dataframe, once again using the *pandas* library. The database will consist of objects with three variables:
 - **Known (*str*)** The known language of the users
 - **Want (*str*)** The language that users wanted to know
 - **Count (*int*)** The number of developers who knew the provided known language, and wanted to learn the want language

In [6]:
result = []

# Function *Add an Entry*
This function's role is to update the result database with a new entry. It will take in a pair (a known and a wanted language), and check the result array for that pair. If that pair exists, it increments the count value. Otherwise, it will append a new entry at the end of the database

In [7]:
def add_entry(know: str, want: str) -> None:
    for item in result:
        if item['Know'] == know and item['Want'] == want:
            item['Count'] = item['Count'] + 1
            return
    result.append({
        'Count': 1,
        'Know': know,
        'Want': want
    })

# Iterating on the Data

Now we iterate over each row, and for each row we will add an entry for each pair of programming languages from the know and want columns

In [8]:
for index,row in df2.iterrows():
    known_languages = row['LanguageHaveWorkedWith'].split(';')
    wanted_languages = row['LanguageWantToWorkWith'].split(';')
    for lang_1 in known_languages:
        for lang_2 in wanted_languages:
            if lang_1 != lang_2:
                add_entry(lang_1, lang_2)

# Convert Results to a DataFrame
To make displaying the results easier, we will convert the results into a pandas DataFrame 

In [9]:
result_dataframe = pd.DataFrame(result)
display(result_dataframe)

Unnamed: 0,Count,Know,Want
0,1805,C++,Swift
1,3897,HTML/CSS,Swift
2,4312,JavaScript,Swift
3,1139,Objective-C,Swift
4,1651,PHP,Swift
...,...,...,...
1401,45,Erlang,COBOL
1402,38,Groovy,COBOL
1403,43,Haskell,COBOL
1404,42,LISP,COBOL


# Sorting the Data
Because we already have the data loaded, we might as well sort it by count, so we'll do just that

In [10]:
result_sorted = result_dataframe.sort_values(by=['Count']).reset_index(drop=True)
display(result_sorted)

Unnamed: 0,Count,Know,Want
0,38,Groovy,COBOL
1,38,Julia,COBOL
2,39,Clojure,COBOL
3,39,Crystal,COBOL
4,39,Erlang,VBA
...,...,...,...
1401,21367,JavaScript,Node.js
1402,22006,JavaScript,Python
1403,23712,JavaScript,TypeScript
1404,25391,JavaScript,HTML/CSS
