# Data analysis of  "2019 Stack Overflow Developer Survey"

Source explained how to analyze the survey data but I flet that it requires some background knowledges about setdefault and defaultdict, which I added here,  and also there need some deeper understanding of some trikcy codes used in the source, which I break down further here.

<a src="https://www.youtube.com/watch?v=_P7X8tMplsw&t=1007s"> code source</a> <br>
<a src="https://insights.stackoverflow.com/survey"> data source </a>

## 1) Basic knowlege about setdefualt and defaultdict

### setdefault & get()
- Python dictionary method setdefault() is similar to get(), but will set dict[key]=default if key is not in the dict. 
- get() method returns the value of the key in the dictionary, however, if given key is not available then it will just return provided default value without setting the key in the dictionary

`dict.setdefault(key, default=None)`

- Parameters
  - key − This is the key to be searched.
  - default − This is the Value to be returned in case key is not found.

In [1]:
d1 = {'Name': 'Zara', 'Age': 7}

print ("Value : {}".format(d1.setdefault('Age', None)))
print ("Value : {}".format(d1.setdefault('Sex', None)))
print(d1)
print()

d2 = {'Name': 'Zara', 'Age': 7}
print ("Value : {}".format(d2.get('Age', None)))
print ("Value : {}".format(d2.get('Sex', None)))
print(d2)

Value : 7
Value : None
{'Name': 'Zara', 'Age': 7, 'Sex': None}

Value : 7
Value : None
{'Name': 'Zara', 'Age': 7}


### defaultdict & setdefault
**Using list as the default_factory, it is easy to group a sequence of key-value pairs into a dictionary of lists:**

In [2]:
from collections import defaultdict

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

**grouping using `defaultdict`**

- define the type of value in advance

In [3]:
d = defaultdict(list)

for k, v in s:
    d[k].append(v)

sorted(d.items())

[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

- When each key is encountered for the first time, it is not already in the mapping; so an entry is automatically created using the default_factory function which returns an empty list. 
- The list.append() operation then attaches the value to the new list. When keys are encountered again, the look-up proceeds normally (returning the list for that key) and the list.append() operation adds another value to the list. This technique is simpler and faster than an equivalent technique using dict.setdefault():

**grouping using `setdefault`**

In [4]:
d = {}
for k, v in s:
    d.setdefault(k, []).append(v)

sorted(d.items())

[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

**Setting the default_factory to int makes the defaultdict useful for counting (like a bag or multiset in other languages):**

In [5]:
# set default value to int

s='mississippi'
d=defaultdict(int)

for k in s:
    d[k]+=1
    
sorted(d.items())


[('i', 4), ('m', 1), ('p', 2), ('s', 4)]

When a letter is first encountered, it is missing from the mapping, so the default_factory function calls int() to supply a default count of zero. The increment operation then builds up the count for each letter. <br>
The function int() which always returns zero is just a special case of constant functions. A faster and more flexible way to create constant functions is to use a lambda function which can supply any constant value (not just zero):

In [6]:
def constant_factory(value):
    return lambda: value

d=defaultdict(constant_factory('<missing>'))
d.update(name='John', action='ran')
d


print('%(name)s %(action)s to %(object)s' % d)

print(f"{d['name']} {d['action']} to {d['object']}")

John ran to <missing>
John ran to <missing>


**using set as a default datatype**

In [7]:
# set default value to set

s = [('red', 1), ('blue', 2), ('red', 3), ('blue', 4), ('red', 1), ('blue', 4)]
d = defaultdict(set)

for k, v in s:
    d[k].add(v)

sorted(d.items())

e={}
for k, v in s:
    e.setdefault(k, set()).add(v)
    
sorted(e.items())

[('blue', {2, 4}), ('red', {1, 3})]

In [8]:
from collections import defaultdict

def def_value():
    return "Not present"

d=defaultdict(def_value)

d['a']=1
d['b']=2

d

defaultdict(<function __main__.def_value()>, {'a': 1, 'b': 2})

In [9]:
print(d['c'])

Not present


In [10]:
d=defaultdict(lambda:'Not present')
d['a']=1
d['b']=2

print(d['c'])

Not present


In [11]:
from collections import defaultdict

d=defaultdict(list)

for i in range(5):
    d[i].append(i)
print(d)

defaultdict(<class 'list'>, {0: [0], 1: [1], 2: [2], 3: [3], 4: [4]})


In [12]:
from collections import defaultdict

d=defaultdict(int)

L=[1,2,3,4,2,3,4,2]

for i in L:
    d[i]+=1
    
print(d)

defaultdict(<class 'int'>, {1: 1, 2: 3, 3: 2, 4: 2})


In [13]:
dict1={'A':'Geeks', 'B':'For'}

val=dict1.setdefault('A')
print(val)
print(dict1)
val=dict1.setdefault('C')
print(val)
print(dict1)
val=dict1.setdefault('D', 'Geeks')
print(val)
print(dict1)

Geeks
{'A': 'Geeks', 'B': 'For'}
None
{'A': 'Geeks', 'B': 'For', 'C': None}
Geeks
{'A': 'Geeks', 'B': 'For', 'C': None, 'D': 'Geeks'}


## 2) Data analysis of  "2019 Stack Overflow Developer Survey"

In [14]:
import csv
from pprint import pprint

with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
  
    
    for line in reader:
        pprint(line)
        break

{'Age': '14',
 'Age1stCode': '10',
 'BetterLife': 'Yes',
 'BlockchainIs': 'NA',
 'BlockchainOrg': 'NA',
 'CareerSat': 'NA',
 'CodeRev': 'NA',
 'CodeRevHrs': 'NA',
 'CompFreq': 'NA',
 'CompTotal': 'NA',
 'Containers': 'I do not use containers',
 'ConvertedComp': 'NA',
 'Country': 'United Kingdom',
 'CurrencyDesc': 'NA',
 'CurrencySymbol': 'NA',
 'DatabaseDesireNextYear': 'MySQL',
 'DatabaseWorkedWith': 'SQLite',
 'Dependents': 'No',
 'DevEnviron': 'IntelliJ;Notepad++;PyCharm',
 'DevType': 'NA',
 'EdLevel': 'Primary/elementary school',
 'EduOther': 'Taught yourself a new language, framework, or tool without '
             'taking a formal course',
 'Employment': 'Not employed, and not looking for work',
 'EntTeams': "No, and I don't know what those are",
 'Ethnicity': 'NA',
 'Extraversion': 'Online',
 'FizzBuzz': 'NA',
 'Gender': 'Man',
 'Hobbyist': 'Yes',
 'ITperson': 'Fortunately, someone else has that title',
 'ImpSyn': 'NA',
 'JobFactors': 'NA',
 'JobSat': 'NA',
 'JobSeek': 'NA',
 'L

In [15]:
import csv
from pprint import pprint

with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    
    for line in reader:
        pprint(line['Hobbyist'])
        break

'Yes'


In [16]:
# calculate the number for both yes and no for 'Hobbyist' column

with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    yes_count=0
    no_count=0
    
    for line in reader:
        if line['Hobbyist']=='Yes':
            yes_count+=1
        elif line['Hobbyist']=='No':
            no_count+=1
total=yes_count + no_count
yes_pct=round((yes_count/total)*100, 2)
no_pct=round((no_count/total)*100, 2)


print('Yes : {:}'.format(yes_count))
print('No  : {:}'.format(no_count))
print()
print('Yes : {:}'.format(yes_pct))
print('No  : {:}'.format(no_pct))

Yes : 71257
No  : 17626

Yes : 80.17
No  : 19.83


In [17]:
# calculate the number for both yes and no for 'Hobbyist' column and store the results in dictionary


with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    counts={'Yes':0,
           'No':0} # initialize the intial values to 0, keys are already in there
    
    for line in reader:
        counts[line['Hobbyist']]+=1
        
total= counts['Yes'] + counts['No']

yes_pct=round((counts['Yes']/total)*100, 2)
no_pct=round((counts['No']/total)*100, 2)


print('Yes : {:}'.format(counts['Yes']))
print('No  : {:}'.format(counts['No']))
print()
print('Yes : {:}%'.format(yes_pct))
print('No  : {:}%'.format(no_pct))

Yes : 71257
No  : 17626

Yes : 80.17%
No  : 19.83%


In [18]:
# use defaultdict

from collections import defaultdict # initialize values of keys by providing data type

with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    counts=defaultdict(int) # dictionary knows to expect integers as values to our keys and it will start at zero by default
    
    for line in reader:
        counts[line['Hobbyist']]+=1
total= counts['Yes'] + counts['No']
yes_pct=round((counts['Yes']/total)*100, 2)
no_pct=round((counts['No']/total)*100, 2)


print('Yes : {:}'.format(counts['Yes']))
print('No  : {:}'.format(counts['No']))
print()
print('Yes : {:}%'.format(yes_pct))
print('No  : {:}%'.format(no_pct))

Yes : 71257
No  : 17626

Yes : 80.17%
No  : 19.83%


In [19]:
# use defaultdict

from collections import defaultdict, Counter

with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    counts=Counter()  # when simply counting certain values
    
    for line in reader:
        counts[line['Hobbyist']]+=1
total= counts['Yes'] + counts['No']
yes_pct=round((counts['Yes']/total)*100, 2)
no_pct=round((counts['No']/total)*100, 2)


print('Yes : {:}'.format(counts['Yes']))
print('No  : {:}'.format(counts['No']))
print()
print('Yes : {:}%'.format(yes_pct))
print('No  : {:}%'.format(no_pct))

Yes : 71257
No  : 17626

Yes : 80.17%
No  : 19.83%


In [20]:
# find the most popular programming language using Counter

from collections import defaultdict, Counter

with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    language_counter=Counter() 
        
    for line in reader:
        languages=line['LanguageWorkedWith'].split(';')
        
        for language in languages:
            language_counter[language]+=1
            
        print(language_counter)
        break


Counter({'HTML/CSS': 1, 'Java': 1, 'JavaScript': 1, 'Python': 1})


In [21]:
# use defaultdict
import time
from collections import defaultdict, Counter

t1=time.perf_counter()
with open('survey_results_public.csv', encoding='utf-8') as f:
    reader=csv.DictReader(f)
    language_counter=Counter() 
        
    for line in reader:
        languages=line['LanguageWorkedWith'].split(';')
        language_counter.update(languages)
        
t2=time.perf_counter()

pprint(language_counter)
print()
print(f'taken {t2 - t1}')
pprint(language_counter.most_common(10))

Counter({'JavaScript': 59219,
         'HTML/CSS': 55466,
         'SQL': 47544,
         'Python': 36443,
         'Java': 35917,
         'Bash/Shell/PowerShell': 31991,
         'C#': 27097,
         'PHP': 23030,
         'C++': 20524,
         'TypeScript': 18523,
         'C': 18017,
         'Other(s):': 7920,
         'Ruby': 7331,
         'Go': 7201,
         'Assembly': 5833,
         'Swift': 5744,
         'Kotlin': 5620,
         'R': 5048,
         'VBA': 4781,
         'Objective-C': 4191,
         'Scala': 3309,
         'Rust': 2794,
         'Dart': 1683,
         'NA': 1314,
         'Elixir': 1260,
         'Clojure': 1254,
         'WebAssembly': 1015,
         'F#': 973,
         'Erlang': 777})

taken 3.8947582999999995
[('JavaScript', 59219),
 ('HTML/CSS', 55466),
 ('SQL', 47544),
 ('Python', 36443),
 ('Java', 35917),
 ('Bash/Shell/PowerShell', 31991),
 ('C#', 27097),
 ('PHP', 23030),
 ('C++', 20524),
 ('TypeScript', 18523)]


In [22]:
import csv
with open('survey_results_public.csv', encoding='utf-8') as f:
    csv_reader = csv.DictReader(f)

    dev_type_info = {}
    


    for line in csv_reader:
        if line['DevType'] != 'NA':
            count=0
            while count <10:
                print (line['DevType'].split(';'))
                count+=1
            break
    print()        
    for line in csv_reader:
        if line['LanguageWorkedWith'] != 'NA':
            count=0
            while count <10:
                print (line['LanguageWorkedWith'].split(';'))
                count+=1
            break

['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']
['Developer, desktop or enterprise applications', 'Developer, front-end']

['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']
['HTML/CSS']


In [23]:
# Analyse further to get the info of developer types

import csv
from collections import defaultdict, Counter

with open('survey_results_public.csv', encoding='utf-8') as f:
    csv_reader = csv.DictReader(f)

    dev_type_info = {}

    for line in csv_reader:
        dev_types = line['DevType'].split(';')
        

        for dev_type in dev_types:
            dev_type_info.setdefault(dev_type, {
                'total': 0,
                'language_counter': Counter()
            })

            languages = line['LanguageWorkedWith'].split(';')
            dev_type_info[dev_type]['language_counter'].update(languages)
            dev_type_info[dev_type]['total'] += 1
            break
        break
    print(dev_type_info)

for dev_type, info in dev_type_info.items():
    print(dev_type)

    for language, value in info['language_counter'].most_common(5):
        language_pct = (value / info['total']) * 100
        language_pct = round(language_pct, 2)

        print(f'\t{language}: {language_pct}%')

{'NA': {'total': 1, 'language_counter': Counter({'HTML/CSS': 1, 'Java': 1, 'JavaScript': 1, 'Python': 1})}}
NA
	HTML/CSS: 100.0%
	Java: 100.0%
	JavaScript: 100.0%
	Python: 100.0%


In [24]:
# Analyse further to get the info of developer types

import csv
from collections import defaultdict, Counter
from pprint import pprint

with open('survey_results_public.csv', encoding='utf-8') as f:
    csv_reader = csv.DictReader(f)

    dev_type_info = {}
    
    for line in csv_reader:
        if line['DevType'] != 'NA':
            print(line['DevType'], end='\n\n')
            
            dev_types = line['DevType'].split(';')
            print(dev_types, end='\n\n')
         
            for _ in range(2):
                for dev_type in dev_types:
                    print(dev_type, end='\n')

                    dev_type_info.setdefault(dev_type, {
                        'total': 0,
                        'language_counter': Counter()
                    })

                    languages = line['LanguageWorkedWith'].split(';')
                    dev_type_info[dev_type]['language_counter'].update(languages)
                    dev_type_info[dev_type]['total'] += 1
                break
            break
    pprint(dev_type_info)
    
print()

for dev_type, info in dev_type_info.items():
    print(dev_type)

    for language, value in info['language_counter'].most_common(5):
        language_pct = (value / info['total']) * 100
        language_pct = round(language_pct, 2)

        print(f'\t{language}: {language_pct}%')
        
    print(end='\n')

Developer, desktop or enterprise applications;Developer, front-end

['Developer, desktop or enterprise applications', 'Developer, front-end']

Developer, desktop or enterprise applications
Developer, front-end
{'Developer, desktop or enterprise applications': {'language_counter': Counter({'C++': 1,
                                                                                'HTML/CSS': 1,
                                                                                'Python': 1}),
                                                   'total': 1},
 'Developer, front-end': {'language_counter': Counter({'C++': 1,
                                                       'HTML/CSS': 1,
                                                       'Python': 1}),
                          'total': 1}}

Developer, desktop or enterprise applications
	C++: 100.0%
	HTML/CSS: 100.0%
	Python: 100.0%

Developer, front-end
	C++: 100.0%
	HTML/CSS: 100.0%
	Python: 100.0%



- Above, there is two dev_types: one is "Developer, desktop or enterprise applications"; the other one is "Developer, front-end"

- So these two keys have the same values as below:
        Counter({'C++': 1,
                 'HTML/CSS': 1,
                 'Python': 1}),
        
- And the total get 1 for each developer type

In [25]:
# Analyse further to get the info of developer types

import csv
from collections import defaultdict, Counter
from pprint import pprint

with open('survey_results_public.csv', encoding='utf-8') as f:
    csv_reader = csv.DictReader(f)

    dev_type_info = {}
    
    for line in csv_reader:
        dev_types = line['DevType'].split(';')
        for dev_type in dev_types:
            dev_type_info.setdefault(dev_type, {
                'total': 0,
                'language_counter': Counter()
            })

        languages = line['LanguageWorkedWith'].split(';')
        dev_type_info[dev_type]['language_counter'].update(languages)
        dev_type_info[dev_type]['total'] += 1
    

for dev_type, info in dev_type_info.items():
    print(dev_type)

    for language, value in info['language_counter'].most_common(5):
        language_pct = (value / info['total']) * 100
        language_pct = round(language_pct, 2)

        print(f'\t{language}: {language_pct}%')
        
    print(end='\n')

NA
	HTML/CSS: 54.9%
	Python: 51.09%
	JavaScript: 50.58%
	Java: 42.71%
	C++: 35.02%

Developer, desktop or enterprise applications
	SQL: 49.93%
	C#: 48.74%
	JavaScript: 39.57%
	Java: 37.65%
	HTML/CSS: 34.96%

Developer, front-end
	JavaScript: 86.11%
	HTML/CSS: 82.25%
	SQL: 32.71%
	TypeScript: 30.12%
	PHP: 28.31%

Designer
	HTML/CSS: 59.65%
	JavaScript: 44.44%
	Python: 25.15%
	SQL: 23.98%
	Java: 21.05%

Developer, back-end
	SQL: 53.6%
	JavaScript: 48.27%
	Java: 45.16%
	Python: 38.62%
	HTML/CSS: 36.62%

Developer, full-stack
	JavaScript: 86.82%
	HTML/CSS: 77.6%
	SQL: 63.58%
	C#: 37.44%
	Java: 35.86%

Academic researcher
	Python: 59.04%
	C++: 36.52%
	Bash/Shell/PowerShell: 33.79%
	C: 30.72%
	HTML/CSS: 30.03%

Developer, mobile
	JavaScript: 59.85%
	Java: 55.21%
	HTML/CSS: 51.85%
	SQL: 41.15%
	C#: 29.43%

Data or business analyst
	SQL: 65.75%
	Python: 54.34%
	R: 37.21%
	HTML/CSS: 35.84%
	VBA: 32.65%

Data scientist or machine learning specialist
	Python: 84.6%
	SQL: 57.88%
	R: 55.08%
	Bash/S