# **Not So One-Hot: Where are the Non-Binary People in Data Science?**

***

![Image](https://images-na.ssl-images-amazon.com/images/I/41e6vlU00TL._AC_.jpg)

*Image from www.amazon.co.uk*

***

> "Why did the non-binary data scientist enter the kaggle competition? Because there is gold in them/their data."

***

This notebook is all about exploring the experience in data science for non-binary people, presented by the Kaggle 2020 survey dataset. We all know that more inclusive workspaces function better, perform better and contain happier people. 

There has been a lot of focus on getting more women and people of colour into data science, and rightly so. But what about the non-binary? 🌈

This notebook is intended for beginners so all code will be shown and walked through. 🏆 💬

***

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 1 Initial Exploratory Data Analysis

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

In [None]:
df.head()

In [None]:
# Since the first row of our data is the questions, we'll split that out

questions = df[0:1]
data = df[1:]

In [None]:
# Check how many respondents there were to the survey

len(data)

In [None]:
# Have a look at the gender split

data.Q2.value_counts()

In [None]:
# Often, we tend to only keep male and female responses, however for our purposes, we'll keep Nonbinary too

data = data[(data.Q2=='Man') | (data.Q2=='Woman') | (data.Q2=='Nonbinary')]

Splitting the data into binary and non binary we get:

In [None]:
nb = data[data.Q2 == 'Nonbinary']
nb_employ = nb[(nb.Q5!='Student') & (nb.Q5!='Currently not employed')]
len(nb)

In [None]:
theb = data[(data.Q2 == 'Man') | (data.Q2 == 'Woman')]
theb_employ = theb[(theb.Q5!='Student') & (theb.Q5!='Currently not employed')]
len(theb)

Let's check for any errors in the survey responses, we'll check how long it took respondents to carry out their surveys (in seconds):

In [None]:
print('for nb people:')
print('Max', nb['Time from Start to Finish (seconds)'].astype(int).max())
print('Min', nb['Time from Start to Finish (seconds)'].astype(int).min())
print('mean', nb['Time from Start to Finish (seconds)'].astype(int).mean())
print('median', nb['Time from Start to Finish (seconds)'].astype(int).median())
#print('mode', nb['Time from Start to Finish (seconds)'].astype(int).mode())

print('\nfor b people:')
print('Max', theb['Time from Start to Finish (seconds)'].astype(int).max())
print('Min', theb['Time from Start to Finish (seconds)'].astype(int).min())
print('mean', theb['Time from Start to Finish (seconds)'].astype(int).mean())
print('median', theb['Time from Start to Finish (seconds)'].astype(int).median())

In [None]:
# convert to days
1144493/60/60/24

In [None]:
# have a look at the offending row
theb[theb['Time from Start to Finish (seconds)']=='1144493']

Some of the times look a bit strange, these would be contenders to drop from the data if we were to look into them further but we'll carry on for now.

There are clearly a *lot* more responses from binary people, but human's arent great and visualising large numbers so lets draw it out.

In [None]:
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Binary', 'Nonbinary'
sizes = [len(theb), len(nb)]
explode = (0, 0.1)  # only "explode" the 2nd slice (i.e. 'Nonbinary')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

Wow, that puts it into a bit more perspective, but is this comparable to the population? 💭

Recent estimates for the [US](https://news.stlpublicradio.org/politics-issues/2020-03-17/the-2020-census-is-underway-but-nonbinary-and-gender-nonconforming-respondents-feel-counted-out#:~:text=That%20study%20revealed%20at%20least,to%20about%20two%20million%20people.) state that 0.5% of their population is nonbinary, whereas the [UK](https://practicalandrogyny.com/2014/12/16/how-many-people-in-the-uk-are-nonbinary/) place this at 0.4%

The exact value in the pie chart is:

In [None]:
len(nb)/(len(nb)+len(theb))*100

So it seems nonbinary people could be under-represented within Data Science. Let's carry on looking into the data to see if there are any more indicators of their experience in the industry.

***

## 2 Data Preprocessing

In this section we'll edit the data a bit and then create a function that allows us to pull the information we want from the survey questions. This will be useful since some questions were multiple response and in our current dataset they're all split up.

Let's write a function to relabel the values in the gender column so we have only 'binary' and 'nonbinary' as that's what we're interesting in comparing:

In [None]:
def gender_label(x):
    if x == 'Man':
        return 'binary'
    elif x == 'Woman':
        return 'binary'
    elif x == 'Nonbinary':
        return 'nonbinary'
    else:
        return ''

In [None]:
data['gender_label'] = data['Q2'].apply(gender_label)

I've written a scary looking function (create_data_frame) that will return a dataframe based on some inputs - it looks horrible, I know, but if you read through it section by section it should make sense. Or you could just try it out for yourself 😃 I'll use this function in section 3 and 4 below to grab the required data for visualisations.

In [None]:
"""
Function to handle split questions in the survey given some inputs

inputs:
questionref - the question in the kaggle survey e.g. 'Q1'
q_type - the type of question, either 'multi' or 'single'
d_set - which data you want the function to act on
      - this is either 'all' data or just the 'employed'
      
returns:
a dataframe comparing the question respones for binary and nonbinary people
"""

def create_data_frame(questionref, q_type, d_set):
    if (q_type == 'multi') & (d_set == 'all'):
    
        # series object for non-binary
        df_nb = nb[[i for i in nb.columns if questionref in i]]
        df_nb_all = pd.Series(dtype='int')

        for i in df_nb.columns:
            try:
                df_nb_all[df_nb[i].value_counts().index[0]] = df_nb[i].count()
            except:
                df_nb_all['None'] = 0

        # series object for binary people
        df_b = theb[[i for i in theb.columns if questionref in i]]

        df_b_all = pd.Series(dtype='int')

        for i in df_b.columns:
            try:
                df_b_all[df_b[i].value_counts().index[0]] = df_b[i].count()
            except:
                df_b_all['None'] = 0

        # create the dataframes
        df_b_all = pd.DataFrame(df_b_all)
        df_nb_all = pd.DataFrame(df_nb_all)
        df_b_all.rename(columns={0:'B'}, inplace=True)
        df_nb_all.rename(columns={0:'NB'}, inplace=True)

        # concat frames into one for analysis
        # separate dfs also exist
        frames=[df_b_all, df_nb_all]
        df_all = pd.concat(frames, axis=1)
        df_all.fillna(0, inplace=True)

        def to_percentage_b(x):
            return (x/df_all.B.sum())*100

        def to_percentage_nb(x):
            return (x/df_all.NB.sum())*100

        # add columns for %s and delta
        df_all['B%']=df_all['B'].apply(to_percentage_b)
        
        if (df_all['NB'].sum() >0):
            df_all['NB%']=df_all['NB'].apply(to_percentage_nb)
            
        else:
            df_all['NB%']=0
            
        df_all['Delta']=df_all['B%']-df_all['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df_all.sort_values(by='Delta', ascending=True, inplace=True)

        return df_all
    
    elif (q_type == 'multi') & (d_set == 'employed'):
        
        # series object for non-binary - this one uses nb_employ data
        df_nb = nb_employ[[i for i in nb_employ.columns if questionref in i]]
        df_nb_all = pd.Series(dtype='int')

        for i in df_nb.columns:
            try:
                df_nb_all[df_nb[i].value_counts().index[0]] = df_nb[i].count()
            except:
                df_nb_all['None'] = 0

        # series object for binary people
        df_b = theb[[i for i in theb.columns if questionref in i]]

        df_b_all = pd.Series(dtype='int')

        for i in df_b.columns:
            try:
                df_b_all[df_b[i].value_counts().index[0]] = df_b[i].count()
            except:
                df_b_all['None'] = 0

        # create the dataframes
        df_b_all = pd.DataFrame(df_b_all)
        df_nb_all = pd.DataFrame(df_nb_all)
        df_b_all.rename(columns={0:'B'}, inplace=True)
        df_nb_all.rename(columns={0:'NB'}, inplace=True)

        # concat frames into one for analysis
        # separate dfs also exist
        frames=[df_b_all, df_nb_all]
        df_all = pd.concat(frames, axis=1)
        df_all.fillna(0, inplace=True)

        def to_percentage_b(x):
            return (x/df_all.B.sum())*100

        def to_percentage_nb(x):
            return (x/df_all.NB.sum())*100

        # add columns for %s and delta
        df_all['B%']=df_all['B'].apply(to_percentage_b)
        
        if (df_all['NB'].sum() >0):
            df_all['NB%']=df_all['NB'].apply(to_percentage_nb)
            
        else:
            df_all['NB%']=0
            
        df_all['Delta']=df_all['B%']-df_all['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df_all.sort_values(by='Delta', ascending=True, inplace=True)

        return df_all
        
    
    elif (q_type == 'single') & (d_set == 'all'):
        
        frames=[theb[[i for i in theb.columns if questionref == i]].value_counts(),\
        nb[[i for i in nb.columns if questionref == i]].value_counts()]

        df = pd.concat(frames, axis=1).rename(columns={0:'B', 1:'NB'})
        df.fillna(0, inplace=True)
        
        def to_percentage_b(x):
            return (x/df.B.sum())*100

        def to_percentage_nb(x):
            return (x/df.NB.sum())*100
        
        # add columns for %s and delta
        df['B%']=df['B'].apply(to_percentage_b)
        
        if (df['NB'].sum() >0):
            df['NB%']=df['NB'].apply(to_percentage_nb)
        else:
            df['NB%']=0
            
        df['Delta']=df['B%']-df['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df.sort_values(by='Delta', ascending=True, inplace=True)
        
        return df
    
    elif (q_type == 'single') & (d_set == 'employed'):
        
        # creates frames using employ data
        frames=[theb_employ[[i for i in theb_employ.columns if questionref == i]].value_counts(),\
        nb_employ[[i for i in nb_employ.columns if questionref == i]].value_counts()]

        df = pd.concat(frames, axis=1).rename(columns={0:'B', 1:'NB'})
        df.fillna(0, inplace=True)
        
        def to_percentage_b(x):
            return (x/df.B.sum())*100

        def to_percentage_nb(x):
            return (x/df.NB.sum())*100
        
        # add columns for %s and delta
        df['B%']=df['B'].apply(to_percentage_b)
        
        if (df['NB'].sum() >0):
            df['NB%']=df['NB'].apply(to_percentage_nb)
        else:
            df['NB%']=0
            
        df['Delta']=df['B%']-df['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df.sort_values(by='Delta', ascending=True, inplace=True)
        
        return df
        
    else:
        print('invalid q_type or d_set')

## 3 Visulisations of Key Demographics

### **Q1 - What is your age (# years)?**

In [None]:
age = create_data_frame('Q1', q_type='single', d_set='all')
age

In [None]:
age.sort_values(by='Q1').drop(['B', 'NB', 'Delta'], axis=1).plot()

We see that in general, the ages align, but nonbinary people are slightly younger.

***

### **Q3 - In which country do you currently reside?**

In [None]:
create_data_frame('Q3', q_type='single', d_set='all')

Interestingly the largest delta is in the US! To see this data a bit more clearly though, let's use ploty to create a choropleth map that uses ISO codes (i.e. GBR for the UK)

In [None]:
# getting the iso codes
ISOdf = pd.read_excel('/kaggle/input/isocodes2/ISOalpha3codes.xlsx')
ISOdf.rename(columns={'Country':'country'}, inplace=True)

In [None]:
# reorganising and adding the iso codes onto our df
nb_locs = pd.DataFrame(nb.Q3.value_counts()).reset_index()
nb_locs.rename(columns={'index':'country', 'Q3':'count'}, inplace=True)
nb_locs = nb_locs[nb_locs.country != 'Other']
nb_locs=ISOdf.merge(nb_locs, on='country')

In [None]:
# visualise location of non binary people!
import plotly.express as px

fig = px.choropleth(nb_locs, locations="Code",
                    color="count", # lifeExp is a column of gapminder
                    hover_name="country", # column to add to hover information
                    #color_continuous_scale=px.colors.qualitative.G10)
                    color_continuous_scale=px.colors.sequential.Redor)
fig.show()

This is a cool map that can show us where the nonbinary people are! Let's do this for the binary people too and compare:

In [None]:
theb_locs = pd.DataFrame(theb.Q3.value_counts()).reset_index()
theb_locs.rename(columns={'index':'country', 'Q3':'count'}, inplace=True)
theb_locs = theb_locs[theb_locs.country != 'Other']
theb_locs=ISOdf.merge(theb_locs, on='country')

In [None]:
# visualise location of binary people!

fig = px.choropleth(theb_locs, locations="Code",
                    color="count", # lifeExp is a column of gapminder
                    hover_name="country", # column to add to hover information
                    #color_continuous_scale=px.colors.qualitative.G10)
                    color_continuous_scale=px.colors.sequential.Redor)
fig.show()

The maps are quite different and point to some importany issues of diversity and culture.

***

### **Q4 - What is the highest level of formal education obtained?**

In this section, only those who indicated that they were employed are included, since students could sway the results with education not yet obtained!

In [None]:
education = create_data_frame('Q4', q_type='single', d_set='employed')
education

In [None]:
eduList = []
for item in education.index:
    item = str(item)
    item = item.strip('(')
    item = item.strip(')')
    item = item.strip(',')
    item = item.strip('\'')
    eduList.append(item)
    
x1 = education['B%']
x2 = education['NB%']

In [None]:
x11 = []
for item in x1:
    x11.append(item)
    
x22 = []
for item in x2:
    x22.append(item)

In [None]:
# orange is nb people, blue is binary
fig, ax = plt.subplots()

plt.bar(eduList, x11, alpha=0.2)
plt.bar(eduList, x22, alpha=0.5)

ax.set_xticklabels(eduList, rotation=90)

There is an interesting finding here, nonbinary people tend to have more masters and doctoral degrees than bachelors.

***

### **Q5 - Select your current title**

In [None]:
titles = create_data_frame('Q5', q_type='single', d_set='all')
titles

In [None]:
titles.drop(['B', 'NB', 'Delta'], axis=1).plot(kind='bar')

Significantly more nonbinary students and slightly more data scientists. Less spread over the other roles.

***

### **Q23 - What activities take up an important role in your work?**

In [None]:
# activities that take up important role in your work
create_data_frame('Q23', q_type='multi', d_set='employed')

We see that there isn't a large delta present, and B and NB people tend to carry out similar activities.

***

### **Q6 - For how many years have you been writing code?**

In [None]:
# how long been coding
coding = create_data_frame('Q6', q_type='single', d_set='all')
coding

In [None]:
# quick and dirty way to sort Q6 years
coding['sort']=[5,20,10,0,0.1,1,3]

In [None]:
coding.sort_values(by='sort').drop(['B', 'NB', 'Delta', 'sort'], axis=1).plot(kind='bar')

For binary people we see a slightly skewed normal distribution, whereas for nonbinary the distribution is centred on 5-10 years. Let's see how this compares with salary and skip to Q24 next.

***

### **Q24 - What is your yearly salary?**

In [None]:
salary = create_data_frame('Q24', q_type='single', d_set='employed')

In [None]:
salary['sort']=[125, 200, 60, 300, 3, 80, 2, 150, 70, 30, 100, 1, 500, 250, 10, 90, 4, 25, 7, 20, 15, 5, 50, 40, 0]

In [None]:
salary.sort_values(by='sort').drop(['B', 'NB', 'Delta', 'sort'], axis=1).plot(kind='bar')

In [None]:
salary.sort_values(by='sort').drop(['B', 'NB', 'Delta', 'sort'], axis=1).plot(kind='kde')

We see that there is a larger disparity towards the high end of the salary scale with NB people earning more. Let's see if this has any correlation with the size of the company in Q20.

***

### **Q20 - How many employees are in your company?**

In [None]:
sizecomp = create_data_frame('Q20', q_type='single', d_set='employed')

In [None]:
sizecomp['sort']=[250, 0, 50, 1000, 10000]

In [None]:
sizecomp.sort_values(by='sort').drop(['B','NB','Delta','sort'], axis=1).plot(kind='bar')

Interestingly the majority of both binary and nonbinary people seem to work in smaller companies with only 0-49 employees.

***

## 4 Technologies

In the final section we'll explore what the differences are in technologies such as programming languages, data visualisation libraries, machine learning algorithms etc.

The differences here are important - we need diverse teams to approach problems from all angles!

***

### **Q7 - What programming languages do you use?**

In [None]:
langs = create_data_frame('Q7', q_type='multi', d_set='all')

In [None]:
langs.drop(['B', 'NB', 'Delta'], axis=1).plot(kind='bar')

In [None]:
langs.drop(['B', 'NB', 'B%', 'NB%'], axis=1).plot(kind='bar')

Lots of similarities in languages, the largest differences are in Java (B) and Bash (NB)

***

### **Q8 - What programming languages would you recommend?**

In [None]:
langrec = create_data_frame('Q8', q_type='single', d_set='all')

In [None]:
langrec.drop(['B', 'NB', 'Delta'], axis=1).plot(kind='bar')

Overwhelmingly, everyone recommends Python! No surprises there though...

***

### **Q14 - What data viz libraries do you use?**

In [None]:
dataviz = create_data_frame('Q14', q_type='multi', d_set='all')
dataviz

The percetages look largely inline, so lets just consider the delta:

In [None]:
dataviz.drop(['B', 'NB', 'B%', 'NB%'], axis=1).plot(kind='bar')

It seems that binary people are more inclined to use libraries like Seaborn, Matplotlib and D3 js, whereas nonbinary may use the lesser known libraries of Bokeh, Folium and Shiny

***

### **Q17 - ML Algorithms used on a regular basis**

In [None]:
ml = create_data_frame('Q17', q_type='multi', d_set='all')
ml

In [None]:
ml.drop(['B%', 'NB%', 'B', 'NB'], axis=1).plot(kind='bar')

The approach to solving a problem with ML is an important one. Nonbinary people appear to prefer the less mainstream Transfromer networks compared to Binary people's decision trees. 

***

### **Q39 - Fave media sources for data science**

In [None]:
media=create_data_frame('Q39', q_type='multi', d_set='all')
media

In [None]:
# lets shorted the media types, theyre too long for a graph
def shorten_type(x):
    if x.count('(') == 0:
        return x
    else:
        x=x[:x.find('(')]
        return x

In [None]:
# apply the function
media['type']=media.index
media['type']=media['type'].apply(shorten_type)

In [None]:
# the resasdign the index
media.index=media['type']

In [None]:
media.drop(['B', 'NB', 'Delta', 'type'], axis=1).plot(kind='bar')

In [None]:
media.drop(['B%', 'NB%', 'B', 'NB', 'type'], axis=1).plot(kind='bar')

Nonbinary people seem to prefer Reddit and Journals, whereas Binary people favour visual and interactive mediums like Youtube and Kaggle.

***

## 5 Conclusion

This notebook has illustrated the experience that nonbinary people have within the datascience industry. As demonstrated, they have a very different outlook, working with different technologies, languages and algorithms. It is important to have a diverse team, in an industry, especially when the problems we're trying to solve affect us all.

***

## 6 Appendix

In [None]:
"""
Function to handle split questions in the survey given some inputs

inputs:
questionref - the question in the kaggle survey e.g. 'Q1'
q_type - the type of question, either 'multi' or 'single'
d_set - which data you want the function to act on
      - this is either 'all' data or just the 'employed'
      
returns:
a dataframe comparing the question respones for binary and nonbinary people
"""

def create_data_frame(questionref, q_type, d_set):
    if (q_type == 'multi') & (d_set == 'all'):
    
        # series object for non-binary
        df_nb = nb[[i for i in nb.columns if questionref in i]]
        df_nb_all = pd.Series(dtype='int')

        for i in df_nb.columns:
            try:
                df_nb_all[df_nb[i].value_counts().index[0]] = df_nb[i].count()
            except:
                df_nb_all['None'] = 0

        # series object for binary people
        df_b = theb[[i for i in theb.columns if questionref in i]]

        df_b_all = pd.Series(dtype='int')

        for i in df_b.columns:
            try:
                df_b_all[df_b[i].value_counts().index[0]] = df_b[i].count()
            except:
                df_b_all['None'] = 0

        # create the dataframes
        df_b_all = pd.DataFrame(df_b_all)
        df_nb_all = pd.DataFrame(df_nb_all)
        df_b_all.rename(columns={0:'B'}, inplace=True)
        df_nb_all.rename(columns={0:'NB'}, inplace=True)

        # concat frames into one for analysis
        # separate dfs also exist
        frames=[df_b_all, df_nb_all]
        df_all = pd.concat(frames, axis=1)
        df_all.fillna(0, inplace=True)

        def to_percentage_b(x):
            return (x/df_all.B.sum())*100

        def to_percentage_nb(x):
            return (x/df_all.NB.sum())*100

        # add columns for %s and delta
        df_all['B%']=df_all['B'].apply(to_percentage_b)
        
        if (df_all['NB'].sum() >0):
            df_all['NB%']=df_all['NB'].apply(to_percentage_nb)
            
        else:
            df_all['NB%']=0
            
        df_all['Delta']=df_all['B%']-df_all['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df_all.sort_values(by='Delta', ascending=True, inplace=True)

        return df_all
    
    elif (q_type == 'multi') & (d_set == 'employed'):
        
        # series object for non-binary - this one uses nb_employ data
        df_nb = nb_employ[[i for i in nb_employ.columns if questionref in i]]
        df_nb_all = pd.Series(dtype='int')

        for i in df_nb.columns:
            try:
                df_nb_all[df_nb[i].value_counts().index[0]] = df_nb[i].count()
            except:
                df_nb_all['None'] = 0

        # series object for binary people
        df_b = theb[[i for i in theb.columns if questionref in i]]

        df_b_all = pd.Series(dtype='int')

        for i in df_b.columns:
            try:
                df_b_all[df_b[i].value_counts().index[0]] = df_b[i].count()
            except:
                df_b_all['None'] = 0

        # create the dataframes
        df_b_all = pd.DataFrame(df_b_all)
        df_nb_all = pd.DataFrame(df_nb_all)
        df_b_all.rename(columns={0:'B'}, inplace=True)
        df_nb_all.rename(columns={0:'NB'}, inplace=True)

        # concat frames into one for analysis
        # separate dfs also exist
        frames=[df_b_all, df_nb_all]
        df_all = pd.concat(frames, axis=1)
        df_all.fillna(0, inplace=True)

        def to_percentage_b(x):
            return (x/df_all.B.sum())*100

        def to_percentage_nb(x):
            return (x/df_all.NB.sum())*100

        # add columns for %s and delta
        df_all['B%']=df_all['B'].apply(to_percentage_b)
        
        if (df_all['NB'].sum() >0):
            df_all['NB%']=df_all['NB'].apply(to_percentage_nb)
            
        else:
            df_all['NB%']=0
            
        df_all['Delta']=df_all['B%']-df_all['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df_all.sort_values(by='Delta', ascending=True, inplace=True)

        return df_all
        
    
    elif (q_type == 'single') & (d_set == 'all'):
        
        frames=[theb[[i for i in theb.columns if questionref == i]].value_counts(),\
        nb[[i for i in nb.columns if questionref == i]].value_counts()]

        df = pd.concat(frames, axis=1).rename(columns={0:'B', 1:'NB'})
        df.fillna(0, inplace=True)
        
        def to_percentage_b(x):
            return (x/df.B.sum())*100

        def to_percentage_nb(x):
            return (x/df.NB.sum())*100
        
        # add columns for %s and delta
        df['B%']=df['B'].apply(to_percentage_b)
        
        if (df['NB'].sum() >0):
            df['NB%']=df['NB'].apply(to_percentage_nb)
        else:
            df['NB%']=0
            
        df['Delta']=df['B%']-df['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df.sort_values(by='Delta', ascending=True, inplace=True)
        
        return df
    
    elif (q_type == 'single') & (d_set == 'employed'):
        
        # creates frames using employ data
        frames=[theb_employ[[i for i in theb_employ.columns if questionref == i]].value_counts(),\
        nb_employ[[i for i in nb_employ.columns if questionref == i]].value_counts()]

        df = pd.concat(frames, axis=1).rename(columns={0:'B', 1:'NB'})
        df.fillna(0, inplace=True)
        
        def to_percentage_b(x):
            return (x/df.B.sum())*100

        def to_percentage_nb(x):
            return (x/df.NB.sum())*100
        
        # add columns for %s and delta
        df['B%']=df['B'].apply(to_percentage_b)
        
        if (df['NB'].sum() >0):
            df['NB%']=df['NB'].apply(to_percentage_nb)
        else:
            df['NB%']=0
            
        df['Delta']=df['B%']-df['NB%']
        
        # sort the df by the lowest delta values - indicating areas with more nb ppl
        df.sort_values(by='Delta', ascending=True, inplace=True)
        
        return df
        
    else:
        print('invalid q_type or d_set')