# Academia and Industry — a story told in bars 🍷

I am a PhD student coming towards the end of my degree. That means that I am once again faced with an important decision: where to from here? 

I really like academia so I was quite determined to do a postdoc but an initial search left me very disappointed. Most of the programs out there are very... applied, whereas I am studying the theory of generalisation. I then started wondering what caused my disappointment and I realised the professional isolation of the pandemic has caused me to submerge even further in a bubble; a bubble of theoretical research where I started losing the notion of application. Sure, all my research was conducted having in mind the contributions to the community. But I was treating the community as a rather abstract and distant notion instead of a vibrant and constantly changing entity. So I wanted to find out: who are the people applying ML? Does a real disconnection between researchers and practitioners exist and if so, how does that affect the two? Setting out to answer these questions has changed my perspective and made me both aware of issues I knew deep down existed but didn't take the time to reflect on, as well as biases I was ignorant about. In this notebook I'll walk you through a condensed version of my quest and share with you the most important things I've learned.

So. *How relevant is academia for the Machine Learning industry and vice versa?* This is the question we’ll try to explore today through the eyes of the Kaggle community.

I will focus on the story, using mostly bar plots to illustrate the points made. We start by trying to understand who is using what technologies. Is the latest academic research of relevance to the field or is industry leading? In the quest to answer these questions we discover some less evident truths this data set contains. I hope that this story is going to remind the academic community that it is important to be more aware of who are the people using their research, while encouraging Kaggle and the industry as a whole to value academic work more.

After reading this, the insights you'll gain and my associated suggestions are: 
* By far the most used algorithms by the Kaggle community are classic ones. I believe Kaggle could 1. propose challenges that implicitly entail a larger variety of algorithms and 2. encourage integration of the latest research; 
* Biases in the data collection process shape the stories that emerge from the data. A community-aware methodology could lead to more insightful stories;
* Researchers from both academia and industry might be more state of the art results-oriented. Reporting results on unautomated settings as well could be more relevant for the wider community.

We start with some general statistics. This is for us to set the context and get familiar with the broad pool of respondents. As I have said and will reemphasise throughout the notebook, the answers are not necessarily representative of the broader population so we have to first understand whom this data talks about.

We’ll skip over the graphs to keep things short but have a look for yourself if you want to. Overall, we have an overwhelming number of students, yet only 18.79% of the respondents took an ML University Course. Most of the respondees have little ML experience but we are largely dealing with people who hold or will hold a Master’s or Bachelor's degree. Nothing surprising so far.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap


# load data and get rid of the question row
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', low_memory=False)
df = df.iloc[1: , :]

### plotting occupations as percentage
df_Q5 = df['Q5'].value_counts()
ax = sns.barplot(x = df_Q5.values/df_Q5.sum(), y = df_Q5.index)
plt.xlim(0,1)
plt.ylabel("occupations")
plt.xlabel("percentage")
plt.show()

### plotting learning platform usage as percentage
Q_40_Columns = [x for x in df.columns if 'Q40_Part' in x]
df_Q40 = df[Q_40_Columns]
learner = [[df_Q40[column].dropna().unique()[0], df_Q40[column].count()/df_Q40.shape[0]] for column in Q_40_Columns]
learner = pd.DataFrame(learner, columns =['platforms', 'percentage'])
ax = sns.barplot(x = 'percentage', y = 'platforms', data = learner)
ax.set_yticklabels([textwrap.fill(e, 60) for e in learner['platforms']])
plt.xlim(0,1)
plt.show()

### ploting years of ML experience as percentage
df_15 = df['Q15'].value_counts()
sns.barplot(x = df_15.values/df.shape[0], y=df_15.index)
plt.xlim(0,1)
plt.ylabel("ML experience")
plt.xlabel("percentage")
plt.show()


### plotting level of formal education as percentage
df_4 = df['Q4'].value_counts()
sns.barplot(x = df_4.values/df.shape[0], y=df_4.index)
plt.xlim(0,1)
plt.ylabel("education level")
plt.xlabel("percentage")
plt.show()

Let’s have a look at what are the algorithms Kagglers use.

In [None]:
### plotting algorithm usage

df_Q17 = df.filter(like='Q17')

# create data frame with (algorithm name, associated number of respondees who use it)
Q_17_Columns = [x for x in df.columns if 'Q17' in x]
algos_usage = [[df_Q17[column].dropna().unique()[0], df_Q17[column].count()] for column in Q_17_Columns]
algos_usage = pd.DataFrame(algos_usage, columns =['algorithm', 'raw usage count'])
ax = sns.barplot(x = 'raw usage count', y = 'algorithm', data = algos_usage)
plt.show()

Linear/Logistic Regression, Decision Trees/Random Forests, and Gradient Boosting Machines are great but... what about something more challenging? Perhaps the success of these classical but simple methods is caused by the respondees being mostly ML beginners?

In [None]:
regressors = df[df[Q_17_Columns[0]] == 'Linear or Logistic Regression']

# normalise each experience category by the total number of respondents with that experience
regressors_experience = regressors['Q15'].value_counts().div(df['Q15'].value_counts())
sns.barplot(x = regressors_experience.values, y=regressors_experience.index)
plt.xlim(0,1)
plt.ylabel("ML experience")
plt.xlabel("percentage")
plt.show()

Well, not really. I plotted for us the usage of Linear/Logistic Regression per ML experience category. We can observe that more experienced Kagglers prefer these methods as much as beginners.

Perhaps the number of Kaggle competitions for which one would use GANs or Transformers is not so big. But at the same time, a very small population declared they use any other algorithms apart from the few that were listed.
Is then the research disconnected from real-world usages of algorithms? Sure, achieving incremental improvements on the same 5 consecrated data sets is fun but shouldn’t we try to demonstrate our algorithms are better by *showing they are successful on more realistic data sets? Kaggle is a great source of freely available data, which researchers should exploit more*. On the other hand, the extensive usage of classical algorithms amongst Kagglers could underline a possible area of improvement. I believe Kaggle challenges can encourage competitors to use newer technologies by both the nature of their challenges as well as the way solutions are evaluated. Regarding the latter, I believe *Kaggle could actively support the community to discover new research by rewarding those who incorporate it in their solutions*.

It is important to remember that this data talks about *algorithm* usage and does not reflect the overall engagement of the Kaggle community with novel research. It is in this area that the data tells us we could improve the cooperation between researchers and appliers.

You may ask: Why should we try new technologies if old ones work just fine? I think that's how we advance as a field.  Undoubtedly, every new data set comes with its own challenges, and mastering the application of classical methods to new problems is very important. But  I think the Kaggle community is a hive mind that has great potential to broaden its collective knowledge by exploring newly proposed algorithms, rather than going for the "safe bet". Sure, while some methods will significantly improve results, others will not work so well in real-world scenarios. But it is important that we experiment with them and communicate what works and what doesn't. This would in turn provide "feedback" to researchers which can get a reality check and improve their methods so that they become better when applied in practice. A win-win situation I'd say. 

So we've seen that in terms of algorithms, the community has clear preferences.
While most Kagglers use at least one of the algorithms listed, automated methods don't seem to be anywhere near that popular.

In [None]:
### plotting automated tools usage as percentage
Q_36_A_Columns = [x for x in df.columns if 'Q36_A' in x]
df_36_A = df[Q_36_A_Columns]
auto_usage = [[df_36_A[column].dropna().unique()[0], df_36_A[column].count()/df_36_A.shape[0]] for column in Q_36_A_Columns]
auto_usage = pd.DataFrame(auto_usage, columns =['automation tool', 'usage (%)'])
ax = sns.barplot(x = 'usage (%)', y = 'automation tool', data = auto_usage)
ax.set_yticklabels([textwrap.fill(e, 60) for e in auto_usage['automation tool']])
plt.xlim(0,1)
plt.show()

Let's understand a bit better this subpopulation of automation users. Who are they?

In [None]:
# for each entry, make the last column 1 if at least one automation method is used and 0 otherwise
aggregated_automation = df.filter(like='Q36_A').ffill(axis=1).iloc[:,-1].notnull().astype('int')

# join occupation and automation usage and select only those respondees that use automation
automators_occupation = df.filter(like='Q5').join(aggregated_automation)
automators_occupation = automators_occupation.loc[automators_occupation['Q36_A_OTHER'] == 1]

# plot raw number of automation users per job title
automators_occupation = automators_occupation.value_counts()
sns.barplot(x = automators_occupation.values, y = [x[0] for x in automators_occupation.keys()])
plt.xlabel("automation usage count")
plt.ylabel("occupation")
plt.show()

Remember that our ultimate goal is to understand the relationship between research and industry. Let's look at the above plot from this perspective.
Even if we consider the ML industry to be ML Engineers and Data Scientists alone, they seem to use twice as much automation as Research Scientists. But is it safe to say Kaggle researchers don’t make that much use of automated tools? What if some of the Engineers and Data Scientists actually do research as part of their job? So instead of looking at the declared occupation, let’s have a look at the activities that make up an important part of respondees’ role at work. We are particularly interested in three categories. Those who:
1. Do research that advances the state of the art of machine learning (researchers);
2. Are focused on experimentation and iteration to improve existing ML models (improvers);
3. Build prototypes to explore applying machine learning to new areas (prototypers).

In [None]:
# join automation usage with the three role columns
automators_role = df[['Q24_Part_3', 'Q24_Part_5', 'Q24_Part_6']].join(aggregated_automation)

# plot percentage of automators usage per role
automators_role = automators_role.loc[automators_role['Q36_A_OTHER'] == 1]
ydata = automators_role.filter(like='Q24').count().values/df[['Q24_Part_3', 'Q24_Part_5', 'Q24_Part_6']].count().values
labels = ['Prototypers', 'Improvers', 'Researchers']
sns.barplot(x= labels, y = ydata)
plt.ylim(0,1)
plt.ylabel("role at work")
plt.xlabel("automation users (%)")
plt.show()

We conclude that the percentage of researchers and improvers that use either of the automated methods is mostly the same, only slightly higher than prototypers. Still... is this the real picture? There is an important subpopulation of the research community that is excluded from this plot. The study implicitly assumes that students are not professionals and thus asks what tools they are hoping to get familiar with in the next two years. So the picture of automation tools usage in academia versus industry is incomplete in the absence of data from PhD students, which make up an important part of research.

So who are the PhD students? How can we identify them? I think some of them are people who have or will have a Doctoral degree in the next two years but whose occupation is still “Student”. Surely, some PhD students could opt for “Research Scientist” as an occupation instead but I, a PhD student, usually identify as “Student”. But let’s see if it is just me or if this subpopulation actually exists.

In [None]:
phd = df[(df['Q4']=='Doctoral degree') & (df['Q5']=='Student')]
phd

Voila! We have 294 doctoral students amongst the respondents. So let’s have a look at what they are hoping to use automation for by first looking at what doctoral students do in general. Before exploring the data we can envisage two scenarios:
1. They are doing a PhD in an ML-related area, trying to push state the state of the art;
2. They are doing a PhD in a different discipline and are trying to use ML to make sense of their data. 

In [None]:
Q_24_Columns = [x for x in df.columns if 'Q24_Part' in x]

# create data frame with (activity name, associated number of respondees -- normalised)
activity_count = [[df[column].dropna().unique()[0], phd[column].count()/phd.shape[0]] for column in Q_24_Columns]
activity_count = pd.DataFrame(activity_count, columns =['work activities', 'percentage'])

# plot activity count for phd students
ax = sns.barplot(x = 'percentage', y = 'work activities', data = activity_count)
ax.set_yticklabels([textwrap.fill(e, 60) for e in activity_count['work activities']])
plt.xlim(0,1)
plt.show()

Wait... did we just get an empty plot?

It looks like this quest brought to light a different but important problem. To the question “Select any activities that make up an important part of your role at work” none of the PhD students replied, although “Do research that advances the state of the art of machine learning” would definitely suit the first category, while “Build prototypes to explore applying machine learning to new areas” fits the second.

The first thought was that there must be an error in the data collection process. Perhaps students were not asked the work-related questions. I quickly went and checked the ‘kaggle_survey_2021_answer_choices.pdf’. The purpose of this document is for us to see the answer choices and which questions were asked to which respondents. I found to my surprise that it was not stated anywhere that students could not answer these questions.

So I thought: PhD students don’t consider their work as... well... work? There could be an implicit bias in the data caused by the order in which questions were answered. This could have caused the students to believe the question was specifically addressed to people who are employed in industry. However, none of the 294 PhD students felt the question was including them. Absolute zero is suspicious, but that’s what the data shows. I suddenly remembered that I overheard my dad saying to his friends that by doing a PhD I was prolonging my childhood. Or my friends who although they knew I was halfway through my PhD asked me if I got a job when I said I started working from home. Or my brother who keeps pushing me to drop my PhD and “get a real job”. Sure, the (lack of) answers could’ve been influenced by the tone of the previous questions. But perhaps there is something else at play. I triple-check the “which questions were asked to whom” file just to be sure.

A few days later the following sentences in the methodology file caught my attention: “Respondents with the most experience were asked the most questions. For example, students and unemployed persons were not asked questions about their employer”. Right. So probably that means they were not asked questions about their work at all, not just the ones about the employer per se. This is conflicting with the ‘’answer choices” file, but it's the most likely scenario. 

Ultimately, it is irrelevant. Regardless of which is the truth, we draw two important insights:

1. Data does not only speak about its population, but also about its collectors. What are the things we as data collectors are interested in? What are our biases? How do they affect the story our data tells?
2. Being a PhD student means you’re neither here nor there. Yes, you can have a paid 9 to 5 contract, like any other full-time worker, but it’s quite possible that others don’t consider you are actually “working”. I’m looking forward for a study of how this affects the self-perception of PhD students and if there is any link with the reported poor mental health. Until then, feast yourself with some memes on the subject.


So where were we?
Automation... yes, right.
We wanted to understand the usage of automation tools of the entire community of ML researchers (including ML PhDs). 
Unfortunately, because we don’t have the activity that makes part of doctoral students' work we cannot distinguish between PhD student who simply want to apply ML to their areas of study and ML PhD students (those who research ML subjects). This is important since we expect their answers to differ quite significantly. 

(CAVEAT. STRONG ASSUMPTION INCOMING) Let’s see if we can identify them by the number of years of experience in ML. That means we are implicitly assuming ML PhD students are more likely to have some ML background. The caveat is that you can do a PhD in ML if you have a different background e.g. Maths or Physics, but let’s add this assumption to keep things simple.

In [None]:
### Plotting ML experience of phd students
phd_Q15 = phd['Q15'].value_counts()
ax = sns.barplot(x = phd_Q15.values/phd_Q15.sum(), y = phd_Q15.index)
plt.xlim(0,1)
plt.ylabel("ML experience")
plt.xlabel("percentage")
plt.show()

We know that the students will get their doctoral degree in the upcoming 2 years. So for someone who’s doing a 3-year long ML PhD, they must’ve been studying ML for at least over a year.
We could try to infer who are the ML students by looking at the minimum PhD length in the countries of our PhD students but as us researchers say when we are too lazy, it’s beyond the scope of this study and we leave it as future work.
We’re just gonna say that having just started doing ML close to your penultimate years is a sign that ML is not the subject of your PhD so you must be using it to aid the science you’ve been doing so far. 1-2 years of ML experience is a blurry line, so we will leave out all of the students who belong to that category. We’ll consider everything above 2 years a sign that the PhD is in ML. Let’s segregate our students according to this rule. We’ll call ML PhD students “MLers” quite naturally, while all other PhD students will be called “appliers” for lack of better words.

In [None]:
appliers = phd[phd['Q15']=='Under 1 year']
mlers = phd[(phd['Q15']=='2-3 years')| (phd['Q15']=='3-4 years') | (phd['Q15']=='4-5 years')| (phd['Q15']=='5-10 years')]

mlers.shape[0], appliers.shape[0]

Sweet. That’s 85 MLers and 81 appliers.

In [None]:
automation_mlers = mlers.filter(like='Q36_B')
automation_appliers = appliers.filter(like='Q36_B')

# get automation method names
automation_methods = [df[x].dropna().unique()[0] for x in df.columns if 'Q36_B' in x]

# plot automation usage
ax = sns.barplot(y=automation_methods, x=automation_mlers.apply(automation_mlers.value_counts).sum().values/mlers.shape[0], label="MLers", alpha=0.5, color='blue')
ax = sns.barplot(y=automation_methods, x=automation_appliers.apply(automation_appliers.value_counts).sum().values/appliers.shape[0], label="appliers", alpha=0.5, color='red')
plt.legend()
plt.ylabel("automation method")
plt.xlabel("percentage of interested respondents")
plt.xlim(0,1)
plt.show()

It looks like MLers tend to be more interested in learning about automated methods than appliers (notice that the bars are overlapped). Once again, this result stands on strong assumptions and only reflects the reality of people who participated in the survey. But do we get to learn anything from this?
Perhaps industry and academic research are too sota results-focused, whereas real-world ML is more interested in having a simple solution that works rather than finding the solution that has a 0.01% higher accuracy than the rest. If we want our research to be relevant for this community, we should also report results in settings that are more in line with theirs. This is not at all to say we shouldn’t seek to push the sota or that we shouldn’t use automated tools to get the best results. But maybe when proposing a new algorithm it would be worth reporting on how it compares to other methods in unoptimised settings as well.
An important observation here is that we considered ML PhD students to be the ones with more experience. What if our conclusion was drawn hastily and ML experience is actually the one making a difference. Let's apply the same experience-wise separation for all the respondents of the question.

In [None]:
Q_36_Columns = [x for x in df.columns if 'Q36_B' in x]
role_experience = df[['Q15'] + Q_36_Columns]

# segregate according to experience
novices = role_experience[role_experience['Q15'] == 'Under 1 year']
experienced = role_experience[(role_experience['Q15']=='2-3 years')| (role_experience['Q15']=='3-4 years') | (role_experience['Q15']=='4-5 years')| (role_experience['Q15']=='5-10 years')]

# plot interest in automation according to experience
ax = sns.barplot(x=novices.count()[Q_36_Columns]/novices.shape[0], y=automation_methods, color='blue', alpha=0.5, label="novice")
ax = sns.barplot(x=experienced.count()[Q_36_Columns]/experienced.shape[0], y=automation_methods, color='red', alpha=0.5, label="experienced" )
plt.legend()
plt.xlim(0,1)
plt.ylabel("automation method")
plt.xlabel("percentage of interested respondents")
plt.show()

Lo and behold! If anything, they are *negatively* correlated. This means that generally the trend is to be less interested in automation as you gain experience. In terms of our insights — they remain valid: Automation is great, but if there are so many ML enthusiasts who do not use it, shouldn’t we report results that are realistic for them as well?

There are many other things we can talk about. Such as the higher variety of algorithms reseachers tend to use or how taking a university ML course impacts the algorithms you choose. But to keep the story short and effective, I decided to only present to the reader what I considered to be the most important insights. 

So let’s do a quick recap: the Kaggle community enjoys simplicity in all its beauty. But both appliers and researchers have something to win by considering each other a bit more. I believe appliers should actively search to integrate novel techniques in their work. This will push research to be better, which will once again help appliers and so on. That also means researchers should ensure their studies and results are more in line with appliers' needs. I believe Kaggle can help researchers and appliers work together and I suggested two ways of doing this. Lastly, the subject we looked at today is far greater than I could cover. By making the community aware of the problems we identified today and integrating more subpopulation-aware questions in the upcoming surveys, I believe great further insights can be gained in the future. 

Before wrapping up, I want to point out one more thing. Omitting it would be hypocritical. Remember when I said “What if some of the Engineers and Data Scientists actually do research as part of their job?”? I only got to this question because the data wasn’t making sense. According to my narrow perspective on the world, ML Engineers... “engineer” which cannot include research. Or at least research that pushes the ML field further. Sure, they could be applying it to research questions in science but they don't do actual ML research.

In [None]:
# get the machine learning engineers that are doing ML research
df[(df['Q5']=='Machine Learning Engineer') & (df['Q24_Part_6']=='Do research that advances the state of the art of machine learning')]

I was surprised to find out what a wrong perception I’d had until looking at this data. It turned out that Research Scientists make up less than 50% of all the researchers in the Kaggle community!

In [None]:
researchers = df[df['Q24_Part_6']=='Do research that advances the state of the art of machine learning']

# plot what percentage each role represents out of the overall number of researchers
ax = sns.barplot(x = researchers['Q5'].value_counts().div(df['Q5'].value_counts()), y = researchers['Q5'].value_counts().div(df['Q5'].value_counts()).index)
plt.xlim(0,1)
plt.ylabel("occupations")
plt.xlabel("percentage out of the overall research")
plt.show()

Biases still exist. On both ends of the spectrum. We should not be surprised or angry when people have biased opinions about a group we identify with. What we can do instead is tell our story. And so I did.