# **Is Data Science Crowded by Young Adults?**
### An exploratory data analysis based on all of the last six years data by the Kaggle Machine Learning & Data Science Survey.

# Hypothesis:

**Target group:** Young adults make up the majority of newcomers to the field of data science, and they have a wide range of interests in terms of programming languages, schooling, and other aspects. Typically, Python is regarded as the most well-known of all and certification is considered to be the new hype instead of formal education in recent time.

**Problem Statement:** One of the most significant factors in the field of data science is the target group for AGE, with the selection of various platforms and their application truly transforming this field with the passage of time. Investigate this pattern using the survey data with respect to years.

# Data Information:
- All the data from the previous surveys i.e., from 2017 to 2022 has been downloaded from the official Kaggel survey site.
- Merged data of 2017 til 2021 hsa been download from this Kaggle Notebook (https://www.kaggle.com/datasets/harveenchadha/kaggle-survey-20172020-merged-data)
- We compiled 2022 data in this data by making 2022 questions as refrence and considered only 2022 questions in the final merged file.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook_connected"
import plotly.graph_objects as go
pd.set_option('display.max_columns', 5000)
import warnings
warnings.filterwarnings("ignore")

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
merged = pd.read_csv("/kaggle/input/merged-kaggle-survey-data/Kaggle_survey_data/merged_data.csv")

In [None]:
# removing the 0th row and save that into a dictionary of questions
questions = merged.iloc[0].to_dict()
# drop the 0th row
merged.drop(0, inplace=True)

# year
merged = merged[merged['Year'] != 'Year']
merged['Year'] = merged['Year'].astype(int)

# Q3 = Gender
# replace the values based on the definitions
merged['Q3'] = merged['Q3'].replace({'Male':'Man',
 'Female':'Woman', 'Non-binary, genderqueer, or gender non-conforming':'Nonbinary','A different identity':'Nonbinary'})

# Q4 = Country
merged['Q4'] = merged['Q4'].replace({'United States of America':'USA',
    'United States': 'USA',
    'United Kingdom': "UK",
    'United Kingdom of Great Britain and Northern Ireland':'UK',
    'Iran, Islamic Republic of...':'Iran',
    'Viet Nam':'Vietnam',
    'Republic of Korea':'South Korea',
    'United Arab Emirates':'UAE',
    'Hong Kong (S.A.R.)':'Hong Kong',
    'Taiwan, Province of China':'Taiwan',
    'Republic of Moldova':'Moldova',
    'Czech Republic':'Czechia',
    'People \'s Republic of China':'China',
    'Republic of China':'China'})


#Q23: Job title
merged['Q23']= merged['Q23'].replace({'Manager (Program, Project, Operations, Executive-level, etc)':'Manager',
                                    'Machine Learning/ MLops Engineer':'ML Engineer',
                            'Data Analyst (Business, Marketing, Financial, Quantitative, etc)': 'Data Analyst'})

  
#Q30: money spent on ML
merged['Q30'] = merged['Q30'].replace({'$0 ($USD)':'$0','$100,000 or more ($USD)': '$100,000 or more'})


In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x='Year', data=merged, order=merged['Year'].value_counts().index)
for p in plt.gca().patches:
    plt.gca().text(p.get_x() + p.get_width()/2., p.get_height(), '{:1.0f}'.format(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')
plt.title('Recorded responses per Year')
plt.show()

- Responses per year has overall increased from 2017 till 2022 with fluctuations.
- Most responses has been recorded in the last year i.e, 2021
- 2022 and 2018 responses are quite comparable.
- The least scores are in the year of 2017.

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x='Q2', data=merged, order=merged['Q2'].value_counts().index)
for p in plt.gca().patches:
    plt.gca().text(p.get_x() + p.get_width()/2., p.get_height(), '{:1.0f}'.format(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')
plt.title('Recorded responses per Age Group')
plt.xlabel(" Age Groups")
plt.show()

- Main three age groups are from 18 to 29. Which shows that Data Science is famous and used among the young adults mostly
- This age group of 18 to 29 is actually more than 50% of the data science population.

In [None]:
fig = px.histogram(merged, x="Q2", color="Year", barmode="group", title=" Data Scientist Age Distribution by Year",
                  category_orders={"Q2": ["18-21", "22-24", "25-29", "30-34", "35-39", "40-44",
                  "45-49", "50-54", "55-59", "60-69", "70+"]}).update_xaxes(title_text="Age Groups").update_yaxes(title_text="Count")
fig.show(renderer="kaggle")

- 1: Throughout all these 6 years; 18 to 34 is the most popular age group among the data scientists as it covers more than 50% of the DS population globally.
- 2: If we slice in each year:
  - From 2017 to 2019: 22 to 34 leads with 25-29 being the most populated group.
  - From 2020 till 2022: 18-29 leads with 25-29 being the most populated till last years but in this year 18-21 took the charge.
- 3: We can say that Data sciecne domain is more popular among the young adults from its initial days.
- 4: Whereas the people above 35 years of age are gradually showing more intrest in the data science filed but the rate is not fast as that of the young adults.

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x='Q3', data=merged, order=merged['Q3'].value_counts().index)
for p in plt.gca().patches:
    plt.gca().text(p.get_x() + p.get_width()/2., p.get_height(), '{:1.0f}'.format(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')
plt.title(' Gender Distribution')
plt.xlabel(" Gender ")
plt.show()

In [None]:
fig = px.histogram(merged, x="Q3", color="Year", barmode="group", title=" Data Scientists Gender Distribution by Year").update_xaxes(title_text="Gender").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

- 1: The data science industry is highly gender disbalanced with a significnt portion under Male territory.
- 2: But surprisngly, in 2020 women have showed an increased trend wrt to the males
- 3: Plus, in the ongoing era (2022), women count is by far the greatest in all of the last 6 years whereas males count for this year is less as compared to last two years.
- 4: This highlights towards the enlightment and awarness of women contribution in STEM and other fields. 

In [None]:
fig = px.histogram(merged, x="Q2", color="Q3", barmode="group", title=" Data Scientists Gender Distribution witin Age groups").update_xaxes(title_text="Age Groups").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

- 1: Among Male Data scientist the top three age groups are 25-29, 22-24, 30-34 i.e, most of the Male DS are between 22 to 34 years of age.
- 2: Whereas most of the Female DS are between 18 to 29 years of age.

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x='Q4', data=merged, order=merged['Q4'].value_counts().iloc[:15].index)
for p in plt.gca().patches:
    plt.gca().text(p.get_x() + p.get_width()/2., p.get_height(), '{:1.0f}'.format(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')
plt.title('Top 15 Countries in the Data Science Community')
plt.show()

- 1: In the overall trend of lst 6 years: More than quater of the DS are from India and this is the huge ratio as comapred to the rest of the world, followed by USA and China.
- 2: Pakistan is at 14th position with only 1.5% of its population in DS field.

In [None]:
top_15_countries = merged['Q4'].value_counts().iloc[:15].index.tolist()
fig = px.histogram(merged[merged['Q4'].isin(top_15_countries)], x="Q4", color="Year",
                 barmode="group", title=" Top 15 Countries in the Data Science Community by Year",
                category_orders={"Q4": top_15_countries}).update_xaxes(title_text="Countries").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

- 1: In the inital years of 2017 and 2018: USA were leading in the DS domain followed by India and Russia.
- 2: But from 2019 onwards India took a huge flight in the DS domain and still its on the top followed by USA and Brazil.
- 3: In the initial two years there were no change in Pakistan however, from 2019 Pakistan is slowly rising in the filed.

In [None]:
# Prominent age group i.e, young adults 18-39
top_age_groups = merged['Q2'].value_counts().iloc[:5].index.tolist()
#top 5 countries
top_5_countries = merged['Q4'].value_counts().iloc[:5].index.tolist()

fig = px.histogram(merged[(merged['Q4'].isin(top_5_countries)) & (merged['Q2'].isin(top_age_groups))], x="Q2", color="Q4",
                    barmode="group", title=" Top 5 Countries in the Data Science Community by Age",
                    category_orders={"Q2": top_age_groups}).update_xaxes(title_text="Age Groups").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')


In [None]:
# replace the values in the column
merged['Q6_3'].replace({'Kaggle Learn Courses':'Kaggle Courses','Kaggle Courses (i.e. Kaggle Learn)': 'Kaggle Courses', 'Kaggle Learn' : 'Kaggle Courses'}, inplace=True)
merged['Q6_5'].replace({'Fast.ai':'Fast.AI'}, inplace=True)
merged['Q6_9'].replace({'Cloud-certification programs (direct from AWS, Azure, GCP, or similar)':'Cloud-certification'}, inplace=True)
merged['Q6_10'].replace({'University Courses (resulting in a university degree)':'University Courses'}, inplace=True)

In [None]:
fig = px.histogram(merged, x=['Q6_1','Q6_2', 'Q6_3', 'Q6_4', 'Q6_5', 'Q6_6', 'Q6_7', 'Q6_8', 'Q6_9', 'Q6_10', 'Q6_11', 'Q6_12'], color="Q2", barmode="group", title="Prefrence of Data Science Learning Platform within Age Groups")
fig.show(renderer='kaggle')

- 1. Coursera is the leading platform with Udemy and Kaggle courses.
- 2. Most of the coursera certification has been done by the age group of 25-29.

In [None]:
fig = px.histogram(merged, x=['Q6_1','Q6_2', 'Q6_3', 'Q6_4', 'Q6_5', 'Q6_6', 'Q6_7', 'Q6_8', 'Q6_9', 'Q6_10', 'Q6_11', 'Q6_12'], color="Year", barmode="group", title="How Prefrence of Data Science Learning Platforms has changed over the years")
fig.show(renderer='kaggle')

- 1: The trend for DS learning has begun in 2018 with Coursera being ont he top i.e., no data has been there from 2017.
- 2: Linkedin Learning appeared on 2019 and showed a surged from day 1 till now.
- 3: Kaggle Courses took a sudden leap and now are the true competitors of Coursera.
- 4: Surprisingly, universities platform just been recognized in the ongoing year for DS learning.

In [None]:
# replace the values in the column
merged['Q7_2'].replace({'Online courses (Coursera, EdX, etc)':'Online courses'}, inplace=True)
merged['Q7_3'].replace({'Social media platforms (Reddit, Twitter, etc)':'Social platforms'}, inplace=True)
merged['Q7_4'].replace({'Video platforms (YouTube, Twitch, etc)':'Video platforms'}, inplace=True)
merged['Q7_5'].replace({'Kaggle (notebooks, competitions, etc)':'Kaggle'}, inplace=True)
merged['Q7_6'].replace({'None / I do not study data science':'None'}, inplace=True)

In [None]:
fig = px.histogram(merged, x=['Q7_1','Q7_2', 'Q7_3', 'Q7_4', 'Q7_5', 'Q7_6', 'Q7_7'], color="Q2", barmode="group", title="Learning Pltforms for Data Science")
fig.show(renderer='kaggle')

In [None]:
# replace the values in the column
merged['Q8'].replace({"Bachelor’s degree":"Bachelor's degree", "Master’s degree" : "Master's degree", "Some college/university study without earning a Bachelor’s degree":"Some College",
"Some college/university study without earning a bachelor's degree" : "Some College" , "Some college/university study without earning a bachelor’s degree":"Some College",
"Professional doctorate":"Doctoral degree", "I did not complete any formal education past high school":"No Formal",
"No formal education past high school":"No Formal", "I prefer not to answer": "No Ans"}, inplace=True)

In [None]:
fig = px.histogram(merged, x='Q8', color="Q2", barmode="group", title="Highest level of formal education within Age Groups")
fig.show(renderer='kaggle')

In [None]:
fig = px.histogram(merged, x='Q8', color="Year", barmode="group", title="Highest level of formal education in Years")
fig.show(renderer='kaggle')

- 1: More than 70% of the DS community is comparised of people having either Masters, Bachelors or Doctoral degrees.
- 2: In 2020 people with some schooling got more involved in this filed.
- 3: Surprisgly, people with professional degress are not much involved in this filed and hence no record from last 2 years.

In [None]:
fig = px.histogram(merged, x='Q9', color="Q2", barmode="group", title="Publication")
fig.show(renderer='kaggle')

- 1: Young adults hasn't published much relative to the elder Data scientists ( 40 - 60 )

In [None]:
# replace the values in the column
merged['Q10_1'].replace({'Yes, the research made advances related to some novel machine learning method (theoretical research)': 'theoretical research'}, inplace=True)
merged['Q10_2'].replace({'Yes, the research made use of machine learning as a tool (applied research)': 'applied research'}, inplace=True)

In [None]:
fig = px.histogram(merged, x=['Q10_1', 'Q10_2', 'Q10_3'], color="Q2", barmode="group", title="Did your research make use of machine learning")
fig.show(renderer='kaggle')

In [None]:
merged['Q11'].replace({'< 1 years':"Less than a year","< 1 year":"Less than a year",
"1-2 years":"1-3 years","1 to 2 years":"1-3 years","3 to 5 years ":"3-5 years","3 to 5 years":"3-5 years",
"I have never written code":"Never Code","I have never written code but I want to learn":"Never Code","I don't write code to analyze data":"Never Code",
"I have never written code and I do not want to learn":"Never Code","6 to 10 years":"5-10 years",
"10-20 years":"10+ Years","20+ years":"10+ Years","20-30 years":"10+ Years","30-40 years":"10+ Years","40+ years":"10+ Years","More than 10 years":"10+ Years"}, inplace=True)


In [None]:
fig = px.histogram(merged, x='Q11', color="Year", barmode="group", title="For how many years have you been writing code")
fig.show(renderer='kaggle')

- 1: In 2018: necomers entered the DS market that hasn't code before and its still incresing.
- 2: In 2019: seems like coders switched their filed towards DS.
- 3: From last 6 years, most people that are in DS have their coding experince less than 3 years.

In [None]:
merged['Q12_8'].replace({'Javascript/Typescript':"Javascript"},inplace=True)

In [None]:
fig = px.histogram(merged, x=['Q12_1', 'Q12_2', 'Q12_3', 'Q12_4', 'Q12_5', 'Q12_6', 'Q12_7', 'Q12_8', 'Q12_9', 'Q12_10', 'Q12_11', 'Q12_12', 'Q12_13', 'Q12_14', 'Q12_15'], color="Q2", barmode="group", title=" programming languages do you use")
fig.show(renderer='kaggle')

- 1: The popularity of Pyhton is not linked with any specific age group. It is the most popular among all of the data scientist.
- 2: R and SQL are going in parallel on the second place.

In [None]:
merged['Q13_1'].replace({'JupyterLab ':"JupyterLab"},inplace=True)
merged['Q13_2'].replace({' RStudio ':"RStudio"},inplace=True)
merged['Q13_3'].replace({' Visual Studio ':"Visual Studio","  ":"Visual Studio"},inplace=True)
merged['Q13_4'].replace({' Visual Studio Code (VSCode) ':"VSCode",' Visual Studio Code (VSCode)  ':"VSCode",'  Visual Studio Code (VSCode)   ':"VSCode"},inplace=True)
merged['Q13_5'].replace({' PyCharm ':"PyCharm"},inplace=True)
merged['Q13_6'].replace({'  Spyder  ':"Spyder"},inplace=True)
merged['Q13_7'].replace({'  Notepad++  ':"Notepad++"},inplace=True)
merged['Q13_8'].replace({'  Sublime Text  ':"Sublime Text"},inplace=True)
merged['Q13_9'].replace({'  Vim / Emacs  ':"Vim"},inplace=True)
merged['Q13_10'].replace({' MATLAB ':"MATLAB"},inplace=True)
merged['Q13_11'].replace({' Jupyter Notebook':"Jupyter Notebook"},inplace=True)

In [None]:
fig = px.histogram(merged, x=['Q13_1', 'Q13_2', 'Q13_3', 'Q13_4', 'Q13_5', 'Q13_6', 'Q13_7', 'Q13_8', 'Q13_9', 'Q13_10', 'Q13_11', 'Q13_12', 'Q13_13', 'Q13_14'], color="Q2", barmode="group", title=" IDE's do you use")
fig.show(renderer='kaggle')

- 1: No difference in the relative change in the choice of IDes among differnt age groups.

In [None]:
tool=[]
count=[]

for i in range(1, 17):
    tool.append(merged[f'Q14_{i}'].value_counts().index[0])
    count.append(merged[f'Q14_{i}'].value_counts()[0])



tools=pd.DataFrame(list(zip(tool, count)),
               columns =['Tool', 'Count'])

tools=tools.sort_values(by='Count',ascending=False).reset_index(drop=True)

fig=px.bar(tools,x='Tool',y='Count',color='Tool',text='Count',
           template='simple_white',title="<b>Most used IDE's</b>")

# fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
# fig.update_layout( plot_bgcolor='#FFF8DC', paper_bgcolor='#FFF8DC')
fig.show(renderer='kaggle')

In [None]:
library=[]
count=[]

for i in range(1, 16):
    library.append(merged[f'Q15_{i}'].value_counts().index[0])
    count.append(merged[f'Q15_{i}'].value_counts()[0])



data=pd.DataFrame(list(zip(library, count)),
               columns =['Library', 'Count'])


data=data.sort_values(by='Count',ascending=False).reset_index (drop=True)
fig=px.bar(data,x='Library',y='Count',template='simple_white',color='Library',text='Count',title='<b>Data visualization libraries used on a regular basis</b>')

fig.show(renderer='kaggle')

In [None]:
job_categories = merged['Q23'].value_counts().iloc[:11].index.tolist()
# plot the count with year in plotly
fig = px.histogram(merged[merged['Q23'].isin(job_categories)], x="Q23", color="Year",
                 barmode="group", title=" Job Categories in the Data Science Community by Year",
                category_orders={"Q23": job_categories}).update_xaxes(title_text="Job Categories").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

- Although past 5 years trend shows peoples interest in Data Science from all fields of life. 
- However, the top category people engaged are students, followed by Data Scientits. 
- There has been an exponential rise in students activities in Data Science over the past number of years

In [None]:
company_size = merged['Q25'].value_counts().iloc[:6].index.tolist()
# plot the count with year in plotly
fig = px.histogram(merged[merged['Q25'].isin(company_size)], x="Q25", color="Year",
                 barmode="group", title=" Growth of Companies in the Data Science Community",
                category_orders={"Q25": company_size}).update_xaxes(title_text="Company Size").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

- There has been an uptrend in all companies over the years regarding the use of Data Science to enhance their overall business output.
- Especially in smaller third world countires, we are seeing smaller companies getting more and more into the Data Science Field.
- Most of the young working force are employed in small to medium sized companies.

In [None]:
ml_methods = merged['Year'].value_counts().iloc[:6].index.tolist()
merged=merged.rename(columns={'Q27':'ML Methods'})
fig = px.histogram(merged[merged['Year'].isin(ml_methods)], x="Year", color="ML Methods",
                 barmode="group", title=" ML Methods used by current Employer",
                category_orders={"Year": ml_methods}).update_xaxes(title_text="Machine Learning Methods").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

- According to the data there is a rise in companies using different ML methods.
- As the graph shows, companies Not using  ML methods was around 4000 in 2018, while in 2022 this was reduced to around 1900, which shows a twofold increase. 

In [None]:
repository = merged['Q22'].value_counts().iloc[:6].index.tolist()
# plot the count with year in plotly
fig = px.histogram(merged[merged['Q22'].isin(repository)], x="Q22", color="Year",
                 barmode="group", title="Repositories used",
                category_orders={"Q22": ml_methods}).update_xaxes(title_text="Repositories used").update_yaxes(title_text="Count")
fig.show(renderer='kaggle')

What are Repositories?
A repository is a place that hosts an application’s source code, along with various metadata.
According to 2022 available data from Kaggle website, most commonly used top 3 repositories are:
1.	Kaggle Datasets
2.	TensorFlow Hub
3.	Huggingface Models

Note: Repository Data avaible for year 2022 only on Kaggle website

In [None]:
df = pd.read_csv("/kaggle/input/merged-kaggle-survey-data/Kaggle_survey_data/w_data.csv", low_memory=False )

In [None]:
df.dropna(subset = ['Q16'], inplace=True)
Experience=['1-2 years','I do not use machine learning methods','2-3 years','3-4 years','5-10 years','4-5 years',
            '10-20 years','20 or more years']
values=[20560,11297, 9811, 5337, 4688, 4416, 1723,
       739]
df1=pd.DataFrame({'Experience':Experience,'values':values})
df1.sort_values(by='values',ascending=False,inplace=True)
df1.reset_index(drop=True)
fig=px.bar(df1,x='Experience',y='values',color='values',template='plotly',
           text='values',title='<b> ML-methods (Years Experience) methods from 2017-2022 </b>')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
# fig.update_layout( plot_bgcolor='#FFF8DC', paper_bgcolor='#FFF8DC')
xlabel = fig.update_xaxes(title_text='<b>''Experience')
ylabel = fig.update_yaxes(title_text='<b>''values')
fig.show(renderer="kaggle")

- As the graph interprets that most of the people started using Machine Learning methods from last few years.
- Particularly in last 2 years, the number of ML- Methods users is highest. So, we can say that more people will be using ML- Methods in upcoming years.

In [None]:
def sort_dic_perc(df,col_list_single_ques,dic_counts_single_ques): 
    ''' 
    A helper function that can be used to sort a dictionary.   
    It is an adaptation of a similar function
    from https://www.kaggle.com/sonmou/what-topics-from-where-to-learn-data-science.
    '''
    dictionary = count_perc_mcq(df,
                                                                col_list_single_ques,
                                                                dic_counts_single_ques)
    dictionary = {v:k    for(k,v) in dictionary.items()}
    list_tuples = sorted(dictionary.items(), reverse=False) 
    dictionary = {v:k for (k,v) in list_tuples}   
    return dictionary

def count_percent(df,col_name):
    '''
    A helper function to return value counts as percentages.
    '''
    counts = df[col_name].value_counts(dropna=False)
    percentages = round(counts*100/(df[col_name].count()),1)
    return percentages

def count_perc_mcq(df,col_list_single_ques,dic_counts_single_ques):
    '''
    A helper function to convert counts to percentages.
    '''
    df = df
    subset = col_list_single_ques
    df = df[subset]
    df = df.dropna(how='all')
    total_count = len(df) 
    dictionary = dic_counts_single_ques
    for i in dictionary:
        dictionary[i] = round(float(dictionary[i]*100/total_count),1)
    return dictionary 

responses = df[1:]

Ques_dic_17_2022 = {
    'Scikit-learn' : (responses['Q17_1'].count()),
    'TensorFlow': (responses['Q17_2'].count()),
    'Keras' : (responses['Q17_3'].count()),
    'PyTorch' : (responses['Q17_4'].count()),
    'Fast.ai ' : (responses['Q17_5'].count()),
    'Xgboost ' : (responses['Q17_6'].count()),
    'LightGBM' : (responses['Q17_7'].count()),
    'CatBoost ' : (responses['Q17_8'].count()),
    'Caret ' : (responses['Q17_9'].count()),
    'Tidymodels ' : (responses['Q17_10'].count()),
    'JAX ' : (responses['Q17_11'].count()),
    'PyTorch Lightning' : (responses['Q17_12'].count()),
    'Huggingface' : (responses['Q17_13'].count()),
    'None' : (responses['Q17_14'].count()),
    'Other' : (responses['Q17_15'].count()),
}


title_for_y_axis = 'Respondents'
title_for_chart = 'Machine learning frameworks from 2018-2022'
fig1 = px.bar(x=list(Ques_dic_17_2022.keys()), 
y=list(Ques_dic_17_2022.values()), title='ML Frameworks used').update_xaxes(title_text="ML frameworks")
fig1.update_layout(barmode='group') 
fig1.update_layout(title=title_for_chart,yaxis=dict(title=title_for_y_axis))
fig1.show(renderer='kaggle')

- Scikit-learn TensorFlow and Keras are the most used ML-Frameworks. While Scikit-learn is highest ranked because It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.


In [None]:
Ques_dic_18_2022 = {
    'None' : (responses['Q18_14'].count()),
    'Graph Neural Networks' : (responses['Q18_13'].count()),
    'Autoencoder Networks (DAE, VAE, etc)' : (responses['Q18_12'].count()),
    'Transformer Networks (BERT, gpt-3, etc) ' : (responses['Q18_11'].count()),
    'Recurrent Neural Networks ' : (responses['Q18_10'].count()),
    'Vim, Emacs, or similar ' : (responses['Q18_9'].count()),
    'Generative Adversarial Networks ' : (responses['Q18_8'].count()),
    'Convolutional Neural Networks' : (responses['Q18_7'].count()),
    'Dense Neural Networks (MLPs, etc) ' : (responses['Q18_6'].count()),
    'Evolutionary Approaches' : (responses['Q18_5'].count()),
    'Bayesian Approaches' : (responses['Q18_4'].count()),
    'Gradient Boosting Machines (xgboost, lightgbm, etc)' : (responses['Q18_3'].count()),
    'Decision Trees or Random Forests': (responses['Q18_2'].count()),
    'Linear or Logistic Regression' : (responses['Q18_1'].count())}


title_for_y_axis = 'ML Algorithms'
title_for_chart = 'Machine learning Algorithms from 2018-2022'
fig1 =px.bar(y=list(Ques_dic_18_2022.keys()), x=list(Ques_dic_18_2022.values()), title='ML algorithms').update_xaxes(title_text="Respondents")
fig1.update_layout(barmode='group') 
fig1.update_layout(title=title_for_chart,yaxis=dict(title=title_for_y_axis))
fig1.show(renderer='kaggle')

- Linear or Logistic Regression and Decision Trees or Random Forests are the two most common ML-Algorithms.
- Also, these are the basic Algorithms in Machine Learning. These could be easy to start.

In [None]:
Ques_dic_19_2022 ={
    'Other' : (responses['Q19_8'].count()),
    'None' : (responses['Q19_7'].count()),
    'Generative Networks (GAN, VAE, etc)' : (responses['Q19_6'].count()),
    'Vision transformer networks (ViT, DeiT, BiT, BEiT, Swin, etc)' : (responses['Q19_5'].count()),
    'Image classification and other general purpose networks' : (responses['Q19_4'].count()),
    'Object detection methods (YOLOv6, RetinaNet, etc)' : (responses['Q19_3'].count()),
    'Image segmentation methods (U-Net, Mask R-CNN, etc)': (responses['Q19_2'].count()),
    'General purpose image/video tools (PIL, cv2, skimage, etc)' : (responses['Q19_1'].count())}


title_for_y_axis = 'Computer Vision Method'
title_for_chart = 'Computer vision methods from 2018-2022'
fig1 =px.bar(y=list(Ques_dic_19_2022.keys()), 
x=list(Ques_dic_19_2022.values()), title='ML algorithms').update_xaxes(title_text="No. of respondents")
fig1.update_layout(barmode='group') 
fig1.update_layout(title=title_for_chart,yaxis=dict(title=title_for_y_axis))
fig1.show(renderer='kaggle')

- In Computer Vision Method, Image classification and other general-purpose networks (VGG, Inception, ResNet,ResNeXt, NASNet, EfficientNet, etc) While, 
    - General purpose image/video tools (PIL, cv2, skimage, etc)
    - Image segmentation methods (U-Net, Mask R-CNN, etc)
    - Object detection methods (YOLOv6, RetinaNet, etc)
    
 are comparatively being used.

In [None]:
Ques_dic_20_2022 = {
    'Other' : (responses['Q20_6'].count()),
    'None' : (responses['Q20_5'].count()),
    'Contextualized embeddings (ELMo, CoVe)' : (responses['Q20_3'].count()),
    'Encoder-decoder models (seq2seq, vanilla transformers)': (responses['Q20_2'].count()),
    'Transformer language models (GPT-3, BERT, XLnet, etc)' : (responses['Q20_4'].count()),
    'Word embeddings/vectors (GLoVe, fastText, word2vec)' : (responses['Q20_1'].count()),
}


title_for_y_axis = 'Natural language processing (NLP) methods'
title_for_chart = 'Natural language processing (NLP) methods from 2018-2022'
fig1 =px.bar(y=list(Ques_dic_20_2022.keys()), x=list(Ques_dic_20_2022.values()), title='ML algorithms').update_xaxes(title_text="Respondents")
fig1.update_layout(barmode='group') 
fig1.update_layout(title=title_for_chart,yaxis=dict(title=title_for_y_axis))
fig1.show(renderer='kaggle')

- This graph shows that "Word embeddings/vectors (GLoVe, fastText, word2vec)" followed by Encoder-decoder models (seq2seq, vanilla transformers) and Contextualized embeddings (ELMo, CoVe) are commonly used by the users from Natural Language Processing (NLP) Methods.

## Conclusion


In last 6 years of kaggle surveys following trends are shown:
* There are fluctuations in the recorded responses per year.
* More than 50% of the Data Scientist fall in the age group pf 18-34 and each year this age range is consistently shrinking i.e, Data scientists are majorly young adults. However, elderly people are also in this filed but their progression rate is quite low as compared to the former group.
* Data Science filed is highly gender imbalanced, playing with MALE monopoly in all these 6 years. However, in 2020 FEMALES has showed incredible surge and the only ones with upward trend so far. Which shows the effect of women-empowerment in STEM and other fields in last couple of years.
*  Gender non-binary or partiipants identified as genders other than males and females are largely concentrated in age group 20-35.
* USA was at the top in the initial days of Data Science (2017 + 2018) but suddenly India overpassed USA and took the huge leap and now its on the 1st place in the field of Data Science from consecutive 4 years with the youngest Data Scientist worldwide (18-21).
* Coursera is the most popular learning platform among all specially by young adults and most certification has been done by the people from 25-29 age group from day 1 ( 2018 onwards ). Whereas Kaggle Courses also in the race of competing with the former platform. Universities has just initiated with the DS learning.
* Young adults that are in DS have mostly Master’s degree or Bachelor’s degree. In recent times, people in college are showing more interest and professionals has shown a downward trend in last couple of years.
* Most of the research has been done in the applied filed of DS rather than theoretical and young data scientists are not much interested in publishing papers. By far people from 40-60 years of age are in the limelight in terms of publishing.
* Most of the data science comminuty has less than 3years of coding experience and it has been observed that in 2019 and onwards expert progrmmaers switched their filed towards DS.
* Majority of young people aged < 30 are not using the ML cloud products such Amazon cloud S3 notebook and Azure cloud services.
* Large percentage of kaggle users less than 40 years old are either using tabealue for data visulaizations or Power BI.
* Pyhton popularity is not linked with any specific age group. However, My SQL is most popular database systems in all age groups.
* Among all online cloud notebooks Amazon S3 is most popular.
* Most kaggle users in all age groups are from small companies which have less than 1 year experience. SImilar trend is show in students class which are most young people.
* Scikit learn and tensor flow are getting popularity among all age groups in last 3 years.
* Data visulaizations in case of Python are completed with matplotlib.
* Python is most popular language in all age groups.
* As expected kaggle users are mostly using kaggle notebook followed by google colab notebooks.
* Even 70+ years old ML practioners are using ML tensorboard and ML flow for ML model monitoring.

### Nutshell:
As per our hypothesis, there are many facotrs that are age dependent and has changed significantly over the years. However, we can't say each and every bit in Data Science filed is connected to the mentioned two factors and hence could be more connected points that need further investigation.

## Team Aden

* [Dure Aden Ammara](http://https://www.kaggle.com/adenrajput)
* [Ahsan Shakoor](http://https://www.kaggle.com/ahsanshakoor)
* [Arsalan Khan](http://https://www.kaggle.com/marsalankhan)
* [Wajeeha](http://https://www.kaggle.com/wajeehfai)
* [Shah Nawaz](http://https://www.kaggle.com/shahnawazjappa)
* [Dr. Muhammad Aammar Tufail](http://https://www.kaggle.com/muhammadaammartufail)