# Data analysis of all 42 questions asked in the Kaggle Survey 2021

Hi! I'm Prajit, a PhD student and data enthusiast. In this notebook, I wish to help you understand some insights that I derived from the Kaggle Survey 2021 data. I analyzed all 42 questions in the survey, but left out the optional questions. If you're short on time you can check the TLDR section for the important points. It was indeed interesting to find some insights regarding the distributions of age, experience, skills and preference of fellow data enthusiasts on Kaggle! **The comparison between India and USA, conclusions and insights are at the beginning of the notebook for easy reading. Do share and upvote if you found it interesting!!**

# India v USA comparison 

* AGE: India = Most respondents are under 24; USA = Most respondents are in the age range 25-39
* NUMBER OF RESPONDENTS: India = 28.62% of all; USA = 10.20% of all
* EDUCATION: India = Most have a Bachelor degree; USA = Most have a Master degree 
* JOB PROFILES: India = Most are students; USA = Significant number of Data Scientists present
* CODING EXPERIENCE: India = Most have <3 years work exp; USA = Most have >3 years work exp
* PREFERRED PROGRAMMING LANGUAGE: C and C++ seems to be more popular in India than in USA
* WORKING ENVIRONMENT: A larger proportion of Kagglers from USA work on their PCs as opposed to India
* USE OF TPUs: A larger proportion of Kagglers from India use Google Cloud TPUs as opposed to those from the USA
* VISUALIZATION LIBRARIES: Ggplot is more commonly used in the USA than in India
* ML EXPERIENCE: India = Few people with >3 years exp; USA = Significant amount of people with >3 years exp
* ML ADOPTION IN COMPANIES: India has more companies who are in early stage (<2 years) of ML adoption
* PEAK SALARY RANGE PER ANNUM: India = 10000-14999 USD; USA = 100000-124999 USD
* EXPENDITURE ON ML/CLOUD: Respondents from USA tend to spend more
* PREFERRED CLOUD PRODUCT: India = Amazon EC2, GCE and Azure; USA = Amazon EC2 by a long shot
* PREFERRED DATA STORAGE PRODUCT: India = Amazon S3 and Google Cloud; USA = Amazon S3 by a long shot
* PREFERRED BIG DATA TOOL: MongoDB seems to be more popular in India than in the USA
* TOOLS TO MANAGE ML EXPERIMENTS: Indians seem to be using more tools as compared to their counterparts in the USA
* SHARING DATA SCIENCE PORTFOLIO: India = Kaggle and GitHub, USA = More of GitHub
* COURSEWORK: Picking university courses related to data science and ML seems to be more popular in the USA compared to India
* BASIC DATA ANALYSIS: India = Google Sheets, Microsoft Excel; USA = RStudio, Jupyter
* SOURCE OF INFORMATION ABOUT DATA SCIENCE: India = Kaggle, YouTube, blogs; US = Top 3 + Twitter and other platforms

# Conclusions (TLDR)

* Most people on Kaggle seem to be young and/or in the early stages of their data science journey
* There is a huge gender imbalance in data science 
* The most regularly used algorithms and tools are the basic ones
* Most respondents don't seem to be using costly infrastructures for training, or third party softwares for various ML process- related to the previous point
* Amazon products like AWS, EC2, S3 seem to be preferred over their Google, Microsoft equivalents
* GitHub, Kaggle and Colab seem to be the most popular platforms for people to share their data science portfolios

# Key insights 
* Most Kagglers are aged younger than 30 years
* Most Kagglers are men (~80%)
* Most Kagglers are Indians (US, China, Japan and Brazil round up the top 5)
* Most Kagglers have at least Bachelor's degrees (>40% have master's too)
* Most Kagglers are currently students
* Most Kagglers are in the early stage of their careers (1-3 years of experience)
* Data scientists on average have more programming experience than data analysts
* Most Kagglers use Python regularly (SQL is also used quite often)
* Most Kagglers overwhelmingly recommend Python as the first programming language to learn for beginners 
* Most Kagglers use JupyterLab/notebook as their primary local IDE and Google Colabs or Kaggle as online notebook environments
* Most Kagglers use their laptops as the primary computing platform for data science projects 
* Most Kagglers don't use cloud computing platforms, but those with >3 years do more than the others 
* Most Kagglers don't use specialized hardware, but those who do prefer NVIDIA GPUs
* Most Kagglers haven't made use of TPUs for training their models, but those with >3 years do more than the others 
* Most Kagglers use matplotlib for data viz, Seaborn and Plotly are also popular
* Over 50% of Kagglers have been exposed to ML methods for less than two years
* The most commonly used ML frameworks by Kagglers are Scikit-learn, Tensorflow, Keras, Pytorch and Xgboost
* The most commonly used algorithms by Kagglers are simple ones like linear/logistic regression and decision trees/random forests
* The most commonly used Computer vision application by Kagglers is image classification
* The most commonly used NLP applications by Kagglers is GLoVe/word2vec and GPT-3/BERT
* Most Kagglers belong to the Computers/Technology industry or are in academia as students or researchers
* Most Kagglers are either the only data scientist in their team, or part of a huge team specializing in data science
* Most businesses where the respondents work seem to be in the early stages of ML adoption
* Most Kagglers work with data analysis to influence product and business decisions
* Most Kagglers do not spend money for ML infrastructure/cloud services
* The top 3 cloud platforms are AWS, GCP and Microsoft Azure and developers seem to enjoy using them
* Amazon EC2 and Google Cloud Compute Engine seem to be the most popular cloud products
* Amazon S3 and Google Cloud Filestore are the popular data storage options for the respondents
* Most Kagglers don't seem to be making use of ML products like Amazon SageMaker or DataBricks
* The three most popular Big Data products seem to be MySQL, PostgreSQL and Microsoft SQL Server 
* Microsoft Power BI and Tableau seem to be the most regularly used BI tools
* Most Kagglers don't seem to be using automated ML tools like auto-sklearn
* Data augmentation, hyperparameter tuning and model selection seem be to the areas in which ML automation is more popular
* Google Cloud AutoML is the most regularly used ML automation tool by the respondents 
* Most Kagglers don't use any tools to manage their Ml experiments; For those who do, Tensorboard is the favorite
* GitHub, Kaggle and Colab are the most preferred portals for the respondents to share their data science portfolios 
* Coursera, Kaggle Learn and Udemy are the most popular learning platforms for ML and data science amongst the respondents 
* To perform data analysis, respondents are mostly making use of local development environments like RStudio or basic statistical tools like spreadsheets 
* Most of the respondents seem to get their data science information from Kaggle, YouTube and blogs

# Importing libraries
First, let us import some of the libraries we may need for data processing and visualization

In [1]:
import numpy as np 
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Checking out the files in the input folder

In [2]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [3]:
survey_df =pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
survey_df.head()

Separating out the questions alone so that we can refer to them later

In [4]:
questions = survey_df.iloc[0]
questions = list(questions)

Removing the first row containing the questions to get the survey answers data

In [5]:
survey_df = survey_df.iloc[1:]
survey_df.head()

# Extracting data for USA and India in separate data frames

In [6]:
usa_df = survey_df[survey_df['Q3'] == 'United States of America']
india_df = survey_df[survey_df['Q3'] == 'India']

In [7]:
columns = survey_df.columns.values.tolist()
print(columns)

# Age distribution of Kagglers

In [8]:
print(questions[1])
print(survey_df['Q1'].unique())

In [9]:
parameter = 'Q1'
survey_df = survey_df.sort_values(by = parameter)
sns.countplot(survey_df[parameter])
plt.show()

As expected, we can see that most of the respondents are younger than 30. The largest group of Kagglers are thus students and young professionals in the early stages of their careers. However, it is enlightening to see that there are a significant number of people in older groups on Kaggle, demonstrating that there is no specific age limit to acquiring knowledge in data science. 

In [10]:
parameter = 'Q1'
survey_df = india_df.sort_values(by = parameter)
sns.countplot(survey_df[parameter])
plt.title('India')
plt.show()

In [11]:
parameter = 'Q1'
survey_df = usa_df.sort_values(by = parameter)
sns.countplot(survey_df[parameter])
plt.title('USA')
plt.show()

The respondents from the USA are clearly on average older than respondents from India. This could be due to the presence of a large number of Indian students on Kaggle. 

# Gender distribution of Kagglers

In [12]:
print(questions[2])
print(survey_df['Q2'].unique())

In [13]:
parameter = 'Q2'
sns.countplot(survey_df[parameter], order = ['Man', 'Woman', 'Prefer not to say', 'Prefer to self-describe', 'Nonbinary'])
plt.xticks(rotation = 90)
plt.show()

We can see that an overwhelming majority of respondents are men. This could be attributed to the already large gender gap in STEM fields. Perhaps raising more awareness in schools and colleges, and creating curated data science sessions with prominent women in data science could help in reducing this gap. Kaggle is definitely a good place to start for data science enthusiasts and therefore it would be amazing if the representation of women on this platform could be increased. The number of respondents who idenitified with other genders is also surprisingly low, so perhaps we need to create a safe space for everyone in the discussion and community sections. 

In [14]:
parameter = 'Q2'
sns.countplot(india_df[parameter], order = ['Man', 'Woman', 'Prefer not to say', 'Prefer to self-describe', 'Nonbinary'])
plt.xticks(rotation = 90)
plt.title('India')
plt.show()

In [15]:
parameter = 'Q2'
sns.countplot(usa_df[parameter], order = ['Man', 'Woman', 'Prefer not to say', 'Prefer to self-describe', 'Nonbinary'])
plt.xticks(rotation = 90)
plt.title('USA')
plt.show()

The gender imbalance seems to be the same in both India and USA.

# Nationality distribution of Kagglers

In [16]:
print(questions[3])
print('Total number of country entries: ', len(survey_df['Q3'].unique()))
print('Total number of replies: ', survey_df['Q3'].value_counts().sum())
print(survey_df['Q3'].unique())

In [17]:
country_counts = survey_df['Q3'].value_counts()
print(country_counts[:10])

In [18]:
country_counts = survey_df['Q3'].value_counts()/survey_df['Q3'].value_counts().sum()
print(country_counts[:20])

28.62% of Kagglers are from India, which has 16% of the globe's population. The United States comes in at a distant second with 10.2%. This is also quite high because the US has close to 4% of the world's population. As expected, nations with large populations such as China, Japan, Russia, Nigeria, Pakistan and Indonesia feature in the list too. Over 70% of Kagglers come from the top 20 countries alone. 

In [19]:
country_list = ['India', 'USA', 'Other', 'Japan', 'China', 'Brazil', 'Russia', 'Nigeria', 'UK & NI', 'Pakistan', 'Egypt',
               'Germany', 'Spain', 'Indonesia', 'Turkey', 'France', 'South Korea', 'Taiwan', 'Canada', 'Bangladesh']
plt.bar(country_list, country_counts[:20])
plt.xticks(rotation = 90)
plt.show()

# Educational qualifications of Kagglers

In [20]:
print(questions[4])
print(survey_df['Q4'].unique())

In [21]:
parameter = 'Q4'
sns.countplot(survey_df[parameter], 
              order = ['No formal education past high school',
 'Some college/university study without earning a bachelor’s degree','Bachelor’s degree',
 'Master’s degree','Professional doctorate','Doctoral degree',
 'I prefer not to answer'  ])
plt.xticks(rotation=90)
plt.show()

It is impressive to see that most Kagglers have a master's degree and an overwhelming majority of folks on the platform possess at least a bachelor's degree. This shows that people use Kaggle as an additional learning tool along with their conventional educational degrees. 

In [22]:
parameter = 'Q4'
sns.countplot(india_df[parameter],order = ['No formal education past high school',
 'Some college/university study without earning a bachelor’s degree' ,'Bachelor’s degree',
 'Master’s degree','Professional doctorate','Doctoral degree',
 'I prefer not to answer'  ])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [23]:
parameter = 'Q4'
sns.countplot(usa_df[parameter],order = ['No formal education past high school',
 'Some college/university study without earning a bachelor’s degree' ,'Bachelor’s degree',
 'Master’s degree','Professional doctorate','Doctoral degree',
 'I prefer not to answer'  ])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Most Kagglers from India hold Bachelor degrees whereas most Kagglers from USA hold Master degrees. 

In [24]:
pd.crosstab([survey_df.Q2],[survey_df.Q4],margins=True).style.background_gradient(cmap='summer_r')

In [25]:
men_percentage = (20598/25973)*100
print('Percentage of male respondents:' + str(men_percentage) + '%')
column_names = ['Bachelor’s degree',	'Doctoral degree',	'I prefer not to answer',
                'Master’s degree',	'No formal education past high school',	'Professional doctorate',
                'Some college/university study without earning a bachelor’s degree']
ed_values_men = [7928,	2171,	445,	7995,	360,	270,	1429]
ed_values_all = [9907,	2795,	627,	10132,	417,	360,	1735]
res = [str(round((i / j),4)*100)+'%' for i, j in zip(ed_values_men, ed_values_all)]
res = [(i,j) for i, j in zip(column_names, res)]
for (a,b) in res:
    print(a,b)

The only value which seems to be differing in women and men is that even though 80% of Kagglers are men, of all the Kagglers without formal education past high school, 86.33% of them are men. 

In [26]:
pd.crosstab([survey_df.Q3],[survey_df.Q4],margins=True).style.background_gradient(cmap='summer_r')

# Current job roles reported by Kagglers

In [27]:
print(questions[5])
print(survey_df['Q5'].unique())

In [28]:
parameter = 'Q5'
sns.countplot(survey_df[parameter],
             order = ['Student', 'Software Engineer', 'Machine Learning Engineer',
 'Data Scientist', 'Currently not employed', 'Data Engineer', 'Data Analyst',
 'DBA/Database Engineer', 'Other', 'Statistician', 'Business Analyst',
 'Developer Relations/Advocacy', 'Program/Project Manager',
 'Research Scientist', 'Product Manager'])
plt.xticks(rotation=90)
plt.show()

Not surprisingly, most of the respondents are students. This makes sense because Kaggle is the go-to platform for beginners and students of data science. The other popular roles include data scientist, data analyst and software engineer. There are also a significant number of unemployed people who are here to upskill and build a good data science portfolio. 

In [29]:
parameter = 'Q5'
sns.countplot(india_df[parameter],
             order = ['Student', 'Software Engineer', 'Machine Learning Engineer',
 'Data Scientist', 'Currently not employed', 'Data Engineer', 'Data Analyst',
 'DBA/Database Engineer', 'Other', 'Statistician', 'Business Analyst',
 'Developer Relations/Advocacy', 'Program/Project Manager',
 'Research Scientist', 'Product Manager'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [30]:
parameter = 'Q5'
sns.countplot(usa_df[parameter],
             order = ['Student', 'Software Engineer', 'Machine Learning Engineer',
 'Data Scientist', 'Currently not employed', 'Data Engineer', 'Data Analyst',
 'DBA/Database Engineer', 'Other', 'Statistician', 'Business Analyst',
 'Developer Relations/Advocacy', 'Program/Project Manager',
 'Research Scientist', 'Product Manager'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Most Indians on Kaggle seem to be students, compared to the US where a significant number of Data Scientists and other professionals seem to active on Kaggle. 

# Work experience distribution of Kagglers

In [31]:
print(questions[6])
print(survey_df['Q6'].unique())

In [32]:
parameter = 'Q6'
sns.countplot(survey_df[parameter], 
              order = ['I have never written code','< 1 years', '1-3 years', '3-5 years', 
                       '5-10 years', '10-20 years','20+ years' ])
plt.xticks(rotation=90)
plt.show()

Most of the respondents have some coding experience. It is also evident that most Kagglers have 1-3 years experience. As we reach higher experience bands, the number of respondents decreases, indicating that most of the people on the platform are in the early stages of their careers. 

In [33]:
parameter = 'Q6'
sns.countplot(india_df[parameter], 
              order = ['I have never written code','< 1 years', '1-3 years', '3-5 years', 
                       '5-10 years', '10-20 years','20+ years' ])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [34]:
parameter = 'Q6'
sns.countplot(usa_df[parameter], 
              order = ['I have never written code','< 1 years', '1-3 years', '3-5 years', 
                       '5-10 years', '10-20 years','20+ years' ])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Most Indians on Kaggle have coding experience less than 3 years whereas most Kagglers from US have a significant amount of work experience. 

In [35]:
pd.crosstab([survey_df.Q5],[survey_df.Q6],margins=True).style.background_gradient(cmap='summer_r')

Some findings include:
* There are very few students who do not have experience in programming
* Most Data Scientists seem to have a lot of coding experience
* Most Data Analysts seem to have lesser coding experience
* Most of the people who didn't have programming experience did not hold a data science job, unsurprisingly
* Most  software engineers and research scientists have a lot of coding experience, as expected

# Regularly used programming language

In [36]:
print(questions[columns.index('Q7_OTHER')])
print(survey_df['Q7_Part_1'].unique())
print(survey_df['Q7_Part_2'].unique())
print(survey_df['Q7_Part_3'].unique())
print(survey_df['Q7_Part_4'].unique())
print(survey_df['Q7_Part_5'].unique())
print(survey_df['Q7_Part_6'].unique())
print(survey_df['Q7_Part_7'].unique())
print(survey_df['Q7_Part_8'].unique())
print(survey_df['Q7_Part_9'].unique())
print(survey_df['Q7_Part_10'].unique())
print(survey_df['Q7_Part_11'].unique())
print(survey_df['Q7_Part_12'].unique())
print(survey_df['Q7_OTHER'].unique())

In [37]:
Q7_plot_columns = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5',
                    'Q7_Part_6', 'Q7_Part_7','Q7_Part_8','Q7_Part_9', 'Q7_Part_10',
                    'Q7_Part_11', 'Q7_OTHER']
Q7_plot_labels = ['Python', 'R', 'SQL', 'C',
                   'C++', 'Java', 'Javascript', 'Julia', 'Swift',
                   'Bash', 'MATLAB','None', 'Other' ]
Q7_plot_counts = [0]*len(Q7_plot_labels)
for col in Q7_plot_columns:
    idx = Q7_plot_columns.index(col)
    Q7_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q7_plot_counts)
print(Q7_plot_labels)

In [38]:
plt.bar(Q7_plot_labels, Q7_plot_counts)
plt.xticks(rotation=90)
plt.show()

We can see that the most frequently used programming language by the respondents is Python by a long shot. SQL comes a distant second. This makes sense because Python and SQL are the fundamental skills needed to break into data science. 

In [39]:
Q7_plot_counts = [0]*len(Q7_plot_labels)
for col in Q7_plot_columns:
    idx = Q7_plot_columns.index(col)
    Q7_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q7_plot_labels, Q7_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [40]:
Q7_plot_counts = [0]*len(Q7_plot_labels)
for col in Q7_plot_columns:
    idx = Q7_plot_columns.index(col)
    Q7_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q7_plot_labels, Q7_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

C and C++ seems to be more popular in India than in USA. 

# Primary programming language recommended to aspiring data scientists 

In [41]:
print(questions[columns.index('Q8')])
print(survey_df['Q8'].unique())

In [42]:
parameter = 'Q8'
sns.countplot(survey_df[parameter], order = ['Python', 'Julia', 'R', 'C', 'Java', 'SQL', 'C++',
                                            'None', 'MATLAB', 'Javascript','Other', 'Swift', 'Bash'])
plt.xticks(rotation=90)
plt.show()

Python is the clear winner in this category. Most Kagglers recommmend data science beginners to start with Python, which is hands down, the most the important building block in one's data science career. 

In [43]:
parameter = 'Q8'
sns.countplot(india_df[parameter], order = ['Python', 'Julia', 'R', 'C', 'Java', 'SQL', 'C++',
                                            'None', 'MATLAB', 'Javascript','Other', 'Swift', 'Bash'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [44]:
parameter = 'Q8'
sns.countplot(usa_df[parameter], order = ['Python', 'Julia', 'R', 'C', 'Java', 'SQL', 'C++',
                                            'None', 'MATLAB', 'Javascript','Other', 'Swift', 'Bash'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Python is the overwhelming favorite in both India and the USA.

# Development environments used by Kagglers

In [45]:
print(questions[columns.index('Q9_OTHER')])
print(survey_df['Q9_Part_1'].unique())
print(survey_df['Q9_Part_2'].unique())
print(survey_df['Q9_Part_3'].unique())
print(survey_df['Q9_Part_4'].unique())
print(survey_df['Q9_Part_5'].unique())
print(survey_df['Q9_Part_6'].unique())
print(survey_df['Q9_Part_7'].unique())
print(survey_df['Q9_Part_8'].unique())
print(survey_df['Q9_Part_9'].unique())
print(survey_df['Q9_Part_10'].unique())
print(survey_df['Q9_Part_11'].unique())
print(survey_df['Q9_OTHER'].unique())

In [46]:
Q9_plot_columns = ['Q9_Part_1', 'Q9_Part_2', 'Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7', 'Q9_Part_8',
               'Q9_Part_9', 'Q9_Part_10', 'Q9_Part_11', 'Q9_OTHER']
Q9_plot_labels = ['JupyterLab/Notebook', 'RStudio', 'Visual Studio', 'VSCode', 'PyCharm','Spyder','Notepad++', 'Sublime text', 
               'Vim/Emacs', 'Other' ]
Q9_plot_counts = [0]*len(Q9_plot_labels)
for col in Q9_plot_columns:
    idx = Q9_plot_columns.index(col)
    if col in ['Q9_Part_11']:
        idx = 0
    if col in ['Q9_OTHER']:
        idx = 9
    Q9_plot_counts[idx] += int(survey_df[col].value_counts())
    
print(Q9_plot_counts)
print(Q9_plot_labels)

In [47]:
plt.bar(Q9_plot_labels, Q9_plot_counts)
plt.xticks(rotation=90)
plt.show()

Most Kagglers seem to prefer JupyterLab/notebook as their primary IDE. This makes sense because data analysis is easier when done on a notebook style environment. VSCode and PyCharm are not far behind, possibly used for projects and Kaggle competitions. 

In [48]:
Q9_plot_counts = [0]*len(Q9_plot_labels)
for col in Q9_plot_columns:
    idx = Q9_plot_columns.index(col)
    if col in ['Q9_Part_11']:
        idx = 0
    if col in ['Q9_OTHER']:
        idx = 9
    Q9_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q9_plot_labels, Q9_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [49]:
Q9_plot_counts = [0]*len(Q9_plot_labels)
for col in Q9_plot_columns:
    idx = Q9_plot_columns.index(col)
    if col in ['Q9_Part_11']:
        idx = 0
    if col in ['Q9_OTHER']:
        idx = 9
    Q9_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q9_plot_labels, Q9_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

JupyterLab/Notebook is the most popular IDE in both India and the USA.

# Preferred online notebook environment

In [50]:
print(questions[columns.index('Q10_OTHER')])
print(survey_df['Q10_Part_1'].unique())
print(survey_df['Q10_Part_2'].unique())
print(survey_df['Q10_Part_3'].unique())
print(survey_df['Q10_Part_4'].unique())
print(survey_df['Q10_Part_5'].unique())
print(survey_df['Q10_Part_6'].unique())
print(survey_df['Q10_Part_7'].unique())
print(survey_df['Q10_Part_8'].unique())
print(survey_df['Q10_Part_9'].unique())
print(survey_df['Q10_Part_10'].unique())
print(survey_df['Q10_Part_11'].unique())
print(survey_df['Q10_Part_12'].unique())
print(survey_df['Q10_Part_13'].unique())
print(survey_df['Q10_Part_14'].unique())
print(survey_df['Q10_Part_15'].unique())
print(survey_df['Q10_Part_16'].unique())
print(survey_df['Q10_OTHER'].unique())

In [51]:
Q10_plot_columns = ['Q10_Part_1', 'Q10_Part_2', 'Q10_Part_3', 'Q10_Part_4', 'Q10_Part_5', 'Q10_Part_6', 'Q10_Part_7',
                    'Q10_Part_8','Q10_Part_9', 'Q10_Part_10', 'Q10_Part_11', 'Q10_Part_12', 'Q10_Part_13', 'Q10_Part_14',
                    'Q10_Part_15', 'Q10_Part_16', 'Q10_OTHER']
Q10_plot_labels = ['Kaggle Notebooks', 'Colab Notebooks', 'Azure Notebooks', 'Paperspace/Gradient', 'Binder/JupyterHub', 
                   'Code Ocean','IBM Watson Studio','Amazon Sagemaker Studio Notebooks', 'Amazon EMR Notebooks',
                   'Google Cloud Notebooks','Google Cloud Datalab', 'Databricks Collaborative Notebooks',
                   'Zeppelin / Zepl Notebooks ','Deepnote Notebooks','Observable Notebooks', 'None', 'Other' ]
Q10_plot_counts = [0]*len(Q10_plot_labels)
for col in Q10_plot_columns:
    idx = Q10_plot_columns.index(col)
    Q10_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q10_plot_counts)
print(Q10_plot_labels)

In [52]:
plt.bar(Q10_plot_labels, Q10_plot_counts)
plt.xticks(rotation=90)
plt.show()

Not surprisingly, Kaggle features amongst the favourite online notebook by Kagglers! Apart from Kaggle, the most preferred online notebook is Colab notebooks by Google, which is a preferred environment for many ML researchers. 

In [53]:
Q10_plot_counts = [0]*len(Q10_plot_labels)
for col in Q10_plot_columns:
    idx = Q10_plot_columns.index(col)
    Q10_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q10_plot_labels, Q10_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [54]:
Q10_plot_counts = [0]*len(Q10_plot_labels)
for col in Q10_plot_columns:
    idx = Q10_plot_columns.index(col)
    Q10_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q10_plot_labels, Q10_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The trend seems to be similar in both India and the USA. 

# Computing platform used for data science projects

In [55]:
print(questions[columns.index('Q11')])
print(survey_df['Q11'].unique())

In [56]:
parameter = 'Q11'
sns.countplot(survey_df[parameter],
             order = ['A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)',
 'A laptop', 'A personal computer / desktop' ,
 'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)', 'Other', 'None'])
plt.xticks(rotation=90)
plt.show()

Most Kagglers use a laptop for their personal projects. Very few use cloud computing platforms, presumably because most of the respondents are students. 

In [57]:
parameter = 'Q11'
sns.countplot(india_df[parameter],
             order = ['A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)',
 'A laptop', 'A personal computer / desktop' ,
 'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)', 'Other', 'None'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [58]:
parameter = 'Q11'
sns.countplot(usa_df[parameter],
             order = ['A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)',
 'A laptop', 'A personal computer / desktop' ,
 'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)', 'Other', 'None'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

A larger proportion of Kagglers from USA work on their PCs as opposed to India. 

In [59]:
pd.crosstab([survey_df.Q11],[survey_df.Q6],margins=True).style.background_gradient(cmap='summer_r')

In [60]:
cloud_percentage = (2328/24270)*100
print('Percentage of cloud platform users:' + str(cloud_percentage) + '%')
column_names = ['<1 year', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']
ed_values_cloud = [215, 554, 458, 482, 364, 255]
ed_values_all = [5841, 7774, 4023, 3083, 2154, 1845]
res = [str(round((i / j),4)*100)+'%' for i, j in zip(ed_values_cloud, ed_values_all)]
res = [(i,j) for i, j in zip(column_names, res)]
for (a,b) in res:
    print(a,b)

The percentage of respondents who use cloud platforms is low. However, we can see that those with experience >3 years use cloud platforms more

# Specialized hardware used for ML

In [61]:
print(questions[columns.index('Q12_OTHER')])
print(survey_df['Q12_Part_1'].unique())
print(survey_df['Q12_Part_2'].unique())
print(survey_df['Q12_Part_3'].unique())
print(survey_df['Q12_Part_4'].unique())
print(survey_df['Q12_Part_5'].unique())
print(survey_df['Q12_OTHER'].unique())

In [62]:
Q12_plot_columns = ['Q12_Part_1', 'Q12_Part_2', 'Q12_Part_3', 'Q12_Part_4', 'Q12_Part_5', 'Q12_OTHER']
Q12_plot_labels = ['NVIDIA GPUs', 'Google Cloud TPUs', 'AWS Trainium Chips',
                   'AWS Inferentia Chips', 'None', 'Other' ]
Q12_plot_counts = [0]*len(Q12_plot_labels)
for col in Q12_plot_columns:
    idx = Q12_plot_columns.index(col)
    Q12_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q12_plot_counts)
print(Q12_plot_labels)

In [63]:
plt.bar(Q12_plot_labels, Q12_plot_counts)
plt.xticks(rotation=90)
plt.show()

Here we can see that most of the Kagglers do not use specialized hardware, presumably making use of Kaggle or Colab GPUs. However amongst the ones who do, NVIDIA GPUs seem to be the most popular one. 

In [64]:
Q12_plot_counts = [0]*len(Q12_plot_labels)
for col in Q12_plot_columns:
    idx = Q12_plot_columns.index(col)
    Q12_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q12_plot_labels, Q12_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [65]:
Q12_plot_counts = [0]*len(Q12_plot_labels)
for col in Q12_plot_columns:
    idx = Q12_plot_columns.index(col)
    Q12_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q12_plot_labels, Q12_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

A larger proportion of Kagglers from India use Google Cloud TPUs as opposed to those from the USA.

# Usage of TPUs

In [66]:
print(questions[columns.index('Q13')])
print(survey_df['Q13'].unique())

In [67]:
parameter = 'Q13'
sns.countplot(survey_df[parameter],
             order = ['Never','Once','2-5 times', '6-25 times' , 'More than 25 times'  ])
plt.xticks(rotation=90)
plt.show()

From the above plot, we can see that most Kagglers haven't made use of TPUs for training their models. Barely anyone has used it more than 6 times. 

In [68]:
parameter = 'Q13'
sns.countplot(india_df[parameter],
             order = ['Never','Once','2-5 times', '6-25 times' , 'More than 25 times'  ])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [69]:
parameter = 'Q13'
sns.countplot(usa_df[parameter],
             order = ['Never','Once','2-5 times', '6-25 times' , 'More than 25 times'  ])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The trend seems to be similar in both the countries. 

In [70]:
pd.crosstab([survey_df.Q13],[survey_df.Q6],margins=True).style.background_gradient(cmap='summer_r')

In [71]:
most_times_percentage = (612/24403)*100
print('Percentage of cloud platform users:' + str(most_times_percentage) + '%')
column_names = ['<1 year', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']
ed_values_cloud = [49, 154, 122, 100, 86, 101]
ed_values_all = [5742, 7645, 3977, 3062, 2134, 1843]
res = [str(round((i / j),4)*100)+'%' for i, j in zip(ed_values_cloud, ed_values_all)]
res = [(i,j) for i, j in zip(column_names, res)]
for (a,b) in res:
    print(a,b)

We can see that TPUs are used mainly by more experienced Kagglers. 

In [72]:
pd.crosstab([survey_df.Q13],[survey_df.Q5],margins=True).style.background_gradient(cmap='summer_r')

In [73]:
most_times_percentage = (612/24403)*100
print('Percentage of cloud platform users:' + str(most_times_percentage) + '%')
column_names = ['Business Analyst',	'Currently not employed',	'DBA/Database Engineer',	
                'Data Analyst',	'Data Engineer', 'Data Scientist', 'Developer Relations/Advocacy',	
                'Machine Learning Engineer',	'Other',	'Product Manager',	'Program/Project Manager',	
                'Research Scientist',	'Software Engineer',	'Statistician',	'Student']
ed_values_most = [15,	32,	11,	31,	12,	136,	6,	85,	33,	7,	28,	59,	47,	7,	103,	612]
ed_values_all = [829,	1926,	158,	2134,	651,	3491,	90,	1429,	2057,	274,	777,
                   1440,	2389,	287,	6471]
res = [str(round((i / j),4)*100)+'%' for i, j in zip(ed_values_most, ed_values_all)]
res = [(i,j) for i, j in zip(column_names, res)]
for (a,b) in res:
    print(a,b)

ML Engineers, Data scientists, DBA engineers, Developer Relations/Advocacy professionals and research scientists seem to be the Kagglers who make use of TPUs the most. 

# Data visualization libraries used

In [74]:
print(questions[columns.index('Q14_OTHER')])
print(survey_df['Q14_Part_1'].unique())
print(survey_df['Q14_Part_2'].unique())
print(survey_df['Q14_Part_3'].unique())
print(survey_df['Q14_Part_4'].unique())
print(survey_df['Q14_Part_5'].unique())
print(survey_df['Q14_Part_6'].unique())
print(survey_df['Q14_Part_7'].unique())
print(survey_df['Q14_Part_8'].unique())
print(survey_df['Q14_Part_9'].unique())
print(survey_df['Q14_Part_10'].unique())
print(survey_df['Q14_Part_11'].unique())
print(survey_df['Q14_OTHER'].unique())

In [75]:
Q14_plot_columns = ['Q14_Part_1', 'Q14_Part_2', 'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6',
                    'Q14_Part_7','Q14_Part_8','Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11', 'Q14_OTHER']
Q14_plot_labels = ['Matplotlib', 'Seaborn', 'Plotly', 'Ggplot', 'Shiny', 'D3 js', 'Altair', 'Bokeh',
                   'Geoplotlib', 'Leaflet/Folium','None', 'Other' ]
Q14_plot_counts = [0]*len(Q14_plot_labels)
for col in Q14_plot_columns:
    idx = Q14_plot_columns.index(col)
    Q14_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q14_plot_counts)
print(Q14_plot_labels)

In [76]:
plt.bar(Q14_plot_labels, Q14_plot_counts)
plt.xticks(rotation=90)
plt.show()

Matplotlib seems to the favorite for data viz amongst the Kagglers. Seaborn and Plotly also take podium positions. 

In [77]:
Q14_plot_counts = [0]*len(Q14_plot_labels)
for col in Q14_plot_columns:
    idx = Q14_plot_columns.index(col)
    Q14_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q14_plot_labels, Q14_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [78]:
Q14_plot_counts = [0]*len(Q14_plot_labels)
for col in Q14_plot_columns:
    idx = Q14_plot_columns.index(col)
    Q14_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q14_plot_labels, Q14_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Ggplot seems to be more commonly used in the USA than in India. 

# Years of experience with ML methods

In [79]:
print(questions[columns.index('Q15')])
print(survey_df['Q15'].unique())

In [80]:
parameter = 'Q15'
sns.countplot(survey_df[parameter], 
              order = ['I do not use machine learning methods', 'Under 1 year', '1-2 years',
                      '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-20 years'])
plt.xticks(rotation=90)
plt.show()

Most Kagglers seem to be new in the field of machine learning, with most being exposed to ML for less than a year. A majority of the respondents have less than 2 years experience with ML models. 

In [81]:
parameter = 'Q15'
sns.countplot(india_df[parameter], 
              order = ['I do not use machine learning methods', 'Under 1 year', '1-2 years',
                      '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-20 years'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [82]:
parameter = 'Q15'
sns.countplot(usa_df[parameter], 
              order = ['I do not use machine learning methods', 'Under 1 year', '1-2 years',
                      '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-20 years'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Kagglers from the USA seem to have more ML experience than Kagglers from India in general. 

In [83]:
pd.crosstab([survey_df.Q15],[survey_df.Q5],margins=True).style.background_gradient(cmap='summer_r')

# ML frameworks used regularly

In [84]:
print(questions[columns.index('Q16_OTHER')])
print(survey_df['Q16_Part_1'].unique())
print(survey_df['Q16_Part_2'].unique())
print(survey_df['Q16_Part_3'].unique())
print(survey_df['Q16_Part_4'].unique())
print(survey_df['Q16_Part_5'].unique())
print(survey_df['Q16_Part_6'].unique())
print(survey_df['Q16_Part_7'].unique())
print(survey_df['Q16_Part_8'].unique())
print(survey_df['Q16_Part_9'].unique())
print(survey_df['Q16_Part_10'].unique())
print(survey_df['Q16_Part_11'].unique())
print(survey_df['Q16_Part_12'].unique())
print(survey_df['Q16_Part_13'].unique())
print(survey_df['Q16_Part_14'].unique())
print(survey_df['Q16_Part_15'].unique())
print(survey_df['Q16_Part_16'].unique())
print(survey_df['Q16_Part_17'].unique())
print(survey_df['Q16_OTHER'].unique())

In [85]:
Q16_plot_columns = ['Q16_Part_1', 'Q16_Part_2', 'Q16_Part_3', 'Q16_Part_4', 'Q16_Part_5', 'Q16_Part_6', 'Q16_Part_7',
                    'Q16_Part_8','Q16_Part_9', 'Q16_Part_10', 'Q16_Part_11', 'Q16_Part_12', 
                    'Q16_Part_13', 'Q16_Part_14','Q16_Part_15', 'Q16_Part_16', 'Q16_Part_17', 
                    'Q16_OTHER']
Q16_plot_labels = ['Scikit-learn', 'Tensorflow', 'Keras', 'Pytorch', 'Fast.ai', 'MXNet', 'Xgboost',
                   'LightGBM', 'CatBoost', 'Prophet', 'H20 3', 'Caret', 'Tidymodels', 'JAX',
                   'PyTorch Lightning', 'Huggingface','None', 'Other' ]
Q16_plot_counts = [0]*len(Q16_plot_labels)
for col in Q16_plot_columns:
    idx = Q16_plot_columns.index(col)
    Q16_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q16_plot_counts)
print(Q16_plot_labels)

In [86]:
plt.bar(Q16_plot_labels, Q16_plot_counts)
plt.xticks(rotation=90)
plt.show()

The most commonly used ML frameworks are Scikit-learn, Tensorflow, Keras, Pytorch and Xgboost. 

In [87]:
Q16_plot_counts = [0]*len(Q16_plot_labels)
for col in Q16_plot_columns:
    idx = Q16_plot_columns.index(col)
    Q16_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q16_plot_labels, Q16_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [88]:
Q16_plot_counts = [0]*len(Q16_plot_labels)
for col in Q16_plot_columns:
    idx = Q16_plot_columns.index(col)
    Q16_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q16_plot_labels, Q16_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The distributions seem to be similar in both countries. 

# Popular ML algorithms used regularly

In [89]:
print(questions[columns.index('Q17_OTHER')])
print(survey_df['Q17_Part_1'].unique())
print(survey_df['Q17_Part_2'].unique())
print(survey_df['Q17_Part_3'].unique())
print(survey_df['Q17_Part_4'].unique())
print(survey_df['Q17_Part_5'].unique())
print(survey_df['Q17_Part_6'].unique())
print(survey_df['Q17_Part_7'].unique())
print(survey_df['Q17_Part_8'].unique())
print(survey_df['Q17_Part_9'].unique())
print(survey_df['Q17_Part_10'].unique())
print(survey_df['Q17_Part_11'].unique())
print(survey_df['Q17_OTHER'].unique())

In [90]:
Q17_plot_columns = ['Q17_Part_1', 'Q17_Part_2', 'Q17_Part_3', 'Q17_Part_4', 'Q17_Part_5', 'Q17_Part_6', 'Q17_Part_7',
                    'Q17_Part_8','Q17_Part_9', 'Q17_Part_10', 'Q17_Part_11', 'Q17_OTHER']
Q17_plot_labels = ['Linear/Logistic Regression', 'Decision trees/Random forests', 'GBMs', 'Bayesian approaches',
                   'Evolutionary approaches', 'Dense Neural Networks', 'CNNs', 'GANs', 'RNNs', 'Transformers',
                   'None', 'Other' ]
Q17_plot_counts = [0]*len(Q17_plot_labels)
for col in Q17_plot_columns:
    idx = Q17_plot_columns.index(col)
    Q17_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q17_plot_counts)
print(Q17_plot_labels)

In [91]:
plt.bar(Q17_plot_labels, Q17_plot_counts)
plt.xticks(rotation=90)
plt.show()

The most commonly used algorithms seem to linear/logistic regression and decision trees/random forests. GBMs and CNNs are in third and fourth position respectively. In line with the analysis, most of the Kagglers seem to be beginners and hence opt for classical ML as opposed to complex novel approaches such as GANs or Transformers. 

In [92]:
Q17_plot_counts = [0]*len(Q17_plot_labels)
for col in Q17_plot_columns:
    idx = Q17_plot_columns.index(col)
    Q17_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q17_plot_labels, Q17_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [93]:
Q17_plot_counts = [0]*len(Q17_plot_labels)
for col in Q17_plot_columns:
    idx = Q17_plot_columns.index(col)
    Q17_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q17_plot_labels, Q17_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The distribution seems to be similar in both the countries. 

In [94]:
pd.crosstab([survey_df.Q17_Part_1],[survey_df.Q5],margins=True).style.background_gradient(cmap='summer_r')

In [95]:
pd.crosstab([survey_df.Q17_Part_8],[survey_df.Q5],margins=True).style.background_gradient(cmap='summer_r')

# Computer vision methods used regularly

In [96]:
print(questions[columns.index('Q18_OTHER')])
print(survey_df['Q18_Part_1'].unique())
print(survey_df['Q18_Part_2'].unique())
print(survey_df['Q18_Part_3'].unique())
print(survey_df['Q18_Part_4'].unique())
print(survey_df['Q18_Part_5'].unique())
print(survey_df['Q18_Part_6'].unique())
print(survey_df['Q18_OTHER'].unique())

In [97]:
Q18_plot_columns = ['Q18_Part_1', 'Q18_Part_2', 'Q18_Part_3', 'Q18_Part_4', 'Q18_Part_5', 'Q18_Part_6',
                    'Q18_OTHER']
Q18_plot_labels = ['PIL/cv2/skimage', 'U-Net/Mask R-CNN', 'YOLOv3/RetinaNet', 'VGG/Inception/ResNet',
                   'GAN/VAE', 'None', 'Other' ]
Q18_plot_counts = [0]*len(Q18_plot_labels)
for col in Q18_plot_columns:
    idx = Q18_plot_columns.index(col)
    Q18_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q18_plot_counts)
print(Q18_plot_labels)

In [98]:
plt.bar(Q18_plot_labels, Q18_plot_counts)
plt.xticks(rotation=90)
plt.show()

Image classification using architectures like VGG/Inception/ResNet etc seem to be the most common CV application worked on by Kagglers. Object detection, image processing and image segmentation are also popular. 

In [99]:
Q18_plot_counts = [0]*len(Q18_plot_labels)
for col in Q18_plot_columns:
    idx = Q18_plot_columns.index(col)
    Q18_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q18_plot_labels, Q18_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [100]:
Q18_plot_counts = [0]*len(Q18_plot_labels)
for col in Q18_plot_columns:
    idx = Q18_plot_columns.index(col)
    Q18_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q18_plot_labels, Q18_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The distribution is similar in both the countries. 

# NLP methods used regularly

In [101]:
print(questions[columns.index('Q19_OTHER')])
print(survey_df['Q19_Part_1'].unique())
print(survey_df['Q19_Part_2'].unique())
print(survey_df['Q19_Part_3'].unique())
print(survey_df['Q19_Part_4'].unique())
print(survey_df['Q19_Part_5'].unique())
print(survey_df['Q19_OTHER'].unique())

In [102]:
Q19_plot_columns = ['Q19_Part_1', 'Q19_Part_2', 'Q19_Part_3', 'Q19_Part_4', 'Q19_Part_5',
                    'Q19_OTHER']
Q19_plot_labels = ['GLoVe/fastText/word2vec', 'seq2seq/vanilla transformers', 'ELMo/CoVe', 
                    'GPT-3/BERT/XLnet','None', 'Other' ]
Q19_plot_counts = [0]*len(Q19_plot_labels)
for col in Q19_plot_columns:
    idx = Q19_plot_columns.index(col)
    Q19_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q19_plot_counts)
print(Q19_plot_labels)

In [103]:
plt.bar(Q19_plot_labels, Q19_plot_counts)
plt.xticks(rotation=90)
plt.show()

Word embeddings like GLoVe/word2vec and transformer architectures like GPT-3/BERT seem to be the most commonly used aspects in NLP by Kagglers. 

In [104]:
Q19_plot_counts = [0]*len(Q19_plot_labels)
for col in Q19_plot_columns:
    idx = Q19_plot_columns.index(col)
    Q19_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q19_plot_labels, Q19_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [105]:
Q19_plot_counts = [0]*len(Q19_plot_labels)
for col in Q19_plot_columns:
    idx = Q19_plot_columns.index(col)
    Q19_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q19_plot_labels, Q19_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The trends are similar in both India and the USA.

# Distribution of industry of current job

In [106]:
print(questions[columns.index('Q20')])
print(survey_df['Q20'].unique())

In [107]:
parameter = 'Q20'
sns.countplot(survey_df[parameter],
             order = ['Computers/Technology', 'Medical/Pharmaceutical', 'Other',
 'Academics/Education', 'Broadcasting/Communications', 'Marketing/CRM',
 'Online Business/Internet-based Sales', 'Hospitality/Entertainment/Sports',
 'Accounting/Finance', 'Insurance/Risk Assessment',
 'Shipping/Transportation', 'Energy/Mining', 'Retail/Sales',
 'Online Service/Internet-based Services', 'Government/Public Service',
 'Manufacturing/Fabrication', 'Military/Security/Defense',
 'Non-profit/Service'])
plt.xticks(rotation=90)
plt.show()

Unsurprisingly, most Kagglers belong to the Computers/Technology industry or belong to the academia as students or researchers. Quite a few finance professionals also seem to be interested in data science, which is fascinating. Since data is the currency of the 21st century, we can expect people from more diverse industries to enter data science very soon. 

In [108]:
parameter = 'Q20'
sns.countplot(india_df[parameter],
             order = ['Computers/Technology', 'Medical/Pharmaceutical', 'Other',
 'Academics/Education', 'Broadcasting/Communications', 'Marketing/CRM',
 'Online Business/Internet-based Sales', 'Hospitality/Entertainment/Sports',
 'Accounting/Finance', 'Insurance/Risk Assessment',
 'Shipping/Transportation', 'Energy/Mining', 'Retail/Sales',
 'Online Service/Internet-based Services', 'Government/Public Service',
 'Manufacturing/Fabrication', 'Military/Security/Defense',
 'Non-profit/Service'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [109]:
parameter = 'Q20'
sns.countplot(usa_df[parameter],
             order = ['Computers/Technology', 'Medical/Pharmaceutical', 'Other',
 'Academics/Education', 'Broadcasting/Communications', 'Marketing/CRM',
 'Online Business/Internet-based Sales', 'Hospitality/Entertainment/Sports',
 'Accounting/Finance', 'Insurance/Risk Assessment',
 'Shipping/Transportation', 'Energy/Mining', 'Retail/Sales',
 'Online Service/Internet-based Services', 'Government/Public Service',
 'Manufacturing/Fabrication', 'Military/Security/Defense',
 'Non-profit/Service'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The general trends seem to be the same in both the countries. 

# Distribution of size of current company

In [110]:
print(questions[columns.index('Q21')])
print(survey_df['Q21'].unique())

In [111]:
parameter = 'Q21'
sns.countplot(survey_df[parameter],
             order = [ '0-49 employees', '50-249 employees','250-999 employees','1000-9,999 employees',
 '10,000 or more employees' ])
plt.xticks(rotation=90)
plt.show()

There seem to be Kagglers from small companies and big companies alike, but what sticks out is that many are in the range of <50 employees. This could either be so because students selected that option, or because many work in startups with the former being more likely.

In [112]:
parameter = 'Q21'
sns.countplot(india_df[parameter],
             order = [ '0-49 employees', '50-249 employees','250-999 employees','1000-9,999 employees',
 '10,000 or more employees' ])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [113]:
parameter = 'Q21'
sns.countplot(usa_df[parameter],
             order = [ '0-49 employees', '50-249 employees','250-999 employees','1000-9,999 employees',
 '10,000 or more employees' ])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The trends seem to be similar in both the countries. 

# Data science team size distribution

In [114]:
print(questions[columns.index('Q22')])
print(survey_df['Q22'].unique())

In [115]:
parameter = 'Q22'
sns.countplot(survey_df[parameter], order = ['0','1-2','3-4','5-9','10-14','15-19','20+'])
plt.xticks(rotation=90)
plt.show()

The insight here is that the respondents are either the only data scientist in their team, or part of a huge team specializing data science. It is less likely that the respondents are from a mid-sized data science team. 

In [116]:
parameter = 'Q22'
sns.countplot(india_df[parameter], order = ['0','1-2','3-4','5-9','10-14','15-19','20+'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [117]:
parameter = 'Q22'
sns.countplot(usa_df[parameter], order = ['0','1-2','3-4','5-9','10-14','15-19','20+'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The distribution seems to be similar in both the countries. 

# ML methods' perforation into businesses

In [118]:
print(questions[columns.index('Q23')])
print(survey_df['Q23'].unique())

In [119]:
parameter = 'Q23'
sns.countplot(survey_df[parameter],
             order = [
 'We use ML methods for generating insights (but do not put working models into production)',
 'I do not know',
 'We recently started using ML methods (i.e., models in production for less than 2 years)',
 'We are exploring ML methods (and may one day put a model into production)',
 'No (we do not use ML methods)',
 'We have well established ML methods (i.e., models in production for more than 2 years)'])
plt.xticks(rotation=90)
plt.show()

Although there are a significant number of teams who have established production pipelines for ML, from the data it seems that most businesses are in the early stage of adoption or haven't considered ML yet. This makes sense because many traditional businesses are only now warming up to digitalization and machine learning. 

In [120]:
parameter = 'Q23'
sns.countplot(india_df[parameter],
             order = [
 'We use ML methods for generating insights (but do not put working models into production)',
 'I do not know',
 'We recently started using ML methods (i.e., models in production for less than 2 years)',
 'We are exploring ML methods (and may one day put a model into production)',
 'No (we do not use ML methods)',
 'We have well established ML methods (i.e., models in production for more than 2 years)'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [121]:
parameter = 'Q23'
sns.countplot(usa_df[parameter],
             order = [
 'We use ML methods for generating insights (but do not put working models into production)',
 'I do not know',
 'We recently started using ML methods (i.e., models in production for less than 2 years)',
 'We are exploring ML methods (and may one day put a model into production)',
 'No (we do not use ML methods)',
 'We have well established ML methods (i.e., models in production for more than 2 years)'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

There are a lesser number of companies in the USA which have been using ML for less than 2 years as opposed to India, as per the respondents. 

# Data science roles distribution 

In [122]:
print(questions[columns.index('Q24_OTHER')])
print(survey_df['Q24_Part_1'].unique())
print(survey_df['Q24_Part_2'].unique())
print(survey_df['Q24_Part_3'].unique())
print(survey_df['Q24_Part_4'].unique())
print(survey_df['Q24_Part_5'].unique())
print(survey_df['Q24_Part_6'].unique())
print(survey_df['Q24_Part_7'].unique())
print(survey_df['Q24_OTHER'].unique())

In [123]:
Q24_plot_columns = ['Q24_Part_1', 'Q24_Part_2', 'Q24_Part_3', 'Q24_Part_4', 'Q24_Part_5',
                    'Q24_Part_6', 'Q24_Part_7', 'Q24_OTHER']
Q24_plot_labels = ['Data analysis', 'Data infrastructure', 'Build prototype', 'ML service in product/workflow',
                    'Experimentation to improve ML models', 'ML research','None', 'Other' ]
Q24_plot_counts = [0]*len(Q24_plot_labels)
for col in Q24_plot_columns:
    idx = Q24_plot_columns.index(col)
    Q24_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q24_plot_counts)
print(Q24_plot_labels)

In [124]:
plt.bar(Q24_plot_labels, Q24_plot_counts)
plt.xticks(rotation=90)
plt.show()

Here we can see that most of the respondents work with data analysis to influence product or business decision. Very few Kagglers are involved in core ML research. 

In [125]:
Q24_plot_counts = [0]*len(Q24_plot_labels)
for col in Q24_plot_columns:
    idx = Q24_plot_columns.index(col)
    Q24_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q24_plot_labels, Q24_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [126]:
Q24_plot_counts = [0]*len(Q24_plot_labels)
for col in Q24_plot_columns:
    idx = Q24_plot_columns.index(col)
    Q24_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q24_plot_labels, Q24_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The distribution seems to be similar in both the countries. 

# Salary distribution

In [127]:
print(questions[columns.index('Q25')])
print(survey_df['Q25'].unique())

In [128]:
parameter = 'Q25'
sns.countplot(survey_df[parameter],  
              order=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499',
                    '7,500-9,999', '10,000-14,999', '15,000-19,999', '20,000-24,999', '25,000-29,999',
                    '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                    '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999', 
                    '150,000-199,999', '200,000-249,999', '300,000-499,999', '$500,000-999,999', 
                    '>$1,000,000'])
plt.xticks(rotation=90)
plt.show()

In [129]:
parameter = 'Q25'
sns.countplot(india_df[parameter],  
              order=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499',
                    '7,500-9,999', '10,000-14,999', '15,000-19,999', '20,000-24,999', '25,000-29,999',
                    '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                    '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999', 
                    '150,000-199,999', '200,000-249,999', '300,000-499,999', '$500,000-999,999', 
                    '>$1,000,000'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [130]:
parameter = 'Q25'
sns.countplot(usa_df[parameter],  
              order=['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499',
                    '7,500-9,999', '10,000-14,999', '15,000-19,999', '20,000-24,999', '25,000-29,999',
                    '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999',
                    '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999', 
                    '150,000-199,999', '200,000-249,999', '300,000-499,999', '$500,000-999,999', 
                    '>$1,000,000'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Most of the Kagglers are students and hence they don't report an income, but upon observing the rest of the respondents, we can notice two peaks. The peak at the range 10000 USD to 15000 USD should represent the average salary from India and the developing nations. The other at 100000 US to 125000 USD should represent the average salary from US and the developed nations. Thus, we can see a clear divide in the range of salaries in the different regions. At the extreme end of the spectrum, we can see that there are few who earn close to or more than a million USD per annum. 

# ML and cloud service expenses 

In [131]:
print(questions[columns.index('Q26')])
print(survey_df['Q26'].unique())

In [132]:
parameter = 'Q26'
sns.countplot(survey_df[parameter],  
              order=['$0 ($USD)', '$1-$99','$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)'])
plt.xticks(rotation=90)
plt.show()

Here we can see that most people do not spend money on cloud computing systems. This could be because many Kagglers are in the early stages of their data science journey, and also because of the availability of free platforms like Kaggle and Colab for their experiments. 

In [133]:
parameter = 'Q26'
sns.countplot(india_df[parameter],  
              order=['$0 ($USD)', '$1-$99','$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)'])
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [134]:
parameter = 'Q26'
sns.countplot(usa_df[parameter],  
              order=['$0 ($USD)', '$1-$99','$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)'])
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

We can see from the above plots that respondents from the USA tend to spend more money on ML infrastructure than their Indian counterparts. 

# Regularly used cloud computing platforms

In [135]:
print(questions[columns.index('Q27_A_OTHER')])
print(survey_df['Q27_A_Part_1'].unique())
print(survey_df['Q27_A_Part_2'].unique())
print(survey_df['Q27_A_Part_3'].unique())
print(survey_df['Q27_A_Part_4'].unique())
print(survey_df['Q27_A_Part_5'].unique())
print(survey_df['Q27_A_Part_6'].unique())
print(survey_df['Q27_A_Part_7'].unique())
print(survey_df['Q27_A_Part_8'].unique())
print(survey_df['Q27_A_Part_9'].unique())
print(survey_df['Q27_A_Part_10'].unique())
print(survey_df['Q27_A_Part_11'].unique())
print(survey_df['Q27_A_OTHER'].unique())

In [136]:
Q27_plot_columns = ['Q27_A_Part_1', 'Q27_A_Part_2', 'Q27_A_Part_3', 'Q27_A_Part_4', 'Q27_A_Part_5',
                    'Q27_A_Part_6', 'Q27_A_Part_7','Q27_A_Part_8','Q27_A_Part_9', 'Q27_A_Part_10',
                    'Q27_A_Part_11', 'Q27_A_OTHER']
Q27_plot_labels = ['AWS', 'Microsoft Azure', 'Google Cloud Platform', 'IBM Cloud/Red Hat',
                   'Oracle Cloud', 'SAP Cloud', 'Salesforce Cloud', 'VMWare Cloud', 
                   'Alibaba Cloud', 'Tencent Cloud','None', 'Other' ]
Q27_plot_counts = [0]*len(Q27_plot_labels)
for col in Q27_plot_columns:
    idx = Q27_plot_columns.index(col)
    Q27_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q27_plot_counts)
print(Q27_plot_labels)

In [137]:
plt.bar(Q27_plot_labels, Q27_plot_counts)
plt.xticks(rotation=90)
plt.show()

Across the board, we can see that AWS, Microsoft Azure and GCP are the preferred cloud platforms. 

In [138]:
Q27_plot_counts = [0]*len(Q27_plot_labels)
for col in Q27_plot_columns:
    idx = Q27_plot_columns.index(col)
    Q27_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q27_plot_labels, Q27_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [139]:
Q27_plot_counts = [0]*len(Q27_plot_labels)
for col in Q27_plot_columns:
    idx = Q27_plot_columns.index(col)
    Q27_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q27_plot_labels, Q27_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The same trend holds for both India and USA.

# Cloud platform with best developer experience

In [140]:
print(questions[columns.index('Q28')])
print(survey_df['Q28'].unique())

In [141]:
parameter = 'Q28'
sns.countplot(survey_df[parameter])  
plt.xticks(rotation=90)
plt.show()

Here we can see that the users pretty much enjoy using the top three cloud platforms, especially GCP and AWS,

In [142]:
parameter = 'Q28'
sns.countplot(india_df[parameter])  
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [143]:
parameter = 'Q28'
sns.countplot(usa_df[parameter])  
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Across the board, AWS seems to be the most popular option. 

# Regularly used cloud computing products

In [144]:
print(questions[columns.index('Q29_A_OTHER')])
print(survey_df['Q29_A_Part_1'].unique())
print(survey_df['Q29_A_Part_2'].unique())
print(survey_df['Q29_A_Part_3'].unique())
print(survey_df['Q29_A_Part_4'].unique())
print(survey_df['Q29_A_OTHER'].unique())

In [145]:
Q29_plot_columns = ['Q29_A_Part_1', 'Q29_A_Part_2', 'Q29_A_Part_3', 'Q29_A_Part_4', 'Q29_A_OTHER']
Q29_plot_labels = ['Amazon EC2', 'Microsoft Azure Virtual Machines', 
                   'Google Cloud Compute Engine', 'No/None', 'Other' ]
Q29_plot_counts = [0]*len(Q29_plot_labels)
for col in Q29_plot_columns:
    idx = Q29_plot_columns.index(col)
    Q29_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q29_plot_counts)
print(Q29_plot_labels)

In [146]:
plt.bar(Q29_plot_labels, Q29_plot_counts)
plt.xticks(rotation=90)
plt.show()

Amazon EC2 and Google Cloud Compute Engine seem to be the most popular cloud products, according to the respondents. 

In [147]:
Q29_plot_counts = [0]*len(Q29_plot_labels)
for col in Q29_plot_columns:
    idx = Q29_plot_columns.index(col)
    Q29_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q29_plot_labels, Q29_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [148]:
Q29_plot_counts = [0]*len(Q29_plot_labels)
for col in Q29_plot_columns:
    idx = Q29_plot_columns.index(col)
    Q29_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q29_plot_labels, Q29_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The trend shows that Amazon EC2 is by far the favorite cloud product in the USA, compared to India where there is a more equitable distribution of preference. 

# Regularly used data storage products

In [149]:
print(questions[columns.index('Q30_A_OTHER')])
print(survey_df['Q30_A_Part_1'].unique())
print(survey_df['Q30_A_Part_2'].unique())
print(survey_df['Q30_A_Part_3'].unique())
print(survey_df['Q30_A_Part_4'].unique())
print(survey_df['Q30_A_Part_5'].unique())
print(survey_df['Q30_A_Part_6'].unique())
print(survey_df['Q30_A_Part_7'].unique())
print(survey_df['Q30_A_OTHER'].unique())

In [150]:
Q30_plot_columns = ['Q30_A_Part_1', 'Q30_A_Part_2', 'Q30_A_Part_3', 'Q30_A_Part_4', 'Q30_A_Part_5',
                    'Q30_A_Part_6', 'Q30_A_Part_7', 'Q30_A_OTHER']
Q30_plot_labels = ['Microsoft Azure Data Lake Storage', 'Microsoft Azure Disk Storage', 
                   'Amazon S3', 'Amazon EFS',
                    'Google Cloud Storage', 'Google Cloud Filestore','No/None', 'Other' ]
Q30_plot_counts = [0]*len(Q30_plot_labels)
for col in Q30_plot_columns:
    idx = Q30_plot_columns.index(col)
    Q30_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q30_plot_counts)
print(Q30_plot_labels)

In [151]:
plt.bar(Q30_plot_labels, Q30_plot_counts)
plt.xticks(rotation=90)
plt.show()

Again, we can see that Amazon S3 and Google Cloud Filestore are the popular data storage options for the respondents. 

In [152]:
Q30_plot_counts = [0]*len(Q30_plot_labels)
for col in Q30_plot_columns:
    idx = Q30_plot_columns.index(col)
    Q30_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q30_plot_labels, Q30_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [153]:
Q30_plot_counts = [0]*len(Q30_plot_labels)
for col in Q30_plot_columns:
    idx = Q30_plot_columns.index(col)
    Q30_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q30_plot_labels, Q30_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The trend seems to the same in both USA and India, however the USA prefers Amazon S3 to a greater degree. 

# Regularly used ML products

In [154]:
print(questions[columns.index('Q31_A_OTHER')])
print(survey_df['Q31_A_Part_1'].unique())
print(survey_df['Q31_A_Part_2'].unique())
print(survey_df['Q31_A_Part_3'].unique())
print(survey_df['Q31_A_Part_4'].unique())
print(survey_df['Q31_A_Part_5'].unique())
print(survey_df['Q31_A_Part_6'].unique())
print(survey_df['Q31_A_Part_7'].unique())
print(survey_df['Q31_A_Part_8'].unique())
print(survey_df['Q31_A_Part_9'].unique())
print(survey_df['Q31_A_OTHER'].unique())

In [155]:
Q31_plot_columns = ['Q31_A_Part_1', 'Q31_A_Part_2', 'Q31_A_Part_3', 'Q31_A_Part_4', 'Q31_A_Part_5',
                    'Q31_A_Part_6', 'Q31_A_Part_7', 'Q31_A_Part_8', 'Q31_A_Part_9', 'Q31_A_OTHER']
Q31_plot_labels = ['Amazon SageMaker', 'Azure ML studio', 
                   'Google Cloud Vertex AI', 'DataRobot',
                    'Databricks', 'Dataiku', 'Alteryx', 'Rapidminer','None', 'Other' ]
Q31_plot_counts = [0]*len(Q31_plot_labels)
for col in Q31_plot_columns:
    idx = Q31_plot_columns.index(col)
    Q31_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q31_plot_counts)
print(Q31_plot_labels)

In [156]:
plt.bar(Q31_plot_labels, Q31_plot_counts)
plt.xticks(rotation=90)
plt.show()

ML products seem to be an underutilized market, with none of them used by a large proportion of the respondents. 

In [157]:
Q31_plot_counts = [0]*len(Q31_plot_labels)
for col in Q31_plot_columns:
    idx = Q31_plot_columns.index(col)
    Q31_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q31_plot_labels, Q31_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [158]:
Q31_plot_counts = [0]*len(Q31_plot_labels)
for col in Q31_plot_columns:
    idx = Q31_plot_columns.index(col)
    Q31_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q31_plot_labels, Q31_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The plots are similar for India and USA, indicating similar trends. 

# Regularly used Big Data products

In [159]:
print(questions[columns.index('Q32_A_OTHER')])
print(survey_df['Q32_A_Part_1'].unique())
print(survey_df['Q32_A_Part_2'].unique())
print(survey_df['Q32_A_Part_3'].unique())
print(survey_df['Q32_A_Part_4'].unique())
print(survey_df['Q32_A_Part_5'].unique())
print(survey_df['Q32_A_Part_6'].unique())
print(survey_df['Q32_A_Part_7'].unique())
print(survey_df['Q32_A_Part_8'].unique())
print(survey_df['Q32_A_Part_9'].unique())
print(survey_df['Q32_A_Part_10'].unique())
print(survey_df['Q32_A_Part_11'].unique())
print(survey_df['Q32_A_Part_12'].unique())
print(survey_df['Q32_A_Part_13'].unique())
print(survey_df['Q32_A_Part_14'].unique())
print(survey_df['Q32_A_Part_15'].unique())
print(survey_df['Q32_A_Part_16'].unique())
print(survey_df['Q32_A_Part_17'].unique())
print(survey_df['Q32_A_Part_18'].unique())
print(survey_df['Q32_A_Part_19'].unique())
print(survey_df['Q32_A_Part_20'].unique())
print(survey_df['Q32_A_OTHER'].unique())

In [160]:
Q32_plot_columns = ['Q32_A_Part_1', 'Q32_A_Part_2', 'Q32_A_Part_3', 'Q32_A_Part_4', 'Q32_A_Part_5',
                    'Q32_A_Part_6', 'Q32_A_Part_7', 'Q32_A_Part_8', 'Q32_A_Part_9', 'Q32_A_Part_10',
                    'Q32_A_Part_11', 'Q32_A_Part_12', 'Q32_A_Part_13', 'Q32_A_Part_14', 'Q32_A_Part_15',
                    'Q32_A_Part_16', 'Q32_A_Part_17', 'Q32_A_Part_18', 'Q32_A_Part_19', 'Q32_A_Part_20',
                     'Q32_A_OTHER']
Q32_plot_labels = ['MySQL', 'PostgreSQL', 'SQLite', 'Oracle database', 'MongoDB', 'Snowflake', 'IBM db2',
                   'Microsoft SQL Server', 'Microsoft Azure SQL database', 'Microsoft Azure Cosmos DB',
                   'Amazon Redshift', 'Amazon Aurora', 'Amazon RDS', 'Amazon DynamoDB',
                   'Google Cloud BigQuery', 'Google Cloud SQL', 'Google Cloud Firestore',
                   'Google Cloud BigTable', 'Google Cloud Spanner','None', 'Other' ]
Q32_plot_counts = [0]*len(Q32_plot_labels)
for col in Q32_plot_columns:
    idx = Q32_plot_columns.index(col)
    Q32_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q32_plot_counts)
print(Q32_plot_labels)

In [161]:
plt.bar(Q32_plot_labels, Q32_plot_counts)
plt.xticks(rotation=90)
plt.show()

The three most popular Big Data products seem to be MySQL, PostgreSQL and Microsoft SQL Server. Many do not use Big Data products and they must predominantly be students. 

In [162]:
Q32_plot_counts = [0]*len(Q32_plot_labels)
for col in Q32_plot_columns:
    idx = Q32_plot_columns.index(col)
    Q32_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q32_plot_labels, Q32_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [163]:
Q32_plot_counts = [0]*len(Q32_plot_labels)
for col in Q32_plot_columns:
    idx = Q32_plot_columns.index(col)
    Q32_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q32_plot_labels, Q32_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

MongoDB seems to be more popular in India than in the USA. 

In [164]:
print(questions[columns.index('Q33')])
print(survey_df['Q33'].unique())

In [165]:
parameter = 'Q33'
sns.countplot(survey_df[parameter])  
plt.xticks(rotation=90)
plt.show()

In [166]:
parameter = 'Q33'
sns.countplot(india_df[parameter])  
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [167]:
parameter = 'Q33'
sns.countplot(usa_df[parameter])  
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

MySQL, PostgreSQL and Microsoft SQL Server seem to be the most regularly used Big Data products. MongoDB is less used in the USA as opposed to India. 

# Regularly used BI tools 

In [168]:
print(questions[columns.index('Q34_A_OTHER')])
print(survey_df['Q34_A_Part_1'].unique())
print(survey_df['Q34_A_Part_2'].unique())
print(survey_df['Q34_A_Part_3'].unique())
print(survey_df['Q34_A_Part_4'].unique())
print(survey_df['Q34_A_Part_5'].unique())
print(survey_df['Q34_A_Part_6'].unique())
print(survey_df['Q34_A_Part_7'].unique())
print(survey_df['Q34_A_Part_8'].unique())
print(survey_df['Q34_A_Part_9'].unique())
print(survey_df['Q34_A_Part_10'].unique())
print(survey_df['Q34_A_Part_11'].unique())
print(survey_df['Q34_A_Part_12'].unique())
print(survey_df['Q34_A_Part_13'].unique())
print(survey_df['Q34_A_Part_14'].unique())
print(survey_df['Q34_A_Part_15'].unique())
print(survey_df['Q34_A_Part_16'].unique())
print(survey_df['Q34_A_OTHER'].unique())

In [169]:
Q34_plot_columns = ['Q34_A_Part_1', 'Q34_A_Part_2', 'Q34_A_Part_3', 'Q34_A_Part_4', 'Q32_A_Part_5',
                    'Q34_A_Part_6', 'Q34_A_Part_7', 'Q34_A_Part_8', 'Q34_A_Part_9', 'Q32_A_Part_10',
                    'Q34_A_Part_11', 'Q34_A_Part_12', 'Q34_A_Part_13', 'Q34_A_Part_14', 'Q32_A_Part_15',
                    'Q34_A_Part_16','Q34_A_OTHER']
Q34_plot_labels = ['Amazon QuickSight', 'Microsoft Power BI', 'Google Data Studio', 'Looker', 'Tableau',
                   'Salesforce', 'Tableau CRM', 'Qlik', 'Domo', 'TIBCO Spotfire', 'Alteryx', 'Sisense',
                   'SAP Analytics Cloud', 'Microsoft Azure Synapse', 'Thoughtspot','None', 'Other' ]
Q34_plot_counts = [0]*len(Q34_plot_labels)
for col in Q34_plot_columns:
    idx = Q34_plot_columns.index(col)
    Q34_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q34_plot_counts)
print(Q34_plot_labels)

In [170]:
plt.bar(Q34_plot_labels, Q34_plot_counts)
plt.xticks(rotation=90)
plt.show()

Microsoft Power BI and Tableau seem to be the most regularly used BI tools. 

In [171]:
Q34_plot_counts = [0]*len(Q34_plot_labels)
for col in Q34_plot_columns:
    idx = Q34_plot_columns.index(col)
    Q34_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q34_plot_labels, Q34_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [172]:
Q34_plot_counts = [0]*len(Q34_plot_labels)
for col in Q34_plot_columns:
    idx = Q34_plot_columns.index(col)
    Q34_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q34_plot_labels, Q34_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

The usage trend of BI tools in USA and India seem to be very similar. 

# Most often used BI platform 

In [173]:
print(questions[columns.index('Q35')])
print(survey_df['Q35'].unique())

In [174]:
parameter = 'Q35'
sns.countplot(survey_df[parameter])  
plt.xticks(rotation=90)
plt.show()

In [175]:
parameter = 'Q35'
sns.countplot(india_df[parameter])  
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [176]:
parameter = 'Q35'
sns.countplot(usa_df[parameter])  
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Tableau and Microsoft Power BI seem to the most often used BI tools, across the board. 

# Automated ML tools

In [177]:
print(questions[columns.index('Q36_A_OTHER')])
print(survey_df['Q36_A_Part_1'].unique())
print(survey_df['Q36_A_Part_2'].unique())
print(survey_df['Q36_A_Part_3'].unique())
print(survey_df['Q36_A_Part_4'].unique())
print(survey_df['Q36_A_Part_5'].unique())
print(survey_df['Q36_A_Part_6'].unique())
print(survey_df['Q36_A_Part_7'].unique())
print(survey_df['Q36_A_OTHER'].unique())

In [178]:
Q36_plot_columns = ['Q36_A_Part_1', 'Q36_A_Part_2', 'Q36_A_Part_3', 'Q36_A_Part_4', 'Q36_A_Part_5',
                    'Q36_A_Part_6', 'Q36_A_Part_7','Q36_A_OTHER']
Q36_plot_labels = ['Automated data augmentation', 'Automated feature engineering/selection',
                   'Automated model selection', 'Automated model architecture searches', 
                   'Automated hyperparameter tuning', 'Automation of full ML pipelines', 'None', 'Other' ]
Q36_plot_counts = [0]*len(Q36_plot_labels)
for col in Q36_plot_columns:
    idx = Q36_plot_columns.index(col)
    Q36_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q36_plot_counts)
print(Q36_plot_labels)

In [179]:
plt.bar(Q36_plot_labels, Q36_plot_counts)
plt.xticks(rotation=90)
plt.show()

In [180]:
Q36_plot_counts = [0]*len(Q36_plot_labels)
for col in Q36_plot_columns:
    idx = Q36_plot_columns.index(col)
    Q36_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q36_plot_labels, Q36_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [181]:
Q36_plot_counts = [0]*len(Q36_plot_labels)
for col in Q36_plot_columns:
    idx = Q36_plot_columns.index(col)
    Q36_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q36_plot_labels, Q36_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

There seem to be very few takers for automated ML tools amongst the respondents. Data augmentation, hyperparameter tuning and model selection seem be to the areas in which automation is more popular. 

# Regularly used ML automation tools 

In [182]:
print(questions[columns.index('Q37_A_OTHER')])
print(survey_df['Q37_A_Part_1'].unique())
print(survey_df['Q37_A_Part_2'].unique())
print(survey_df['Q37_A_Part_3'].unique())
print(survey_df['Q37_A_Part_4'].unique())
print(survey_df['Q37_A_Part_5'].unique())
print(survey_df['Q37_A_Part_6'].unique())
print(survey_df['Q37_A_Part_7'].unique())
print(survey_df['Q37_A_OTHER'].unique())

In [183]:
Q37_plot_columns = ['Q37_A_Part_1', 'Q37_A_Part_2', 'Q37_A_Part_3', 'Q37_A_Part_4', 'Q37_A_Part_5',
                    'Q37_A_Part_6', 'Q37_A_Part_7','Q37_A_OTHER']
Q37_plot_labels = ['Google Cloud AutoML', 'H2O Driverless AI',
                   'Databricks AutoML', 'DataRobot AutoML', 
                   'Amazon Sagemaker Autopilot', 'Azure Automated Machine Learning', 'None', 'Other' ]
Q37_plot_counts = [0]*len(Q37_plot_labels)
for col in Q37_plot_columns:
    idx = Q37_plot_columns.index(col)
    Q37_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q37_plot_counts)
print(Q37_plot_labels)

In [184]:
plt.bar(Q37_plot_labels, Q37_plot_counts)
plt.xticks(rotation=90)
plt.show()

In [185]:
Q37_plot_counts = [0]*len(Q37_plot_labels)
for col in Q37_plot_columns:
    idx = Q37_plot_columns.index(col)
    Q37_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q37_plot_labels, Q37_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [186]:
Q37_plot_counts = [0]*len(Q37_plot_labels)
for col in Q37_plot_columns:
    idx = Q37_plot_columns.index(col)
    Q37_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q37_plot_labels, Q37_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Google Cloud AutoML seems to be the most regularly used ML automation tool amongst the respondents. 

# Tools to manage ML experiments 

In [187]:
print(questions[columns.index('Q38_A_OTHER')])
print(survey_df['Q38_A_Part_1'].unique())
print(survey_df['Q38_A_Part_2'].unique())
print(survey_df['Q38_A_Part_3'].unique())
print(survey_df['Q38_A_Part_4'].unique())
print(survey_df['Q38_A_Part_5'].unique())
print(survey_df['Q38_A_Part_6'].unique())
print(survey_df['Q38_A_Part_7'].unique())
print(survey_df['Q38_A_Part_8'].unique())
print(survey_df['Q38_A_Part_9'].unique())
print(survey_df['Q38_A_Part_10'].unique())
print(survey_df['Q38_A_Part_11'].unique())
print(survey_df['Q38_A_OTHER'].unique())

In [188]:
Q38_plot_columns = ['Q38_A_Part_1', 'Q38_A_Part_2', 'Q38_A_Part_3', 'Q38_A_Part_4', 'Q38_A_Part_5',
                    'Q38_A_Part_6', 'Q38_A_Part_7','Q38_A_Part_8', 'Q38_A_Part_9', 'Q38_A_Part_10', 'Q38_A_Part_11',
                    'Q38_A_OTHER']
Q38_plot_labels = ['Neptune.ai', 'Weights & Biases', 'Comet.ml', 'Sacred + Omniboard', 'TensorBoard',
                   'Guild.ai', 'Polyaxon', 'ClearML', 'Domino Model Monitor', 'MLflow', 'None', 'Other' ]
Q38_plot_counts = [0]*len(Q38_plot_labels)
for col in Q38_plot_columns:
    idx = Q38_plot_columns.index(col)
    Q38_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q38_plot_counts)
print(Q38_plot_labels)

In [189]:
plt.bar(Q38_plot_labels, Q38_plot_counts)
plt.xticks(rotation=90)
plt.show()

TensorBoard is clearly the most popular ML experiment management tool. However, we can see that most respondents don't make use of such tools. 

In [190]:
Q38_plot_counts = [0]*len(Q38_plot_labels)
for col in Q38_plot_columns:
    idx = Q38_plot_columns.index(col)
    Q38_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q38_plot_labels, Q38_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [191]:
Q38_plot_counts = [0]*len(Q38_plot_labels)
for col in Q38_plot_columns:
    idx = Q38_plot_columns.index(col)
    Q38_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q38_plot_labels, Q38_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Indians seem to be using more tools to manage ML experiments as compared to their counterparts in the USA. 

# Public platforms to share data science portfolio

In [192]:
print(questions[columns.index('Q39_OTHER')])
print(survey_df['Q39_Part_1'].unique())
print(survey_df['Q39_Part_2'].unique())
print(survey_df['Q39_Part_3'].unique())
print(survey_df['Q39_Part_4'].unique())
print(survey_df['Q39_Part_5'].unique())
print(survey_df['Q39_Part_6'].unique())
print(survey_df['Q39_Part_7'].unique())
print(survey_df['Q39_Part_8'].unique())
print(survey_df['Q39_Part_9'].unique())
print(survey_df['Q39_OTHER'].unique())

In [193]:
Q39_plot_columns = ['Q39_Part_1', 'Q39_Part_2', 'Q39_Part_3', 'Q39_Part_4', 'Q39_Part_5',
                    'Q39_Part_6', 'Q39_Part_7','Q39_Part_8', 'Q39_Part_9',
                    'Q39_OTHER']
Q39_plot_labels = ['Plotly Dash', 'Streamlit', 'NBViewer', 'GitHub', 'Personal blog',
                   'Kaggle', 'Colab', 'Shiny',  'I do not share my work publicly', 'Other' ]
Q39_plot_counts = [0]*len(Q39_plot_labels)
for col in Q39_plot_columns:
    idx = Q39_plot_columns.index(col)
    Q39_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q39_plot_counts)
print(Q39_plot_labels)

In [194]:
plt.bar(Q39_plot_labels, Q39_plot_counts)
plt.xticks(rotation=90)
plt.show()

GitHub, Kaggle and Colab are the most popular platforms used by the respondents to share their data science work. There is also a large number of people who do not share their work publicly. 

In [195]:
Q39_plot_counts = [0]*len(Q39_plot_labels)
for col in Q39_plot_columns:
    idx = Q39_plot_columns.index(col)
    Q39_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q39_plot_labels, Q39_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [196]:
Q39_plot_counts = [0]*len(Q39_plot_labels)
for col in Q39_plot_columns:
    idx = Q39_plot_columns.index(col)
    Q39_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q39_plot_labels, Q39_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

GitHub is popular in both India and the USA. However, Kaggle has a higher preference amongst Indians. 

# Platforms for studying data science 

In [197]:
print(questions[columns.index('Q40_OTHER')])
print(survey_df['Q40_Part_1'].unique())
print(survey_df['Q40_Part_2'].unique())
print(survey_df['Q40_Part_3'].unique())
print(survey_df['Q40_Part_4'].unique())
print(survey_df['Q40_Part_5'].unique())
print(survey_df['Q40_Part_6'].unique())
print(survey_df['Q40_Part_7'].unique())
print(survey_df['Q40_Part_8'].unique())
print(survey_df['Q40_Part_9'].unique())
print(survey_df['Q40_Part_10'].unique())
print(survey_df['Q40_Part_11'].unique())
print(survey_df['Q40_OTHER'].unique())

In [198]:
Q40_plot_columns = ['Q40_Part_1', 'Q40_Part_2', 'Q40_Part_3', 'Q40_Part_4', 'Q40_Part_5',
                    'Q40_Part_6', 'Q40_Part_7','Q40_Part_8', 'Q40_Part_9', 'Q40_Part_10', 'Q40_Part_11',
                    'Q40_OTHER']
Q40_plot_labels = ['Coursera', 'edX', 'Kaggle Learn Courses', 'DataCamp', 'Fast.ai',
                   'Udacity', 'Udemy', 'LinkedIn Learning', 'Cloud-certification programs', 'University Courses', 'None', 'Other' ]
Q40_plot_counts = [0]*len(Q40_plot_labels)
for col in Q40_plot_columns:
    idx = Q40_plot_columns.index(col)
    Q40_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q40_plot_counts)
print(Q40_plot_labels)

In [199]:
plt.bar(Q40_plot_labels, Q40_plot_counts)
plt.xticks(rotation=90)
plt.show()

Coursera, Kaggle and Udemy seem to be the most popular learning platform amongst the respondents. 

In [200]:
Q40_plot_counts = [0]*len(Q40_plot_labels)
for col in Q40_plot_columns:
    idx = Q40_plot_columns.index(col)
    Q40_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q40_plot_labels, Q40_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [201]:
Q40_plot_counts = [0]*len(Q40_plot_labels)
for col in Q40_plot_columns:
    idx = Q40_plot_columns.index(col)
    Q40_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q40_plot_labels, Q40_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

Picking university courses related to data science and ML seems to be more popular in the USA compared to India. 

# Primary tool to analyze data

In [202]:
print(questions[columns.index('Q41')])
print(survey_df['Q41'].unique())

In [203]:
parameter = 'Q41'
sns.countplot(survey_df[parameter], order = ['Local development environments (RStudio, JupyterLab, etc.)',
 'Basic statistical software (Microsoft Excel, Google Sheets, etc.)',
 'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)' ,'Other'
 'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)',
 'Advanced statistical software (SPSS, SAS, etc.)'])  
plt.xticks(rotation=90)
plt.show()

Respondents seem to prefer local development environments or basic statistical software to perform their primary data analysis. 

In [204]:
parameter = 'Q41'
sns.countplot(india_df[parameter], order = ['Local development environments (RStudio, JupyterLab, etc.)',
 'Basic statistical software (Microsoft Excel, Google Sheets, etc.)',
 'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)' ,'Other'
 'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)',
 'Advanced statistical software (SPSS, SAS, etc.)'])  
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [205]:
parameter = 'Q41'
sns.countplot(usa_df[parameter], order = ['Local development environments (RStudio, JupyterLab, etc.)',
 'Basic statistical software (Microsoft Excel, Google Sheets, etc.)',
 'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)' ,'Other'
 'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)',
 'Advanced statistical software (SPSS, SAS, etc.)'])  
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

There seems to be a preference to local development environments like RStudio and Jupyter in the USA compared to India, where basic statistical tools like spreadsheets are preferred. 

# Preferred media for information on data science 

In [206]:
print(questions[columns.index('Q42_OTHER')])
print(survey_df['Q42_Part_1'].unique())
print(survey_df['Q42_Part_2'].unique())
print(survey_df['Q42_Part_3'].unique())
print(survey_df['Q42_Part_4'].unique())
print(survey_df['Q42_Part_5'].unique())
print(survey_df['Q42_Part_6'].unique())
print(survey_df['Q42_Part_7'].unique())
print(survey_df['Q42_Part_8'].unique())
print(survey_df['Q42_Part_9'].unique())
print(survey_df['Q42_Part_10'].unique())
print(survey_df['Q42_Part_11'].unique())
print(survey_df['Q42_OTHER'].unique())

In [207]:
Q42_plot_columns = ['Q42_Part_1', 'Q42_Part_2', 'Q42_Part_3', 'Q42_Part_4', 'Q42_Part_5', 'Q42_Part_6', 'Q42_Part_7',
                    'Q42_Part_8','Q42_Part_9', 'Q42_Part_10', 'Q42_Part_11', 'Q42_OTHER']
Q42_plot_labels = ['Twitter', 'Email newsletters', 'Reddit', 'Kaggle',
                   'Course forums', 'YouTube', 'Podcasts', 'Blogs', 'Journal publications', 
                   'Slack Communities','None', 'Other' ]
Q42_plot_counts = [0]*len(Q42_plot_labels)
for col in Q42_plot_columns:
    idx = Q42_plot_columns.index(col)
    Q42_plot_counts[idx] += int(survey_df[col].value_counts())
    #print(idx, int(survey_df[col].value_counts()))
print(Q42_plot_counts)
print(Q42_plot_labels)

In [208]:
plt.bar(Q42_plot_labels, Q42_plot_counts)
plt.xticks(rotation=90)
plt.show()

Most of the respondents seem to get their data science information from Kaggle, YouTube and blogs.

In [209]:
Q42_plot_counts = [0]*len(Q42_plot_labels)
for col in Q42_plot_columns:
    idx = Q42_plot_columns.index(col)
    Q42_plot_counts[idx] += int(india_df[col].value_counts())
plt.bar(Q42_plot_labels, Q42_plot_counts)
plt.xticks(rotation=90)
plt.title('India')
plt.show()

In [210]:
Q42_plot_counts = [0]*len(Q42_plot_labels)
for col in Q42_plot_columns:
    idx = Q42_plot_columns.index(col)
    Q42_plot_counts[idx] += int(usa_df[col].value_counts())
plt.bar(Q42_plot_labels, Q42_plot_counts)
plt.xticks(rotation=90)
plt.title('USA')
plt.show()

In India, most of the respondents get their data science information from the top three platforms. In the USA we can see that share of people getting their information from other sources is higher. 