# Introduction

**Objective** : In this Kernel I am trying to explore current scenario of Data science field in India and comparison with their contemporaries from rest of the world.

My intention is to analyze <a>trends/relations/distribution</a> in 2019 [Kaggle ML & DS Survey](https://www.kaggle.com/c/kaggle-survey-2019/data.) data from India's perspective and list out the things in which Indians are good and also the things which we have to improve to get new heights in Data Science field.

I have tried to keep the explanation short and simple as much as possible along with simple graphs , if you like the work please upvote and feel free to drop comments for any suggestions/questions.


Let's import the data and cleanse for further analysis



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


# Any results you write to the current directory are saved as output.

In [None]:
#Importing the required libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

In [None]:
#Importing all data set
dataset_mcq=pd.read_csv("../multiple_choice_responses.csv")

In [None]:
#Removing the header 
dataset_mcq.columns=dataset_mcq.iloc[0]
#Removing the first row
dataset_mcq=dataset_mcq.drop([0])

Will be using following data set for our analysis

In [None]:
#First few rows
dataset_mcq.head()

Total number of records in this data set are

In [None]:
#Total number of records
dataset_mcq.shape

# How Gender is playing a role

The very first thing I am intrested is how Gender is playing role in Data Science and Machine Learning fields and how geography is tied up with this.

Let's see Gender ratio across the world

In [None]:
#Gender wise distribution
fig = go.Figure(data=[go.Pie(labels=dataset_mcq['What is your gender? - Selected Choice'],hole=.4)])
fig.show()

This is quite evident from above chart that survey particiaption was dominated by Males .

Let's see what is the country wise distribution of respondants

In [None]:
# Replacing the ambigious countries name with Standard names
dataset_mcq['In which country do you currently reside?'].replace(
                                                   {'United States of America':'USA',
                                                    'Viet Nam':'Vietnam',
                                                    "People 's Republic of China":'China',
                                                    "United Kingdom of Great Britain and Northern Ireland":'UK',
                                                    "Hong Kong (S.A.R.)":"HongKong"},inplace=True)
# Replacing the long name in education level with abbrevations
dataset_mcq['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].replace(
                                                   {"Some college/university study without earning a bachelor’s degree":'Some Education'},inplace=True)


In [None]:
#Country wise distribution of Respondant
country_dist=dataset_mcq['In which country do you currently reside?'].value_counts()
fig = px.choropleth(country_dist.values, #Input DataFrame
                    locations=country_dist.index, #DataFrame column with locations
                    locationmode='country names', # DataFrame column with color values
                    color=country_dist.values, # Set to plot
                    color_continuous_scale="haline")
fig.update_layout(title="Countrywise Distribution of Respondant")
fig.show()

* USA and India are the leading countries
* China, Brazil, Japan, Russia, Canada, UK and France have a significant amount of respondant

To get more specific share of respondants , let's get a chart for country wise share

In [None]:
#Country wise distribution
fig = go.Figure(data=[go.Pie(labels=dataset_mcq['In which country do you currently reside?'],hole=.3)])
fig.show()

>> This is a surprise for me , I always thought that US should be leading in terms of 'Number of people in Data Science field' but looks like that scenario is changing and Indians are leading in that terms . 

>> There could be another dimension to this : **As respondant's are Kaggler's which is one of the famous platform for learning Data Science, we can say that there are more learners in Indians as compare to US**. Well, we have to analyze further to get more clear picture

Let's take out male and female respondent seprately and dig more with top 10 countries with respondent

In [None]:
#Taking male and female count seprately
male_count=dataset_mcq[dataset_mcq['What is your gender? - Selected Choice'] == 'Male']
female_count=dataset_mcq[dataset_mcq['What is your gender? - Selected Choice'] == 'Female']

# Top-10 Countries with Respondents 
male_count_top10=male_count['In which country do you currently reside?'].value_counts()[:10].reset_index()
female_count_top10=female_count['In which country do you currently reside?'].value_counts()[:10].reset_index()

# Pie chart to depict male and female respondant country wise
pieMen=go.Figure(data=[go.Pie(labels=male_count_top10['index'],values=male_count_top10['In which country do you currently reside?'],name="Men",hole=.3)])
pieWomen=go.Figure(data=[go.Pie(labels=female_count_top10['index'],values=female_count_top10['In which country do you currently reside?'],name="Women",hole=.3)])

**Male respondant distribution across top 10 countries are**

In [None]:
pieMen.show()

**Female respondant's distribution across top 10 countries are**

In [None]:
pieWomen.show()

** 📌 Take Away points  **

* Data Science and Machine Learning are dominated by Male techies , infact male techies are five times more than female techies
* Indians techies are above any other countries on spending their resources and time in learning DS and ML, in fact Indians are more in numbers then USA
* Indians female techies are also ahead of their contemporaries in world , however female techie's participation is very less
* Apart from India and USA other countries has very less participation

# How Education level is linked with Gender

Next thing I want to explore is respondant education level and how it can be varies between male and female

We will see gender wise respondant's education level across the world

In [None]:
# Respondant's GenderWise Education level
male_educationlevel=male_count['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().reset_index()
female_educationlevel=female_count['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().reset_index()

# Pie chart to depict male and female respondant country wise
pieMenEducation=go.Figure(data=[go.Pie(labels=male_educationlevel['index'],values=male_educationlevel['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'],name="Men",hole=.3)])
pieWomenEducation=go.Figure(data=[go.Pie(labels=female_educationlevel['index'],values=female_educationlevel['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'],name="Women",hole=.3)])


**Male respondant's education level**

In [None]:
pieMenEducation.show()

**Female respondant's education level**

In [None]:
pieWomenEducation.show()

Individaully number looks good for Women , but our initial analysis say that Men's are dominanting here.To clarify more on this let's compare

In [None]:
# Add a gender colum to male and female education level data frames 
male_educationlevel=male_educationlevel.assign(Gender = ['Male', 'Male', 'Male', 'Male','Male','Male','Male']) 
female_educationlevel=female_educationlevel.assign(Gender = ['Female', 'Female', 'Female', 'Female','Female','Female','Female']) 
#Concat both data frame to generate comparision graph
frames1 = [male_educationlevel, female_educationlevel]
result1 = pd.concat(frames1)
result1 = result1.rename(columns = {"index": "Education Level", 
                                  "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?":"Count"}) 
#Bar graph to compare Education level for both genders
fig = px.bar(result1, x='Education Level', y='Count',color='Gender')
fig.show()

Well, this proves our initial analysis that men's are dominanting this field.

Let's see what are Indian's edcuation level when it comes to Data Science and Machine Learning.

In [None]:
#Indian Male/Female techie's 
male_count_india=male_count[male_count['In which country do you currently reside?'] == 'India']
female_count_india=female_count[female_count['In which country do you currently reside?'] == 'India']

indian_male_educationlevel=male_count_india['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().reset_index()
indian_female_educationlevel=female_count_india['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().reset_index()

# Pie chart to depict Indian male and female respondant's
pieIndianMenEducation=go.Figure(data=[go.Pie(labels=indian_male_educationlevel['index'],values=indian_male_educationlevel['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'],name="Men",hole=.3)])
pieIndianWomenEducation=go.Figure(data=[go.Pie(labels=indian_female_educationlevel['index'],values=indian_female_educationlevel['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'],name="Women",hole=.3)])

**Indian Male respondant's education level**

In [None]:
pieIndianMenEducation.show()

**Indian Female respondant's education level**

In [None]:
pieIndianWomenEducation.show()

Now , let's compare both of them

In [None]:
# Add a gender colum to male and female education level data frames 
indian_male_educationlevel=indian_male_educationlevel.assign(Gender = ['Male', 'Male', 'Male', 'Male','Male','Male','Male']) 
indian_female_educationlevel=indian_female_educationlevel.assign(Gender = ['Female', 'Female', 'Female', 'Female','Female','Female','Female']) 
#Concat both data frame to generate comparision graph
frames2 = [indian_male_educationlevel, indian_female_educationlevel]
result2 = pd.concat(frames2)
result2 = result2.rename(columns = {"index": "Education Level", 
                                  "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?":"Count"}) 
#Bar graph to compare Education level for both genders
fig = px.bar(result2, x='Education Level', y='Count',color='Gender')
fig.show()

Women's percentage are more than men who are either enrolled or completed Doctoral/Master's/Professional Degree.

For almost half of the respondant Master's degree is choice of education ,** so who looking for carrer in DS/ML field Master's degree can be their choice.**

There are few more points which are evident here , let's look at them

** 📌 Take Away points  **

* Master's degree is favorite among Female respondent and Bachelor's degree is choice of education for Male , this trend is **slgihtly** **different** from rest of the world where Master's degree is famous among men and women
* Globally women's enrollment/completion is slightly more than Men in Master's degree this trend is similar in India 
* Globally men's enrollment/completion is more in Bachelor's degree and again this is the same case in India 
* Women's enrollment is more in Doctorate Program as compare to Men , and trend is same in India also 

 >>  <a>In Data science and Machine learning field women's representation is less but their inclination towards Higher education is more as compare to men</a>

# How Education level playing a role in Job Title and Salary

Now , I am intrested to know how respondant's education level are playing a role in their salaries and job titles. 

Let's start at global level.First we will see how Salaries are related with Job Titles and to start with this let's find out about most common Job Title

In [None]:
#Taking out job role and Education level to another data frame for visualization
dataset_salary_jobrole=dataset_mcq[['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?','Select the title most similar to your current role (or most recent title if retired): - Selected Choice','What is your current yearly compensation (approximate $USD)?','What is your gender? - Selected Choice']]
dataset_salary_jobrole=dataset_salary_jobrole.rename(columns = {'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?':'EducationLevel','Select the title most similar to your current role (or most recent title if retired): - Selected Choice':'JobTitle','What is your current yearly compensation (approximate $USD)?':'Salary','What is your gender? - Selected Choice':'Gender'})

#Different Job roles counts
dataset_salary_jobrole['JobTitle'].value_counts().plot.bar()

As they said **Data Scientist is the sexiest job of the century** well above graph proves this. Another thing comes up with this , number of **Student** is matching with Data Scientist , this proves that Kaggle is one of the primary resource of learning Data Science and Machine Learning for Students.


According to my experience in Kaggle also , good number of Kernels/Discussion/Competition are participated/completed by students.

After this let's explore relation between Job Role and Education level

In [None]:
#Visualize the Job Role and Education level
fig = px.scatter(dataset_salary_jobrole,x='JobTitle',y='Salary',hover_data=['EducationLevel'])
fig.show()


This is a pretty straight forward graph which will tell us about education level (*on hovering*) for particular job title along with corresponding salary range. 

As per my understanding , out of all available job titles <a>Data Scientist/Statistician/Data Analyst</a> are roles where person is directly dealing/involved in Data Science and Machine Learning field and on looking at their data it's quite evident that person with Higher Salary range in these roles must have enrolled/completed Doctoral or Master's Degree

>> <a>In Data Science and Machine Learning field if you want to be in higher salary bracket then earn either Doctoral or Master's degree  </a>


After this let's see how 'Gender' ,'Education Level' and 'Job Title' are related to each other


In [None]:
#Parallel Category graph to compare all three categories
fig=px.parallel_categories(dataset_salary_jobrole)
fig.show()

Don't get overwhelmed with this graph :) . I know it's little tricky but all I am only trying to link EducationLevel/JobTitle/Gender cateogries here.

If you hover on ribbon connecting 'Master's' and 'Data Scientist' you will come to know the count ,which means there are **1788** number of **Males** with **Master's Degree** working as **Data Scientist**. In other word *width* of the ribbon will tell you the relationship among all cateogries

This shows that majority of Data scientist are holding either Master's or Doctoral degree and Male Data Scientist numbers are more with Master's and Bachelor's degree while Female Data scientist numbers are more with Doctoral and Master's Degree.

Now, let's check this trend in India

First start with count for different Job title

In [None]:
#Taking out data entries related to India
dataset_salary_jobrole_india=dataset_mcq[dataset_mcq['In which country do you currently reside?'] == 'India']
#Keeping only relevant columns
dataset_salary_jobrole_india=dataset_salary_jobrole_india[['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?','Select the title most similar to your current role (or most recent title if retired): - Selected Choice','What is your current yearly compensation (approximate $USD)?','What is your gender? - Selected Choice']]
#Replacing column name to short and relevant name
dataset_salary_jobrole_india=dataset_salary_jobrole_india.rename(columns = {'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?':'EducationLevel','Select the title most similar to your current role (or most recent title if retired): - Selected Choice':'JobTitle','What is your current yearly compensation (approximate $USD)?':'Salary','What is your gender? - Selected Choice':'Gender'})
#Different Job roles counts
dataset_salary_jobrole_india['JobTitle'].value_counts().plot.bar()

In India majority of Kagglers are student which are just double the number of Data Scientist , this is very different from rest of the world trend. 

Also 'Statistician' and 'Data Analyst' are not much in numbers as compare to rest of the world.

Let's see how Job Role and Education level are related in India.

In [None]:
#Visualize the Job Role and Education level
fig = px.scatter(dataset_salary_jobrole_india,x='JobTitle',y='Salary',hover_data=['EducationLevel'])
fig.show()

Looks like in India Data Scientist with higher salary range is holding Bachelor's Degree and not Master's or Doctoral , this scenario is quite different from rest of the world. 

Also, here Statistician is not even in higher salary range which is again different from rest of the world.

Let's see how 'Gender' ,'Education Level' and 'Job Title' are related to each other

In [None]:
#Parallel Category graph to compare all three categories
fig=px.parallel_categories(dataset_salary_jobrole_india)
fig.show()

Majority of Data Scientist in India are Males with Bachelor's Degree and very less number of Data Scientist are female.

Out of female data scientist , most of them holding Master's degree

** 📌 Take Away points  **

* Across the world majority of respondents are Data scientist but in India majority of them are Students , which means Kaggle is quiet popular among students in India while in other parts of the world more number of professionals are into Kaggle
*  To be in higher salary bracket either Doctoral or Master's degree is required , but in India people with Bachelor's Degree are also in higher salary bracket
* Globally majority of Data Scientist are holding Master's or Doctoral degree but in India majority of them are holding Bachelor's degree
* Globally Statistician and Data Analyst are also in Higher salary range but this is not the case with India

>> <a>In India Data Scientist have not completed their Master's or Doctoral degree before turning into Job , most likely they have started their journey right after their Bachelor's degree . Also , higher salary is not linked with higher education which is not the case with remaining world where higher salary is directly linked with higher education</a>

**Open question:** : *Highest Salary range is '>500,000 USD' which is pretty high as compare to Indian salary standards so i am not sure if any record with this value is correct data or not.*

# Which Age group Is Earning More 

Now , let's see how different age groups are performing in Data Science field.

First let's check which Age group has more number of participants at global level

In [None]:
#Taking out job role and and age to another data frame for visualization
dataset_age_jobrole=dataset_mcq[['What is your age (# years)?','Select the title most similar to your current role (or most recent title if retired): - Selected Choice','What is your current yearly compensation (approximate $USD)?','What is your gender? - Selected Choice']]
dataset_age_jobrole=dataset_age_jobrole.rename(columns = {'What is your age (# years)?':'AgeGroup','Select the title most similar to your current role (or most recent title if retired): - Selected Choice':'JobTitle','What is your current yearly compensation (approximate $USD)?':'Salary','What is your gender? - Selected Choice':'Gender'})
#Different Job roles counts
dataset_age_jobrole['AgeGroup'].value_counts().plot.bar()

It's clear that '25-29' years old have majority and followed by '22-24' and '30-34' . Well this is a  young age group and this may be because Data Science is still in early stage and lot of buzz around this attracts young talent.

Let's check Age group and Salary relation


In [None]:
#Visualize the Age Group and Salary
fig = px.scatter(dataset_age_jobrole,x='AgeGroup',y='Salary',hover_data=['JobTitle'])
fig.show()

If we hover on '22-24' or'25-29' and '>$500,000' or '300,00 -500,000'  then we can see the 'JobTitle' as *Data Analyst/Statisticians/Data Scientist* 

>> <a>Globally , people from young age group like 22-29 who are working in Data Science field are in higher salary range </a>


Let's check relation between 'AgeGroup' 'JobTitle' and 'Gender'

In [None]:
#Parallel Category graph to compare all three categories
fig=px.parallel_categories(dataset_age_jobrole)
fig.show()

There are more number of males as compare to female in age group of '22-29' who are working as *Data Analyst/Statisticians/Data Scientist* . Well , this gender gap was quite evident from our previous analysis also.

Let' check how Indians are doing in this age group . First start with which Age group has more number of participants.

In [None]:
#Taking out data entries related to India
dataset_age_jobrole_india=dataset_mcq[dataset_mcq['In which country do you currently reside?'] == 'India']
#Keeping only relevant columns
dataset_age_jobrole_india=dataset_age_jobrole_india[['What is your age (# years)?','Select the title most similar to your current role (or most recent title if retired): - Selected Choice','What is your current yearly compensation (approximate $USD)?','What is your gender? - Selected Choice']]
#Replacing column name to short and relevant name
dataset_age_jobrole_india=dataset_age_jobrole_india.rename(columns = {'What is your age (# years)?':'AgeGroup','Select the title most similar to your current role (or most recent title if retired): - Selected Choice':'JobTitle','What is your current yearly compensation (approximate $USD)?':'Salary','What is your gender? - Selected Choice':'Gender'})
#Different Job roles counts
dataset_age_jobrole_india['AgeGroup'].value_counts().plot.bar()

**As mentioned , in India majority of respondants/kaggler's are student and that's the reason they are falling under '18-21' age group.

Other age group followed by '18-21' are '22-29' which is similar to global trend. Let's check Age group and Salary relation


In [None]:
#Visualize the Age Group and Salary
fig = px.scatter(dataset_age_jobrole_india,x='AgeGroup',y='Salary',hover_data=['JobTitle'])
fig.show()

On hovering we can find out that in India out of young age group ('22-29') only '25-29' group falls under cateogory of people working as Data Scientist with higher salary range.



In [None]:
#Parallel Category graph to compare all three categories
fig=px.parallel_categories(dataset_age_jobrole_india)
fig.show()

Another intresting thing to notice here is , unlike global trend only 'Data Scientist' job title is coming in young higher earner category and not much numbers from other job titles like 'Statisticians' and 'Data Analyst' are coming up  . 

If we link this with our previous analysis  then it could be possible that : <a> As Indian Data Scientist are not much intrested in Master's/Doctoral degree they may be missing tradiotional Statistical/Mathematics education , that might be the reason we are not seeing much 'Statisticians' Job title in young age group</a>

>>Food for thought : <b>Is it valid to say that "Indian Data Scientist are lacking in Statistical and mathematical skills which is the basic skill required in Data Science and Machine Learning Field"?</b>

** 📌 Take Away points  **

* Data Science young workforce which belongs to age group of '22-29' are in higher salary range , In India this trend starts little late at '25-29'
* India has majority of student share in Kaggle
* India has less number of Statisticians and Data Analyst as compare to Data Scientist 
* On looking at the data it's quite possible that Indian Data Scientist are lacking in Statistical and mathematical skills

# Which Learning Media and Online Courses are famous

In the age of Social media and Internet we are highly dependent on online resources to gain knowledge , let's see how different social platform are contributing in Data Science field globally


In [None]:
#Take out Online resources seprately and put in Data Dictonary
mediasource_count_dict = {
    'Twitter' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Twitter (data science influencers)'].value_counts().values[0]),
    'Hacker': (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Hacker News (https://news.ycombinator.com/)'].value_counts().values[0]),
    'Reddit' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, r/datascience, etc)'].value_counts().values[0]),
    'Kaggle' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (forums, blog, social media, etc)'].value_counts().values[0]),
    'Course Forums' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, etc)'].value_counts().values[0]),
    'YouTube' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Cloud AI Adventures, Siraj Raval, etc)'].value_counts().values[0]),
    'Podcasts' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, Linear Digressions, etc)'].value_counts().values[0]),
    'Blogs' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc)'].value_counts().values[0]),
    'Journal Publications' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (traditional publications, preprint journals, etc)'].value_counts().values[0]),
    'Slack Communities' : (dataset_mcq['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)'].value_counts().values[0])
            }
#Convert Data dictonary to series
mediasource_series=pd.Series(mediasource_count_dict)
#Visualizing media source
fig = px.bar(mediasource_series, x=mediasource_series.values, y=mediasource_series.index,orientation='h')
fig.show()

It's evident that **Kaggle** is the winner in terms of media sources used for learning followed by **Blogs**(Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc) and then **You Tube Channel** (Cloud AI Adventures, Siraj Raval,etc).

This makes sense because Kaggle provides a platform where you can particpate and network with other people , learn from other people's work and test/enhance your skills , where as blogs  like 'Towards Data Science, Medium' provides quality of reading material on specific topics of Data Science.

Now , let's check how online courses platform are performing

In [None]:
#Take out Online resources seprately and put in Data Dictonary
onlinecourses_dict = {
    'Udacity' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udacity'].value_counts().values[0]),
    'Coursera': (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera'].value_counts().values[0]),
    'edX' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX'].value_counts().values[0]),
    'DataCamp' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp'].value_counts().values[0]),
    'DataQuest' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataQuest'].value_counts().values[0]),
    'Kaggle Course' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Courses (i.e. Kaggle Learn)'].value_counts().values[0]),
    'Fast.ai' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai'].value_counts().values[0]),
    'Udemy' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udemy'].value_counts().values[0]),
    'LinkedIn Learning' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - LinkedIn Learning'].value_counts().values[0]),
    'University Course' : (dataset_mcq['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - University Courses (resulting in a university degree)'].value_counts().values[0])
}
#Convert Data dictonary to series
onlinecourses_series=pd.Series(onlinecourses_dict)
#Visualizing onlinecourses series
fig = px.bar(onlinecourses_series, x=onlinecourses_series.values, y=onlinecourses_series.index,orientation='h')
fig.show()

Well '**Coursera**' is  leading by a huge margin as compared to other online courses may be because of Andrew Ngs Machine Learning and Deep Learning Fundamental courses , '**Kaggle**' is at the second place may be becuase they offer smalll,precise and practical courses.

Let's do this analysis for **India** , first we will see how's Indian's inclination towards different media sources

In [None]:
#Taking out India's participants
dataset_mcq_india=dataset_mcq[dataset_mcq['In which country do you currently reside?'] == 'India']
#Take out Online resources seprately and put in Data Dictonary
mediasource_count_dict_india = {
    'Twitter' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Twitter (data science influencers)'].value_counts().values[0]),
    'Hacker': (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Hacker News (https://news.ycombinator.com/)'].value_counts().values[0]),
    'Reddit' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, r/datascience, etc)'].value_counts().values[0]),
    'Kaggle' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (forums, blog, social media, etc)'].value_counts().values[0]),
    'Course Forums' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, etc)'].value_counts().values[0]),
    'YouTube' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Cloud AI Adventures, Siraj Raval, etc)'].value_counts().values[0]),
    'Podcasts' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, Linear Digressions, etc)'].value_counts().values[0]),
    'Blogs' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc)'].value_counts().values[0]),
    'Journal Publications' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (traditional publications, preprint journals, etc)'].value_counts().values[0]),
    'Slack Communities' : (dataset_mcq_india['Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)'].value_counts().values[0])
            }
#Convert Data dictonary to series
mediasource_series_india=pd.Series(mediasource_count_dict_india)
#Visualizing media source
fig = px.bar(mediasource_series_india, x=mediasource_series_india.values, y=mediasource_series_india.index,orientation='h')
fig.show()

This is similar to global trend , Kaggle is the winner followed by Blogs(Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc) and then You Tube Channel (Cloud AI Adventures, Siraj Raval,etc) . 

Interestingly ,gap between 'You tube' and 'Blogs' is not that much in India as compare to global numbers which means Indians are giving preference to You tube more than their rest of the world contemporaries.

Now,Indian's favorite online courses

In [None]:
#Take out Online resources seprately and put in Data Dictonary
onlinecourses_dict_india = {
    'Udacity' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udacity'].value_counts().values[0]),
    'Coursera': (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera'].value_counts().values[0]),
    'edX' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX'].value_counts().values[0]),
    'DataCamp' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp'].value_counts().values[0]),
    'DataQuest' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataQuest'].value_counts().values[0]),
    'Kaggle Course' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Courses (i.e. Kaggle Learn)'].value_counts().values[0]),
    'Fast.ai' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai'].value_counts().values[0]),
    'Udemy' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udemy'].value_counts().values[0]),
    'LinkedIn Learning' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - LinkedIn Learning'].value_counts().values[0]),
    'University Course' : (dataset_mcq_india['On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - University Courses (resulting in a university degree)'].value_counts().values[0])
}
#Convert Data dictonary to series
onlinecourses_series_india=pd.Series(onlinecourses_dict_india)
#Visualizing onlinecourses series
fig = px.bar(onlinecourses_series_india, x=onlinecourses_series_india.values, y=onlinecourses_series_india.index,orientation='h')
fig.show()

In India also 'Coursera'  is most popular but 'Udemy' beat 'Kaggle' for second place , I guess reason behind 'Udemy' popularity is it's  cheap and beginner's friendly courses

** 📌 Take Away points  **
 
*  Kaggle is one of the popular learning medium across the world , followed by Blogs and You Tube channels.Indinas inclination towards You tube is more as comapre to their contemporaries
*  Coursera is favorite online course portal accross the world , In India margin of Coursera popularity is less as compare to rest of the world
*  In India Udemy is more popular as compare to Kaggle , may be becuase of it's pocket friendly courses
  
 

# Choice of Weapon: Programming Language and Visualization Libraries

Any discussion about Data Science is incomplete without discussing about programming language. In this section we will explore Regular Usage/Recommended Programming languages and which is most preferable visualization libraries.

Let's start with regular usuage programming language across the world

In [None]:
#Take out programming language seprately and put in Data Dictonary
programminglang_dict = {
 'Python' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python'].count()),
 'R': (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R'].count()),
 'SQL' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL'].count()),
 'C' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C'].count()),
 'C++' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C++'].count()),
 'Java ' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Java'].count()),
 'Javascript' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Javascript'].count()),
 'Typescript' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - TypeScript'].count()),
 'Bash ' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Bash'].count()),
 'MATLAB' : (dataset_mcq['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - MATLAB'].count())
}

#Convert Data dictonary to series
programminglang_series=pd.Series(programminglang_dict)
#Visualizing frequently used programming language series
fig = px.scatter(programminglang_series, y=programminglang_series.values, x=programminglang_series.index,size=programminglang_series.values)
fig.show()



For the obvious reasons **Python** is leading the game followed by **SQL** and **R** , let see what is the  recommended language for aspiring data scientist

In [None]:
#Taking out recomemded language in a series
recommendedlang_series = dataset_mcq['What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice'].value_counts()
#Visualizing recommended programming language series
fig = px.scatter(recommendedlang_series, y=recommendedlang_series.values, x=recommendedlang_series.index,size=recommendedlang_series.values)
fig.show()

>> **Note**: The intent of above scatter plot is about **Recommendation** and the intent of previous plot is about **Regular Usage**

It's pretty evident from above plot is except **Python**, the next five languages emerging in **Regular Usage** are of different order than the languages **Recommended** for learning

Let's check these numbers in India , start with Regular Usage programming language

In [None]:
#Take out programming language for india seprately and put in Data Dictonary
programminglang_india_dict = {
 'Python' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python'].count()),
 'R': (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R'].count()),
 'SQL' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL'].count()),
 'C' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C'].count()),
 'C++' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C++'].count()),
 'Java ' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Java'].count()),
 'Javascript' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Javascript'].count()),
 'Typescript' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - TypeScript'].count()),
 'Bash ' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Bash'].count()),
 'MATLAB' : (dataset_mcq_india['What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - MATLAB'].count())
}
#Convert Data dictonary to series
programminglang_india_series=pd.Series(programminglang_india_dict)
#Visualizing onlinecourses series
fig = px.scatter(programminglang_india_series, y=programminglang_india_series.values, x=programminglang_india_series.index,size=programminglang_india_series.values)
fig.show()

Here also **Python** is leading and followed by **SQL** and **R** , let see what is the  recommended language scenario in India for aspiring data scientist

In [None]:
#Taking out recomemded language from india in a series
recommendedlang_india_series = dataset_mcq_india['What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice'].value_counts()
#Visualizing recommended programming language series
fig = px.scatter(recommendedlang_india_series, y=recommendedlang_india_series.values, x=recommendedlang_india_series.index,size=recommendedlang_india_series.values)
fig.show()

Well , Indians are also following the rest of the world trend.

Now , let's check what are the most commonly used visualization library across the world.

In [None]:
#Take out programming language seprately and put in Data Dictonary
visuallib_dict = {
 'Ggplot' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Ggplot / ggplot2 '].count()),
 'Matplotlib': (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Matplotlib '].count()),
 'Altair' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Altair '].count()),
 'Shiny' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Shiny '].count()),
 'D3' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  D3.js '].count()),
 'Plotly' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Plotly / Plotly Express '].count()),
 'Bokeh' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Bokeh '].count()),
 'Seaborn' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Seaborn '].count()),
 'Geoplotlib' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Geoplotlib '].count()),
 'Leaflet-Folium' : (dataset_mcq['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Leaflet / Folium '].count())
}

#Convert Data dictonary to series
visuallib_series=pd.Series(visuallib_dict)
#Visualizing frequently used programming language series
fig = px.scatter(visuallib_series, y=visuallib_series.values, x=visuallib_series.index,size=visuallib_series.values)
fig.show()


**Matplotlib** is at top leading with huge margin followed by **Seaborn** , **Ggplot** and **Plotly**.

>> **Note** : Matplotlib and Seaborn are a plotting library for the Python programming language and that's why numbers for these visualizing libraries are high because Python is most used language and we can see the correlation

Let's check the same trend in India

In [None]:
#Take out programming language seprately and put in Data Dictonary
visuallib_india_dict = {
 'Ggplot' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Ggplot / ggplot2 '].count()),
 'Matplotlib': (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Matplotlib '].count()),
 'Altair' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Altair '].count()),
 'Shiny' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Shiny '].count()),
 'D3' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  D3.js '].count()),
 'Plotly' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Plotly / Plotly Express '].count()),
 'Bokeh' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Bokeh '].count()),
 'Seaborn' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Seaborn '].count()),
 'Geoplotlib' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Geoplotlib '].count()),
 'Leaflet-Folium' : (dataset_mcq_india['What data visualization libraries or tools do you use on a regular basis?  (Select all that apply) - Selected Choice -  Leaflet / Folium '].count())
}

#Convert Data dictonary to series
visuallib_india_series=pd.Series(visuallib_india_dict)
#Visualizing frequently used programming language series
fig = px.scatter(visuallib_india_series, y=visuallib_india_series.values, x=visuallib_india_series.index,size=visuallib_india_series.values)
fig.show()


As expected , this trend is also same in India because of the obvious reason

** 📌 Take Away points  **

* Python is most used and most recommended language followed by SQL and R, this is same in India and across the world
* Other five languages emerging in Regular Usage are of different order than the languages Recommended for aspiring data scientist
* Matplotlib is most used visualization library followed by Seaborn , this is related to popularity of Python as both are library for Python programming language


# Which ML Algorithm ,Tools and Frameworks are Famous

In addition to the Programming language and Visualization library I also want to understand the most used machine learning algorithm,tools and frameworks

Let's start with Machine Learning algorithm and this time let's compare rest of the world and India's data at the same time

In [None]:
#Take out machine learning algo seprately and put in Data Dictonary
mlaglo_dict = {
 'Linear or Logistic Regression' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Linear or Logistic Regression'].count()),
 'Decision Trees or Random Forests': (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Decision Trees or Random Forests'].count()),
 'Gradient Boosting Machines' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Gradient Boosting Machines (xgboost, lightgbm, etc)'].count()),
 'Bayesian Approaches' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Bayesian Approaches'].count()),
 'Evolutionary Approaches' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Evolutionary Approaches'].count()),
 'Dense Neural Networks (MLPs, etc) ' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Dense Neural Networks (MLPs, etc)'].count()),
 'Convolutional Neural Networks' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Convolutional Neural Networks'].count()),
 'Generative Adversarial Networks ' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Generative Adversarial Networks'].count()),
 'Recurrent Neural Networks' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Recurrent Neural Networks'].count()),
 'Transformer Networks (BERT, gpt-2, etc)' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Transformer Networks (BERT, gpt-2, etc)'].count()),
 'None' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - None'].count()),
 'Other' : (dataset_mcq['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Other'].count()),
}

mlaglo_india_dict = {
 'Linear or Logistic Regression' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Linear or Logistic Regression'].count()),
 'Decision Trees or Random Forests': (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Decision Trees or Random Forests'].count()),
 'Gradient Boosting Machines' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Gradient Boosting Machines (xgboost, lightgbm, etc)'].count()),
 'Bayesian Approaches' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Bayesian Approaches'].count()),
 'Evolutionary Approaches' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Evolutionary Approaches'].count()),
 'Dense Neural Networks (MLPs, etc) ' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Dense Neural Networks (MLPs, etc)'].count()),
 'Convolutional Neural Networks' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Convolutional Neural Networks'].count()),
 'Generative Adversarial Networks ' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Generative Adversarial Networks'].count()),
 'Recurrent Neural Networks' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Recurrent Neural Networks'].count()),
 'Transformer Networks (BERT, gpt-2, etc)' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Transformer Networks (BERT, gpt-2, etc)'].count()),
 'None' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - None'].count()),
 'Other' : (dataset_mcq_india['Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Selected Choice - Other'].count()),
}

#Convert Data dictonary to series
mlaglo_series=pd.Series(mlaglo_dict)
mlaglo_india_series=pd.Series(mlaglo_india_dict)

#Visualizing frequently used machine learning algorithm series
fig = go.Figure(data=[
    go.Bar(name='ROW', x=mlaglo_series.index, y=mlaglo_series.values),
    go.Bar(name='India',  x=mlaglo_india_series.index, y=mlaglo_india_series.values)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Well , **Logistic Regression** is at top followed by **Decision Trees and Random Forests** at Second position.

Gradient Boosting Machines is at third position with Convolutional Neural Networks at fourth but both are very close in numbers , in India order is reverse with Convolutional Neural network is at third and Gradient Boosting at fourth with slight margin.

Let's examine latest machine learning tools

In [None]:
#Take out machine learning tools seprately and put in Data Dictonary
mltool_dict = {
 'Automated data augmentation (e.g. imgaug, albumentations)' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated data augmentation (e.g. imgaug, albumentations)'].count()),
 'Automated feature engineering/selection (e.g. tpot, boruta_py)': (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated feature engineering/selection (e.g. tpot, boruta_py)'].count()),
 'Automated model architecture searches (e.g. darts, enas)' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated model architecture searches (e.g. darts, enas)'].count()),
 'Automated model selection (e.g. auto-sklearn, xcessiv)' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated model selection (e.g. auto-sklearn, xcessiv)'].count()),
 'Automated hyperparameter tuning (e.g. hyperopt, ray.tune)' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated hyperparameter tuning (e.g. hyperopt, ray.tune)'].count()),
 'Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)'].count()),
 'None ' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - None'].count()),
 'Other' : (dataset_mcq['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Other'].count()),
}
 
mltool_india_dict = {
 'Automated data augmentation (e.g. imgaug, albumentations)' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated data augmentation (e.g. imgaug, albumentations)'].count()),
 'Automated feature engineering/selection (e.g. tpot, boruta_py)': (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated feature engineering/selection (e.g. tpot, boruta_py)'].count()),
 'Automated model architecture searches (e.g. darts, enas)' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated model architecture searches (e.g. darts, enas)'].count()),
 'Automated model selection (e.g. auto-sklearn, xcessiv)' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated model selection (e.g. auto-sklearn, xcessiv)'].count()),
 'Automated hyperparameter tuning (e.g. hyperopt, ray.tune)' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automated hyperparameter tuning (e.g. hyperopt, ray.tune)'].count()),
 'Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)'].count()),
 'None ' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - None'].count()),
 'Other' : (dataset_mcq_india['Which categories of ML tools do you use on a regular basis?  (Select all that apply) - Selected Choice - Other'].count()),
 }
 
 #Convert Data dictonary to series
mltool_series=pd.Series(mltool_dict)
mltool_india_series=pd.Series(mltool_india_dict)

#Visualizing frequently used machine learning algorithm series
fig = go.Figure(data=[
    go.Bar(name='ROW', x=mltool_series.index, y=mltool_series.values),
    go.Bar(name='India',  x=mltool_india_series.index, y=mltool_india_series.values)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Looks like none of the mentioned tools is favorable choice and majority of respondent are not using any framework , trend is same across the world. 

>> <a> Food for thought : Why people are not interested in using ML framework ? </a>

Will check most used machine learning framework  now


In [None]:
#Take out machine learning framework seprately and put in Data Dictonary

mlfw_dict = {
 'Scikit-learn' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   Scikit-learn '].count()),
 'TensorFlow': (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   TensorFlow '].count()),
 'Keras' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Keras '].count()),
 'RandomForest' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  RandomForest'].count()),
 'Xgboost' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Xgboost '].count()),
 'PyTorch' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  PyTorch '].count()),
 'Caret' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Caret '].count()),
 'LightGBM' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  LightGBM '].count()),
 'SparkMLib' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Spark MLib '].count()),
 'Fast.ai' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Fast.ai '].count()),
  'None' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice - None'].count()),
 'Other' : (dataset_mcq['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice - Other'].count())
}

mlfw_india_dict = {
 'Scikit-learn' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   Scikit-learn '].count()),
 'TensorFlow': (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -   TensorFlow '].count()),
 'Keras' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Keras '].count()),
 'RandomForest' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  RandomForest'].count()),
 'Xgboost' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Xgboost '].count()),
 'PyTorch' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  PyTorch '].count()),
 'Caret' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Caret '].count()),
 'LightGBM' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  LightGBM '].count()),
 'SparkMLib' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Spark MLib '].count()),
 'Fast.ai' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice -  Fast.ai '].count()),
 'None' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice - None'].count()),
 'Other' : (dataset_mcq_india['Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply) - Selected Choice - Other'].count())
 }

 
 #Convert Data dictonary to series
mlfw_series=pd.Series(mlfw_dict)
mlfw_india_series=pd.Series(mlfw_india_dict)

#Visualizing frequently used machine learning algorithm series
fig = go.Figure(data=[
    go.Bar(name='ROW', x=mlfw_series.index, y=mlfw_series.values),
    go.Bar(name='India',  x=mlfw_india_series.index, y=mlfw_india_series.values)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

**Scikit Learn** is a clear winner with big margin , followed by **TensorFlow** and **Keras** , trend is same accross the world.

Again , Scikit Learn popularity we can link with Python and Tensor flow is famous becuase it provides excellent functionalities and services when compared to other popular deep learning frameworks

>> Another deduction we can make now is : As algorithms **Logistic Regression** and **Decision Trees**  are mostly used and  because of this **Scikit Learn** becomes mostly used as it provides libraries to implement these algorithm 

** 📌 Take Away points  **
   * Mostly used algorithms are  Logistic Regression and Decision Trees and Random Forests
   * Mostly used machine learning framework is Scikit Learn
   * Programming Language, Algorithms and Framework are correlated to each other (*Python >> Scikit Learn >> Logistic Regression/Decision Tree-Random Forest*)
   * None of the mentioned machine learning tools is mostly used 
   * All trends are same in India and rest the world

# Relevant Experiences and Activities

At last ,I want to know how much coding and Data Science experience respondents are having and what kind of ML/DS activities they are doing on day to day basis. 

Let's start with Coding experience across the world.

In [None]:
#Taking out relevant columns related to Experience in different data frame
dataset_codemlexp=dataset_mcq[['How long have you been writing code to analyze data (at work or at school)?','For how many years have you used machine learning methods?']]
dataset_codemlexp_india=dataset_mcq_india[['How long have you been writing code to analyze data (at work or at school)?','For how many years have you used machine learning methods?']]

#Replacing column name to short and relevant name
dataset_codemlexp=dataset_codemlexp.rename(columns = {'How long have you been writing code to analyze data (at work or at school)?':'CodeExp','For how many years have you used machine learning methods?':'MLExp'})
dataset_codemlexp_india=dataset_codemlexp_india.rename(columns = {'How long have you been writing code to analyze data (at work or at school)?':'CodeExp','For how many years have you used machine learning methods?':'MLExp'})

#Converting Code experince into series for visualization
codeexp_series_india=pd.Series(dataset_codemlexp_india.iloc[:,0])
codeexp_series=pd.Series(dataset_codemlexp.iloc[:,0])

#Converting ML experince into series for visualization
mlexp_series_india=pd.Series(dataset_codemlexp_india.iloc[:,1])
mlexp_series=pd.Series(dataset_codemlexp.iloc[:,1])

#Visualizing Code Experince
fig = go.Figure(data=[
    go.Bar(name='ROW',x=codeexp_series.index,y=codeexp_series.values,orientation='h'),
    go.Bar(name='India',x=codeexp_series_india.index, y=codeexp_series_india.values,orientation='h')
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Let's check ML Experince now

In [None]:
#Visualizing Code Experince
fig = go.Figure(data=[
    go.Bar(name='ROW', x=mlexp_series.index,y=mlexp_series.values,orientation='h'),
    go.Bar(name='India',  x=mlexp_series_india.index,y=mlexp_series_india.values,orientation='h')
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Well, majority of respondent's coding experience falls under either **<1 year** or **1-2 years** , in India we have more less than 1 year of experience respondents.

In India we have fewer people who are in higher range of coding experience , I think this is because majority of respondents from India are students.

For ML experience also we can see the same trend that maximum respondents falls under either **<1 year** or **1-2 years** . In India less number of people are in **2-3 years** category which is not the case with rest of the world where we have significant numbers in this category.

>> <a>With this we can deduce that Data Science is dominated by younger people, majority of them are having experience less than 3 years , whereas in India majority of them having experience less than 2 years </a>

Now , let's see what are the important ML activities at work which people are doing on day to day basis.

Let's start with Global level

In [None]:
#Take out machine learning framework seprately and put in Data Dictonary

mltasks_dict = {
 'Analyze Data' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions'].count()),
 'Build-Run DataInfrastructure': (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data'].count()),
 'BuildPrototypes' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas'].count()),
 'Build-Run MLService' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows'].count()),
 'Experiment-Iterate ML Model' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models'].count()),
 'AdvanceResearch' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning'].count()),
 'None' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work'].count()),
 'Other' : (dataset_mcq['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other'].count()),
}

mltasks_india_dict = {
 'Analyze Data' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions'].count()),
 'Build-Run DataInfrastructure': (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data'].count()),
 'BuildPrototypes' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas'].count()),
 'Build-Run MLService' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows'].count()),
 'Experiment-Iterate ML Model' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models'].count()),
 'AdvanceResearch' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning'].count()),
 'None' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - None of these activities are an important part of my role at work'].count()),
 'Other' : (dataset_mcq_india['Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Other'].count()),
}

 #Convert Data dictonary to series
mltasks_series=pd.Series(mltasks_dict)
mltasks_india_series=pd.Series(mltasks_india_dict)

#Visualize the ml tasks globally
fig = px.scatter(mltasks_series, y=mltasks_series.values, x=mltasks_series.index,size=mltasks_series.values,color=mltasks_series.values)
fig.show()

Let's see what Indians are doing

In [None]:
#Visualizing ml tasks for Indians
fig = px.scatter(mltasks_india_series, y=mltasks_india_series.values, x=mltasks_india_series.index,size=mltasks_india_series.values,color=mltasks_india_series.values)
fig.show()

Trend is same for India and rest of the world . Majority of respondents spend their time in **Analyzing Data** and then in **Building Prototype** followed by **Experiment and Iterate ML Model , Build and run ML services and Infrastructure**.

>>Looks like organization/people are not spending much time in "Advance Research" , which is required for increasing horizons in Data Science

** 📌 Take Away points  **

* Majority of respondant's Coding and ML experince are less than 3 years , in India majority of them are having less than 2 years of experince
* Analyzing Data and Building prototype are one of most done activities
* Not much advance research is happening , or people who are doing advance research are not spending time at Kaggle :)

# Let's Summarize

With above EDA I can summarize following points which states about trends in India's Data Science field 
<a>
* Data Science is dominating by Males , this is same across the world
* Majority of Indians Kaggler's are Students
* Female Data Scientist from India are more interested in Higher education as compare to men , this is same across the world
* Majority of Indians Data Scientist are holding Bachelor's degree , this is quite opposite to rest of the world where majority of data scientist holding Master's/Doctoral Degree
* Globally people who are in higher salary range have completed Master's/Doctoral Degree but this is not the case with Indians
* Data Science young workforce which belongs to age group of '22-29' are in higher salary range , In India this trend starts little late at '25-29' age group 
* It's quite possible that Indian Data Scientist are lacking in Statistical and mathematical skills
* Coursera and Kaggle are most common learning medium in data science field
* Python is most used and recommended language across the world and it's popularity encouraged usage of Matplotlib and Seaborn libraries
* Python's popularity is also encoraging usuage of Scikit Learn framework and Logistic Regression,Decision Trees/Random Forest Algorithms
* Majority of Coding and ML experince falls under less than 3 years , numbers for India is less than 2 years
* People are spending time in Analyzing Data and Building prototype and not much people in Advance Research

# Last Word !!

I have talked about so many data points here but how does it make any sense and how I can relate this to my objective, so I will try to put this in a narrative manner.

*Let's start with <a>What is good for India ?</a>. This is very exciting time to be in Data Science field and as established that Indians respondents are more in number , majority of them are students by this we can say that young generation are spending lot of time in improving their data science skill and learning the same which means there are high possibility that in future they will work in Data Science field.*

*Data scientist in India who are below 30 years in age are in higher salary bracket while holding only bachelor's degree , which means in India higher salary is not linked with higher education. When it comes to learning we are at par with our rest of the world contemporizes , Kaggle, Coursera and Python and it's related framework/tools are our choice of learning.*

*Now , Let's discuss <a>What needs to be improved ?</a> . In 2019 also not many women are turning into STEM/Data Science , if we want flourishing Data Science industry in India then there should be gender diversity.*

*Not lot of people are turning up for higher education , this can affect the quality of upcoming data science engineer. I think this could be the possible reason behind not much 'Statistician' job title in India .There is no replacement of university degree and for upcoming breed of data scientist success this is much required.*

*As established that Data Science belongs to young generation , but it needs experienced professional also. People with good range of experience from other technical domain  should also try their hand in data science field. This will bring diversity and experience , young blood is full of energy and enthusiasm but it needs right direction which can come from experience and expertise*

Well folks , I will rest my case now. If you like the work please upvote and feel free to drop comments for any suggestions/questions.
