# Demographic of respondents, comparison between Indian and American respondents.

## Introduction

Since 2017, Kaggle is conducting an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live from 09/01/2021 to 10/04/2021, and 25,973 usable responses from participants in 171 different countries and territories are available to generate insights on industry trends, user demographics, popular tools in industry etc.<br>

In this notebook I set out to understand what are the demographics of the respondents, and comparing the scenario in India and USA.

## Top Insights

* Men to Women participation ratio in this survey is 4.21:1
* 81% of all respondents identify as a Man.
* Most of the respondents/ say Kaggle users belong to the age group of 25-29 followed by 18-21 years.
* Proportion of women is very less compared to men in all age group categories.
* we can see that DS and ML are more popular with women between the age group of 18-29. Where as there are very few DS and ML enthusiasts above 30 years.
* Most of the women(respondents) belong to the age group of 18-21 and men(respondents) belong to 25-29 years.
* Though there is not much significant difference between age groups of 18-21 and 25-29 between both genders, nonetheless this poses a question:
#### Young women showing more interest towards DS and ML?<br>
* Indian respondents are comparatively younger compared to American respondents.
* Most of the American respondents are beyoung 30 years.
#### Young Indians are showing interest in DS and ML compared to young Americans.
#### 65 and more countries participated in the survey
There are some countries where less than 50 respondents participated, such countries are grouped into "others"
#### Representation of Indians in the survey is 2.87 times more compared to Americans. 
* Looks like Kaggle is highly popular amongst Indians. 30% of total responses are from India<br>
* About 10% of total responses are from USA.<br>
* One very interesting observation from this graph, there is a growing popularity in DS and ML fields in the developing countries such as India, Brazil, Nigeria etc.
* The India and United States provided the highest volume of survey responses.
* The top 15 countries account for almost 70% of all respondents.
#### There more opportunities for DS and ML fields in Developing countries?
* Again most of the Women respondents are from India followed by USA.
* But there is an emerging community of Women from Egypt, Uk and Nigeria.
* Participation from Japan and China is more in Men than in Women
* Many Kaggle users are holding a Master's Degree or Bachelor's Degree. This is observed holds valid for both Men and Women of Kaggle community.
* Most of the Indian responders are holding Bachelor's degree, while most of the American resonders are holding either Master's degree or Doctoral degree.
* Based on the age group it can be assumed that many are contributing to Kaggle, exploring and build models, participating in Kaggle competitions to gain exposure or to build a profile in DS and ML files.
* This can be with an intention to pursue Masters Degree or up-skilling and moving to DS and ML fields in pursuit of new roles and jobs.
#### It is safe say that Kaggle helping Students and job-seekers to gain exposure in DS and ML fileds.
* Most responses are from students and they are from India.
* Amongst students proportion of women is more than that of Men.
* Men are already holding positions in data science, data analyst or software fields in comparison to women responders.
* Proportion of American responders in Data Scientist and Data Analyst roles is more than Indian responders.
* Since the USA is having established tech hubs, the proportion of DS and ML is more in various other fields compared to India.
* There are other fields in USA who are benefiting from Kaggle.
#### Responders from which other fields such as Medical, Automotive or Startups are prominent on Kaggle from USA .
Our assumptions based on the plots of formal education strengthen the fact that, Kaggle is helping students and job seekers to work with data, learn new concepts, gain knowledge, participate in competitions and gain exposure which is required to pursue higher education or seek a role or a new job.<br>
* Many women responders have been coding since 1-3 years or less.
* Men have more coding experience compared women respondents.
* Almost equal proportion of men and women have coding experience for 3-5 years.
* Twice as many men have 20+ years of experience than women.
#### Looks like India is getting started with coding and programming.
* Many American responders have more than 5 years of programming experience.
* Proportion of Americans who have 20+years of programming is 5 times more than Indians. 
* Less than 5% of American respondents have never written a code.
* Respondents with more programming experience have more salary.
* Majority of those who are earning more than 1,00,000 USD in Yearly compensation have more than 5 years of coding experience.
* responders with 20+years earn more than 2,00,000 USD.
#### About 20% of responders are having 20+years of experience are drawing more than 10,00,000 USD.
#### Computers/Technology and Academics/Education industry are the most popular.
* Third most popular Industry is Accounting/Finance but it is about half as popular as Academics/Education industry.
* Academics/Education and Computers/Technology are popular industries among women.
* Proportion of women in Computers/Technology industry about 3% less compared to Men.
* Women responders working in Computers/Technology industry is three times than that of Accounting/Finance.
* Proportion of responders from India in Computers/Technology industry twice as much as Academics/Education industry.
* Accounting/Finance, Medical/Pharmaceutical, Government/Public Service are other popular industries among responders in USA.
* Representation from Hospitality/Entertainment/Sports industry is the least.
* Except for Computers/Technology industry which is highly popular among Indian responders, other industries are relatively popular among Americans. 
#### It is likely that most Indian Kaggle users are working in Computers/Technology Industry.
#### About 41% of respondents were not comfortable disclosing their current yearly compensation
* Majority of women responders are drawing 0-999 USD as yearly compensation.
* Proportion of Men is more than women as the salary-band is increasing.
#### There is definite pay gap between men and women responders.
* Indian reponders are having lesser pay compared to US counterparts.
#### There is definite pay difference between Indian and American responders.
* Majority of Americans are earning anywhere between 100000-200000 USD.
* There are very few Indian responders who are drawing higher salaries.

## Importing Libraries
This notebook is using following libraries for Analysis

In [None]:
import numpy as np 
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import re
from wordcloud import WordCloud

#plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
class color:
   BOLD = '\033[1m'
   END = '\033[0m'

## Data Source/ Import Datafile

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
responses_df =pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
responses_df.head()

In [None]:
#responses_df =pd.read_csv('kaggle_survey_2021_responses.csv')
#responses_df.head()

In [None]:
print("Row 0 has detailed questions, for now we will copy all the questions to a list and delete this row from the dataset.")
#copy all questions into a list to be used further
questions = responses_df.loc[0].values.tolist()
responses_df.drop([0], inplace=True)

### Shorten some column values for better visualization

In [None]:
responses_df["Q3"].replace({'United States of America':'USA',
                            'United Kingdom of Great Britain and Northern Ireland':'UK'},inplace=True)

responses_df["Q4"].replace({'Some college/university study without earning a bachelor’s degree':'Some college/university',
                            'No formal education past high school': 'high school'}, inplace=True)

responses_df["Q11"].replace({'A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)': 'Cloud Computin Platform',
                            'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)': 'Deep learning workstation'}, inplace=True)

responses_df["Q20"].replace({'Online Business/Internet-based Sales': 'Online Business',
                             'Online Service/Internet-based Services': 'Online Service'}, inplace=True)

responses_df["Q15"].replace({'I do not use machine learning methods': 'I do not use ML'},inplace=True)

# Remove $ symbol from the column

responses_df["Q25"].replace({"$0-999": "0-999",
                             "$500,000-999,999": "500,000-999,999",
                             ">$1,000,000": ">1,000,000" },inplace=True)


In [None]:
#  Filtering responses from Men and Women only
survey_2021= responses_df[(responses_df.Q2.isin(["Man","Woman"]))]

### Create new columns

In [None]:
# Convert the range and add mid value of range to the survey_2021 df as salary column

condition = [ survey_2021.Q25 == "0-999", survey_2021.Q25 == "1,000-1,999", survey_2021.Q25 =="2,000-2,999",
             survey_2021.Q25 =="3,000-3,999", survey_2021.Q25 =="4,000-4,999", survey_2021.Q25 =="5,000-7,499",
             survey_2021.Q25 =="7,500-9,999", survey_2021.Q25 =="10,000-14,999", survey_2021.Q25 =='15,000-19,999',
             survey_2021.Q25 =="20,000-24,999", survey_2021.Q25 =="25,000-29,999", survey_2021.Q25 =="30,000-39,999",
             survey_2021.Q25 =="40,000-49,999", survey_2021.Q25 =="50,000-59,999", survey_2021.Q25 =="60,000-69,999",
             survey_2021.Q25 =="70,000-79,999", survey_2021.Q25 =="80,000-89,999", survey_2021.Q25 =="90,000-99,999",
             survey_2021.Q25 =="100,000-124,999", survey_2021.Q25 =="125,000-149,999", survey_2021.Q25 =="150,000-199,999",
             survey_2021.Q25 =="200,000-249,999",  survey_2021.Q25 =="250,000-299,999", survey_2021.Q25 =="300,000-499,999",
             survey_2021.Q25 =="500,000-999,999", survey_2021.Q25 ==">1,000,000" ]

value = [ 500, 1500, 2500, 3500, 4500, 6250, 8750, 12500, 17500, 22500, 27500, 35000,
         45000, 55000, 65000, 75000, 85000, 95000, 112500, 137500, 175000, 225000, 275000, 400000, 750000, 1000000]
## add Salary column
survey_2021['Salary'] = np.select(condition,value)


In [None]:
# Convert the range and add mid value of range to the a new dataframe

condition = ["0-999",  "1,000-1,999", "2,000-2,999", "3,000-3,999", "4,000-4,999", "5,000-7,499",
             "7,500-9,999", "10,000-14,999", '15,000-19,999', "20,000-24,999", "25,000-29,999",
             "30,000-39,999", "40,000-49,999", "50,000-59,999", "60,000-69,999", "70,000-79,999",
             "80,000-89,999", "90,000-99,999", "100,000-124,999", "125,000-149,999", "150,000-199,999",
             "200,000-249,999",  "250,000-299,999", "300,000-499,999", "500,000-999,999", ">1,000,000" ]

value = [ 500, 1500, 2500, 3500, 4500, 6250, 8750, 12500, 17500, 22500, 27500, 35000,
         45000, 55000, 65000, 75000, 85000, 95000, 112500, 137500, 175000, 225000, 275000, 400000, 750000, 1000000]
## add Salary column
#survey_2021['Salary'] = np.select(condition,value)

salary = pd.DataFrame(zip(condition, value))

### Create subsets of Dataframe

In [None]:
#Select only rows with Woman respondents:
Woman=responses_df.loc[responses_df['Q2'] == 'Woman']

#Select only rows with Man respondents:
Man=responses_df.loc[responses_df['Q2'] == 'Man']

#Select only rows with India as country:
India=responses_df.loc[responses_df['Q3'] == 'India']

#Select only rows with USA as country:
USA=responses_df.loc[responses_df['Q3'] == 'USA']

## Insights on Kaggle Survey 2021

In [None]:
# A code that can be used for various visualizations!!

def uniquecount_barplot(column):
    fig, ax = plt.subplots(figsize=(10, 6))

    #count_uniques = pd.DataFrame(survey_2021[column].value_counts()).rename(columns={column:'Total_Count'}).sort_values('Total_Count',ascending=False)
    count_uniques = pd.DataFrame(round(((survey_2021[column].value_counts())/survey_2021.shape[0])*100,2)).rename(columns={column:'Percentage_of_Total'}).sort_values('Percentage_of_Total',ascending=False)
    
    ax = sns.barplot(x=count_uniques.index.values.tolist()  , y='Percentage_of_Total', data=count_uniques, color='peachpuff')#palette= "Set2")
    
    # rotates labels and aligns them horizontally to left 
    plt.setp( ax.xaxis.get_majorticklabels(), rotation=90, ha="left" );
    

def genderwise_comparison_barplot(column):
    plt.rcParams["figure.figsize"] = (10, 6)
    
    proportion = pd.DataFrame(survey_2021['Q2'].groupby(survey_2021[column]).value_counts().unstack()).sort_values('Woman',ascending=False)
    proportion['Man'] = proportion['Man'].apply(lambda x: round((x/Man.shape[0]) * 100, 2))
    proportion['Woman'] = proportion['Woman'].apply(lambda x: round((x/Woman.shape[0]) * 100, 2))

    #count_uniques_gender = pd.DataFrame(survey_2021['Q2'].groupby(survey_2021[column]).value_counts().unstack().sort_values('Woman',ascending=False).reset_index()).set_index(column);
    proportion.plot.bar(color={"Man": "cornflowerblue", "Woman": "lightskyblue"}, width=.8)
    plt.xlabel('')
    plt.ylabel('Proportion')
    plt.legend(title="Gender")


def INDA_USA_comparison_barplot(column):
    plt.rcParams["figure.figsize"] = (10, 6)
    
    a_df_ = pd.DataFrame(survey_2021['Q3'].groupby(survey_2021[column]).value_counts().unstack())
    proportion = a_df_[['India', 'USA']].sort_values('India',ascending=False)
    proportion['India'] = proportion['India'].apply(lambda x: round((x/India.shape[0]) * 100, 2))
    proportion['USA'] = proportion['USA'].apply(lambda x: round((x/USA.shape[0]) * 100, 2))
    
    proportion.plot.bar(color={"India": "lightcoral", "USA": "royalblue"}, width=.8)
    
    plt.xlabel('')
    plt.ylabel('Proportion')
    plt.legend(title="Country")
    
def cross_plot_salary(column1, column2):
    fig, ax = plt.subplots(figsize=(14, 9))
    ax.set_title('Title', pad=15)

    pivot_table = survey_2021.groupby([column1,column2]).size().unstack()
    #pivot_table =pd.merge(pivot_table,salary, left_on=column1, right_on=0).sort_values(1).reset_index().drop(columns=['index',0,1])
    pivot_table = pivot_table.fillna(0).astype(int)
    ax = sns.heatmap(pivot_table, mask=pivot_table < threshold, annot=True, fmt="d", cmap="Reds",linewidth = .5, linecolor='whitesmoke', cbar=False)
    ax.set_facecolor("white")
    

## 'What is your gender? '

In [None]:
# create data
size_of_groups=survey_2021['Q2'].value_counts().tolist()
names=['{} % Man'.format(round(size_of_groups[0]/ survey_2021.shape[0] *100, 2)), '{} % Woman'.format(round(size_of_groups[1]/ survey_2021.shape[0] *100, 2))]

# Create a pieplot
plt.rcParams["figure.figsize"] = (7,7)

plt.pie(size_of_groups, colors=["cornflowerblue", "lightskyblue"], labels=names)
plt.title('Gender distribution between respondents', fontsize=15)


# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle((0,0), 0.6, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)

plt.show()

In [None]:
print(color.BOLD +"Men to Women participation ratio in this survey is {}:1 ".format(round((size_of_groups[0]/size_of_groups[1]),2)), "\n"+ color.END)
print("81% of all respondents identify as a Man.")

## 'What is your age (# years)?' 

In [None]:
column='Q1'
uniquecount_barplot(column)
plt.title('Age distribution between respondents', fontsize=15)

genderwise_comparison_barplot(column)
plt.title('Age distribution between Men and Women respondents', fontsize=15);

INDA_USA_comparison_barplot(column)
plt.title('Age distribution between Indian and American respondents', fontsize=15);

* Most of the respondents/ say Kaggle users belong to the age group of 25-29 followed by 18-21 years.
* Proportion of women is very less compared to men in all age group categories.
* we can see that DS and ML are more popular with women between the age group of 18-29. Where as there are very few DS and ML enthusiasts above 30 years.
* Most of the women(respondents) belong to the age group of 18-21 and men(respondents) belong to 25-29 years.
* Though there is not much significant difference between age groups of 18-21 and 25-29 between both genders, nonetheless this poses a question:
#### Are young women showing more interest towards DS and ML?<br>
* Indian respondents are comparatively younger compared to American respondents.
* Most of the American respondents are beyoung 30 years.
#### Young Indians are showing interest in DS and ML compared to young Americans.

## 'In which country do you currently reside?'

In [None]:
count_uniques = pd.DataFrame(round(((survey_2021['Q3'].value_counts())/survey_2021.shape[0])*100,2)).rename(columns={'Q3':'Percentage_of_Total'})

# Create Figure
fig, ax = plt.subplots(figsize=(10,7))

ax = sns.barplot(x=list(count_uniques[0:15].index), y="Percentage_of_Total", data=count_uniques[0:15], color='peachpuff')

# rotates labels and aligns them horizontally to left 
plt.setp( ax.xaxis.get_majorticklabels(), rotation=90, ha="left" )
plt.suptitle('Number of responses country wise', size = 15);

plt.show()

In [None]:
print(color.BOLD +"{} and more countries participated in the survey".format(count_uniques.shape[0]-1)+ color.END)
print('There are some countries where less than 50 respondents participated, such countries are grouped into "others"')
print(color.BOLD +"Representation of Indians in the survey is {} times more compared to Americans.".format(round(count_uniques.iloc[0]/count_uniques.iloc[1],2)[0]), "\n"+ color.END)

* Looks like Kaggle is highly popular amongst Indians. 30% of total responses are from India<br>
* About 10% of total responses are from USA.<br>
* One very interesting observation from this graph, there is a growing popularity in DS and ML fields in the developing countries such as India, Brazil, Nigeria etc.
* The India and United States provided the highest volume of survey responses.
* The top 15 countries account for almost 70% of all respondents.

#### Are there more opportunities for DS and ML fields in Developing countries?


Lets see, if there is any difference in gender distribution between respondents from different countries.<br>
Also, how popular is DS among women in developing countries.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.subplots_adjust(hspace=2)

plt.rcParams["figure.figsize"] = (12,6)

plt.rcParams['font.size']=13

count_uniques_Woman = pd.DataFrame(Woman['Q3'].value_counts())
count_uniques_Man = pd.DataFrame(Man['Q3'].value_counts())

index_list=[0,1,3,4,5]
label_list_f=[]
for i in index_list:
    label_list_f.append(count_uniques_Woman.index[i])
label_list_f.append('others')

percentages_f=[]
for i in index_list:
    percentages_f.append(round((count_uniques_Woman.iloc[i][0]/Woman.shape[0])*100,2))
percentages_f.append(100-sum(percentages_f))

explode=(0.1,0.1,0.1,0,0,0)

ax1.pie(percentages_f, explode=explode, labels=label_list_f,  
       colors=['lightcoral','royalblue', 'gold', 'pink','lime','grey'], autopct='%1.0f%%', 
       shadow=False, startangle=0,   
       pctdistance=.8,labeldistance=1.1)
ax1.axis('equal')
ax1.set_title("Percentage of Woman respondents country wise")

label_list_m=[]
for i in index_list:
    label_list_m.append(count_uniques_Man.index[i])
label_list_m.append('others')

percentages_m=[]
for i in index_list:
    percentages_m.append(round((count_uniques_Man.iloc[i][0]/Man.shape[0])*100,2))
percentages_m.append(100-sum(percentages_m))

explode=(0.1,0.1,0.1,0,0,0)

ax2.pie(percentages_m, explode=explode, labels=label_list_m,  
       colors=['lightcoral','royalblue', 'wheat', 'green', 'blueviolet','grey'], autopct='%1.0f%%', 
       shadow=False, startangle=0,   
       pctdistance=.8,labeldistance=1.1)
ax2.axis('equal')
ax2.set_title("Percentage of Man respondents country wise")
plt.tight_layout()

* Again most of the Women respondents are from India followed by USA.<br>
* But there is an emerging community of Women from Egypt, Uk and Nigeria.<br>
* Participation from Japan and China is more in Men than in Women

## 'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'


In [None]:
column='Q4'
uniquecount_barplot(column)
plt.title('Formal Educatiton of respondents', fontsize=15)

genderwise_comparison_barplot(column)
plt.title('Formal Educatiton comparison between Men and Women respondents', fontsize=15);

INDA_USA_comparison_barplot(column)
plt.title('Formal Educatiton comparison between Indian and American respondents', fontsize=15);

* Many Kaggle users are holding a Master's Degree or Bachelor's Degree. This is observed holds valid for both Men and Women of Kaggle community.
* Most of the Indian responders are holding Bachelor's degree, while most of the American resonders are holding either Master's degree or Doctoral degree.
* Based on the age group it can be assumed that many are contributing to Kaggle, exploring and build models, participating in Kaggle competitions to gain exposure or to build a profile in DS and ML files.
* This can be with an intention to pursue Masters Degree or up-skilling and moving to DS and ML fields in pursuit of new roles and jobs.
#### It is safe say that Kaggle helping Students and job-seekers to gain exposure in DS and ML fileds.

## 'Select the title most similar to your current role (or most recent title if retired)?

In [None]:
column='Q5'
uniquecount_barplot(column)
plt.title('Current role of respondents', fontsize=15)

genderwise_comparison_barplot(column)
plt.title('Current role comparison between Men and Women respondents', fontsize=15);

INDA_USA_comparison_barplot(column)
plt.title('Current role comparison between Indian and American respondents', fontsize=15);

* Most responses are from students and they are from India.
* Amongst students proportion of women is more than that of Men.
* Men are already holding positions in data science, data analyst or software fields in comparison to women responders.
* Proportion of American responders in Data Scientist and Data Analyst roles is more than Indian responders.
* Since the USA is having established tech hubs, the proportion of DS and ML is more in various other fields compared to India.
* There are other fields in USA who are benefiting from Kaggle.
#### Responders from which other fields in USA are prominent on Kaggle? (May be Medical, Automotive or Startups?)


## 'For how many years have you been writing code and/or programming?


In [None]:
column='Q6'
uniquecount_barplot(column)
plt.title('Codeing/ programming experience in years', fontsize=15)

genderwise_comparison_barplot(column)
plt.title('Gender wise codeing/ programming experience in years', fontsize=15);

INDA_USA_comparison_barplot(column)
plt.title('Country wise codeing/ programming experience in years', fontsize=15);

Our assumptions based on the plots of formal education strengthen the fact that, Kaggle is helping students and job seekers to work with data, learn new concepts, gain knowledge, participate in competitions and gain exposure which is required to pursue higher education or seek a role or a new job.<br>
* Many women responders have been coding since 1-3 years or less.
* Men have more coding experience compared women respondents.
* Almost equal proportion of men and women have coding experience for 3-5 years.
* Twice as many men have 20+ years of experience than women.
* Looks like India is getting started with coding and programming.
* Proportion of Americans who have 20+years of programming is 5 times more than Indians. 
* Less than 5% of American respondents have never written a code.

In [None]:
column1='Q25'
column2='Q6'

plt.rcParams["figure.figsize"] = (14, 9)

#pivot_table = survey_2021.groupby([column1,column2]).size().unstack().reset_index()
pivot_table = survey_2021.groupby([column1,column2]).size().unstack()
pivot_table = pd.DataFrame(pd.merge(pivot_table,salary, left_on=column1, right_on=0).sort_values(1).reset_index().drop(columns=['index',0,1]))
pivot_table['row_sum'] = pivot_table.sum(axis = 1, skipna = True)
pivot_table.iloc[:,1:-1] =  pivot_table.iloc[:,1:-1].div(pivot_table.row_sum, axis=0).apply(lambda x: round(x*100,2))
pivot_table.drop(columns = 'row_sum', inplace=True)

pivot_table = pivot_table[['Q25','I have never written code', '< 1 years', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']]

# plot data in stack manner of bar type
ax = pivot_table.plot(x=column1, kind='bar', stacked=True, colormap='BuPu');
ax.legend(bbox_to_anchor=(1.1, 1.05))

plt.title('Yearly compensation comparison over years of codeing experience', fontsize=15, y=1.05);
plt.xlabel("Yearly compensation band (USD)", labelpad=15, loc= 'center')
plt.ylabel("Proportion of responses", labelpad=15, loc='center');


* Respondents with more programming experience have more salary.
* Majority of those who are earning more than 1,00,000 USD in Yearly compensation have more than 5 years of coding experience.
* responders with 20+years earn more than 2,00,000 USD.
* About 20% of responders have 20+years of experience drawing more than 10,00,000 USD.


## 'In what industry is your current employer/contract (or your most recent employer if retired)?'

In [None]:
column='Q20'
uniquecount_barplot(column)
plt.title('Industry respondents employed in ', fontsize=15)

genderwise_comparison_barplot(column)
plt.title('Industry respondents employed in Gender wise', fontsize=15);

INDA_USA_comparison_barplot(column)
plt.title('Industry respondents employed in Country wise', fontsize=15);

The heatmap below portrays the number of respondents falling in a particular salary band from each category.

In [None]:
# create heatmap seaborn
column1= 'Q20' # Industry employed in
column2='Q25' # this is salary column
threshold = 50 #min number of responses in selected cateory

cross_plot_salary(column1, column2)
plt.title("Number of responses for Yearly compensation band industry wise", fontsize = 15, y=1.05)
plt.xlabel("Yearly compensation band", labelpad=15, loc= 'center')
plt.ylabel("Industry employed in ", labelpad=15, loc='center');

#### Computers/Technology and Academics/Education industry are the most popular.
* Third most popular Industry is Accounting/Finance but it is about half as popular as Academics/Education industry.
* Academics/Education and Computers/Technology are popular industries among women.
* Proportion of women in Computers/Technology industry about 3% less compared to Men.
* Women responders working in Computers/Technology industry is three times than that of Accounting/Finance.
* Proportion of responders from India in Computers/Technology industry twice as much as Academics/Education industry.
* Accounting/Finance, Medical/Pharmaceutical, Government/Public Service are other popular industries among responders in USA.
* Representation from Hospitality/Entertainment/Sports industry is the least.
* Except for Computers/Technology industry which is highly popular among Indian responders, other industries are relatively popular among Americans. 
#### It is likely that most Indian Kaggle users are working in Computers/Technology Industry?

## 'What is your current yearly compensation (approximate $USD)?'

In [None]:
size_of_groups=survey_2021['Q25'].isnull().value_counts().tolist()
names=['{} % Responded'.format(round(size_of_groups[0]/ survey_2021.shape[0] *100, 2)), '{} % Not responded'.format(round(size_of_groups[1]/ survey_2021.shape[0] *100, 2))]
# Create a pieplot
plt.rcParams["figure.figsize"] = (7,7)

plt.pie(size_of_groups, colors=["darkblue", "lightsteelblue"], labels=names)
plt.title('How many disclosed compensation details', fontsize=15)


# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle((0,0), 0.6, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)

plt.show()

About 41% of respondents were not comfortable disclosing their current yearly compensation

In [None]:
column='Q25'
uniquecount_barplot(column)
plt.title('Yearly compensation', fontsize=15)

genderwise_comparison_barplot(column)
plt.title('Gender wise yearly compensation distribution', fontsize=15);

INDA_USA_comparison_barplot(column)
plt.title('Country wise yearly compensation distribution', fontsize=15);

* Majority of women responders are drawing 0-999 USD as yearly compensation.
* Proportion of Men is more than women as the salary-band is increasing.
#### There is definite pay gap between men and women responders.
* Indian reponders are having lesser pay compared to US counterparts.
#### There is definite pay difference between Indian and American responders.
* Majority of Americans are earning anywhere between 100000-200000 USD.
* There are very few Indian responders who are drawing higher salaries.

The heatmap below portrays the number of respondents falling in a particular salary band from each category.

In [None]:
# create heatmap seaborn
column1= 'Q1' # Age in years 
column2='Q25' # this is salary column

threshold = 100 #min number of responses in selected cateory

cross_plot_salary(column1, column2)
plt.title("Number of responses for each Yearly compensation band for an age group", fontsize = 15)
plt.xlabel("Yearly compensation band", labelpad=15, loc= 'center')
plt.ylabel("Age Group", labelpad=15, loc='center');