<div align='center'><font size="6" color='#088a5a'> Pakistan's Rising Data Science Community</font></div>
<div align='center'><font size="4" color="#158a08"> In-depth comparative analysis of Pakistan's Data Science Community and Data Science/ML/AI landscape in Pakistan vs. Rest of the World</font></div>
<hr>


# Data
Kaggle's fifth annual Machine Learning and Data Science Survey 2021 received 25,973 respondents from 171 countries and territories. Pakistan is one of the top 10 countries for the total number of respondents, and participation increased from 2019 to 2021 (Moved from number 19 to number 10). The data provides vital insights into what the Data Science Community around the world is using, what they prefer, and the population's demographics.

# Objective
Understand and provide a comprehensive view of the state of Data Science and Machine Learning in Pakistan. What are the key areas of interest right now, and what is the progress?

# Why Pakistan
* Pakistan's population of 220M+ people, makes it the 5th largest country, slightly ahead of Brazil. It's an incredibly young population; 115M people are under the age of 25. [1]
* With 25,000 IT/Engg Students graduating each year, Pakistan's Tech Industry is booming. Software Engineers, Data Scientists, ML/AI Engineers, and mostly the Students are very enthusiastic.
* Pakistani IT Industry and Startups are on the rise, with 65% growth in 2021. [3] It's been a breakout year, funding this year has outstripped the last six combined [4]
* Pakistan ranks 6th among the fastest-growing countries of open source projects by the GitHub Report 2020 [2] with 51.5% growth in contributors since last year.
* I was born and raised in Pakistan and spend a lot of time sharing knowledge about Data Science/ Data Engineering and ML/AI with various community efforts. I also did this [analysis in 2019](https://www.kaggle.com/muazmaz/pakistan-s-rising-data-science-community) and have wanted to see the growth since then.

# Key Findings
* Most of the respondents are Male and Students and/or Early Career.
* Bachelors degree is more common than any other education level.
* Google Cloud Platform is popular than Amazon Web Services and Microsoft Azure.
* Most of the DS community works in small companies [<50 employees].
* Python is the preferred programming language of choice.
* Kaggle and YouTube are the favorite media sources. Coursera and University Courses is largest learning resource.
* Majority of the DS community is pretty young <1 year and 1-3 years old.
* Most of the Users do not use ML methods in production, they are in the exploration phase.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import seaborn as sns
sns.color_palette("rocket")
sns.set()
import plotly.graph_objects as go
import plotly.express as px
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

kg21 = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')

country_list_21 = list(kg21.Q3.unique())[1:]
country = 'Pakistan'
column_mapping = {'Q1':'Age',
'Q2':'Gender',
'Q3':'Country',
'Q4':'Education',
'Q5':'Job Title',
'Q21':'Company Size',
'Q22':'Team Size',
'Q23':'ML Status in Company',
'Q25':'Compensation Status',
'Q26':'$ Spent',
'Q6':'Yrs of Coding',
'Q15':'Yrs ML'                 
}

kg21 = kg21.rename(columns= column_mapping)
kg21= kg21.drop([0], errors='ignore')

# Re-usable Functions
def value_counts(df, column, normalize= True, rename= 'Percentage', return_percent= True):
    if rename is not None and return_percent:
        mod_df= df[column].value_counts(normalize=normalize).rename(rename).mul(100).reset_index()
        mod_df= mod_df.rename(columns ={'index':column})
        return mod_df
def combine_row_country(df_country, df_row):
    df_country['Geography'] = country
    df_row['Geography'] = 'ROW'
    concat = pd.concat([df_country,  df_row], axis = 0).reset_index(drop=True)
    column = list(df_country.columns)[0]
    length_1 = list(df_country[column].unique())
    length_2 = list(df_row[column].unique())
    dict_more=[]
    if(len(length_1) == len(length_2)):
        return concat
    elif(len(length_1) > len(length_2)):
        not_present = []
        for col in length_1:
            if col not in length_2:
                not_present.append(col)
        for col in not_present:
            dict_more.append([col, 0, 'ROW'])
            
    elif(len(length_1) < len(length_2)):
        not_present = []
        for col in length_2:
            if col not in length_1:
                not_present.append(col)
        for col in not_present:
            dict_more.append([col, 0, country])

    new_df = pd.DataFrame(dict_more, columns=[column, 'Percentage', 'Geography'])
    concat = pd.concat([concat,  new_df], axis = 0).reset_index(drop=True)
    return concat

# Overall View of Respondents by Country

In [None]:
# Re-usable Function
def calculate_percent(series):
    return series.value_counts() / len(series)
percent_per_country = (calculate_percent(kg21['Country'])*100).to_frame().reset_index()\
                                            .rename(columns={'index': 'country name','Country': '%'})

fig = go.Figure(data=go.Choropleth(
    locations=percent_per_country['country name'],
    z = percent_per_country['%'], 
    locationmode='country names', 
    colorscale='bluyl', colorbar_title='%',
    zmax=5
))

fig.update_layout(title={
        'text': "Kaggle Survey Respondents per Country",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
)
fig.show()

# Pakistan vs. Rest of the World
* Where does Pakistan sits in terms of repondents compared to the rest of the world?
* Any specific demographics that are different?

In [None]:
kg21_country = kg21[(kg21['Country'] == country)]
kg21_row = kg21[~(kg21['Country'] == country)]
print("Total Number of Respondents from "+country+": "+str(len(kg21_country)))
print("Total Percentage of "+country+" Respondents: "+ str(round(len(kg21_country) / len(kg21) *100, 2)) + "%") 
total_respondents = len(kg21) - len(kg21_country)

top_respondents = value_counts(kg21, 'Country')
top_5_respondents = top_respondents.head(5)

your_country = top_respondents[top_respondents['Country']==country]
rank =  int( your_country.index[0] + 1 )
filter_top_5 = top_5_respondents[top_5_respondents['Country'] == country]
if len(filter_top_5) == 0:
    top_5_respondents = pd.concat([ your_country, top_5_respondents])

fig, ax = plt.subplots(figsize=(6,6))
plt.axis('equal')
colors_p = ['green','lightblue']
plt.pie(x = [ len(kg21_country), total_respondents], explode= [0.2, 0], labels = [country,'Rest Of World'], startangle=90, shadow=False, colors=colors_p, autopct='%1.1f%%'  );

fig, ax = plt.subplots(figsize=(12,8))
plt.title('Pakistan vs.Top 5 Respondents')
sns.barplot(y = 'Percentage' , x = 'Country', palette='viridis', data = top_5_respondents);

print("Rank in terms of correspondents of "+country+" is "+str(rank) +" out of " + str(len(top_respondents.index)) + " countries")

# 📌 Points to Note:
* Pakistan is at # 10 out of 66 countries, (Note: Country/Territory received less than 50 were grouped into “Other”).
* Overall it is very promising compared to the rest of the world for a developing country as it had jumped to the top 10 countries from being at 19th position in 2019 in terms of responses.

# Age Grouping: Pakistan vs. Rest of the World
* What is the typical age group of the population?
* How does the age grouping compare to the Rest of the World?

In [None]:
age = value_counts(kg21_country, 'Age')
age_row = value_counts(kg21_row, 'Age')
age_concat = combine_row_country(age, age_row)


fig, ax = plt.subplots(figsize=(12,8))
plt.title('Age Group for '+country)
sns.barplot(y = 'Percentage' , x = 'Age', palette='viridis', data = age);

In [None]:
agec_sort=age_concat.sort_values(by=['Age'],ascending=True)
plt.title('Age Grouping of Pakistan vs. Rest of the World')
sns.set(rc={'figure.figsize':(15,14.27)})
sns.stripplot(x="Age", y="Percentage", hue="Geography",palette='plasma', data=agec_sort);

# 📌 Points to Note
* Mostly the population is young i-e; < 30 years of age
* Ages between 22-24 years are significantly higher than Rest of the world, followed by 18-21 years which matches the understanding of an incredibly young population (115M people are under 25)

# Gender Distribution: Pakistan vs. Rest of the World
* What is the gender distribution?
* How does the gender distribution compare to the Rest of the World?

In [None]:
gender = kg21_country['Gender'].value_counts(normalize=True).rename('Percentage').mul(100).reset_index()
fig, ax = plt.subplots(figsize=(8,6))
sns.barplot(x='index', y ='Percentage',  palette='viridis',data= gender);

# Age and Gender Distribution

In [None]:
filter_age_gender = kg21_country.groupby(['Gender','Age']).count().reset_index()[['Gender','Age','Country']]
filter_age_gender =  filter_age_gender.loc[filter_age_gender['Gender'].isin(['Man', 'Woman'])]
fig, ax = plt.subplots(figsize=(16,8))
plt.title('Male vs Female Respondents in terms of Age')
sns.barplot(x='Age', y ='Country', hue='Gender', palette='viridis',data= filter_age_gender,  ax=ax);
ax.set_xlabel('Age');
ax.set_ylabel('Count');

# 📌 Points to Note
* Mostly the respondents are Male and < 24 years of age.

# Education: Pakistan vs. Rest of the World
* What is the Education level?
* How does the education level compare to the Rest of the World?

In [None]:
degrees_propotion = value_counts(kg21_country,'Education') 
degrees_propotion_row = value_counts(kg21_row,'Education')

degree_concat = combine_row_country(degrees_propotion, degrees_propotion_row)

fig, ax = plt.subplots(figsize=(8,6))
title='Degree Holders'
sns.barplot(x='Percentage', y ='Education',  palette='viridis',data= degrees_propotion);

fig, ax = plt.subplots(figsize=(8,6))
# title='Degree Holders" vs ROW'
sns.barplot(x='Percentage', y ='Education',  hue='Geography', palette='viridis',data= degree_concat);

# 📌 Points to Note
* Bachelor's degree is more common than Master's as compared to rest of the world, it could be due to the age group responding to the Survey.
* Some college/university without earning a bachelor's degree group is closer to the rest of the world which is very interesting!

# Education and Age Relationship
* Understand if the age group is different for the Education
* Interested in characteristics for some college/university without a degree group

In [None]:
age_deg_gdr = kg21_country.groupby(['Education','Age']).agg('count').reset_index()[['Education', 'Age', 'Gender']]

fig, ax = plt.subplots(figsize=(16,8))

colors_p = ['#21918c','#440154','#3b528b','#4ac16d','#365c8d','#5ec962','#fde725']
age_deg_gdr.pivot('Age','Education','Gender').plot.bar(stacked=True,width=0.8, color=colors_p, ax=ax);
ax.set_xlabel('Age Group');
ax.set_ylabel('Count');
ax.set_title('Education and Age Groups');

# 📌 Points to Note
* No formal degree group are mostly young (possibly still pursuing the education)


# Money Matters
* What is the Salary distribution?
* Is it different between Male and Female?

In [None]:
filter_gender_salary = kg21_country.groupby(['Compensation Status', 'Gender']).count().reset_index()[['Compensation Status','Gender','Country']]
filter_gender_salary = filter_gender_salary[filter_gender_salary['Gender'].isin(['Man','Woman'])]

fig, ax = plt.subplots(figsize=(20,10))
plt.xticks(rotation=75)
sns.barplot(x='Compensation Status', y ='Country', hue='Gender', palette='viridis',data= filter_gender_salary,  ax=ax);
ax.set_title('Gender and Salary Distribution');
ax.set_xlabel('Compensation Status');
ax.set_ylabel('Salary');


# 📌 Points to Note
* The common range is 0-999 USD which is in line with common salaries as 1 USD = ~ 170 PKR.
* A few female respondents are getting 3k-4k USD, I want to know more about them.

# Job Title/Roles: Pakistan vs. Rest of the World
* What are common titles/roles for these people?
* How do they compare to the rest of the world?

In [None]:
job_title = value_counts(kg21_country, 'Job Title')
job_title_row = value_counts(kg21_row, 'Job Title')

job_title_concat = combine_row_country(job_title, job_title_row)

fig, ax= plt.subplots(figsize=(12,8))
plt.xticks(rotation=75)
sns.barplot(x='Job Title', y ='Percentage', palette='viridis',data= job_title,ax=ax)



fig, ax= plt.subplots(figsize=(12,8))
plt.xticks(rotation=75)
sns.barplot(x='Job Title', y ='Percentage',hue='Geography', palette='viridis',data= job_title_concat,ax=ax);

# 📌 Points to Note
* Clearly most of them are Students, which is in line with the age group and earlier findings.
* Data Scientists and Software Engineers are next in terms of titles.
* Increase in Statistician, Data Engineers, or Analysts roles as compared to 2019.

# Where do they work?
* What is a typical company size and how does it compare to rest of the world?

In [None]:
company_size = value_counts(kg21_country, 'Company Size')
company_size_row = value_counts(kg21_row, 'Company Size')
company_size_concat = combine_row_country(company_size, company_size_row)

fig, ax= plt.subplots(figsize=(8,4))
sns.barplot(x='Percentage', y ='Company Size', palette='viridis',data= company_size,ax=ax);

fig, ax= plt.subplots(figsize=(8,4))
sns.barplot(x='Percentage', y ='Company Size', hue='Geography',palette='viridis',data= company_size_concat,ax=ax);

# 📌 Points to Note
* Majority works in < 50 employee companies and this is significantly different than the rest of the world.
* Significantly less number of people are working with big companies (> 10k employees) which matches with mostly the age group and International companies' presence in Pakistan.

# Current Use of ML/AI: Pakistan vs. Rest of the World
* What stage of Data Science is the population and how does it compare to the rest of the world?

In [None]:
current_ML = value_counts(kg21_country, column ='ML Status in Company') 
current_ML_row = value_counts(kg21_row, column ='ML Status in Company') 
current_ML_concat = combine_row_country(current_ML, current_ML_row)

dict = {'We are exploring ML methods (and may one day put a model into production)' : 'Exploring ML methods \n(May put a model into production one day)',
        'We recently started using ML methods (i.e., models in production for less than 2 years)': 'Recently started using ML methods \n(models in production < 2 years)',
       'We have well established ML methods (i.e., models in production for more than 2 years)':'Well established ML methods \n(models in production > 2 years)',
       'No (we do not use ML methods)':'No, we do not use ML methods',
       'I do not know':'No idea',
       'We use ML methods for generating insights (but do not put working models into production)':'ML models for Generating insights \n(but do not put working models into production)'}

current_ML_concat['ML Status in Company'] = current_ML_concat['ML Status in Company'].map(dict)


fig, ax= plt.subplots(figsize=(8,8))
# state of ML in Production
sns.barplot(x='Percentage', y ='ML Status in Company', hue='Geography',palette='viridis',data= current_ML_concat,ax=ax);


# 📌 Points to Note
* Most of them are exploring mode and recently started to use ML methods.
* Rest of the world has a high ratio of already established ML methods, and Pakistan has a lot to learn.

# $ Spent on ML/AI efforts: Pakistan vs. Rest of the World
* Amount of money spent on ML?

In [None]:
money = value_counts(kg21_country, '$ Spent')
money_row = value_counts(kg21_row, '$ Spent')
money_concat = combine_row_country(money, money_row)

fig, ax= plt.subplots(figsize=(8,8))
plt.xticks(rotation=45)
sns.barplot(x='Percentage', y ='$ Spent', hue='Geography',palette='viridis',data= money_concat,ax=ax);

# 📌 Points to Note
* Very little to no dollars are being spent right now, which is consistent with most of them being students.

# Years of Coding and ML: Pakistan vs. Rest of the World
* How many years of experience in coding and ML?
* How does it compare to the rest of the world?

In [None]:
yrs_code = value_counts(kg21_country, 'Yrs of Coding')
yrs_code_row = value_counts(kg21_row, 'Yrs of Coding')
yrs_code_concat = combine_row_country(yrs_code, yrs_code_row)

fig, ax= plt.subplots(figsize=(8,4))
sns.barplot(x='Percentage', y ='Yrs of Coding', palette='viridis',data= yrs_code,ax=ax)

fig, ax= plt.subplots(figsize=(8,4))
sns.barplot(x='Percentage', y ='Yrs of Coding', hue='Geography',palette='viridis',data= yrs_code_concat,ax=ax);

# Years of ML

In [None]:
yrs_ml = value_counts(kg21_country, column ='Yrs ML')
yrs_ml_row = value_counts(kg21_row, column ='Yrs ML')

yrs_ml_concat = combine_row_country(yrs_ml, yrs_ml_row)
plt.title('Yrs of ML used in Pakistan vs. Rest of the World')
sns.set(rc={'figure.figsize':(11.7,18.27)})
plt.xticks(rotation=45)
ax = sns.stripplot(x="Yrs ML", y="Percentage", hue="Geography",palette='viridis', data=yrs_ml_concat);

#  📌 Points to Note
* Most of the population is <3 year or coding and ML experience.
* Pakistan is lacking in experienced ML developers and users. This will change in the next 2-5 years.

In [None]:
# Re-usable Function
def multiple_answers(limit, df, col, mod_name):
    dict_local = {}

    for i in range(1, limit+1):
        col_name = col + str(i)
        new_col_name = df[col_name].value_counts().reset_index().iloc[0,0]
        dict_local[new_col_name]  = df[col_name].value_counts().reset_index().iloc[0,1]

    new_df = pd.DataFrame(dict_local, index=[0])
    new_df = new_df.melt()
    new_df = new_df.sort_values(by='value', ascending=False)
    new_df = new_df.rename(columns= {'variable':mod_name, 'value':'Count'})
    return new_df

# Media and Learning Sources: Pakistan vs. Rest of the World
* What are media sources commonly followed and used for learning?
* What platforms does the community complete courses for learning?
* How does it compare to the rest of the world?

In [None]:
limit = 11

media = multiple_answers(limit, kg21_country, 'Q42_Part_', 'Media Sources')
media['Percentage']= (media.Count /  media.Count.sum()) * 100
media_row = multiple_answers(limit, kg21_row, 'Q42_Part_', 'Media Sources')
media_row['Percentage']= (media_row.Count /  media_row.Count.sum()) * 100
media_concat = combine_row_country(media, media_row)


fig, ax= plt.subplots(figsize=(8,8))
sns.barplot(x='Percentage', y ='Media Sources', hue='Geography',palette='viridis',data= media_concat,ax=ax);

In [None]:
limit = 11

courses = multiple_answers(limit, kg21_country, 'Q40_Part_', 'Courses')
courses['Percentage']= (courses.Count / courses.Count.sum()) * 100
courses_row = multiple_answers(limit, kg21_row, 'Q40_Part_', 'Courses')
courses_row['Percentage']= (courses_row.Count /  courses_row.Count.sum()) * 100
courses_concat = combine_row_country(courses, courses_row)


fig, ax= plt.subplots(figsize=(8,8))
sns.barplot(x='Percentage', y ='Courses', hue='Geography',palette='viridis',data= courses_concat,ax=ax);

# 📌 Points to Note
* Kaggle and YouTube are favorite media sources.
* Podcasts and Reddit are not common in Pakistan as compared to the rest of the world.

# Favorite Programming Language

In [None]:
#Take out programming language seprately and put in Data Dictonary
programminglang_pk_dict = {
 'Python' : (kg21_country['Q7_Part_1'].count()),
 'R': (kg21_country['Q7_Part_2'].count()),
 'SQL' : (kg21_country['Q7_Part_3'].count()),
 'C' : (kg21_country['Q7_Part_4'].count()),
 'C++' : (kg21_country['Q7_Part_5'].count()),
 'Java ' : (kg21_country['Q7_Part_6'].count()),
 'Javascript' : (kg21_country['Q7_Part_7'].count()),
 'Julia' : (kg21_country['Q7_Part_8'].count()),
 'Swift' : (kg21_country['Q7_Part_9'].count()),
 'Bash' : (kg21_country['Q7_Part_10'].count()),
 'MATLAB' : (kg21_country['Q7_Part_11'].count()),
 'None' : (kg21_country['Q7_Part_12'].count()),
 'Other' : (kg21_country['Q7_OTHER'].count())
}
#Convert Data dictonary to series
programminglang_pk_series=pd.Series(programminglang_pk_dict)
fig = px.scatter(programminglang_pk_series, y=programminglang_pk_series.values, x=programminglang_pk_series.index,size=programminglang_pk_series.values,color_discrete_sequence=px.colors.qualitative.Dark2)
fig.show();

# 📌 Points to Note
* Python is the winner, which is consistent with the rest of the world.
* No Swift or Julia, which makes sense as they are new.

# Cloud Computing Platform: Pakistan vs. Rest of the World
* What Cloud Computing Platform are they currently using?
* What Cloud computing platforms do they hope to become more familiar with within the next two years?
* How does it compare to the rest of the world?

In [None]:
cc_row_dict = {
 'AWS': (kg21_row['Q27_A_Part_1'].count()),
 'Microsoft Azure' : (kg21_row['Q27_A_Part_2'].count()),
 'GCP' : (kg21_row['Q27_A_Part_3'].count()),
 'IBM Cloud/ Red Hat' : (kg21_row['Q27_A_Part_4'].count()),
 'Oracle Cloud' : (kg21_row['Q27_A_Part_5'].count()),
 'SAP Cloud' : (kg21_row['Q27_A_Part_6'].count()),
 'Salesforce Cloud' : (kg21_row['Q27_A_Part_7'].count()),
 'VMware Cloud' : (kg21_row['Q27_A_Part_8'].count()),
 'AliBaba Cloud' : (kg21_row['Q27_A_Part_9'].count()),
 'Tencent Cloud' : (kg21_row['Q27_A_Part_10'].count()),
 'None' : (kg21_row['Q27_A_Part_11'].count()),
 'Other' : (kg21_row['Q27_A_OTHER'].count())
}

cc_pak_dict = {
 'AWS': (kg21_country['Q27_A_Part_1'].count()),
 'Microsoft Azure' : (kg21_country['Q27_A_Part_2'].count()),
 'GCP' : (kg21_country['Q27_A_Part_3'].count()),
 'IBM Cloud/ Red Hat' : (kg21_country['Q27_A_Part_4'].count()),
 'Oracle Cloud' : (kg21_country['Q27_A_Part_5'].count()),
 'SAP Cloud' : (kg21_country['Q27_A_Part_6'].count()),
 'Salesforce Cloud' : (kg21_country['Q27_A_Part_7'].count()),
 'VMware Cloud' : (kg21_country['Q27_A_Part_8'].count()),
 'AliBaba Cloud' : (kg21_country['Q27_A_Part_9'].count()),
 'Tencent Cloud' : (kg21_country['Q27_A_Part_10'].count()),
 'None' : (kg21_country['Q27_A_Part_11'].count()),
 'Other' : (kg21_country['Q27_A_OTHER'].count())
}

#Convert Data dictonary to series
cc_row_series=pd.Series(cc_row_dict)
cc_pak_series=pd.Series(cc_pak_dict)

fig = go.Figure(data=[
    go.Bar(name='Pakistan',  x=cc_pak_series.index, y=cc_pak_series.values)
])


fig.update_layout(barmode='group',title_text='Cloud Computing Platform Currently Used')
fig.show();


In [None]:
limit = 11

cloud_comp = multiple_answers(limit, kg21_country, 'Q27_B_Part_', 'Cloud Computing')
cloud_comp['Percentage']= (cloud_comp.Count / cloud_comp.Count.sum()) * 100
cloud_comp_row = multiple_answers(limit, kg21_row, 'Q27_B_Part_', 'Cloud Computing')
cloud_comp_row['Percentage']= (cloud_comp_row.Count /  cloud_comp_row.Count.sum()) * 100
cloud_comp_concat = combine_row_country(cloud_comp, cloud_comp_row)


fig, ax= plt.subplots(figsize=(8,8))
ax.set_title('Cloud Computing Interest for next 2 years')
sns.barplot(x='Percentage', y ='Cloud Computing', hue='Geography',palette='viridis',data= cloud_comp_concat,ax=ax);


# 📌 Points to Note
* Google Cloud Platform is more used as compared to Amazon Web Services and Microsoft Azure, this is different than the rest of the world. Google Cloud programs are very active in universities and Pakistan has a huge GDG community!
* There is appetite to learn Amazon Web Services(AWS) and Microsoft Azure in the next 2 years.

# Data Scientist Choices: Pakistan vs. Rest of the World
* What Integrated development environments (IDE's) do you use on a regular basis?
* How does it compare to the rest of the world?

In [None]:
limit = 12

ide = multiple_answers(limit, kg21_country, 'Q9_Part_', 'IDE')
ide['Percentage']= (ide.Count /  ide.Count.sum()) * 100
ide_row = multiple_answers(limit, kg21_row, 'Q9_Part_', 'IDE')
ide_row['Percentage']= (ide_row.Count /  ide_row.Count.sum()) * 100
ide_concat = combine_row_country(ide, ide_row)


fig, ax= plt.subplots(figsize=(8,8))
sns.barplot(x='Percentage', y ='IDE', hue='Geography',palette='viridis',data= ide_concat,ax=ax);

# Visualization Library or Tool is Popular?

In [None]:
limit = 11

bi = multiple_answers(limit, kg21_country, 'Q14_Part_', 'Visualization Lib')
bi['Percentage']= (bi.Count /  bi.Count.sum()) * 100
bi_row = multiple_answers(limit, kg21_row, 'Q14_Part_', 'Visualization Lib')
bi_row['Percentage']= (bi_row.Count /  bi_row.Count.sum()) * 100
bi_concat = combine_row_country(bi, bi_row)


fig, ax= plt.subplots(figsize=(8,8))
sns.barplot(x='Percentage', y ='Visualization Lib', hue='Geography',palette='viridis',data= bi_concat,ax=ax);

# Machine Learning Frameworks Popularity

In [None]:
limit = 17

ml = multiple_answers(limit, kg21_country, 'Q16_Part_', 'ML')
ml['Percentage']= (ml.Count / ml.Count.sum()) * 100
ml_row = multiple_answers(limit, kg21_row, 'Q16_Part_', 'ML')
ml_row['Percentage']= (ml_row.Count /  ml_row.Count.sum()) * 100
ml_concat = combine_row_country(ml, ml_row)


fig, ax= plt.subplots(figsize=(8,8))
sns.barplot(x='Percentage', y ='ML', hue='Geography',palette='viridis',data= ml_concat,ax=ax);

In [None]:
limit = 11

ml_algo = multiple_answers(limit, kg21_country, 'Q17_Part_', 'ML Algorithms')
ml_algo['Percentage']= (ml_algo.Count / ml_algo.Count.sum()) * 100
ml_algo_row = multiple_answers(limit, kg21_row, 'Q17_Part_', 'ML Algorithms')
ml_algo_row['Percentage']= (ml_algo_row.Count /  ml_algo_row.Count.sum()) * 100
ml_algo_concat = combine_row_country(ml_algo, ml_algo_row)


fig, ax= plt.subplots(figsize=(8,8))
sns.barplot(x='Percentage', y ='ML Algorithms', hue='Geography',palette='viridis',data= ml_algo_concat,ax=ax);

# 📌 Points to Note
* PyCharm, VSCode and Visual Studio are more popular as IDEs as compared to the rest of the world.
* Tensorflow and Keras are more popular than rest of the world.
* A lot of respondents are not using data visualization libraries.
* Convolutional Neural Networks (CNN) are more popular than the rest of the world, which is interesting.


# Conclusion and Key Takeaways
* Youth is the Future of Data Science and ML/AI Community of Pakistan.
* Universities have a significant role in equipping students with the right tools to better prepare for Industry.
* Platforms like Kaggle are supporting the journey for DS Community in Pakistan.
* As companies grow, more opportunities will improve the overall landscape of Data Science and ML/AI in Pakistan. Significantly the startup ecosystem will boost ML/AI adoption in the coming 2-5 years.

I cannot wait to see next year's results and participation. I am super optimistic about the future!

Feel free to leave comments and/or any questions.

# References
1. https://en.wikipedia.org/wiki/Pakistan
2. https://octoverse.github.com/
3. https://www.forbesmiddleeast.com/leadership/opinion/investors-have-poured-more-than-%24125m-into-pakistani-startups-heres-why
4. https://www.bloomberg.com/news/features/2021-11-17/pakistan-startups-draw-record-money-helped-by-covid-and-china-s-tech-crackdown
5. https://www.kaggle.com/muazmaz/pakistan-s-rising-data-science-community