# Story: What it takes to being a Data Scientist in India?



We see a lot of online trainings, classes, and career advisors giving ideas and helping students from various fields become data scientists in India today. Since data is available in all forms and fields today, Data Management - essentially what a data scientist does has become a very important and valuable asset to any orgnaisation with collects or analyzes third-party data. 

*According to an article in July'19: Analysts predict that the country will have more than 11 million job openings by 2026. In fact, since 2019, hiring in the data science industry has increased by 46%. Yet, around 
93,000 jobs in Data Science were vacant at the end of August 2020 in India.Another article states - There is a huge scope of data-related operations in India. The main career opportunities available are in data science, data analytics, big data engineers, big data managers, and data architects.*

By doing some pre-analysis, I observed that the highest number of participants surveyed in the following data were from India. We can believe this given the size of population and increasing demand for data science professionals in the country.
So let's really find out what, and how does one become a valuable asset of the data science world!

## Importing libraries and data:


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'

Uploading & Saving it as pandas Data Frame:

In [None]:
#To save data as dataframe, using pandas
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")

In [None]:
#Checking if the data is correctly read
df.head(2)

Since the focus of analysis is in India. Let us reduce the data to Country-India. This will also help in data reduction for now.
Assigning the country as a variable will make it easy to use.

# Extracting country of analysis

Let's check the list of countries. Data sub-setting is done in below line so as to input country name. In case of change, we can carry out the same analysis for any country just by renaming the subset-value.

In [None]:
#Checking list of countries - This analsysi can be done for any country in general
df['Q3'].unique()

In [None]:
# Selecting the country of interest
Data = df[df.Q3 == 'India']
print("Data with selected country has",len(Data),"row &",len(Data.columns), "columns ")

**We see that for INDIA out of 25973 entries, 7433 entries are from India.**

# Solving the main problem of data - Columns

I have created a simple function to coalesce the first option from the different parts of that specific question. The function takes 'column part name' and *'title to be displayed'* on the graph as an input.

Defining a function-
Keeps all the original columns with no optional questions

* Extracts this data from main data
* Filters data from required part_number (Question number  with parts)
* Coalesces it with first preference
* C-binds the coalesced column with data extracted before
* Creates a pie-plot 



In [None]:
#Pie plots
def fix_columns(part_number,title):
  Data1 = Data.filter(like=part_number,axis=1)
  Data1[part_number + 'coalesce'] = Data1.bfill(axis=1).iloc[:, 0]
  Data_temp = Data1.filter(like='coalesce',axis=1)
  Data_temp.value_counts().plot.pie(labeldistance=None,startangle=90)
  percentages = (Data_temp.value_counts().values/sum(Data_temp.value_counts().values))*100
  labels = [f'{l}, {s:0.1f}%' for l, s in zip(Data_temp.value_counts().index, percentages)]
  plt.legend(labels=labels, title="Categories & Percentages", bbox_to_anchor=(1.4,1.1), loc="best", fontsize=10, bbox_transform=plt.gcf().transFigure)
  plt.ylabel(part_number)
  plt.title(title)
  plt.show()

# Primary analysis of category columns:📊

Excluding the column for country, we have three other basic columns which can divide the data into basic categories instantly. 
Helping us understand the structure of data.

* Age group
* Gender
* Highest Education


In [None]:
#Defining the functions

#Pie Plots
def plot_pie(col,title):
    Data.iloc[1:][col].value_counts().plot.pie(labeldistance=None,startangle=90)
    percentages = (Data.iloc[1:][col].value_counts().values/sum(Data.iloc[1:][col].value_counts().values))*100
    labels = [f'{l}, {s:0.1f}%' for l, s in zip(Data.iloc[1:][col].value_counts().index, percentages)]
    #legend = ['{} ({:%})'.format(idx, value) for idx, value in zip(Data.iloc[1:][col].value_counts().index, Data.iloc[1:][col].value_counts().values)]
    plt.legend(labels=labels, title="Categories & Percentages", bbox_to_anchor=(1.4,1.1), loc = "best", fontsize=10, bbox_transform=plt.gcf().transFigure)
    plt.title(title)
    #draw circle
    centre_circle = plt.Circle((0,0),0.70,fc='white')
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)
    # Equal aspect ratio ensures that pie is drawn as a circle
    plt.tight_layout()
    plt.show()
  

#Bar Charts
def plot_bars(col,title):
  sns.barplot(x=Data.iloc[1:][col].value_counts(),y=Data.iloc[1:][col].value_counts().index,data=Data.iloc[1:],palette='rainbow')
  plt.title(title)
  plt.xlabel(col)
  plt.show()


## Age group:🧑
It is clearly visible that maximum number of data science enthuciast are from age groups of 18-29, the three major categories being - 18-21, 22-24 & 25,29. We can say that the majority group of people using Kaggle are students from universities or high schools, preparaing themselves for the data centric India. As age increases we see that, percent of data scientist reduces. As data suggests the best way to start a career is in age from 18-24!

In [None]:
plot_bars('Q1','Distribution of Age across all participants')

## Gender:🧑
We see that 76% of the population consisted of Men, and 22% Women. Knowing the fact that data science is a male dominated field for now, we see that the Kaggle community is a true picture of the data science world it lives in.

In [None]:
plot_pie('Q2','Distribution of Gender across all participants')

## Highest Education:📑
Even though the results of distribution show higher number of participants completing a 'Bachelor's degree', gap between the graduation and post-graduation is 20%.  This could be due to the fact that since most particpants are from age groups 18-21, they are yet to enroll for a master's degree. 


**This distribution highlights the most important part of being a data scientist, is the fact that a Graduation or Post Graduation degree, majorly in Mathematics,Statistics, Economics or Computer Science has been a very important part of education if one is keen to pursue a data science job.**

In [None]:
plot_bars('Q4','Distribution of Highest Education across all participants')

## Coding - Years of experience: 💻

In the above graphs we have seen that majority of participants are from a Bachelor's degree background wanting to pursue Master's degree, followed by a Master degree holders. *This matches the assumption that most participants must have atleast 2-3 years of coding from Bachelor's experience. Which can be confimed below -*

In [None]:
plot_bars('Q6','Coding - Years of Experience')

What programming languages **do you use on a regular basis?**

In [None]:
#Using the function of coalesce column parts and plot a pie-chart
fix_columns("Q7_","What Programming Languages do you use on a Regular basis?")

What programming language **would you recommend** an aspiring data scientist to learn first? 

There are several programming languages for data science today. *It is said that Data scientists should learn and master at least one language as it is an essential tool to realize various data science functions.* As per below we see that **Python is the most used language by students and professionals, today.**




In [None]:
plot_pie('Q8','Which Programming Lnaguage would you Recommended?')

**We see, Python, R and SQL are the most used tools when it comes to data analysis. SQL is a great tool for data storage and data management. While Python and R support it for running various analysis models by having the ability to fetch data from SQL-Databases. Let's check if there is any difference between how students and professionals are well-versed with these programming languages.**


# Difference between the use of programming languages of Students & Professionals:🧒 🆚 🧑

In [None]:
# We will divide the data age wise into Students and Professionals.
Data.Q1.unique()

# Let 18-29 be students, and rest be professionals
Data_students = Data[Data.Q1.isin(['25-29', '18-21', '22-24'])]
Data_professionals = Data[~Data.Q1.isin(['25-29', '18-21', '22-24'])]

#Plotting pie chart for Students
Data_students.iloc[1:]['Q8'].value_counts().plot.pie(labeldistance=None,startangle=90)
percentages = (Data_students.iloc[1:]['Q8'].value_counts().values/sum(Data_students.iloc[1:]['Q8'].value_counts().values))*100
labels = [f'{l}, {s:0.1f}%' for l, s in zip(Data_students.iloc[1:]['Q8'].value_counts().index, percentages)]
plt.legend(labels=labels, title="Categories & Percentages", bbox_to_anchor=(1.4,1.1), loc = "best", fontsize=10, bbox_transform=plt.gcf().transFigure)
plt.title("Programming Languages used by Students")


In [None]:
#Plotting pie chart for Professionals
Data_professionals.iloc[1:]['Q8'].value_counts().plot.pie(labeldistance=None,startangle=90)
percentages = (Data_professionals.iloc[1:]['Q8'].value_counts().values/sum(Data_professionals.iloc[1:]['Q8'].value_counts().values))*100
labels = [f'{l}, {s:0.1f}%' for l, s in zip(Data_professionals.iloc[1:]['Q8'].value_counts().index, percentages)]
plt.legend(labels=labels, title="Categories & Percentages", bbox_to_anchor=(1.4,1.1), loc = "best", fontsize=10, bbox_transform=plt.gcf().transFigure)
plt.title("Programming Languages used by Professionals")

Both students and professionals heavily use Python, SQL, R and C as data managment & analysis tools. While we see increased use of MATLAB in professionals than students. 
Matlab is not very popular when it comes to data science but it is one of the languages that many people consider for learning data science. **Researchers, scientists and engineers who are already using MATLAB find it easy to move to deep learning because of the functionality of the Deep Learning Toolbox** 

## What type of computing platform do you use most often for your data science projects?

***What kind of high-end computers does one need?***

***Is it good to start a data science career on one's personal low-powered?***

***Is it hard to learn and practice data science or machine learning if one doesn't have a good laptop?***

These are some common questions students have before starting a career in data science. It is a common mis-conception that data requires heavy high-end computing systems including large storage capacity and heavy running RAMS. But in today's times most data is used off cloud-storage and most softwares freely run on any ordinary laptop. It is true that with large data set comes requirement for larger powered systems and storage spaces, but to start off with something, a laptop is all one needs!

In [None]:
plot_bars('Q11','Most used computing platform')

# Technical Stuff:📈📉

## What is the primary tool that you use at work or school to analyze data? 

Microsoft Excel is the most important tool to start basic analysis in Data science. It  helps students to produce different college projects to helping professionals create fully apprved automated dash-boards for clients. One can almost make the most simplest charts to pivots and look-ups to in and from different datasets to create a fully visualized experience tool for client. It also has additional functions like power-pivots to access large files which can't be worked on excel directly. The most Primary tool that has been in all industries for reporting and analysis at all levels.

Yet we see **not all college or universities in India offer a proper course of Excel functionalities. It is assumed that basic excel skills will do good, but it is as much important to attain advance skills for smooth flow of work when working in Data Science field.**

We also observe that although SPSS & SAS are always the jargons words in the Data Science, not many people are familiar due to them being the *ultra-paid* softwares to work on. Students generally dont have hands-on experience on SPSS or SAS and the certification seems expensive to invest in personally. 

In [None]:
plot_bars('Q41','Use of Primary Tools')

## Which of the following business intelligence tools do you use most often?

Tableau & Microsoft Power BI have always been the most used visualization and analysis, dash-board integrated tools in business.

In [None]:
plot_pie('Q35','Use of Business Intelligence Tools')

## Does your current employer incorporate machine learning methods into their business?


In [None]:
plot_bars('Q23','Use of core machine learning models into business')

## For how many years have you used machine learning methods?

Machine learning refers to a **group of techniques** used by data scientists that allow computers to learn from data. Most particpants have experience in ML for under a year or 1-2 years. But we also see almost 17% of peope who are yet to use the most important and core part of data science.

In [None]:
plot_pie('Q15','Years of Coding - Machine Learning')

## Which of the following Integrated Development Environments (IDE's) do you use on a regular basis?

An integrated development environment (IDE) is a software suite that consolidates basic tools required to write and test software. Developers use numerous tools throughout software code creation, building and testing. Development tools often include text editors, code libraries, compilers and test platforms.

In [None]:
fix_columns("Q9","Which IDEs do you use on regular basis?")

## What data visualization libraries or tools do you use on a regular basis? 
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. **Seaborn is an open-source Python library** built on top of Matplotlib. It is used for data visualization and exploratory data analysis. Seaborn works easily with dataframes and the Pandas library. The graphs created can also be customized easily.

In [None]:
fix_columns("Q14","Which visualization libraries do you use on regular basis?")

## Which of the following Machine Learning frameworks do you use on a regular basis? 
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. 

In [None]:
fix_columns("Q16","Which Machine Learning Framework do you use on regular basis?")

## Which of the following Machine Learning algorithms do you use on a regular basis?
In the most simple words, Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable i.e it finds the linear relationship between the dependent and independent variable.

**As data suggests Linear Regression is the most regularly used algorithm to start practicing data science & coding skills**

In [None]:
fix_columns("Q17_","Which Machine Learning Algorithms do you use on regular basis?")

#Moving to Clouds⛈☁☁☁


## Of the cloud platforms that you are familiar with, which has the best developer experience (most enjoyable to use)?

Data science and cloud computing essentially go hand in hand. A Data Scientist typically analyzes different types of data that are stored in the Cloud.  With the increase in Big Data, Organizations are increasingly storing large sets of data online and there is a need for Data Scientists.

In [None]:
plot_bars('Q28','Cloud Platforms - Convinient use of platforms')

## Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often?
In computing, a data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis and is considered a core component of business intelligence. 

In [None]:
plot_bars('Q33','Use of Big Data products')

## Where do you publicly share your data analysis or machine learning applications? - deployment

In [None]:
fix_columns("Q39_","Location for publically sharing output?")

#Now that we have our insights from the data and know the path to follow professionally, coming to the part of personal side of being a data scientist. This question is specifically giving importance of the day-to-day life of a data scientist. How's & What's of the real day job. 

## Select any activities that make up an important part of your role at work: (Select all that apply)

We see that 57% of the work comprises of understanding and analyzing the data that is used, to see how it will influence the business decisions. But there are many levels to being a data scientist and using the survey we can divide the data science job into below levels:    
**1)** **Model building** - The trial-&-error part or the experimentational part  where they build prototypes and new machine learning models, or improve them and do the research that advances the state of current machine learning until validation. 

**2)** **Automation** - Building a machine learning service that operationally imporves the system workflow

**3)** **Reporting** - Running the pre-built data infrastructure that the business uses for storing, analyzing and operationalizing the data. This can also include sending daily reports to the client and having calls to understand changes if any


In [None]:
fix_columns("Q24","Your job as a Data Scientist!")

## On which platforms have you begun or completed data science courses?
These platforms and the data science and machine learning courses they offer are suitable for all, from freshers to experienced professionals. 

**Coursera** is a popular online learning platform offering massive open online courses (MOOC), specializations, and degrees in a range of subjects, including data science, machine learning, and Artificial Intelligence. It is the most used platform. From learning a new skill to upgrading an old one. 

**Kaggle** has to be the second most important platform in terms of competitions and learning badges for people from data science background.Data scientists of all levels can benefit from the resources and community on Kaggle. Whether you are a beginner, looking to learn new skills and contribute to projects, an advanced data scientist looking for competitions, or somewhere in between, Kaggle is a good place to go. **Despite the differences between Kaggle and typical data science, Kaggle can still be a great learning tool for beginners.**

In [None]:
fix_columns("Q40","Data Science platforms?")

#Conclusion

## What are your favorite media sources that report on data science topics?

In [None]:
fix_columns("Q42","Media Source")


# **"Kaggle, a gaggle of data scientists."**

**Once again Kaggle tops the media house for being the most informative and crucial part of student's life as a data scientist!!**

It allows users to participate in predictive modeling competitions, to explore and publish data sets and also to get access to training accelerators. It's a great ecosystem to engage, connect, and collaborate with other data scientists to build amazing machine learning models.

For students and beginners, Kaggle is a great learning tool for beginners. Each competition is self-contained. You don't need to scope your own project and collect data, which frees you up to focus on other skills.

For professionals, anytime you win something on a platform where you have global audience, you're getting a global fame. That stays true for Kaggle too.
It’s the largest platform for machine learning in the world with more than 23,000 public datasets for practicing and different competitions to enhance your skills. Some of these competitions also pay an insane amount of price money.

**An excellent Kaggle profile will definitely result in a lot of exposure from recruiters which will help you in getting a job!**
