In [1]:
# Importing library
import pandas as pd

In [2]:
# Reading the dataset
data = pd.read_csv('https://raw.githubusercontent.com/analyticsindiamagazine/MocksDatasets/main/Data%20Science%20Salary.csv')
data.head()

Unnamed: 0,S.No.,Company Name,Job Title,Salaries Reported,Location,Salary
0,0,Mu Sigma,Data Scientist,105,Bangalore,648573.0
1,1,IBM,Data Scientist,95,Bangalore,1191950.0
2,2,Tata Consultancy Services,Data Scientist,66,Bangalore,836874.0
3,3,Impact Analytics,Data Scientist,40,Bangalore,669578.0
4,4,Accenture,Data Scientist,32,Bangalore,944110.0


# **Basic Properties of the dataset**

In [3]:
# Shape and size of dataset
data.shape

(1838, 6)

In [4]:
# Length of dataset
len(data)

1838

In [5]:
# Columns in the dataset
data.columns

Index(['S.No.', 'Company Name', 'Job Title', 'Salaries Reported', 'Location',
       'Salary'],
      dtype='object')

In [6]:
# Number of columns
len(data.columns)

6

Next, we will check the basic information about the dataset using the info() function that will give us the column-wise details including the number of non-null records and the data type of each column.

In [7]:
# Information about dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1838 entries, 0 to 1837
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   S.No.              1838 non-null   int64  
 1   Company Name       1838 non-null   object 
 2   Job Title          1838 non-null   object 
 3   Salaries Reported  1838 non-null   int64  
 4   Location           1838 non-null   object 
 5   Salary             1838 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 86.3+ KB


We can see on the output that each of the columns has 1838 records and hence there are no missing values in any of the columns. Along with that, we can also find the data type of the details held by each of the columns.

Next, we will find the description of the dataset using the describe() function. If you do not use any argument with this function, it will give us the description of numerical features only.

In [8]:
# Data description (Numerical features)
data.describe()

Unnamed: 0,S.No.,Salaries Reported,Salary
count,1838.0,1838.0,1838.0
mean,2067.449946,5.195321,885472.0
std,1087.870929,7.241193,607095.3
min,0.0,2.0,29520.0
25%,1438.25,2.0,484255.5
50%,2174.5,3.0,732668.5
75%,2824.75,5.0,1139704.0
max,4314.0,105.0,6518917.0


We have got the description of each of the numerical features in the dataset. This itself gives all the important descriptive statistics of the features.

To get the same description of the non-numeric features, we need to pass an argument include='object' with the function.

In [9]:
# Data description (Character features)
data.describe(include='object')

Unnamed: 0,Company Name,Job Title,Location
count,1838,1838,1838
unique,1029,20,5
top,Tata Consultancy Services,Data Analyst,Bangalore
freq,26,733,672


Here we have got the descriptive statistics of the non-numeric features. Similarly, to get the description of all the features together - numeric and non-numeric, we need to pass the argument as include='all

In [10]:
# Data description (All features)
data.describe(include='all')

Unnamed: 0,S.No.,Company Name,Job Title,Salaries Reported,Location,Salary
count,1838.0,1838,1838,1838.0,1838,1838.0
unique,,1029,20,,5,
top,,Tata Consultancy Services,Data Analyst,,Bangalore,
freq,,26,733,,672,
mean,2067.449946,,,5.195321,,885472.0
std,1087.870929,,,7.241193,,607095.3
min,0.0,,,2.0,,29520.0
25%,1438.25,,,2.0,,484255.5
50%,2174.5,,,3.0,,732668.5
75%,2824.75,,,5.0,,1139704.0


# **Counting the categorical labels**

In the dataset, there are categorical features as well such as Location, Job title, company name etc. In each of the categorical features, we can check how many categories or values are there using the value_counts() function.

In [11]:
# Counting values in feature
data['Location'].value_counts()

Bangalore    672
New Delhi    402
Pune         255
Mumbai       255
Hyderabad    254
Name: Location, dtype: int64

In [12]:
# # Counting values in feature
data['Company Name'].value_counts()

Tata Consultancy Services    26
Amazon                       24
Accenture                    21
First Student                20
IBM                          18
                             ..
Karza Technologies            1
Dure Technologies             1
Kotak Mahindra                1
Freshworks                    1
iSchoolConnect                1
Name: Company Name, Length: 1029, dtype: int64

In [13]:
# Counting values in feature
data['Job Title'].value_counts()

Data Analyst                           733
Data Scientist                         674
Data Engineer                          313
Machine Learning Engineer               85
Senior Data Scientist                    9
Data Science                             6
Senior Machine Learning Engineer         3
Lead Data Scientist                      2
Junior Data Scientist                    2
Data Science Manager                     1
Data Scientist - Trainee                 1
Data Science Lead                        1
Data Science Associate                   1
Machine Learning Data Associate          1
Machine Learning Data Associate I        1
Machine Learning Associate               1
Machine Learning Data Associate II       1
Associate Machine Learning Engineer      1
Machine Learning Data Analyst            1
Data Science Consultant                  1
Name: Job Title, dtype: int64

# **Summary Statistics**

Here we will discuss how to find important summary statistics of the separate features. First, we will check how we can count the number of elements in a feature.

In [29]:
# Counting elements in a feature
data['Job Title'].count()

1838

We can see in this output that the count of job titles is given as 1838 which is the total number of rows in the dataset. There may be many job titles repeated multiple times in the dataset. To check the unique number of job titles, we will use nunique() function.

In [14]:
# Counting unique job titles
data['Job Title'].nunique()

20

Now we have got the correct count of job titles as per the dataset. In the same way, we can check unique salaries in the dataset.

In [15]:
# Counting elemnts in a feature
data['Salary'].nunique()

1688

In [16]:
# Counting elemnts in a feature
data['Company Name'].nunique()

1029

Next, we will check the mean or average of the salaries as per the dataset.

In [17]:
# Mean of salaries
data['Salary'].mean()

885472.0192709467

However, the mean is not always the correct measure in the case of salaries and hence median is preferred in such cases because it may be affected by the high skewness and extreme values in the salary data. So this is how we can find the median salary.

In [18]:
# Median salary
data['Salary'].median()

732668.5

We can find out the mode of the salary that we will show us the most frequent element in the salaries.

In [19]:
# Most frequent salary
data['Salary'].mode()

0    1200000.0
dtype: float64

In [20]:
# Standard deviation of salary
data['Salary'].std()

607095.2502229401

In [21]:
# Highest salary
data['Salary'].max()

6518917.42

In [22]:
# Lowest salary
data['Salary'].min()

29520.0

# **Querying the data**

As we have gone through the important descriptive statistics of the data, using the findings above we can query the data as well. Let us say we want to find out the job title and the name of the company where the highest salary was reported. For this purpose, we will find the index location of the maximum salary and we will find the job title and the company name at that index location.

In [23]:
import numpy as np

However, we can do that in one line of code only but we will make it simple to understand by breaking it down into sub-steps. First, we will find out the index location of the maximum salary.

In [24]:
# Location with highest salary
pos = np.where(data['Salary'] == data['Salary'].max())
pos

(array([1798]),)

We have got a 2D array in the output comprising the index location, i.e., 1798. First, we will fetch the value held by the 2D array so that we can come to know what will be used in the next line of code

In [25]:
pos[0][0]

1798

Now, we will use the element held by the array and find the job title at that index value.

In [26]:
# Job title with highest salary
data['Job Title'][pos[0][0]]

'Machine Learning Engineer'

So the Machine Learning Engineer is the job title with the highest salary. Let's check the name of the company as well who is offering that salary.

In [27]:
# Company name with highest salary
data['Company Name'][pos[0][0]]

'Amazon'

Finally, we will put it all together in one line of code to get the job title.

In [28]:
data['Job Title'][np.where(data['Salary'] == data['Salary'].max())[0][0]]

'Machine Learning Engineer'