üöÄ Working on a Real Project with Python
(A part of Big Data Analysis)
üìä The Salary Dataset
This dataset contains real-world salary information collected from multiple companies across different locations, job roles, and employment types. It is designed to help understand salary trends, pay distribution, and factors influencing compensation in the job market.

The dataset consists of 22,000+ records, making it suitable for exploratory data analysis (EDA), data cleaning, visualization, and business insights generation using Python.



üß† Questions
What does the salary distribution look like?

Which job roles have the highest average salary?

Which cities offer the highest average salary?

Top 5 companies offering the highest average salary?

Top 5 companies with salaries reported more than 20 times?

Is there a relationship between company rating and salary?

Does employment status affect salary?

Which job roles are most common?

How do salaries vary across locations?

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set_context("notebook")

In [3]:
# Load dataset

df = pd.read_csv(r"Salary_Dataset_DSL.csv")

In [4]:
df

Unnamed: 0,Rating,Company Name,Job Title,Salary,Salaries Reported,Location,Employment Status,Job Roles
0,3.8,Sasken,Android Developer,400000,3,Bangalore,Full Time,Android
1,4.5,Advanced Millennium Technologies,Android Developer,400000,3,Bangalore,Full Time,Android
2,4.0,Unacademy,Android Developer,1000000,3,Bangalore,Full Time,Android
3,3.8,SnapBizz Cloudtech,Android Developer,300000,3,Bangalore,Full Time,Android
4,4.4,Appoids Tech Solutions,Android Developer,600000,3,Bangalore,Full Time,Android
...,...,...,...,...,...,...,...,...
22765,4.7,Expert Solutions,Web Developer,200000,1,Bangalore,Full Time,Web
22766,4.0,Nextgen Innovation Labs,Web Developer,300000,1,Bangalore,Full Time,Web
22767,4.1,Fresher,Full Stack Web Developer,192000,13,Bangalore,Full Time,Web
22768,4.1,Accenture,Full Stack Web Developer,300000,7,Bangalore,Full Time,Web


In [5]:
# Dataset dimensions

df.shape

(22770, 8)

In [6]:
# Dataset structure

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22770 entries, 0 to 22769
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rating             22770 non-null  float64
 1   Company Name       22769 non-null  object 
 2   Job Title          22770 non-null  object 
 3   Salary             22770 non-null  int64  
 4   Salaries Reported  22770 non-null  int64  
 5   Location           22770 non-null  object 
 6   Employment Status  22770 non-null  object 
 7   Job Roles          22770 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 1.4+ MB


In [7]:
# Summary statistics

df.describe()

Unnamed: 0,Rating,Salary,Salaries Reported
count,22770.0,22770.0,22770.0
mean,3.918213,695387.2,1.855775
std,0.519675,884399.0,6.823668
min,1.0,2112.0,1.0
25%,3.7,300000.0,1.0
50%,3.9,500000.0,1.0
75%,4.2,900000.0,1.0
max,5.0,90000000.0,361.0


In [8]:
df.describe(include = 'all')

Unnamed: 0,Rating,Company Name,Job Title,Salary,Salaries Reported,Location,Employment Status,Job Roles
count,22770.0,22769,22770,22770.0,22770.0,22770,22770,22770
unique,,11260,1080,,,10,4,11
top,,Tata Consultancy Services,Software Development Engineer,,,Bangalore,Full Time,SDE
freq,,271,2351,,,8264,20083,8183
mean,3.918213,,,695387.2,1.855775,,,
std,0.519675,,,884399.0,6.823668,,,
min,1.0,,,2112.0,1.0,,,
25%,3.7,,,300000.0,1.0,,,
50%,3.9,,,500000.0,1.0,,,
75%,4.2,,,900000.0,1.0,,,


In [9]:
# Check missing values

df.isnull().sum()

Rating               0
Company Name         1
Job Title            0
Salary               0
Salaries Reported    0
Location             0
Employment Status    0
Job Roles            0
dtype: int64