Data Description



In [1]:
# Importing necessary libraries
import pandas as pd

In [2]:
# Reading the CSV file into a DataFrame
file = r"/content/ds_salaries.csv"
df = pd.read_csv(file)

In [3]:
# Displaying the first few rows of the DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [None]:
# Displaying information about the DataFrame, including data types and non-null counts
df.info()

In [None]:
# Displaying the column names in the DataFrame
df.columns

In [None]:
# Checking the count of missing values in each column
df.isna().sum()

In [None]:
# Unique values in the 'work_year' column
df.work_year.unique()

In [None]:
# Unique values in the 'experience_level' column
df.experience_level.unique()

### Subsetting Data:
1. Selecting Specific Columns:

Method: df[['column1', 'column2']] or df.loc[:, ['column1', 'column2']]
Example:

In [None]:
# 1. Selecting specific columns: 'work_year', 'job_title', 'salary'
selected_columns = df[['work_year', 'job_title', 'salary']]

In [None]:
selected_columns

Difference between a series and a dataframe

In [None]:
# Difference between a series and a dataframe
# (Note: A DataFrame is a 2D table, while a Series is a 1D labeled array.)

2. Selecting Rows Based on Conditions:

Method: Boolean indexing
Example:

In [None]:
# 2. Selecting rows based on conditions (e.g., high salary jobs)
high_salary_jobs = df[df['salary_in_usd'] > 80000]

In [None]:
high_salary_jobs

### Filtering Data:
1. Filtering Rows Based on Multiple Conditions:

Method: Use logical operators (&, |, ~)
Example:

In [None]:
# (e.g., Senior-level jobs with salary > 80000)
filtered_data = df[(df['experience_level'] == 'Senior') & (df['salary_in_usd'] > 80000)]

In [None]:
filtered_data

2. Filtering Rows Based on String Matching:

Method: str.contains()
Example:

In [None]:
# (e.g., jobs with 'Remote' in employment type)
remote_jobs = df[df['employment_type'].str.contains('Remote', case=False, na=False)]

### Grouping Data:
1. Grouping by Categorical Variable:

Method: groupby()
Example:

In [None]:
#  (e.g., experience_level)
average_salary_by_level = df.groupby('experience_level')['salary_in_usd'].mean()

2. Aggregating Data within Groups:

Method: groupby() with aggregation functions
Example:

In [None]:
stats_by_job = df.groupby('job_title').agg({'salary_in_usd': ['mean', 'median'], 'work_year': 'count'})

In [None]:
stats_by_job.sort_values(('salary_in_usd', 'mean'), ascending=False)

- Mean Salary by Job Title:
    The mean salary for each job title gives you an overall idea of the average compensation associated with each position.
    
- Median Salary by Job Title:
    The median salary provides a measure of central tendency that is less affected by outliers. It can be valuable to understand the typical salary distribution within each job title.
    
- Number of Employees (Work Years) by Job Title:
    The count of work years gives insight into how many employees are associated with each job title. This can be crucial for understanding the workforce distribution.
    
- Comparison of Mean and Median:
    The comparison between mean and median salaries can indicate the presence of salary outliers within specific job titles. A significant difference between mean and median might suggest skewed salary distributions.
    
- Identifying High and Low-Paying Job Titles:
    Job titles with a higher mean or median salary might be considered higher-paying positions, while those with lower values could be lower-paying roles.
    
- Variability in Salaries:
    The spread between mean and median, as well as the distribution of work years, can provide insights into the variability of salaries across different job titles.
    
- Identifying Anomalies or Outliers:
    Extreme values in mean or median salaries may indicate anomalies or outliers that could be explored further.

Make the categories in experience level and employment type columns more spelt out, eg MI = mid level/ intermediate.

In [None]:
# Updating categories in 'experience_level' column
df['experience_level'] = df['experience_level'].replace('EN','Entry-level/Junior')
df['experience_level'] = df['experience_level'].replace('MI','Mid-level/Intermediate')
df['experience_level'] = df['experience_level'].replace('SE','Senior-level/Expert')
df['experience_level'] = df['experience_level'].replace('EX','Executive-level/Director')

In [None]:
# Or Mapping old names to new names for the 'Category' column
category_mapping = {'EN': 'Entry-level/Junior', 'MI': 'Mid-level/Intermediate', 'SE': 'Senior-level/Expert'}
df['experience_level'] = df['experience_level'].map(category_mapping)

In [4]:
# Checking unique values and their counts in 'remote_ratio' column
df.remote_ratio.unique()

array([  0,  50, 100])

In [5]:
df.remote_ratio.value_counts()

Unnamed: 0_level_0,count
remote_ratio,Unnamed: 1_level_1
100,381
0,127
50,99


In [None]:
df.employment_type.unique()

In [None]:
df.employment_type.value_counts()

TASK

- change the variables in the employment type column:
    - 'FT' - 'Full time',
    - 'CT'-'Contract',
    - 'PT'-'Part Time',
    - 'FL'-'Freelance'
- Compare the salaries of the fully remote jobs and onsite jobs. Which is greater?


- Use the datacard of this [link](https://www.kaggle.com/datasets/abdallahwagih/telco-customer-churn) to create a datacard for this dataset

Appendix

- [Principles of Data Wrangling](https://www.fintechfutures.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf)