## Introduction:
Dataset Overview:


1.   File Name: Uncleaned_DS_jobs.csv
2.   Content: The dataset includes job postings for data science-related roles, with details such as job titles, salary estimates, job descriptions, company information, and more.
3.   Number of Rows: The dataset contains multiple job postings, each with detailed information.
4.   Columns: The dataset has several columns, including:

*   Index: A unique identifier for each job posting.
*   Job Title: The title of the job (e.g., Data Scientist, Senior Data Scientist, etc.).
*   Job Description: A detailed description of the job responsibilities, qualifications, and requirements.
*   Rating: The company rating (e.g., 3.1, 4.2, etc.).
*   Company Name: The name of the company offering the job.
*   Location: The location of the job (e.g., New York, NY, San Francisco, CA, etc.).
*   Headquarters: The headquarters location of the company.
*   Size: The size of the company (e.g., 1001 to 5000 employees).
*   Founded: The year the company was founded.
*   Type of Ownership: The type of ownership (e.g., Nonprofit Organization, Company - Public, etc.).
*   Industry: The industry the company operates in (e.g., Insurance, Biotech & Pharmaceuticals, etc.).
*   Sector: The sector of the industry (e.g., Insurance, Manufacturing, etc.).
*   Revenue: The revenue of the company (e.g., 100 t0 500 million, 1 to 2 billion, etc.).
*   Competitors: Competitors of the company (if available).


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Libraries:



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

## Import Dataset

In [28]:
df=pd.read_csv('/content/drive/MyDrive/Data/Dataset_ Unused/Uncleaned_DS_jobs.csv')
df.head(2)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              672 non-null    int64  
 1   Job Title          672 non-null    object 
 2   Salary Estimate    672 non-null    object 
 3   Job Description    672 non-null    object 
 4   Rating             672 non-null    float64
 5   Company Name       672 non-null    object 
 6   Location           672 non-null    object 
 7   Headquarters       672 non-null    object 
 8   Size               672 non-null    object 
 9   Founded            672 non-null    int64  
 10  Type of ownership  672 non-null    object 
 11  Industry           672 non-null    object 
 12  Sector             672 non-null    object 
 13  Revenue            672 non-null    object 
 14  Competitors        672 non-null    object 
dtypes: float64(1), int64(2), object(12)
memory usage: 78.9+ KB




1.   There is no NULL values in dataset
2.   Most of columns have Object values




In [None]:
df.shape

(672, 15)

In [26]:
df.isnull().sum()

Unnamed: 0,0
Job Title,0
Mean of Salary Estimate (M$),0
Job Description,0
Rating,50
Company Name,0
Location,0
Headquarters,31
Size,27
Founded,118
Type of ownership,27


In [None]:
df.columns

Index(['index', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')

In [None]:
pd.DataFrame(df.nunique(), columns=["Number of Unique values"])

Unnamed: 0,Number of Unique values
index,672
Job Title,172
Salary Estimate,30
Job Description,489
Rating,32
Company Name,432
Location,207
Headquarters,229
Size,9
Founded,103


## Perform an in-depth analysis to identify unique values

In [4]:
def value_count(df,column):
  return pd.DataFrame(df[column].value_counts()).head(5)

In [None]:
value_count(df,column=['Job Title'])

Unnamed: 0_level_0,count
Job Title,Unnamed: 1_level_1
Data Scientist,337
Data Engineer,26
Senior Data Scientist,19
Machine Learning Engineer,16
Data Analyst,12


In [7]:
value_count(df,column=['Salary Estimate'])

Unnamed: 0_level_0,count
Salary Estimate,Unnamed: 1_level_1
$99K-$132K (Glassdoor est.),32
$79K-$131K (Glassdoor est.),32
$75K-$131K (Glassdoor est.),32
$90K-$109K (Glassdoor est.),30
$137K-$171K (Glassdoor est.),30


In [23]:
value_count(df,column=['Rating'])

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
3.5,58
3.3,41
4.0,41
3.9,40
3.8,39


In [None]:
value_count(df,column=['Company Name'])

Unnamed: 0_level_0,count
Company Name,Unnamed: 1_level_1
Hatch Data Inc,12
Maxar Technologies\n3.5,12
Tempus Labs\n3.3,11
AstraZeneca\n4.0,10
Klaviyo\n4.8,8


In [None]:
value_count(df,column=['Location'])

Unnamed: 0_level_0,count
Location,Unnamed: 1_level_1
"San Francisco, CA",69
"New York, NY",50
"Washington, DC",26
"Boston, MA",24
"Chicago, IL",22


In [None]:
value_count(df,column=['Headquarters'])

Unnamed: 0_level_0,count
Headquarters,Unnamed: 1_level_1
"New York, NY",33
-1,31
"San Francisco, CA",31
"Chicago, IL",23
"Boston, MA",19


In [None]:
value_count(df,column=['Size'])

Unnamed: 0_level_0,count
Size,Unnamed: 1_level_1
51 to 200 employees,135
1001 to 5000 employees,104
1 to 50 employees,86
201 to 500 employees,85
10000+ employees,80


In [None]:
value_count(df,column=['Founded'])

Unnamed: 0_level_0,count
Founded,Unnamed: 1_level_1
-1,118
2012,34
2011,25
2015,22
1996,22


In [None]:
value_count(df,column=['Type of ownership'])

Unnamed: 0_level_0,count
Type of ownership,Unnamed: 1_level_1
Company - Private,397
Company - Public,153
Nonprofit Organization,36
Subsidiary or Business Segment,28
-1,27


In [None]:
value_count(df,column=['Industry'])

Unnamed: 0_level_0,count
Industry,Unnamed: 1_level_1
-1,71
Biotech & Pharmaceuticals,66
IT Services,61
Computer Hardware & Software,57
Aerospace & Defense,46


In [None]:
value_count(df,column=['Sector'])

Unnamed: 0_level_0,count
Sector,Unnamed: 1_level_1
Information Technology,188
Business Services,120
Biotech & Pharmaceuticals,66
Aerospace & Defense,46
Finance,33


In [None]:
value_count(df,column=['Revenue'])

Unnamed: 0_level_0,count
Revenue,Unnamed: 1_level_1
Unknown / Non-Applicable,213
$100 to $500 million (USD),94
$10+ billion (USD),63
$2 to $5 billion (USD),45
$10 to $25 million (USD),41


In [None]:
value_count(df,column=['Competitors'])

Unnamed: 0_level_0,count
Competitors,Unnamed: 1_level_1
-1,501
"Roche, GlaxoSmithKline, Novartis",10
"Leidos, CACI International, Booz Allen Hamilton",6
"Los Alamos National Laboratory, Battelle, SRI International",6
"Nielsen, Zappi, SurveyMonkey",3


In [None]:
df.duplicated().sum()

0

# Data Cleaning

## Drop columns

In [36]:
# Drop the 'Competitors', 'index' and 'Job Description' columns as they are not needed.

df.drop(columns=['index','Competitors', 'Job Description'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Job Title          672 non-null    object  
 1   Salary Estimate    672 non-null    object  
 2   Rating             672 non-null    category
 3   Company Name       672 non-null    object  
 4   Location           672 non-null    object  
 5   Headquarters       672 non-null    object  
 6   Size               672 non-null    object  
 7   Founded            672 non-null    int64   
 8   Type of ownership  672 non-null    object  
 9   Industry           672 non-null    object  
 10  Sector             672 non-null    object  
 11  Revenue            672 non-null    object  
dtypes: category(1), int64(1), object(10)
memory usage: 58.8+ KB


In [22]:
# Replace -1 with NaN
### Write a function to replace -1 in specific columns ------ Rating columns dont need to replace -1
df.replace(-1, pd.NA, inplace= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Job Title                     672 non-null    object
 1   Mean of Salary Estimate (M$)  672 non-null    int64 
 2   Job Description               672 non-null    object
 3   Rating                        622 non-null    object
 4   Company Name                  672 non-null    object
 5   Location                      672 non-null    object
 6   Headquarters                  641 non-null    object
 7   Size                          645 non-null    object
 8   Founded                       554 non-null    object
 9   Type of ownership             645 non-null    object
 10  Industry                      601 non-null    object
 11  Sector                        601 non-null    object
 12  Revenue                       645 non-null    object
dtypes: int64(1), object(

## Salary Estimate

In [8]:
# Regex pattern to extract salary ranges
pattern = r'\$(\d+)K-\$(\d+)K'

# Iterate over rows and calculate the mean salary
for i, row in df.iterrows():
  text=row['Salary Estimate']
  matches = re.findall(pattern, text)  # Find all matches using the regex pattern
  if matches:  # Check if matches were found
    lower_salary = int(matches[0][0])  # Extract the lower salary
    upper_salary = int(matches[0][1])  # Extract the upper salary
    mean_salary = (lower_salary + upper_salary) / 2  # Calculate the mean salary
    df.at[i, 'Salary Estimate'] = mean_salary  # Update the DataFrame with the mean salary

In [12]:
# change type of 'Salary Estimate'
df['Salary Estimate']= df['Salary Estimate'].astype('int')

# change Salary Estimate column name
df.rename(columns={'Salary Estimate': 'Mean of Salary Estimate (M$)'}, inplace=True)

## Rating

In [32]:

# Define the mapping for ranges
bins = [-2, 0, 1, 2, 3, 4, 5]
labels = ['Very poor', 'Poor', 'Lower Medium', 'Upper Medium', 'High', 'Excellent']

# Map values in column 'Rating' using pd.cut
df['Rating'] = pd.cut(df['Rating'], bins=bins, labels=labels)


In [35]:
value_count(df,column=['Rating'])

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
High,352
Excellent,225
Very poor,50
Upper Medium,44
Lower Medium,1
