# **Data Science Job Posting on Glassdoor**

<p align="center">
  <img src="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2019/09/11134058/What-is-data-science-2.jpg" alt="FIFA 23 Players" width="650" height="400">
</p>

## Setup

First, let's set up the environment by installing necessary libraries, uploading Kaggle API keys, and downloading the dataset.

```python
# Install necessary libraries
!pip install kaggle

# Import required modules
from google.colab import files
import os

# Upload Kaggle API JSON file
uploaded = files.upload()

# Move and configure Kaggle API JSON file
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download and extract dataset
!kaggle datasets download -d rashikrahmanpritom/data-science-job-posting-on-glassdoor
!unzip -o data-science-job-posting-on-glassdoor.zip -d job-posting


#**Data Loading and Initial Exploration**
We load the dataset containing job postings and perform initial data exploration to understand its structure and content.

In [2]:
import pandas as pd

jobs = pd.read_csv('/content/job-posting/Uncleaned_DS_jobs.csv')
jobs.drop(columns='index',inplace=True)
print('jobs Shape:', jobs.shape)
jobs.head()

jobs Shape: (672, 15)


#**Handling Missing data**

In [24]:
missing_data1 = jobs.isnull().sum()
print( 'Missing Data:', missing_data1)

Missing Data: Job Title            0
Salary Estimate      0
Job Description      0
Rating               0
Company Name         0
Location             0
Headquarters         0
Size                 0
Founded              0
Type of ownership    0
Industry             0
Sector               0
Revenue              0
Competitors          0
Formatted Salary     0
dtype: int64


#**Extracting formatted values from Salary Estimate column**

In [50]:
import re

# Function to extract and format salary values
def format_salary(salary_string):
    # Extracting numeric values using regex
    salary_values = re.findall(r'\d+', salary_string)
    if len(salary_values) >= 2:
        return f"{salary_values[0]} - {salary_values[1]}"
    else:
        return "Not Available"

# Create a new column with formatted salary values
jobs['Formatted Salary'] = jobs['Salary Estimate'].apply(format_salary)

jobs.head()

jobs_new = jobs.drop(columns='Salary Estimate')


#**Removing numbers from Company name column**

In [51]:
jobs_new['Company Name'] = jobs['Company Name'].str.split('\n').str[0]
jobs_new.head()

Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Formatted Salary
0,Sr Data Scientist,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",137 - 171
1,Data Scientist,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,137 - 171
2,Data Scientist,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,137 - 171
3,Data Scientist,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",137 - 171
4,Data Scientist,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",137 - 171


#**Adding 'Company age' and 'State' columns**

In [52]:
from datetime import datetime

jobs_new['Company age'] = datetime.now().year - jobs['Founded']
jobs_new['State'] = jobs['Location'].str.split(',').str[-1]
jobs_new.head()

Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Formatted Salary,Company age,State
0,Sr Data Scientist,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",137 - 171,30,NY
1,Data Scientist,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,137 - 171,55,VA
2,Data Scientist,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,137 - 171,42,MA
3,Data Scientist,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",137 - 171,23,MA
4,Data Scientist,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",137 - 171,25,NY


# **Identifying hard skills from job descriptions**

In [53]:
import pandas as pd

hard_skills = [
    'Python', 'SQL', 'Machine Learning', 'R', 'Statistics', 'Data Visualization',
    'Feature Engineering', 'Hadoop', 'Spark', 'Big Data', 'Tableau', 'A/B Testing',
    'NoSQL', 'Communication', 'Model Deployment']

# Initialize new columns for each hard skill
for skill in hard_skills:
    jobs_new[skill] = jobs['Job Description'].str.contains(skill, case=False).astype(int)

jobs_new.head(5)


Unnamed: 0,Job Title,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,...,Data Visualization,Feature Engineering,Hadoop,Spark,Big Data,Tableau,A/B Testing,NoSQL,Communication,Model Deployment
0,Sr Data Scientist,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,...,0,0,0,0,0,0,0,0,0,0
1,Data Scientist,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,...,0,0,1,0,1,0,0,0,0,0
2,Data Scientist,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,...,1,0,0,0,0,0,0,0,1,0
3,Data Scientist,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,...,0,1,0,0,0,0,0,1,0,0
4,Data Scientist,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,...,0,0,0,0,0,0,0,0,1,0
