## Introduction:
Dataset Overview:


1.   File Name: Uncleaned_DS_jobs.csv
2.   Content: The dataset includes job postings for data science-related roles, with details such as job titles, salary estimates, job descriptions, company information, and more.
3.   Number of Rows: The dataset contains multiple job postings, each with detailed information.
4.   Columns: The dataset has several columns, including:

*   Index: A unique identifier for each job posting.
*   Job Title: The title of the job (e.g., Data Scientist, Senior Data Scientist, etc.).
*   Job Description: A detailed description of the job responsibilities, qualifications, and requirements.
*   Rating: The company rating (e.g., 3.1, 4.2, etc.).
*   Company Name: The name of the company offering the job.
*   Location: The location of the job (e.g., New York, NY, San Francisco, CA, etc.).
*   Headquarters: The headquarters location of the company.
*   Size: The size of the company (e.g., 1001 to 5000 employees).
*   Founded: The year the company was founded.
*   Type of Ownership: The type of ownership (e.g., Nonprofit Organization, Company - Public, etc.).
*   Industry: The industry the company operates in (e.g., Insurance, Biotech & Pharmaceuticals, etc.).
*   Sector: The sector of the industry (e.g., Insurance, Manufacturing, etc.).
*   Revenue: The revenue of the company (e.g., 100 t0 500 million, 1 to 2 billion, etc.).
*   Competitors: Competitors of the company (if available).


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Libraries:



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Import Dataset

In [83]:
df=pd.read_csv('/content/drive/MyDrive/Data/Dataset_ Unused/Uncleaned_DS_jobs.csv')
df.head(2)

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              672 non-null    int64  
 1   Job Title          672 non-null    object 
 2   Salary Estimate    672 non-null    object 
 3   Job Description    672 non-null    object 
 4   Rating             672 non-null    float64
 5   Company Name       672 non-null    object 
 6   Location           672 non-null    object 
 7   Headquarters       672 non-null    object 
 8   Size               672 non-null    object 
 9   Founded            672 non-null    int64  
 10  Type of ownership  672 non-null    object 
 11  Industry           672 non-null    object 
 12  Sector             672 non-null    object 
 13  Revenue            672 non-null    object 
 14  Competitors        672 non-null    object 
dtypes: float64(1), int64(2), object(12)
memory usage: 78.9+ KB




1.   There is no NULL values in dataset
2.   Most of columns have Object values




In [5]:
df.shape

(672, 15)

In [11]:
df.isnull().sum()

Unnamed: 0,0
index,0
Job Title,0
Salary Estimate,0
Job Description,0
Rating,0
Company Name,0
Location,0
Headquarters,0
Size,0
Founded,0


In [7]:
df.columns

Index(['index', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')

In [15]:
pd.DataFrame(df.nunique(), columns=["Number of Unique values"])

Unnamed: 0,Number of Unique values
index,672
Job Title,172
Salary Estimate,30
Job Description,489
Rating,32
Company Name,432
Location,207
Headquarters,229
Size,9
Founded,103


## Perform an in-depth analysis to identify unique values

In [95]:
def value_count(df,columns):
  for col in columns:
    return pd.DataFrame(df[col].value_counts()).head(5)

In [97]:
value_count(df,columns=['Job Title'])

Unnamed: 0_level_0,count
Job Title,Unnamed: 1_level_1
Data Scientist,337
Data Engineer,26
Senior Data Scientist,19
Machine Learning Engineer,16
Data Analyst,12


In [98]:
value_count(df,columns=['Salary Estimate'])

Unnamed: 0_level_0,count
Salary Estimate,Unnamed: 1_level_1
$79K-$131K (Glassdoor est.),32
$99K-$132K (Glassdoor est.),32
$75K-$131K (Glassdoor est.),32
$137K-$171K (Glassdoor est.),30
$90K-$109K (Glassdoor est.),30


In [106]:
value_count(df,columns=['Rating'])

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
3.5,58
-1.0,50
4.0,41
3.3,41
3.9,40


In [99]:
value_count(df,columns=['Company Name'])

Unnamed: 0_level_0,count
Company Name,Unnamed: 1_level_1
Hatch Data Inc,12
Maxar Technologies\n3.5,12
Tempus Labs\n3.3,11
AstraZeneca\n4.0,10
Klaviyo\n4.8,8


In [100]:
value_count(df,columns=['Location'])

Unnamed: 0_level_0,count
Location,Unnamed: 1_level_1
"San Francisco, CA",69
"New York, NY",50
"Washington, DC",26
"Boston, MA",24
"Chicago, IL",22


In [111]:
value_count(df,columns=['Headquarters'])

Unnamed: 0_level_0,count
Headquarters,Unnamed: 1_level_1
"New York, NY",33
-1,31
"San Francisco, CA",31
"Chicago, IL",23
"Boston, MA",19


In [112]:
value_count(df,columns=['Size'])

Unnamed: 0_level_0,count
Size,Unnamed: 1_level_1
51 to 200 employees,135
1001 to 5000 employees,104
1 to 50 employees,86
201 to 500 employees,85
10000+ employees,80


In [113]:
value_count(df,columns=['Founded'])

Unnamed: 0_level_0,count
Founded,Unnamed: 1_level_1
-1,118
2012,34
2011,25
2015,22
2010,22


In [114]:
value_count(df,columns=['Type of ownership'])

Unnamed: 0_level_0,count
Type of ownership,Unnamed: 1_level_1
Company - Private,397
Company - Public,153
Nonprofit Organization,36
Subsidiary or Business Segment,28
-1,27


In [115]:
value_count(df,columns=['Industry'])

Unnamed: 0_level_0,count
Industry,Unnamed: 1_level_1
-1,71
Biotech & Pharmaceuticals,66
IT Services,61
Computer Hardware & Software,57
Aerospace & Defense,46


In [116]:
value_count(df,columns=['Sector'])

Unnamed: 0_level_0,count
Sector,Unnamed: 1_level_1
Information Technology,188
Business Services,120
-1,71
Biotech & Pharmaceuticals,66
Aerospace & Defense,46


In [117]:
value_count(df,columns=['Revenue'])

Unnamed: 0_level_0,count
Revenue,Unnamed: 1_level_1
Unknown / Non-Applicable,213
$100 to $500 million (USD),94
$10+ billion (USD),63
$2 to $5 billion (USD),45
$10 to $25 million (USD),41


In [118]:
value_count(df,columns=['Competitors'])

Unnamed: 0_level_0,count
Competitors,Unnamed: 1_level_1
-1,501
"Roche, GlaxoSmithKline, Novartis",10
"Los Alamos National Laboratory, Battelle, SRI International",6
"Leidos, CACI International, Booz Allen Hamilton",6
"MIT Lincoln Laboratory, Lockheed Martin, Northrop Grumman",3


In [104]:
df.duplicated().sum()

0

# Data Cleaning

## Drop columns

In [120]:
df.drop(columns=['index','Competitors'], axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          672 non-null    object 
 1   Salary Estimate    672 non-null    object 
 2   Job Description    672 non-null    object 
 3   Rating             672 non-null    float64
 4   Company Name       672 non-null    object 
 5   Location           672 non-null    object 
 6   Headquarters       672 non-null    object 
 7   Size               672 non-null    object 
 8   Founded            672 non-null    int64  
 9   Type of ownership  672 non-null    object 
 10  Industry           672 non-null    object 
 11  Sector             672 non-null    object 
 12  Revenue            672 non-null    object 
dtypes: float64(1), int64(1), object(11)
memory usage: 68.4+ KB
