Data from Github: Glassdoor_jobs.csv from Ken Jee

Data Wrangling  by Elizabeth Ramos


In [6]:
# import libraries for data manipulation and exploration
import pandas as pd

## 1. Loading and Inspecting the Data

In [85]:
# read the file
df = pd.read_csv(r'......Glassdor Jobs - Python\glassdoor_jobs.csv') 

In [86]:
# Look at the data to see how is like
df.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),-1
1,1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1
2,2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,Company - Private,Security Services,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,Energy,"Oil, Gas, Energy & Utilities",$500 million to $1 billion (USD),"Oak Ridge National Laboratory, National Renewa..."
4,4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


Observations:
- The Unamed Column is not adding value, we should remove it
- Clean the Salary Estimate: remove the (Glassdor est), remove the K and -
- Parse the company name (text only)
- Agregate a column - State
- Extract skills from job description
- Calculate company years in bussines

In [87]:
# Check all columns datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 956 entries, 0 to 955
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         956 non-null    int64  
 1   Job Title          956 non-null    object 
 2   Salary Estimate    956 non-null    object 
 3   Job Description    956 non-null    object 
 4   Rating             956 non-null    float64
 5   Company Name       956 non-null    object 
 6   Location           956 non-null    object 
 7   Headquarters       956 non-null    object 
 8   Size               956 non-null    object 
 9   Founded            956 non-null    int64  
 10  Type of ownership  956 non-null    object 
 11  Industry           956 non-null    object 
 12  Sector             956 non-null    object 
 13  Revenue            956 non-null    object 
 14  Competitors        956 non-null    object 
dtypes: float64(1), int64(2), object(12)
memory usage: 112.2+ KB


Observations:
- 956 observations and 15 variables
- We have several datatypes that need to be change like salary estimate------

## 2. Data Cleaning

In [88]:
# Check for null values (if any)
df.isnull().values.any()

False

In [89]:
# Check for duplicates
df.duplicated().sum()

0

In [90]:
# Check for unique values in general to get a better picture of the data values
print(df.apply(lambda col: col.unique()))

Unnamed: 0           [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
Job Title            [Data Scientist, Healthcare Data Scientist, Re...
Salary Estimate      [$53K-$91K (Glassdoor est.), $63K-$112K (Glass...
Job Description      [Data Scientist\nLocation: Albuquerque, NM\nEd...
Rating               [3.8, 3.4, 4.8, 2.9, 4.1, 3.3, 4.6, 3.5, 3.2, ...
Company Name         [Tecolote Research\n3.8, University of Marylan...
Location             [Albuquerque, NM, Linthicum, MD, Clearwater, F...
Headquarters         [Goleta, CA, Baltimore, MD, Clearwater, FL, Ri...
Size                 [501 to 1000 employees, 10000+ employees, 1001...
Founded              [1973, 1984, 2010, 1965, 1998, 2000, 2008, 200...
Type of ownership    [Company - Private, Other Organization, Govern...
Industry             [Aerospace & Defense, Health Care Services & H...
Sector               [Aerospace & Defense, Health Care, Business Se...
Revenue              [$50 to $100 million (USD), $2 to $5 billion (...
Compet

In [92]:
# Lets drop the Unamed column
jobs = df.drop(['Unnamed: 0'],axis =1)

In [94]:
jobs.columns

Index(['Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')

In [95]:
# Before starts cleaning lets inspect what is actually in each column
jobs['Job Title'].value_counts()

Data Scientist                                                        178
Data Engineer                                                          68
Senior Data Scientist                                                  42
Data Analyst                                                           18
Senior Data Engineer                                                   17
                                                                     ... 
Program/Data Analyst                                                    1
Software Engineer (Data Scientist/Software Engineer) - SISW - MG        1
Associate Research Scientist I (Protein Expression and Production)      1
Senior Engineer, Data Management Engineering                            1
Sr. Manager, Data Science - Marketing Mix Media                         1
Name: Job Title, Length: 328, dtype: int64

In [96]:
# Variable 'Salary Estimate'. Lets check what values it contains:
unique_v = jobs['Salary Estimate'].unique()
print (unique_v)

['$53K-$91K (Glassdoor est.)' '$63K-$112K (Glassdoor est.)'
 '$80K-$90K (Glassdoor est.)' '$56K-$97K (Glassdoor est.)'
 '$86K-$143K (Glassdoor est.)' '$71K-$119K (Glassdoor est.)'
 '$54K-$93K (Glassdoor est.)' '$86K-$142K (Glassdoor est.)'
 '$38K-$84K (Glassdoor est.)' '$120K-$160K (Glassdoor est.)'
 '$126K-$201K (Glassdoor est.)' '$64K-$106K (Glassdoor est.)'
 '$106K-$172K (Glassdoor est.)' '$46K-$85K (Glassdoor est.)'
 '$83K-$144K (Glassdoor est.)' '$102K-$190K (Glassdoor est.)'
 '$67K-$137K (Glassdoor est.)' '$118K-$189K (Glassdoor est.)'
 '$110K-$175K (Glassdoor est.)' '$64K-$111K (Glassdoor est.)'
 '$81K-$130K (Glassdoor est.)' '$73K-$119K (Glassdoor est.)'
 '$86K-$139K (Glassdoor est.)' '$63K-$105K (Glassdoor est.)' '-1'
 '$109K-$177K (Glassdoor est.)' '$63K-$110K (Glassdoor est.)'
 '$75K-$124K (Glassdoor est.)' '$34K-$61K (Glassdoor est.)'
 '$72K-$120K (Glassdoor est.)' '$93K-$149K (Glassdoor est.)'
 '$85K-$140K (Glassdoor est.)' '$77K-$135K (Glassdoor est.)'
 '$82K-$132K (Glass

Observations:
- We have records with a -1 which doesn't add value to it. We should delete them
- The '(Glassdoor est.)" is not adding value so we should remove it
- The 'Employer Provided Salary:' I think we should keep it (but in another column)
- Per Hour: we should keep it (but in another column)
- Add new columns for minumun and maximun salary (split Salary Estimate)
- Add a new column for avg salary
- Add a column for the most wanted skills

In [97]:
#Lets see how many records we need to delete
jobs['Salary Estimate'].value_counts()

-1                                  214
$86K-$143K (Glassdoor est.)           6
$21-$34 Per Hour(Glassdoor est.)      6
$54K-$115K (Glassdoor est.)           6
$49K-$113K (Glassdoor est.)           6
                                   ... 
$79K-$127K (Glassdoor est.)           1
$115K-$180K (Glassdoor est.)          1
$39K-$82K (Glassdoor est.)            1
$33K-$61K (Glassdoor est.)            1
$68K-$125K (Glassdoor est.)           1
Name: Salary Estimate, Length: 417, dtype: int64

In [98]:
# We have 214 records we need to delete
jobs = jobs[jobs['Salary Estimate'] != '-1']

In [99]:
# Lets take a look again
jobs ['Salary Estimate'].value_counts()

$54K-$115K (Glassdoor est.)         6
$49K-$113K (Glassdoor est.)         6
$21-$34 Per Hour(Glassdoor est.)    6
$86K-$143K (Glassdoor est.)         6
$76K-$142K (Glassdoor est.)         5
                                   ..
$39K-$82K (Glassdoor est.)          1
$115K-$180K (Glassdoor est.)        1
$57K-$118K (Glassdoor est.)         1
$33K-$61K (Glassdoor est.)          1
$79K-$134K (Glassdoor est.)         1
Name: Salary Estimate, Length: 416, dtype: int64

In [100]:
# Now we need to remove the (Glassdoor est.)
# Lets use lambda function to replace, split
salary = jobs['Salary Estimate'].apply(lambda x: x.split('(')[0])

In [35]:
salary_unique = salary.unique()
print(salary_unique)

['$53K-$91K ' '$63K-$112K ' '$80K-$90K ' '$56K-$97K ' '$86K-$143K '
 '$71K-$119K ' '$54K-$93K ' '$86K-$142K ' '$38K-$84K ' '$120K-$160K '
 '$126K-$201K ' '$64K-$106K ' '$106K-$172K ' '$46K-$85K ' '$83K-$144K '
 '$102K-$190K ' '$67K-$137K ' '$118K-$189K ' '$110K-$175K ' '$64K-$111K '
 '$81K-$130K ' '$73K-$119K ' '$86K-$139K ' '$63K-$105K ' 'Not Available'
 '$109K-$177K ' '$63K-$110K ' '$75K-$124K ' '$34K-$61K ' '$72K-$120K '
 '$93K-$149K ' '$85K-$140K ' '$77K-$135K ' '$82K-$132K ' '$83K-$137K '
 '$115K-$180K ' '$74K-$138K ' '$64K-$112K ' '$68K-$129K ' '$52K-$113K '
 '$110K-$150K' 'Employer Provided Salary:$150K-$160K' '$158K-$211K '
 '$20K-$39K ' '$56K-$117K ' '$63K-$99K ' '$68K-$114K ' '$41K-$95K '
 '$86K-$144K ' '$80K-$139K ' '$56K-$95K ' '$120K-$189K ' '$111K-$176K '
 '$84K-$146K ' '$107K-$172K ' '$49K-$85K ' '$61K-$109K ' '$88K-$148K '
 '$60K-$99K ' '$41K-$72K ' '$96K-$161K ' '$65K-$130K ' '$52K-$81K '
 '$139K-$220K ' '$50K-$102K ' '$85K-$139K ' '$74K-$122K ' '$99K-$157K '
 '$79K-$2

In [101]:
# Remove the 'K' and "$"
remove_kd = salary.apply(lambda x: x.replace('K','').replace('$',''))

In [102]:
remove_kd.head()

0     53-91 
1    63-112 
2     80-90 
3     56-97 
4    86-143 
Name: Salary Estimate, dtype: object

In [103]:
# Before cleaning the column 'Salary Estimate' Lets make the two new columns based on 'per hour'
 # and 'Employer Provided Salary'
jobs['Hourly'] = jobs['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)

In [104]:
# Lets do the same for 'Employer Provided Salary'
jobs['Employer_provided'] = jobs['Salary Estimate'].apply(lambda x : 1 if 'Employer Provided Salary' in x.lower() else 0)

In [105]:
# Lets take a look
jobs.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Hourly,Employer_provided
0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),-1,0,0
1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,0,0
2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,Company - Private,Security Services,Business Services,$100 to $500 million (USD),-1,0,0
3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,Energy,"Oil, Gas, Energy & Utilities",$500 million to $1 billion (USD),"Oak Ridge National Laboratory, National Renewa...",0,0
4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",0,0


In [106]:
# Now we have to remove the "Employer Provided" and "per hour"
salary_range = remove_kd.apply(lambda x: x.lower().replace('per hour','').replace('employer provided salary:',''))

In [107]:
salary_range

0       53-91 
1      63-112 
2       80-90 
3       56-97 
4      86-143 
        ...   
950    58-111 
951    72-133 
952     56-91 
953    95-160 
955    61-126 
Name: Salary Estimate, Length: 742, dtype: object

In [108]:
# Now we can split the minimun and max salary (since we already created the column 'hourly', we are ok)
jobs['min_salary'] = salary_range.apply(lambda x: int(x.split('-')[0]))
jobs['max_salary'] = salary_range.apply(lambda x: int(x.split('-')[1]))

In [109]:
jobs.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Hourly,Employer_provided,min_salary,max_salary
0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),-1,0,0,53,91
1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,0,0,63,112
2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,Company - Private,Security Services,Business Services,$100 to $500 million (USD),-1,0,0,80,90
3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,Energy,"Oil, Gas, Energy & Utilities",$500 million to $1 billion (USD),"Oak Ridge National Laboratory, National Renewa...",0,0,56,97
4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",0,0,86,143


In [110]:
# Lets add an extra column for average salary 
jobs['Avg_salary'] = (jobs.min_salary+jobs.max_salary)/2

In [118]:
# Company name looks like that has the rating in the last 3 characters and Ratings <0  don't have the rating in the name
jobs['Company_name'] = jobs.apply(lambda x: x['Company Name'] if x['Rating'] <0 else x['Company Name'][:-3], axis=1)

In [119]:
jobs.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Hourly,Employer_provided,min_salary,max_salary,Avg_salary,Company_name
0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),-1,0,0,53,91,72.0,Tecolote Research\n
1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,0,0,63,112,87.5,University of Maryland Medical System\n
2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,Company - Private,Security Services,Business Services,$100 to $500 million (USD),-1,0,0,80,90,85.0,KnowBe4\n
3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,Energy,"Oil, Gas, Energy & Utilities",$500 million to $1 billion (USD),"Oak Ridge National Laboratory, National Renewa...",0,0,56,97,76.5,PNNL\n
4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",0,0,86,143,114.5,Affinity Solutions\n


In [122]:
# Lets extractstate from location
jobs['Job_state'] = jobs['Location'].apply(lambda x: x.split(',')[1])

In [137]:
jobs.Job_state.value_counts()

 CA             151
 MA             103
 NY              72
 VA              41
 IL              40
 MD              35
 PA              33
 TX              28
 NC              21
 WA              21
 NJ              17
 FL              16
 OH              14
 TN              13
 DC              11
 CO              11
 UT              10
 IN              10
 WI              10
 MO               9
 AZ               9
 AL               8
 GA               6
 KY               6
 MI               6
 DE               6
 CT               5
 IA               5
 LA               4
 NE               4
 OR               4
 NM               3
 KS               3
 MN               2
 ID               2
 SC               1
 Los Angeles      1
 RI               1
Name: Job_state, dtype: int64

In [139]:
# Job_state needs to be fix "los Angeles" by CA
jobs['Job_state']= jobs.Job_state.apply(lambda x: x.strip() if x.strip().lower() != 'los angeles' else 'CA')
jobs.Job_state.value_counts()

CA    152
MA    103
NY     72
VA     41
IL     40
MD     35
PA     33
TX     28
WA     21
NC     21
NJ     17
FL     16
OH     14
TN     13
CO     11
DC     11
IN     10
UT     10
WI     10
AZ      9
MO      9
AL      8
DE      6
GA      6
KY      6
MI      6
IA      5
CT      5
NE      4
LA      4
OR      4
KS      3
NM      3
MN      2
ID      2
RI      1
SC      1
Name: Job_state, dtype: int64

In [128]:
# Lets calculate the years of the company
jobs['Years_market'] = jobs.Founded.apply(lambda x: x if x< 1 else 2022-x)

In [135]:
# Add a new columm with the most wanted skills in the job description
jobs['Python'] = jobs['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
jobs['Sparks'] = jobs['Job Description'].apply(lambda x: 1 if 'sparks' in x.lower() else 0)
jobs['Excel'] = jobs['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
jobs['Tableau'] = jobs['Job Description'].apply(lambda x: 1 if 'tableau' in x.lower() else 0)
jobs['Power_BI'] = jobs['Job Description'].apply(lambda x: 1 if 'power bi' in x.lower() else 0)
jobs['Python'] = jobs['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
jobs['SQL'] = jobs['Job Description'].apply(lambda x: 1 if 'sql' in x.lower() else 0)
jobs['ML'] = jobs['Job Description'].apply(lambda x: 1 if 'machine learning' in x.lower() else 0)
jobs['Statistics'] = jobs['Job Description'].apply(lambda x: 1 if 'statistics' in x.lower() else 0)
jobs['AWS'] = jobs['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)


In [140]:
jobs.head()

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,Years_market,Python,Sparks,Excel,Tableau,Power_BI,SQL,ML,Statistics,AWS
0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,...,49,1,0,1,1,1,0,1,0,0
1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,...,38,1,0,0,0,0,0,1,1,0
2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,Company - Private,...,12,1,0,1,0,0,1,1,1,0
3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,...,57,1,0,0,0,0,0,1,1,0
4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,...,24,1,0,1,0,0,1,1,1,0


In [142]:
jobs.columns

Index(['Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'Hourly', 'Employer_provided', 'min_salary', 'max_salary', 'Avg_salary',
       'Company_name', 'Job_state', 'Years_market', 'Python', 'Sparks',
       'Excel', 'Tableau', 'Power_BI', 'SQL', 'ML', 'Statistics', 'AWS'],
      dtype='object')

In [144]:
jobs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 742 entries, 0 to 955
Data columns (total 31 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          742 non-null    object 
 1   Salary Estimate    742 non-null    object 
 2   Job Description    742 non-null    object 
 3   Rating             742 non-null    float64
 4   Company Name       742 non-null    object 
 5   Location           742 non-null    object 
 6   Headquarters       742 non-null    object 
 7   Size               742 non-null    object 
 8   Founded            742 non-null    int64  
 9   Type of ownership  742 non-null    object 
 10  Industry           742 non-null    object 
 11  Sector             742 non-null    object 
 12  Revenue            742 non-null    object 
 13  Competitors        742 non-null    object 
 14  Hourly             742 non-null    int64  
 15  Employer_provided  742 non-null    int64  
 16  min_salary         742 non

In [143]:
# Lets export the file to explore the data and get some insights in Tableau
jobs.to_csv('glassdoor_cleaned.csv',index = False)