# Analyzing simplyhired.com scraped data
## Introduction

I [collected data from simplyhired.com](https://github.com/Derrick-Mulwa/Web-Scraping-SimplyHired "The repository where I scraped the data") that contains details about jobs being advertised for Data Analysis/Business Analysis positions in the US. The data is stored in the Collected data.csv file in this repository. I will use the data to get insights of the job postings.

### These are the ten potential hypotheses I will investigate with this dataset:
* Companies with higher ratings tend to offer higher salaries than those with lower ratings.
* Jobs in larger cities tend to have higher salaries than those in smaller cities.
* Full-time jobs tend to have higher salaries than part-time jobs.
* The most common job qualifications for high-paying jobs are related to specific technical skills.
* The most common job benefits for high-paying jobs include healthcare and retirement plans.
* Salaries in certain states tend to be higher than in others.
* Jobs that pay annually tend to pay more than jobs that pay monthly or hourly
* Senior analyst's roles tend to have higher salaries than junior analyst roles.
* Senior analyst jobs tend to have more job benefits than junior analyst jobs.
* Senior analyst roles require more qualifications than junior analyst jobs.
* Lower rated companies are more likely to hire junior/Entry level analysts than hiher rated analysts.
* There is a positive correlation between the number of job benefits offered and the company rating.









In [1]:
# Import modules to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline

## Understanding the data

In [2]:
# read in the data
df = pd.read_csv("Collected data.csv", index_col = 0)
df.head(5)

Unnamed: 0,job_title,company_name,rating,company_location(city),company_location_state,salary_type,payment_cycle,Salary Range From,Salary range To,currency,job_type,job_benefits,qualifications,job_description
0,"Senior Business Analyst, Contract Data",Stericycle,3.1,New Jersey,New Jersey,,,,,,Full-time | Contract,"Prescription drug insurance, Dental insurance,...",Bachelor's degree,"About Us:\nAt Stericycle, we deliver solutions..."
1,Digital Commerce Business Analyst,Sealed Air,3.5,Charlotte,NC,Estimated,Annually,65000.0,85000.0,$,,,"Change management, Project management, Analysi...",Sealed Air designs and delivers packaging solu...
2,"Analyst, Sales Force Deploy-SR",Quest Diagnostics,3.6,Collegeville,PA,Estimated,Annually,91000.0,120000.0,$,Full-time,,"Microsoft Access, Laboratory experience, Micro...",Overview: Provide analysis and insight into sa...
3,Senior Business Analyst,eSales Technologies,,West Babylon,NY,Estimated,Annually,89000.0,120000.0,$,,,"Analysis skills, Microsoft Excel, Business ana...",We are hiring a Business Analyst to join our p...
4,Business Analyst,Saama Technologies Inc,3.5,Bridgewater,NJ,Estimated,Annually,110000.0,140000.0,$,,,"Analysis skills, Communication skills, Informa...",Does solving complex business problems and rea...


## The dataframe has the following columns:
* __job_title__ : Contains the name of position being advertised	
* __company_name__ : Contains the name of the company
* __rating__ : Contains the rating of the company (0-5 star company)
* __company_location(city)__: Contains the city in which the company is located
* __company_location_state__ : Contains the state where the company/organisation is located
* __salary_type__ : Defines whether the salary stated is Estimated by simplyhired.com or Explicitly defined by the employer
* __payment_cycle__	: Defines the payment duration (monthly/ hourly/ annually)
* __Salary Range From__	: Low bound of the salary to be paid
* __Salary range To__	: High bound of the salary to be paid
* __currency__	: Currency of the payments ie (dollars/Pound/Euro)
* __job_type__	: Defines whether the job is full time or part time
* __job_benefits__: Contains a list of job benefits offered for that position by the company
* __qualifications__: Contains a list of job qualifications/skills required for that position by the company
* __job_description__: Contains a thorough description of the job.

In [3]:
df.shape

(5849, 14)

The dataframe has 5849 records and 14 columns

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5849 entries, 0 to 5884
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   job_title               5849 non-null   object 
 1   company_name            5844 non-null   object 
 2   rating                  4488 non-null   float64
 3   company_location(city)  5849 non-null   object 
 4   company_location_state  5849 non-null   object 
 5   salary_type             5012 non-null   object 
 6   payment_cycle           5012 non-null   object 
 7   Salary Range From       5012 non-null   object 
 8   Salary range To         5012 non-null   object 
 9   currency                5012 non-null   object 
 10  job_type                4443 non-null   object 
 11  job_benefits            3008 non-null   object 
 12  qualifications          5801 non-null   object 
 13  job_description         5849 non-null   object 
dtypes: float64(1), object(13)
memory usage: 

## Data Cleaning

"Salary Range From" and "Salary range To" columns should be converted to float data type for manipulation.

In [5]:
df["Salary Range From"]

0           NaN
1        65,000
2        91,000
3        89,000
4       110,000
         ...   
5877        NaN
5878     63,100
5879        NaN
5880     70,000
5884         35
Name: Salary Range From, Length: 5849, dtype: object

In [6]:
# remove the comma in the values and convert to float

df["Salary Range From"].replace(",", "", regex=True, inplace = True)
df["Salary range To"].replace(",", "", regex=True, inplace = True)


#Convert to float
df = df.astype({
    "Salary Range From": float,
    "Salary range To":float
})
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5849 entries, 0 to 5884
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   job_title               5849 non-null   object 
 1   company_name            5844 non-null   object 
 2   rating                  4488 non-null   float64
 3   company_location(city)  5849 non-null   object 
 4   company_location_state  5849 non-null   object 
 5   salary_type             5012 non-null   object 
 6   payment_cycle           5012 non-null   object 
 7   Salary Range From       5012 non-null   float64
 8   Salary range To         5012 non-null   float64
 9   currency                5012 non-null   object 
 10  job_type                4443 non-null   object 
 11  job_benefits            3008 non-null   object 
 12  qualifications          5801 non-null   object 
 13  job_description         5849 non-null   object 
dtypes: float64(3), object(11)
memory usage: 

### Remove duplicates

In [7]:
df[df.duplicated(subset=["job_title","company_name", "company_location(city)"], keep=False)].sort_values(by="job_title").count()

job_title                 0
company_name              0
rating                    0
company_location(city)    0
company_location_state    0
salary_type               0
payment_cycle             0
Salary Range From         0
Salary range To           0
currency                  0
job_type                  0
job_benefits              0
qualifications            0
job_description           0
dtype: int64

The dataframe has no duplicated records 

In [8]:
df.head()

Unnamed: 0,job_title,company_name,rating,company_location(city),company_location_state,salary_type,payment_cycle,Salary Range From,Salary range To,currency,job_type,job_benefits,qualifications,job_description
0,"Senior Business Analyst, Contract Data",Stericycle,3.1,New Jersey,New Jersey,,,,,,Full-time | Contract,"Prescription drug insurance, Dental insurance,...",Bachelor's degree,"About Us:\nAt Stericycle, we deliver solutions..."
1,Digital Commerce Business Analyst,Sealed Air,3.5,Charlotte,NC,Estimated,Annually,65000.0,85000.0,$,,,"Change management, Project management, Analysi...",Sealed Air designs and delivers packaging solu...
2,"Analyst, Sales Force Deploy-SR",Quest Diagnostics,3.6,Collegeville,PA,Estimated,Annually,91000.0,120000.0,$,Full-time,,"Microsoft Access, Laboratory experience, Micro...",Overview: Provide analysis and insight into sa...
3,Senior Business Analyst,eSales Technologies,,West Babylon,NY,Estimated,Annually,89000.0,120000.0,$,,,"Analysis skills, Microsoft Excel, Business ana...",We are hiring a Business Analyst to join our p...
4,Business Analyst,Saama Technologies Inc,3.5,Bridgewater,NJ,Estimated,Annually,110000.0,140000.0,$,,,"Analysis skills, Communication skills, Informa...",Does solving complex business problems and rea...


I need to add a column for the salary that will be a single value. It will be the median value of salary range from and salary range to

In [9]:
# Store salary rane from and saslary range to columns in lists

salary_range_from = list(df["Salary Range From"])
salary_range_to = list(df["Salary range To"])

# Get the median value for each record and store in "salary" list
salary = []
for i in range(len(salary_range_from)):
    salary.append(pd.Series([salary_range_from[i], salary_range_to[i]]).median())
    
df.insert((df.columns.get_loc("Salary range To")+1), "Median_salary", salary)

In [10]:
df.head()

Unnamed: 0,job_title,company_name,rating,company_location(city),company_location_state,salary_type,payment_cycle,Salary Range From,Salary range To,Median_salary,currency,job_type,job_benefits,qualifications,job_description
0,"Senior Business Analyst, Contract Data",Stericycle,3.1,New Jersey,New Jersey,,,,,,,Full-time | Contract,"Prescription drug insurance, Dental insurance,...",Bachelor's degree,"About Us:\nAt Stericycle, we deliver solutions..."
1,Digital Commerce Business Analyst,Sealed Air,3.5,Charlotte,NC,Estimated,Annually,65000.0,85000.0,75000.0,$,,,"Change management, Project management, Analysi...",Sealed Air designs and delivers packaging solu...
2,"Analyst, Sales Force Deploy-SR",Quest Diagnostics,3.6,Collegeville,PA,Estimated,Annually,91000.0,120000.0,105500.0,$,Full-time,,"Microsoft Access, Laboratory experience, Micro...",Overview: Provide analysis and insight into sa...
3,Senior Business Analyst,eSales Technologies,,West Babylon,NY,Estimated,Annually,89000.0,120000.0,104500.0,$,,,"Analysis skills, Microsoft Excel, Business ana...",We are hiring a Business Analyst to join our p...
4,Business Analyst,Saama Technologies Inc,3.5,Bridgewater,NJ,Estimated,Annually,110000.0,140000.0,125000.0,$,,,"Analysis skills, Communication skills, Informa...",Does solving complex business problems and rea...


The column was sucessfully added next to "Salary range To" column.

### Drop less important columns

In [11]:
df.currency.value_counts()

$    5012
Name: currency, dtype: int64

The currency column only has $ value and hence should be dropped.

The description column contains the roles and responsibilities of the applicant, as well as the application process. This is not useful in this analysis hence the column should be dropped

In [12]:
df.drop(["currency", "job_description"], axis=1, inplace = True)

In [13]:
df

Unnamed: 0,job_title,company_name,rating,company_location(city),company_location_state,salary_type,payment_cycle,Salary Range From,Salary range To,Median_salary,job_type,job_benefits,qualifications
0,"Senior Business Analyst, Contract Data",Stericycle,3.1,New Jersey,New Jersey,,,,,,Full-time | Contract,"Prescription drug insurance, Dental insurance,...",Bachelor's degree
1,Digital Commerce Business Analyst,Sealed Air,3.5,Charlotte,NC,Estimated,Annually,65000.0,85000.0,75000.0,,,"Change management, Project management, Analysi..."
2,"Analyst, Sales Force Deploy-SR",Quest Diagnostics,3.6,Collegeville,PA,Estimated,Annually,91000.0,120000.0,105500.0,Full-time,,"Microsoft Access, Laboratory experience, Micro..."
3,Senior Business Analyst,eSales Technologies,,West Babylon,NY,Estimated,Annually,89000.0,120000.0,104500.0,,,"Analysis skills, Microsoft Excel, Business ana..."
4,Business Analyst,Saama Technologies Inc,3.5,Bridgewater,NJ,Estimated,Annually,110000.0,140000.0,125000.0,,,"Analysis skills, Communication skills, Informa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5877,Fraud Recovery Analyst,"JPMorgan Chase Bank, N.A.",3.8,Indianapolis,IN,,,,,,Full-time,"Health insurance, Tuition reimbursement, Retir...","Analysis skills, Microsoft Outlook, Fraud, Lea..."
5878,Actuarial Data Analyst,Lincoln Financial,3.5,Omaha,NE,Explicitly Defined,Annually,63100.0,137900.0,100500.0,,"Health insurance, Employee assistance program,...","Analysis skills, SQL, SAS, Pricing, Communicat..."
5879,Business Analyst Associate I,"JPMorgan Chase Bank, N.A.",3.8,Fort Worth,TX,,,,,,Full-time,"Health insurance, Tuition reimbursement, Retir...","Microsoft Access, Customer service, Microsoft ..."
5880,Data Analyst,"Bainbridge, Inc.",4.1,Remote,Remote,Estimated,Annually,70000.0,93000.0,81500.0,Full-time,"Dental insurance, Health insurance, Vision ins...","Microsoft Excel, Financial services, Python, B..."


In [14]:
df.company_location_state.value_counts().sort_index().head(20)

AK               4
AL              40
AR              23
AZ              98
Alabama          1
Arizona          3
Arkansas         5
CA             522
CO              87
CT              60
California      12
Connecticut      2
DC             131
DE              35
FL             298
Florida         10
GA             197
GU               1
Georgia          2
HI              15
Name: company_location_state, dtype: int64

Some states have full names while others have abbreviated names. This makes records referring to same state be categorized as different. We need to abbreviate all state names to ensure the data has integrity. 

In [15]:
# Read a table with the US states and their abbreviations from a website

url = "https://www.bls.gov/respondents/mwr/electronic-data-interchange/appendix-d-usps-state-abbreviations-and-fips-codes.htm"
states = pd.read_html(url)[0]
states.columns = list(states.loc[0].values)
states = pd.concat([states.iloc[:, :2], states.iloc[:, 3:5]])
states.set_index("State", inplace = True)

In [16]:
# Create a function to rename the states

def state_name(x):
    try:
        x = x.replace(" State", "")
        return states.loc[x].values[0]
    except:
        return x

In [17]:
# Apply the function to the column

df.company_location_state = df.company_location_state.apply(lambda x: state_name(x))

In [18]:
df

Unnamed: 0,job_title,company_name,rating,company_location(city),company_location_state,salary_type,payment_cycle,Salary Range From,Salary range To,Median_salary,job_type,job_benefits,qualifications
0,"Senior Business Analyst, Contract Data",Stericycle,3.1,New Jersey,NJ,,,,,,Full-time | Contract,"Prescription drug insurance, Dental insurance,...",Bachelor's degree
1,Digital Commerce Business Analyst,Sealed Air,3.5,Charlotte,NC,Estimated,Annually,65000.0,85000.0,75000.0,,,"Change management, Project management, Analysi..."
2,"Analyst, Sales Force Deploy-SR",Quest Diagnostics,3.6,Collegeville,PA,Estimated,Annually,91000.0,120000.0,105500.0,Full-time,,"Microsoft Access, Laboratory experience, Micro..."
3,Senior Business Analyst,eSales Technologies,,West Babylon,NY,Estimated,Annually,89000.0,120000.0,104500.0,,,"Analysis skills, Microsoft Excel, Business ana..."
4,Business Analyst,Saama Technologies Inc,3.5,Bridgewater,NJ,Estimated,Annually,110000.0,140000.0,125000.0,,,"Analysis skills, Communication skills, Informa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5877,Fraud Recovery Analyst,"JPMorgan Chase Bank, N.A.",3.8,Indianapolis,IN,,,,,,Full-time,"Health insurance, Tuition reimbursement, Retir...","Analysis skills, Microsoft Outlook, Fraud, Lea..."
5878,Actuarial Data Analyst,Lincoln Financial,3.5,Omaha,NE,Explicitly Defined,Annually,63100.0,137900.0,100500.0,,"Health insurance, Employee assistance program,...","Analysis skills, SQL, SAS, Pricing, Communicat..."
5879,Business Analyst Associate I,"JPMorgan Chase Bank, N.A.",3.8,Fort Worth,TX,,,,,,Full-time,"Health insurance, Tuition reimbursement, Retir...","Microsoft Access, Customer service, Microsoft ..."
5880,Data Analyst,"Bainbridge, Inc.",4.1,Remote,Remote,Estimated,Annually,70000.0,93000.0,81500.0,Full-time,"Dental insurance, Health insurance, Vision ins...","Microsoft Excel, Financial services, Python, B..."


In [19]:
df.payment_cycle.value_counts(dropna=False)

Annually       4319
NaN             837
Hourly          606
Monthly          81
Not Defined       6
Name: payment_cycle, dtype: int64

In [20]:
df["company_location(city)"].value_counts().head(30)

Remote           497
New York         175
Atlanta          135
Washington       131
Chicago          113
Houston           89
United States     81
Austin            79
Dallas            76
Boston            72
Philadelphia      57
Nashville         56
Columbus          55
Los Angeles       51
Tampa             50
Charlotte         49
Miami             46
Seattle           45
Richmond          45
San Diego         43
Phoenix           42
Jersey City       40
San Francisco     39
Minneapolis       39
Denver            39
Arlington         38
Plano             35
Pittsburgh        32
Indianapolis      32
Portland          31
Name: company_location(city), dtype: int64

In [21]:
df.job_title.value_counts().head(30)

Business Analyst                           427
Data Analyst                               341
Senior Business Analyst                     90
Business Intelligence Analyst               77
Senior Data Analyst                         61
IT Business Analyst                         46
Board Certified Behavior Analyst (BCBA)     40
Sr. Business Analyst                        31
Salesforce Business Analyst                 30
Technical Business Analyst                  27
Business Data Analyst                       26
Financial Analyst                           19
Business Analyst I                          19
Data Analyst II                             17
Business Analyst II                         16
Senior Financial Analyst                    15
Senior Business Intelligence Analyst        15
Clinical Data Analyst                       15
Sr Business Analyst                         15
Junior Business Analyst                     14
Sr. Data Analyst                            14
Data Speciali

The dataframe looks good to start working on. Lets test the hypotheses

# Hypothesis testing
## Companies with higher ratings tend to offer higher salaries than those with lower ratings.