## Project Description 

### AIM

## Introduction to Data Science Job

Tasks -
1. Basic data cleaning and feature exploration
2. Exploratory data analysis
3. Pipelines
4. Model experimentation 
5. Feature Engineering

In [None]:
# Import the relevant packages
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import scipy.stats as stats

### Basic data exploration
1. check the datasets
2. combine the datasets
3. Look at summary statistics
4. Clean the dataset
5. Check for the null values


In [8]:
# get data
job2 = pd.read_csv('AustraliaDataScienceJob2.csv')
job = pd.read_csv('AustraliaDataScienceJobs.csv')
print(f'job2 dataset shape = {job2.shape}')
print(f'job dataset shape = {job.shape}')

job2 dataset shape = (652, 53)
job dataset shape = (2088, 53)


In [9]:
# combine both the dataset
job = pd.concat([job, job2])
print(f'Shape of combined dataset = {job.shape} ')

Shape of combined dataset = (2740, 53) 


In [10]:
job.head(5)

Unnamed: 0,Job Title,Job Location,Company,Url,Estimate Base Salary,Low Estimate,High Estimate,Company Size,Company Type,Company Sector,...,cassandra_yn,hive_yn,bigml_yn,tableau_yn,powerbi_yn,nlp_yn,pytorch_yn,tensorflow_yn,mathematic_yn,statistic_yn
0,Analyst,Melbourne,ANZ Banking Group,https://www.glassdoor.com.au/partner/jobListin...,95917,80000,115000,10000+ Employees,Company - Public,Finance,...,0,0,0,0,0,0,0,0,1,0
1,Clinical Research Associate,Mulgrave,Bristol Myers Squibb,https://www.glassdoor.com.au/partner/jobListin...,96555,79000,118000,10000+ Employees,Company - Public,Pharmaceutical & Biotechnology,...,0,0,0,0,0,0,0,0,0,0
2,Clinical Research Associate,Mulgrave,Bristol Myers Squibb,https://www.glassdoor.com.au/partner/jobListin...,96555,79000,118000,10000+ Employees,Company - Public,Pharmaceutical & Biotechnology,...,0,0,0,0,0,0,0,0,0,0
3,Clinical Research Associate,Mulgrave,Bristol Myers Squibb,https://www.glassdoor.com.au/partner/jobListin...,96555,79000,118000,10000+ Employees,Company - Public,Pharmaceutical & Biotechnology,...,0,0,0,0,0,0,0,0,0,0
4,Data Scientist,Melbourne,ANZ Banking Group,https://www.glassdoor.com.au/partner/jobListin...,115631,94000,143000,10000+ Employees,Company - Public,Finance,...,0,0,0,0,0,0,0,0,0,0


In [11]:
job.tail(5)

Unnamed: 0,Job Title,Job Location,Company,Url,Estimate Base Salary,Low Estimate,High Estimate,Company Size,Company Type,Company Sector,...,cassandra_yn,hive_yn,bigml_yn,tableau_yn,powerbi_yn,nlp_yn,pytorch_yn,tensorflow_yn,mathematic_yn,statistic_yn
647,Electronics Engineer,Western Australia,Australian Antarctic Division,https://www.glassdoor.com.au/partner/jobListin...,92076,92000,92000,501 to 1000 Employees,Government,Government & Public Administration,...,0,0,0,0,0,0,0,0,0,0
648,Construction,Perth,Chaleen Botha,https://www.glassdoor.com.au/partner/jobListin...,55000,50000,60000,,,,...,0,0,0,0,0,0,0,0,0,0
649,Construction,Perth,Chaleen Botha,https://www.glassdoor.com.au/partner/jobListin...,55000,50000,60000,,,,...,0,0,0,0,0,0,0,0,0,0
650,Metallurgical Engineer,Perth,BHP,https://www.glassdoor.com.au/partner/jobListin...,141443,121000,165000,10000+ Employees,Company - Public,"Energy, Mining, Utilities",...,0,0,0,0,0,0,0,0,0,0
651,Metallurgical Engineer,Perth,BHP,https://www.glassdoor.com.au/partner/jobListin...,141443,121000,165000,10000+ Employees,Company - Public,"Energy, Mining, Utilities",...,0,0,0,0,0,0,0,0,0,0


In [12]:
job.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2740 entries, 0 to 651
Data columns (total 53 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Job Title                      2737 non-null   object 
 1   Job Location                   2740 non-null   object 
 2   Company                        2740 non-null   object 
 3   Url                            2740 non-null   object 
 4   Estimate Base Salary           2740 non-null   int64  
 5   Low Estimate                   2740 non-null   int64  
 6   High Estimate                  2740 non-null   int64  
 7   Company Size                   2469 non-null   object 
 8   Company Type                   2469 non-null   object 
 9   Company Sector                 2000 non-null   object 
 10  Company Founded                1569 non-null   float64
 11  Company Industry               2000 non-null   object 
 12  Company Revenue                2469 non-null   ob

Depict - 
1. There are 2740 entries in the dataset, which means that it is fairly small by machine learning standards.
2. Null values in 
    1. Job Title
    2. Company Size
    3. Company Type
    4. Company Secto
    5. Company Founded
    6. Company Industry
    7. Company Revenue
    8. Job Descriptions
    9. Company Rating
    10. Company Friend Reccomendation
    11. Company CEO Approval           
    12. Companny Number of Rater       
    13. Company Career Opportinities   
    14. Compensation and Benefits      
    15. Company Culture and Values     
    16. Company Senior Management      
    17. Company Work Life Balance


In [15]:
# Summary of categorical features
cat_feature = job.select_dtypes(include=[object])
cat_feature.describe().T

Unnamed: 0,count,unique,top,freq
Job Title,2737,352,Data Scientist,449
Job Location,2740,117,Melbourne,770
Company,2740,485,Deloitte,144
Url,2740,2740,https://www.glassdoor.com.au/partner/jobListin...,1
Company Size,2469,8,10000+ Employees,831
Company Type,2469,12,Company - Public,889
Company Sector,2000,24,Finance,401
Company Industry,2000,65,National Services & Agencies,223
Company Revenue,2469,14,Unknown / Non-Applicable,968
Job Descriptions,2739,807,CSIRO’s Data61 is hiring:\nWe have a range of ...,40


#### Depict
1. The URL feature comprises entirely distinct values, which means it will not have any impact on the prediction
2. The country feature has only one unique value, therefore it will not contribute to the prediction.

In [18]:
# Summary of numerical features
num_features = job.select_dtypes(include=[np.number])
num_features.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Estimate Base Salary,2740.0,103185.945255,32620.51331,40500.0,80623.0,96130.0,120181.0,295000.0
Low Estimate,2740.0,91380.291971,29172.947652,30000.0,73000.0,87000.0,106000.0,241000.0
High Estimate,2740.0,116798.905109,38793.121065,41000.0,90000.0,114000.0,140000.0,349000.0
Company Founded,1569.0,1942.167623,63.627829,1631.0,1888.0,1949.0,2001.0,2020.0
Company Rating,2314.0,3.884313,0.517118,1.6,3.7,3.9,4.1,5.0
Company Friend Reccomendation,2225.0,76.335281,14.861318,15.0,69.0,79.0,85.0,100.0
Company CEO Approval,1749.0,77.934248,16.725428,7.0,72.0,83.0,88.0,100.0
Companny Number of Rater,1892.0,1757.846723,4647.614482,0.0,0.0,42.0,1014.0,30783.0
Company Career Opportinities,2270.0,3.690529,0.542892,1.0,3.4,3.7,4.0,5.0
Compensation and Benefits,2270.0,3.628502,0.53114,1.0,3.4,3.6,3.9,5.0
