<a href="https://colab.research.google.com/github/SriSatyaLokesh/naukri-jobs-dashboard/blob/master/notebooks/naukri_jobs_dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# upload your kaggle API token (you can get that from your account) kaggle.com/account -> generate API token
from google.colab import files
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [0]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json

Download datasets from kaggle using kaggle API token

[https://www.kaggle.com/PromptCloudHQ/jobs-on-naukricom](https://www.kaggle.com/PromptCloudHQ/jobs-on-naukricom)

In [3]:
import zipfile 

!kaggle datasets download -d PromptCloudHQ/jobs-on-naukricom/
zip_ref = zipfile.ZipFile("jobs-on-naukricom.zip", 'r')
zip_ref.extractall()
zip_ref.close()

Downloading jobs-on-naukricom.zip to /content
 81% 11.0M/13.6M [00:00<00:00, 24.3MB/s]
100% 13.6M/13.6M [00:00<00:00, 30.6MB/s]


In [4]:
!ls

jobs-on-naukricom.zip  kaggle.json  naukri_com-job_sample.csv  sample_data


In [0]:
import pandas as pd
import numpy as np

In [6]:
data = pd.read_csv("naukri_com-job_sample.csv")
print(data.columns)
data.shape

Index(['company', 'education', 'experience', 'industry', 'jobdescription',
       'jobid', 'joblocation_address', 'jobtitle', 'numberofpositions',
       'payrate', 'postdate', 'site_name', 'skills', 'uniq_id'],
      dtype='object')


(22000, 14)

In [8]:
#Check for any NaN values
def columnsWithNaN(data):
  '''
  this function will take dataframe as input and returns column names which contain NaN values
  '''
  nan_data = data.isna()
  nan_columns = nan_data.any()
  columns_with_nan = data.columns[nan_columns].tolist()
  return columns_with_nan
columnsWithNaN(data)

['company',
 'education',
 'experience',
 'industry',
 'jobdescription',
 'joblocation_address',
 'numberofpositions',
 'payrate',
 'postdate',
 'site_name',
 'skills']

In [0]:
def dropRowsWithNaN(data,column):
  """ this function will drop rows which contain NaN values in specified column"""
  return data[pd.notnull(data[column])]

In [11]:
dropRowsWithNaN(data,"postdate").shape

(21977, 14)

As there are less columns to be dropped we can do this

In [12]:
data = dropRowsWithNaN(data,"postdate")

(21977, 14)

similarly for payrate, joblocation_address too.


In [13]:
data = dropRowsWithNaN(data,"payrate")
data.shape

(21884, 14)

In [15]:
data = dropRowsWithNaN(data,"joblocation_address")
data.shape

(21387, 14)

In [16]:
columnsWithNaN(data)

['education', 'numberofpositions', 'site_name', 'skills']

Still we have 4 cloumns with ambigious data.

In [17]:
dropRowsWithNaN(data,"numberofpositions").shape #no use

(4435, 14)

As the above operation results in loss of large data. we need to fill NaN values with mean or median

In [0]:
def fillNanWithMedian(data,column):  #for median
  return data[column].fillna(data[column].median())

In [0]:
def fillNanWithMean(data,column):
  return data[column].fillna(data[column].mean())

In [21]:
data["numberofpositions"] = fillNanWithMedian(data,"numberofpositions") 
# As mean will be 45.5673 we willl go with median
data["numberofpositions"]

0         4.0
1        60.0
2         4.0
3         4.0
4         4.0
         ... 
21995     2.0
21996     4.0
21997     4.0
21998     4.0
21999     4.0
Name: numberofpositions, Length: 21387, dtype: float64

for site_name we have 2 unique values 
  

1.   www.naukri.com
2.   NaN

So we are gonna fill NaN values with "other"



In [22]:
data["site_name"].fillna("other",inplace=True)
data["site_name"]

0                 other
1                 other
2                 other
3                 other
4                 other
              ...      
21995    www.naukri.com
21996    www.naukri.com
21997    www.naukri.com
21998    www.naukri.com
21999    www.naukri.com
Name: site_name, Length: 21387, dtype: object

In [26]:
data["skills"].unique()

array(['ITES', 'Marketing', 'IT Software - Application Programming',
       'Accounts', 'Production', 'Sales', 'IT Software - Other',
       'Executive Assistant', 'IT Software - Mobile',
       'Engineering Design', 'Financial Services', 'Hotels',
       'IT Software - QA & Testing', 'HR', 'Supply Chain',
       'IT Software - Network Administration', 'Architecture', 'Legal',
       'Site Engineering', 'Journalism', nan, 'IT Software - DBA',
       'Strategy', 'Medical', 'Design', 'Defence Forces',
       'IT Software - Mainframe', 'IT Software - Telecom Software',
       'IT Software - Embedded', 'IT Software - Middleware', 'Teaching',
       'IT Software - System Programming',
       'IT Software - Client/Server Programming', 'Travel',
       'IT Software - eCommerce', 'TV', 'Fashion Designing',
       'IT Software - ERP', 'IT Hardware',
       'Analytics & Business Intelligence', 'Beauty/Fitness/Spa Services',
       'Top Management', 'Export', 'IT Software - Systems', 'Packaging',

In [27]:
print(len(data["skills"].unique()))
dropRowsWithNaN(data,"skills").shape

46


(20880, 14)

Here rather than dropping the rows we can fill them with "not specified"

In [0]:
data["skills"].fillna("not specified",inplace=True)

We need to work on column "joblocation_address" as it has multiple values it may cause an issue for us while visualizing things           
So, we need to [explode rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows)             
You can see the how data is there in that column using follwing code


```
data["joblocation_address"]
print(len(data["joblocation_address"].unique()))
data["joblocation_address"].unique()
```



In [30]:
print("actual data - ",data.shape)
exploded_data = data.copy()
exploded_data = exploded_data.assign(joblocation_address=exploded_data["joblocation_address"].str.split(',')).explode("joblocation_address")
exploded_data = exploded_data[exploded_data.joblocation_address != ""]
print("exploded data - ",exploded_data.shape)

actual data -  (21387, 14)
exploded data -  (38297, 14)
