## Summary of Data

The LinkedIn data set provided contains 7,927 rows and 15 columns, providing a comprehensive overview of job postings on the platform. The data can be used for data analysis, visualization, and research. The job postings include Data Analyst, Machine Learning Engineer, IT Services and IT Consulting roles, located in various locations around the world, with varying salaries and work hours. The data set includes information about the company, role responsibilities, and required skills for each job. This data set is a valuable resource for understanding job opportunities in different industries and locations.

## Understanding the Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("linkdin_Job_data.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7927 entries, 0 to 7926
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   job_ID              7927 non-null   int64  
 1   job                 7894 non-null   object 
 2   location            7894 non-null   object 
 3   company_id          0 non-null      float64
 4   company_name        7892 non-null   object 
 5   work_type           7736 non-null   object 
 6   full_time_remote    7848 non-null   object 
 7   no_of_employ        7603 non-null   object 
 8   no_of_application   7887 non-null   object 
 9   posted_day_ago      7920 non-null   object 
 10  alumni              4858 non-null   object 
 11  Hiring_person       5720 non-null   object 
 12  linkedin_followers  4814 non-null   object 
 13  hiring_person_link  5720 non-null   object 
 14  job_details         7881 non-null   object 
 15  Column1             0 non-null      float64
dtypes: flo

In [4]:
df.dtypes

job_ID                  int64
job                    object
location               object
company_id            float64
company_name           object
work_type              object
full_time_remote       object
no_of_employ           object
no_of_application      object
posted_day_ago         object
alumni                 object
Hiring_person          object
linkedin_followers     object
hiring_person_link     object
job_details            object
Column1               float64
dtype: object

In [5]:
df[df.duplicated(subset=['job_ID'])].count()

job_ID                2084
job                   2078
location              2078
company_id               0
company_name          2077
work_type             2025
full_time_remote      2072
no_of_employ          1989
no_of_application     2077
posted_day_ago        2083
alumni                1272
Hiring_person         1563
linkedin_followers    1284
hiring_person_link    1563
job_details           2076
Column1                  0
dtype: int64

In [6]:
df.shape

(7927, 16)

In [7]:
df.head()

Unnamed: 0,job_ID,job,location,company_id,company_name,work_type,full_time_remote,no_of_employ,no_of_application,posted_day_ago,alumni,Hiring_person,linkedin_followers,hiring_person_link,job_details,Column1
0,3471657636,"Data Analyst, Trilogy (Remote) - $60,000/year USD","Delhi, Delhi, India",,Crossover,Remote,Full-time · Associate,"1,001-5,000 employees · IT Services and IT Con...",200,8 hours,12 company alumni,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,
1,3471669068,"Data Analyst, Trilogy (Remote) - $60,000/year USD","New Delhi, Delhi, India",,Crossover,Remote,Full-time · Associate,"1,001-5,000 employees · IT Services and IT Con...",184,8 hours,12 company alumni,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,
2,3474349934,Data Analyst - WFH,Greater Bengaluru Area,,Uplers,Remote,Full-time · Mid-Senior level,"1,001-5,000 employees · IT Services and IT Con...",200,9 hours,3 company alumni,Shahid Ahmad,,https://www.linkedin.com/in/shahid-ahmad-a2613...,About the job Profile: ML EngineersExperience:...,
3,3472816027,Data Analyst,"Gurugram, Haryana, India",,PVAR SERVICES,On-site,Full-time,1-10 employees,200,7 hours,,Vartika Singh,"2,094 followers",https://www.linkedin.com/in/vartika-singh-,About the job Designation: Data AnalystLocatio...,
4,3473311511,Data Analyst,"Mohali district, Punjab, India",,Timeline Freight Brokers,On-site,Full-time,1-10 employees,8,26 minutes,1 company alumni,Manisha (Gisele Smith),,https://www.linkedin.com/in/manisharathore0029,About the job The ideal candidate will use the...,


There are several duplicate rows and missing values in the dataset that needs to be dealt with. There are also some columns that are completely missing any sort of data that either needs to be filled or removed.

## Data Cleaning

In [8]:
df_clean = df.copy()

Removing duplicate rows within the dataset and reindexing the values.

In [9]:
df_clean.drop_duplicates(subset=['job_ID'], inplace=True)
df_clean.reset_index(inplace=True)
df_clean.drop(columns='index', inplace=True)

In [10]:
df_clean.shape

(5843, 16)

Fill the company_ID column by assigning a unique identifier based on the company_name's order of appearance alphabetically.

In [11]:
df_clean['company_id'] = df_clean.groupby('company_name').ngroup()
df_clean.head()

Unnamed: 0,job_ID,job,location,company_id,company_name,work_type,full_time_remote,no_of_employ,no_of_application,posted_day_ago,alumni,Hiring_person,linkedin_followers,hiring_person_link,job_details,Column1
0,3471657636,"Data Analyst, Trilogy (Remote) - $60,000/year USD","Delhi, Delhi, India",523,Crossover,Remote,Full-time · Associate,"1,001-5,000 employees · IT Services and IT Con...",200,8 hours,12 company alumni,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,
1,3471669068,"Data Analyst, Trilogy (Remote) - $60,000/year USD","New Delhi, Delhi, India",523,Crossover,Remote,Full-time · Associate,"1,001-5,000 employees · IT Services and IT Con...",184,8 hours,12 company alumni,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,
2,3474349934,Data Analyst - WFH,Greater Bengaluru Area,2241,Uplers,Remote,Full-time · Mid-Senior level,"1,001-5,000 employees · IT Services and IT Con...",200,9 hours,3 company alumni,Shahid Ahmad,,https://www.linkedin.com/in/shahid-ahmad-a2613...,About the job Profile: ML EngineersExperience:...,
3,3472816027,Data Analyst,"Gurugram, Haryana, India",1552,PVAR SERVICES,On-site,Full-time,1-10 employees,200,7 hours,,Vartika Singh,"2,094 followers",https://www.linkedin.com/in/vartika-singh-,About the job Designation: Data AnalystLocatio...,
4,3473311511,Data Analyst,"Mohali district, Punjab, India",2146,Timeline Freight Brokers,On-site,Full-time,1-10 employees,8,26 minutes,1 company alumni,Manisha (Gisele Smith),,https://www.linkedin.com/in/manisharathore0029,About the job The ideal candidate will use the...,


Split the full_time_remote and no_of_employ columns as they contain more information than it should.

In [12]:
df_clean[['full_time_remote', 'job_level']] = df_clean['full_time_remote'].str.split('·', 2, expand=True)

In [13]:
df_clean['no_of_employ'] = df_clean['no_of_employ'].str.replace(',', '')
df_clean['no_of_employ'] = df_clean['no_of_employ'].str.replace('employees', '')

In [14]:
df_clean[['no_of_employ', 'industry']] = df_clean['no_of_employ'].str.split('·', 2, expand=True)

In [15]:
df_clean.head()

Unnamed: 0,job_ID,job,location,company_id,company_name,work_type,full_time_remote,no_of_employ,no_of_application,posted_day_ago,alumni,Hiring_person,linkedin_followers,hiring_person_link,job_details,Column1,job_level,industry
0,3471657636,"Data Analyst, Trilogy (Remote) - $60,000/year USD","Delhi, Delhi, India",523,Crossover,Remote,Full-time,1001-5000,200,8 hours,12 company alumni,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting
1,3471669068,"Data Analyst, Trilogy (Remote) - $60,000/year USD","New Delhi, Delhi, India",523,Crossover,Remote,Full-time,1001-5000,184,8 hours,12 company alumni,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting
2,3474349934,Data Analyst - WFH,Greater Bengaluru Area,2241,Uplers,Remote,Full-time,1001-5000,200,9 hours,3 company alumni,Shahid Ahmad,,https://www.linkedin.com/in/shahid-ahmad-a2613...,About the job Profile: ML EngineersExperience:...,,Mid-Senior level,IT Services and IT Consulting
3,3472816027,Data Analyst,"Gurugram, Haryana, India",1552,PVAR SERVICES,On-site,Full-time,1-10,200,7 hours,,Vartika Singh,"2,094 followers",https://www.linkedin.com/in/vartika-singh-,About the job Designation: Data AnalystLocatio...,,,
4,3473311511,Data Analyst,"Mohali district, Punjab, India",2146,Timeline Freight Brokers,On-site,Full-time,1-10,8,26 minutes,1 company alumni,Manisha (Gisele Smith),,https://www.linkedin.com/in/manisharathore0029,About the job The ideal candidate will use the...,,,


Remove all unnecessary text from no_of_application column.

In [16]:
for i in ['days','day','seconds','minutes','minute','hours','hour']:
    df_clean['no_of_application'] = df_clean['no_of_application'].str.replace(i, '')

In [17]:
df_clean['no_of_application'] = pd.to_numeric(df_clean['no_of_application'])

Same thing again, remove all unnecessary text from alumni column.

In [18]:
df_clean['alumni'] = df_clean['alumni'].str.rstrip('company alumni')
df_clean['alumni'] = df_clean['alumni'].str.replace(',', '')
df_clean['alumni'] = pd.to_numeric(df_clean['alumni'])

In [19]:
df_clean.head()

Unnamed: 0,job_ID,job,location,company_id,company_name,work_type,full_time_remote,no_of_employ,no_of_application,posted_day_ago,alumni,Hiring_person,linkedin_followers,hiring_person_link,job_details,Column1,job_level,industry
0,3471657636,"Data Analyst, Trilogy (Remote) - $60,000/year USD","Delhi, Delhi, India",523,Crossover,Remote,Full-time,1001-5000,200.0,8 hours,12.0,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting
1,3471669068,"Data Analyst, Trilogy (Remote) - $60,000/year USD","New Delhi, Delhi, India",523,Crossover,Remote,Full-time,1001-5000,184.0,8 hours,12.0,,"5,395,547 followers",,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting
2,3474349934,Data Analyst - WFH,Greater Bengaluru Area,2241,Uplers,Remote,Full-time,1001-5000,200.0,9 hours,3.0,Shahid Ahmad,,https://www.linkedin.com/in/shahid-ahmad-a2613...,About the job Profile: ML EngineersExperience:...,,Mid-Senior level,IT Services and IT Consulting
3,3472816027,Data Analyst,"Gurugram, Haryana, India",1552,PVAR SERVICES,On-site,Full-time,1-10,200.0,7 hours,,Vartika Singh,"2,094 followers",https://www.linkedin.com/in/vartika-singh-,About the job Designation: Data AnalystLocatio...,,,
4,3473311511,Data Analyst,"Mohali district, Punjab, India",2146,Timeline Freight Brokers,On-site,Full-time,1-10,8.0,26 minutes,1.0,Manisha (Gisele Smith),,https://www.linkedin.com/in/manisharathore0029,About the job The ideal candidate will use the...,,,


Remove any extra names in parentheses so the Hiring_person column only has their first and last names. Also change any nan values to 'Not Available' if there is no hiring person.

In [20]:
df_clean['Hiring_person'] = df_clean['Hiring_person'].str.replace(' \(.*\)', '', regex=True)

In [21]:
df_clean['Hiring_person'].fillna('Not Available', inplace=True)

Remove all unnecessary text from linkedin_followers column.

In [22]:
df_clean['linkedin_followers'] = df_clean['linkedin_followers'].str.rstrip('followers')
df_clean['linkedin_followers'] = df_clean['linkedin_followers'].str.replace(',', '')
df_clean['linkedin_followers'] = df_clean['linkedin_followers'].str.extract('(\d+)')
df_clean['linkedin_followers'] = pd.to_numeric(df_clean['linkedin_followers'])

In [23]:
df_clean.head()

Unnamed: 0,job_ID,job,location,company_id,company_name,work_type,full_time_remote,no_of_employ,no_of_application,posted_day_ago,alumni,Hiring_person,linkedin_followers,hiring_person_link,job_details,Column1,job_level,industry
0,3471657636,"Data Analyst, Trilogy (Remote) - $60,000/year USD","Delhi, Delhi, India",523,Crossover,Remote,Full-time,1001-5000,200.0,8 hours,12.0,Not Available,5395547.0,,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting
1,3471669068,"Data Analyst, Trilogy (Remote) - $60,000/year USD","New Delhi, Delhi, India",523,Crossover,Remote,Full-time,1001-5000,184.0,8 hours,12.0,Not Available,5395547.0,,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting
2,3474349934,Data Analyst - WFH,Greater Bengaluru Area,2241,Uplers,Remote,Full-time,1001-5000,200.0,9 hours,3.0,Shahid Ahmad,,https://www.linkedin.com/in/shahid-ahmad-a2613...,About the job Profile: ML EngineersExperience:...,,Mid-Senior level,IT Services and IT Consulting
3,3472816027,Data Analyst,"Gurugram, Haryana, India",1552,PVAR SERVICES,On-site,Full-time,1-10,200.0,7 hours,,Vartika Singh,2094.0,https://www.linkedin.com/in/vartika-singh-,About the job Designation: Data AnalystLocatio...,,,
4,3473311511,Data Analyst,"Mohali district, Punjab, India",2146,Timeline Freight Brokers,On-site,Full-time,1-10,8.0,26 minutes,1.0,Manisha,,https://www.linkedin.com/in/manisharathore0029,About the job The ideal candidate will use the...,,,


Remove any extra information from the job column so it only has the name of the job title or position.

In [24]:
df_clean['job'] = df_clean['job'].str.replace('\(.*', '', regex=True)
df_clean['job'] = df_clean['job'].str.replace('\[.*', '', regex=True)
df_clean['job'] = df_clean['job'].str.replace(',.*', '', regex=True)
df_clean['job'] = df_clean['job'].str.replace('–.*', '', regex=True)
df_clean['job'] = df_clean['job'].str.replace('-.*', '', regex=True)
df_clean['job'] = df_clean['job'].str.replace('\|.*', '', regex=True)
df_clean['job'] = df_clean['job'].str.rstrip(' ')

Split the location column into three separate columns that lists the Country, State, and City.

In [25]:
df_clean['location'].fillna('', inplace=True)

In [26]:
df_clean['location'] = df_clean['location'].str.split(',').apply(reversed).apply(','.join)

In [27]:
df_clean[['Country', 'State', 'City']] = df_clean['location'].str.split(',', 2, expand=True)

In [28]:
df_clean.head()

Unnamed: 0,job_ID,job,location,company_id,company_name,work_type,full_time_remote,no_of_employ,no_of_application,posted_day_ago,...,Hiring_person,linkedin_followers,hiring_person_link,job_details,Column1,job_level,industry,Country,State,City
0,3471657636,Data Analyst,"India, Delhi,Delhi",523,Crossover,Remote,Full-time,1001-5000,200.0,8 hours,...,Not Available,5395547.0,,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting,India,Delhi,Delhi
1,3471669068,Data Analyst,"India, Delhi,New Delhi",523,Crossover,Remote,Full-time,1001-5000,184.0,8 hours,...,Not Available,5395547.0,,About the job Crossover is the world's #1 sour...,,Associate,IT Services and IT Consulting,India,Delhi,New Delhi
2,3474349934,Data Analyst,Greater Bengaluru Area,2241,Uplers,Remote,Full-time,1001-5000,200.0,9 hours,...,Shahid Ahmad,,https://www.linkedin.com/in/shahid-ahmad-a2613...,About the job Profile: ML EngineersExperience:...,,Mid-Senior level,IT Services and IT Consulting,Greater Bengaluru Area,,
3,3472816027,Data Analyst,"India, Haryana,Gurugram",1552,PVAR SERVICES,On-site,Full-time,1-10,200.0,7 hours,...,Vartika Singh,2094.0,https://www.linkedin.com/in/vartika-singh-,About the job Designation: Data AnalystLocatio...,,,,India,Haryana,Gurugram
4,3473311511,Data Analyst,"India, Punjab,Mohali district",2146,Timeline Freight Brokers,On-site,Full-time,1-10,8.0,26 minutes,...,Manisha,,https://www.linkedin.com/in/manisharathore0029,About the job The ideal candidate will use the...,,,,India,Punjab,Mohali district


Remove any extra columns that are no longer being used and rename columns so that they are consistent.

In [29]:
df_clean.drop(columns=['Column1', 'posted_day_ago', 'hiring_person_link', 'location'], inplace=True)

In [30]:
df_clean.rename(columns= {'job_ID':'Job_ID', 'job':'Job', 'company_id':'Company_ID', 'company_name':'Company_Name', 'work_type':'Work_Location', 'full_time_remote':'Work_Type', 'no_of_employ':'No_of_Employees', 'no_of_application':'No_of_Applications', 'alumni':'No_of_Alumni', 'Hiring_person':'Hiring_Person', 'linkedin_followers':'Linkedin_Followers', 'job_details':'Job_Details', 'job_level':'Job_Level', 'industry':'Industry'}, inplace=True)

In [31]:
df_clean

Unnamed: 0,Job_ID,Job,Company_ID,Company_Name,Work_Location,Work_Type,No_of_Employees,No_of_Applications,No_of_Alumni,Hiring_Person,Linkedin_Followers,Job_Details,Job_Level,Industry,Country,State,City
0,3471657636,Data Analyst,523,Crossover,Remote,Full-time,1001-5000,200.0,12.0,Not Available,5395547.0,About the job Crossover is the world's #1 sour...,Associate,IT Services and IT Consulting,India,Delhi,Delhi
1,3471669068,Data Analyst,523,Crossover,Remote,Full-time,1001-5000,184.0,12.0,Not Available,5395547.0,About the job Crossover is the world's #1 sour...,Associate,IT Services and IT Consulting,India,Delhi,New Delhi
2,3474349934,Data Analyst,2241,Uplers,Remote,Full-time,1001-5000,200.0,3.0,Shahid Ahmad,,About the job Profile: ML EngineersExperience:...,Mid-Senior level,IT Services and IT Consulting,Greater Bengaluru Area,,
3,3472816027,Data Analyst,1552,PVAR SERVICES,On-site,Full-time,1-10,200.0,,Vartika Singh,2094.0,About the job Designation: Data AnalystLocatio...,,,India,Haryana,Gurugram
4,3473311511,Data Analyst,2146,Timeline Freight Brokers,On-site,Full-time,1-10,8.0,1.0,Manisha,,About the job The ideal candidate will use the...,,,India,Punjab,Mohali district
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5838,3472039871,Back End Developer,1526,Orion Innovation,Hybrid,Full-time,5001-10000,25.0,24.0,Poornima Viswanathan,,About the job The ideal candidate will show in...,Associate,IT Services and IT Consulting,India,Kerala,Kochi
5839,3473194471,Software Engineer,2241,Uplers,On-site,Full-time,1001-5000,18.0,3.0,Tejveer Singh,,About the job Experience: 4 - 8 yearsProfile: ...,Mid-Senior level,IT Services and IT Consulting,India,Haryana,Gurugram
5840,3461005032,Vue JS,2049,Tata Consultancy Services,On-site,Full-time,10001+,15.0,10080.0,ANNIE ANTONY,11923634.0,About the job Role- Vue js DeveloperExperience...,Mid-Senior level,IT Services and IT Consulting,India,Telangana,Hyderabad
5841,3474305684,iOS Developer,2241,Uplers,Remote,Full-time,1001-5000,17.0,3.0,Arjun Jaggi,,About the job Profile: iOS DeveloperExperience...,Mid-Senior level,IT Services and IT Consulting,India,Karnataka,Bengaluru


## References

Dataset: https://www.kaggle.com/datasets/shashankshukla123123/linkedin-job-data

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6f729d54-f55f-4523-b7b0-fcda4f0ebbf2' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>