This dataset includes job postings that have been registered with <a href=https://quera.org/magnet/jobs>Quera</a>. To protect the privacy of companies, the IDs of the companies that register the ad have been removed from the dataset. Each line of this dataset is a job advertisement that a company intends to recruit for it.

In [97]:
import pandas as pd
from datetime import datetime

In [98]:
#First let's load the data
df = pd.read_csv('job_posts.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7669 entries, 0 to 7668
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Level               7246 non-null   object
 1   Offers Remote       7669 non-null   bool  
 2   Publish Time        7669 non-null   object
 3   Salary              3493 non-null   object
 4   Title               7669 non-null   object
 5   Close Time          7087 non-null   object
 6   State               7669 non-null   object
 7   Collaboration Type  7246 non-null   object
dtypes: bool(1), object(7)
memory usage: 427.0+ KB


We Delete <b>the rows in which all three columns </b> of <code>Level</code>, <code>Salary</code>, and <code>Collaboration Type</code> are missing (<code>NaN</code>) from the data altogether.

In [99]:
df.dropna(subset=['Salary','Level','Collaboration Type'],how='all',inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7246 entries, 0 to 7668
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Level               7246 non-null   object
 1   Offers Remote       7246 non-null   bool  
 2   Publish Time        7246 non-null   object
 3   Salary              3493 non-null   object
 4   Title               7246 non-null   object
 5   Close Time          6664 non-null   object
 6   State               7246 non-null   object
 7   Collaboration Type  7246 non-null   object
dtypes: bool(1), object(7)
memory usage: 460.0+ KB


We replace the values of the <code>Salary</code> column with the number they have.
For example, value like this
<code>>6MT</code>
to
<code>6</code>.
Or
<code>>12MT</code>
to
<code>12</code>
.
<br>

In [100]:
df['Salary'] = df['Salary'].str.extract(r'(\d+)').astype(float)
df.head()

Unnamed: 0,Level,Offers Remote,Publish Time,Salary,Title,Close Time,State,Collaboration Type
0,S,False,2020-04-20 16:52:15,8.0,توسعه‌دهنده Node.js,2020-06-20 03:00:06,C,FT
1,J,False,2020-02-24 12:55:02,12.0,توسعه‌دهنده Backend,2020-04-25 03:00:05,C,FT
2,J,True,2020-02-12 13:49:32,8.0,توسعه‌دهنده C#,2020-04-13 03:00:05,C,FT
3,J,False,2020-08-18 15:46:48,3.0,توسعه‌دهنده Front-end,2020-08-25 21:51:39,C,PT
5,J,False,2020-10-11 12:02:37,,Associate Product Manager,2020-11-08 01:27:52,C,FT


We fill in the missing values of the <code>Salary</code> column with average salaries by <b>level</b>. That is, if the <code>Salary</code> of a junior ad (<code>J</code>) was missing, fill it in with the average of all juniors; if the <code>Salary</code> of a senior ad (<code>S</code>) was missing, fill it with the average of all seniors;

In [101]:
df['Salary'] = df.groupby('Level')['Salary'].transform(lambda a: a.fillna(a.mean()))
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7246 entries, 0 to 7668
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Level               7246 non-null   object 
 1   Offers Remote       7246 non-null   bool   
 2   Publish Time        7246 non-null   object 
 3   Salary              7246 non-null   float64
 4   Title               7246 non-null   object 
 5   Close Time          6664 non-null   object 
 6   State               7246 non-null   object 
 7   Collaboration Type  7246 non-null   object 
dtypes: bool(1), float64(1), object(6)
memory usage: 460.0+ KB


So far we have accomplished the preprocess goals and it's time to analize the data.
<br>
<ul dir=ltr>
<li>In the first step, we want to measure the impact of Corona on remote working in Iran.</li>
<li>In the second question, we measure the effect of time on the publication of job postings.</li>
<li>And finally, in the third step, we will obtain the average receipt of data activists.</li>
</ul>

How much has the rate of job postings that offer remote work increased compared to before the first official case of Corona in Iran? What is meant by "the rate of job postings that offer remote work" is the result of dividing the job postings that offer remote work by the total number of ads. The purpose of this question is to understand how effective Corona has been in remote work.
<br>
The first official case of Corona in Iran can be found from the <code>start_of_corona</code> variant that is placed in the cell below.

In [102]:
import warnings

start_of_corona = datetime.strptime("2020-02-19", "%Y-%m-%d")
warnings.filterwarnings('ignore')
df_copy = df.copy()
df_copy['Publish Time']=pd.to_datetime(df_copy['Publish Time'])
bins=[df_copy['Publish Time'].min(),start_of_corona,df_copy['Publish Time'].max()]
df_copy['Publish Time']=pd.cut(df_copy['Publish Time'],bins=bins,labels=['before','after'])
seperate_remote_growth=df_copy.groupby('Publish Time')['Offers Remote'].apply(lambda a : (a==True).sum()).sort_values(ascending=False)
remote_growth =((seperate_remote_growth.values[0] - seperate_remote_growth.values[1])/len(df_copy['Offers Remote']))
print(remote_growth)


0.23820038642009383


In this step, we are going to break down time and you need to break down the 24 hours a day into 4 categories according to the table below.

<center>
<div dir=rtl style="direction: rtl;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazirmatn" size=3>
    
|name|time frame|
|:-------:|:-----:|
|dawn| 00:00 to 05:59:59|
|morning|6:00 to 11:59:59|
|noon|12:00 to 17:59:59|
|night|18:00 to 23:59:59|
    
</font>
</div>
</center>

Now calculate how many job postings have been published in each of these four periods? Your answer must be stored in a variable called <code>discrete_time</code>. This variable is a database whose indexes are <code>dawn</code>, <code>morning</code>,<code>noon</code>and <code>night</code>, and the corresponding value of each of these indexes is the number of job postings published. Your output should be sorted descending by values.

In [103]:
from datetime import time
df_copy_2 = df.copy()
dawn = time(0, 0, 0)
morning = time(6, 0, 0)
noon = time(12, 0, 0)
night = time(18, 0, 0)
end_of_day = time(23, 59, 59)

labels = ['dawn', 'morning', 'noon', 'night']
df_copy_2['Publish Time'] = pd.to_datetime(df_copy_2['Publish Time']).dt.time
df_copy_2['Publish Time'] = pd.cut(df_copy_2['Publish Time'], bins=[dawn, morning, noon, night, end_of_day], labels=labels, right=False)
discrete_time = df_copy_2['Publish Time'].value_counts().sort_values(ascending=False)
discrete_time


Publish Time
noon       3950
morning    2443
night       746
dawn        107
Name: count, dtype: int64

In this question, we want to examine the amount of payments that companies pay to people active in the field of data. In order to be able to filter job postings in this field, a list of keywords is placed in the following cell. Let's assume that all job postings that use one of these words in their title are related to the data domain. The title of an ad doesn't have to be entirely one of the keys in the list of keywords; just if a key is part of the title of a job posting, we assume that the job ad is data-related. For example, if a keyword called "Data Analyst" is in the list of keywords, then we consider an ad with the headline "Hire a Data Analyst" to be related to the data field.
Now we define the question as follows: What is the average amount of payment that companies pay to people in the field of data by their level? Put your answer in a database called data_mean. The indexes of this series are the letters M, J, I and S, respectively, and The corresponding value for each of these indices is the average salary paid by companies to that particular level.

In [104]:
keywords = ['machine learning', 'machinelearning', 'داده' , 'data scientist' ,  'datascientist' ,\
        'هوش مصنوعی' ,'پردازش ویدئو' , 'data engineer' , 'dataengineer' ,'بینایی ماشین' , 'یادگیری ماشین' ,\
        'deep learning', 'deeplearning', 'یادگیری عمیق', 'دیتاساینتیست' , 'artificial intelligence' \
        ,'artificialintelligence', 'هوش' , 'data analysis' , 'dataanalysis' , 'پردازش تصویر' , 'شبکه‌های عمیق', 'علم‌داده']

df['Title'] = df['Title'].str.lower()

# Combine keywords into a single regex pattern
pattern = '|'.join(keywords)

# Filter the dataframe based on the combined regex pattern
data_adds = df[df['Title'].str.contains(pattern,na=None)]

# Calculate the mean salary for each level
data_mean = data_adds.groupby('Level')['Salary'].mean()

data_mean

Level
I     2.256501
J     5.750707
M    12.000000
S     8.891994
Name: Salary, dtype: float64