Sanggoroe, a platform to search and provide information related to teaching job vacancies, can help students, fresh graduates, or anyone who wants to gain teaching experience by providing information related to teaching job vacancies.

Tahapan membuat model ML :

1. [Collecting Data](#step-1) ✅
2. [Exploratory Data Analysis](#step-2)
3. [Data Preprocessing](#step-3)
5. [Training Model](#step-4)
5. [Evaluasi Model](#step-5) 
6. [Deployment](#step-6) 


<a name="step-1"></a>
## **1. Data Collecting**

In [2]:
# Import dataset
import pandas as pd
import numpy as np

lowongan =  pd.read_csv('Lowongan.csv', sep=',')

<a name="step-2"></a>
## **2. Exploratory Data Analysis**

Exploratory data analysis atau EDA bertujuan sebagai analisa awal terhadap data dan melihat bagaimana kualitas data untuk meminimalkan potensi kesalahan di kemudian hari. Pada proses ini dilakukan investigasi awal pada data untuk menemukan pola, anomali, menguji hipotesis , memahami distribusi, frekuensi, hubungan antar variabel, dan memeriksa asumsi dengan teknik statistik dan representasi grafik. <br>
Source: Dicoding

<a name="step-3"></a>
## **3. Data Preprocessing**
Data preprocessing adalah tahap di mana data diolah lebih lanjut sehingga menjadi siap dipakai dalam pengembangan model ML. Dengan kata lain, proses ini mengubah dan mentransformasi fitur-fitur data ke dalam bentuk yang mudah diinterpretasikan dan diproses oleh algoritma machine learning. Termasuk di dalam data preprocessing adalah proses data cleaning dan data transformation.
Source: Dicoding


In [3]:
# Menampilkan 5 data teratas
lowongan.head()

Unnamed: 0,JobID,Jenjang Pendidikan,Kualifikasi Pekerjaan,Posisi Pekerjaan,Gender,Lokasi Pekerjaan(Prov),Lokasi Pekerjaan(Kota),Pengalaman Kerja (bulan),Skill yang dibutuhkan,Gaji,Jenis Pekerjaan,Perusahaan,Tanggal Posting,Kategori Perusahaan,Link,Deskripsi
0,1,Kindergarten,Bachelor Degree,Kindergarten Teacher,General,Jawa Barat,Bogor,24,Bahasa Inggris,0,Full time,PT Tunas Tuju Asa,2023-05-16,Sekolah,PRESCHOOL & KINDERGARTEN TEACHERS FOR KINDERFI...,Available for working in the new academic year...
1,2,Kindergarten,Bachelor Degree,Religious Education Teachers,General,Banten,Tangerang,24,Bahasa Inggris,0,Full time,PT Kinder Haven Pusaka,2023-05-18,Sekolah,Guru Agama (Religion Student Counselor) - PT K...,"Create environments, activities, and programmi..."
2,3,Kindergarten,Bachelor Degree,Kindergarten Teacher,General,DI Yogyakarta,Sleman,24,,2200000,Full time,Yayasan Pendidikan Blue Dolphin International,2023-05-18,Sekolah,Preschool and Kindergarten Teacher - Yay. Pend...,We are a well-grown English-speaking Pre-schoo...
3,4,Kindergarten,Bachelor Degree,Kindergarten Teacher,General,Jawa Timur,Surabaya,0,Bahasa Inggris,0,Full time,Yayasan Intan Eduka,2023-05-18,Sekolah,KINDERGARTEN TEACHER - Yayasan Intan Eduka - 4...,Preparing and delivering lessons to a range of...
4,5,Elementary School,Diploma Degree,Natural Sciences Teacher,General,Jawa Barat,Depok,0,Bahasa Inggris,2000000,Full time,Pratiwi School Depok,2023-05-17,Sekolah,Science and Mathematics Teachers - Pratiwi Sch...,"Age max 35 years old, Bachelor degree in Educa..."


In [4]:
list(lowongan)

['JobID',
 'Jenjang Pendidikan',
 'Kualifikasi Pekerjaan',
 'Posisi Pekerjaan',
 'Gender',
 'Lokasi Pekerjaan(Prov)',
 'Lokasi Pekerjaan(Kota)',
 'Pengalaman Kerja (bulan)',
 'Skill yang dibutuhkan',
 'Gaji',
 'Jenis Pekerjaan',
 'Perusahaan',
 'Tanggal Posting',
 'Kategori Perusahaan',
 'Link',
 'Deskripsi']

In [5]:
print(lowongan.shape)
lowongan.isnull().sum()

(140, 16)


JobID                        0
Jenjang Pendidikan           0
Kualifikasi Pekerjaan        0
Posisi Pekerjaan             0
Gender                       0
Lokasi Pekerjaan(Prov)       2
Lokasi Pekerjaan(Kota)      18
Pengalaman Kerja (bulan)     0
Skill yang dibutuhkan       38
Gaji                         0
Jenis Pekerjaan              0
Perusahaan                   0
Tanggal Posting              3
Kategori Perusahaan          0
Link                         0
Deskripsi                    0
dtype: int64

In [6]:
#subsetting only needed columns and not considering the columns that are not necessary
cols = list(['JobID']+['Posisi Pekerjaan']+ ['Perusahaan']+['Lokasi Pekerjaan(Kota)']+['Jenis Pekerjaan']+['Kualifikasi Pekerjaan']+['Deskripsi'])
final_lowongan =lowongan[cols]
final_lowongan.columns = ['Job.ID','Position','Company','City', 'Empl_type','Edu_req','Job_Description']
final_lowongan.head() 

Unnamed: 0,Job.ID,Position,Company,City,Empl_type,Edu_req,Job_Description
0,1,Kindergarten Teacher,PT Tunas Tuju Asa,Bogor,Full time,Bachelor Degree,Available for working in the new academic year...
1,2,Religious Education Teachers,PT Kinder Haven Pusaka,Tangerang,Full time,Bachelor Degree,"Create environments, activities, and programmi..."
2,3,Kindergarten Teacher,Yayasan Pendidikan Blue Dolphin International,Sleman,Full time,Bachelor Degree,We are a well-grown English-speaking Pre-schoo...
3,4,Kindergarten Teacher,Yayasan Intan Eduka,Surabaya,Full time,Bachelor Degree,Preparing and delivering lessons to a range of...
4,5,Natural Sciences Teacher,Pratiwi School Depok,Depok,Full time,Diploma Degree,"Age max 35 years old, Bachelor degree in Educa..."


In [7]:
final_lowongan.isnull().sum()

Job.ID              0
Position            0
Company             0
City               18
Empl_type           0
Edu_req             0
Job_Description     0
dtype: int64

In [8]:
nan_city = final_lowongan[pd.isnull(final_lowongan['City'])]
print(nan_city.shape)
nan_city.head()

(18, 7)


Unnamed: 0,Job.ID,Position,Company,City,Empl_type,Edu_req,Job_Description
6,7,Physical Education and Health Teachers,Yayasan Tunas Muda,,Full time,Bachelor Degree,"Should be an inspirational, passionate and ada..."
24,25,Digital Marketing,Sekolah Citra Kasih - Sekolah Ciputra Kasih - ...,,Full time,Bachelor Degree,"Maximum age of 35 years old, Bachelor's degree..."
31,32,Christian Religion Curriculum Staff,Sekolah Citra Kasih - Sekolah Ciputra Kasih - ...,,Full time,Bachelor Degree,"Maximum age of 35 years old, Bachelor's degree..."
34,35,Homeroom teacher,Sekolah Citra Kasih - Sekolah Ciputra Kasih - ...,,Internship,High School Diploma,"Cover letter, curriculum vitae, Identity Card,..."
38,39,English teacher,Sekolah Citra Kasih - Sekolah Ciputra Kasih - ...,,Internship,High School Diploma,"Cover letter, curriculum vitae, Identity Card,..."


In [9]:
nan_city.groupby(['Company'])['City'].count() 

Company
Sekolah Citra Kasih - Sekolah Ciputra Kasih - Sekolah Citra Berkat    0
Yayasan Tunas Muda                                                    0
Name: City, dtype: int64

In [10]:
final_lowongan.loc[final_lowongan.Company == 'Privat', 'City'] = 'Makassar'
final_lowongan.loc[final_lowongan.Company == 'Sekolah Citra Kasih - Sekolah Ciputra Kasih - Sekolah Citra Berkat'] = 'Jakarta'
final_lowongan.loc[final_lowongan.Company == 'Yayasan Tunas Muda', 'City'] = 'Jakarta'

In [11]:
final_lowongan.isnull().sum()

Job.ID             0
Position           0
Company            0
City               0
Empl_type          0
Edu_req            0
Job_Description    0
dtype: int64

####**Corpus**
######combining the columns of position,company,city,emp_type and position

In [12]:
final_lowongan["pos_com_city_empType_jobDesc"] = final_lowongan["Position"].map(str) + " " + final_lowongan["Company"] +" "+ final_lowongan["City"]+ " "+final_lowongan['Empl_type']+" "+final_lowongan['Job_Description']
final_lowongan.pos_com_city_empType_jobDesc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_lowongan["pos_com_city_empType_jobDesc"] = final_lowongan["Position"].map(str) + " " + final_lowongan["Company"] +" "+ final_lowongan["City"]+ " "+final_lowongan['Empl_type']+" "+final_lowongan['Job_Description']


0    Kindergarten Teacher PT Tunas Tuju Asa Bogor F...
1    Religious Education Teachers PT Kinder Haven P...
2    Kindergarten Teacher Yayasan Pendidikan Blue D...
3    Kindergarten Teacher Yayasan Intan Eduka Surab...
4    Natural Sciences Teacher Pratiwi School Depok ...
Name: pos_com_city_empType_jobDesc, dtype: object

In [13]:
#removing unnecessary characters between words separated by space between each word of all columns to make the data efficient
final_lowongan['pos_com_city_empType_jobDesc'] = final_lowongan['pos_com_city_empType_jobDesc'].str.replace('[^a-zA-Z \n\.]'," ") #removing unnecessary characters
final_lowongan.pos_com_city_empType_jobDesc.head()

  final_lowongan['pos_com_city_empType_jobDesc'] = final_lowongan['pos_com_city_empType_jobDesc'].str.replace('[^a-zA-Z \n\.]'," ") #removing unnecessary characters
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_lowongan['pos_com_city_empType_jobDesc'] = final_lowongan['pos_com_city_empType_jobDesc'].str.replace('[^a-zA-Z \n\.]'," ") #removing unnecessary characters


0    Kindergarten Teacher PT Tunas Tuju Asa Bogor F...
1    Religious Education Teachers PT Kinder Haven P...
2    Kindergarten Teacher Yayasan Pendidikan Blue D...
3    Kindergarten Teacher Yayasan Intan Eduka Surab...
4    Natural Sciences Teacher Pratiwi School Depok ...
Name: pos_com_city_empType_jobDesc, dtype: object

######**Case Folding**

In [14]:
#converting all the characeters to lower case
final_lowongan['pos_com_city_empType_jobDesc'] = final_lowongan['pos_com_city_empType_jobDesc'].str.lower() 
final_lowongan.pos_com_city_empType_jobDesc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_lowongan['pos_com_city_empType_jobDesc'] = final_lowongan['pos_com_city_empType_jobDesc'].str.lower()


0    kindergarten teacher pt tunas tuju asa bogor f...
1    religious education teachers pt kinder haven p...
2    kindergarten teacher yayasan pendidikan blue d...
3    kindergarten teacher yayasan intan eduka surab...
4    natural sciences teacher pratiwi school depok ...
Name: pos_com_city_empType_jobDesc, dtype: object

In [15]:
final_all = final_lowongan[['Job.ID', 'pos_com_city_empType_jobDesc']]
# renaming the column name as it seemed a bit complicated
final_all = final_lowongan[['Job.ID', 'pos_com_city_empType_jobDesc']]
final_all = final_all.fillna(" ")

final_all.head()

Unnamed: 0,Job.ID,pos_com_city_empType_jobDesc
0,1,kindergarten teacher pt tunas tuju asa bogor f...
1,2,religious education teachers pt kinder haven p...
2,3,kindergarten teacher yayasan pendidikan blue d...
3,4,kindergarten teacher yayasan intan eduka surab...
4,5,natural sciences teacher pratiwi school depok ...


Here comes the important concept call Stop words. Stop words are natural language words which have very little meaning, such as "and", "the", "a", "an", and similar words. We use NLP where NLTK ( Natural Language Toolkit ) is used for stopwords. Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

The next package used in here is stemming, The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved. So in order to clean up the space, we use stemming method and the one of the packages used here is PorterStemmer

In [None]:
print(final_all.head(1))
#We can import stopwords from nltk.corpus as below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.

  Job.ID                       pos_com_city_empType_jobDesc
0      1  kindergarten teacher pt tunas tuju asa bogor f...


###**NLTK stands for Natural Languague Toolkit**
It removes stop words such as the, is, and etc..

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
pos_com_city_empType_jobDesc = final_all['pos_com_city_empType_jobDesc']
#removing stopwords and applying potter stemming
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stemmer =  PorterStemmer()
stop = stopwords.words('english')
only_text = pos_com_city_empType_jobDesc.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
only_text.head()

0    kindergarten teacher pt tunas tuju asa bogor f...
1    religious education teachers pt kinder pusaka ...
2    kindergarten teacher yayasan pendidikan blue d...
3    kindergarten teacher yayasan intan eduka surab...
4    natural sciences teacher pratiwi school depok ...
Name: pos_com_city_empType_jobDesc, dtype: object

Splitting each word in a row separated by space

In [None]:
only_text = only_text.apply(lambda x : filter(None,x.split(" ")))
print(only_text.head())

0    <filter object at 0x7fa710869b40>
1    <filter object at 0x7fa710869db0>
2    <filter object at 0x7fa710868a90>
3    <filter object at 0x7fa710868ac0>
4    <filter object at 0x7fa71086b8e0>
Name: pos_com_city_empType_jobDesc, dtype: object


Here **stemming** is basically used to remove the suffixes and common words that repeat and separated by commas. for y in x means, for each word(y) in the total list(x)

In [None]:
only_text = only_text.apply(lambda x : [stemmer.stem(y) for y in x])
print(only_text.head())

0    [kindergarten, teacher, pt, tuna, tuju, asa, b...
1    [religi, educ, teacher, pt, kinder, pusaka, ta...
2    [kindergarten, teacher, yayasan, pendidikan, b...
3    [kindergarten, teacher, yayasan, intan, eduka,...
4    [natur, scienc, teacher, pratiwi, school, depo...
Name: pos_com_city_empType_jobDesc, dtype: object


In the above code, we separated each letter in a word separated by comma, now, in this step, we join the words(x)

In [None]:
only_text = only_text.apply(lambda x : " ".join(x))
print(only_text.head())

0    kindergarten teacher pt tuna tuju asa bogor fu...
1    religi educ teacher pt kinder pusaka tangerang...
2    kindergarten teacher yayasan pendidikan blue d...
3    kindergarten teacher yayasan intan eduka surab...
4    natur scienc teacher pratiwi school depok depo...
Name: pos_com_city_empType_jobDesc, dtype: object


In [None]:
#adding the featured column back to pandas
final_all['text']= only_text
# As we have added a new column by performing all the operations using lambda function, we are removing the unnecessary column
#final_all = final_all.drop("pos_com_city_empType_jobDesc", 1)

list(final_all)
final_all.head()

Unnamed: 0,Job.ID,pos_com_city_empType_jobDesc,text
0,1,kindergarten teacher pt tunas tuju asa bogor f...,kindergarten teacher pt tuna tuju asa bogor fu...
1,2,religious education teachers pt kinder haven p...,religi educ teacher pt kinder pusaka tangerang...
2,3,kindergarten teacher yayasan pendidikan blue d...,kindergarten teacher yayasan pendidikan blue d...
3,4,kindergarten teacher yayasan intan eduka surab...,kindergarten teacher yayasan intan eduka surab...
4,5,natural sciences teacher pratiwi school depok ...,natur scienc teacher pratiwi school depok depo...


In [None]:
# in order to save this file for a backup
final_all.to_csv("job_data.csv", index=True)

##TF-IDF ( Term Frequency - Inverse Document Frequency )
This method is also called as Normalization. TF - How many times a particular word appears in a single doc. IDF - This downscales words that appear a lot across documents.

There is a difference between fit (unique words are created / vectorization) and fit_transform. fit means fit transform and transform( adding the row internally adding 1 in the users input query) ex : innovate is not present then then it will not create ( loosing the efficiency)

In [None]:
#initializing tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf_jobid = tfidf_vectorizer.fit_transform((final_all['text'])) #fitting and transforming the vector
tfidf_jobid

<140x376 sparse matrix of type '<class 'numpy.float64'>'
	with 1856 stored elements in Compressed Sparse Row format>