Document-Classification-of-Job-Titles-using-Job-Descriptions

In this task, a data-set is used that comprise of different jobs posted on a job portal. The data-set was downloaded from Kaggle. It had the following basic properties:

It was provided in .csv format.
The data-set simulated the real life scenario of jobs posted on a job portal and comprised of Job's title, Job's description along with its category As the data was labeled so in the context of machine learning, it was a Supervised Machine learning problem i.e. I had access to the data that was already correctly labeled and I had to train a model using this historical data. The main goal was to build a model that could accurately classify new and unseen data when it was input to it i.e. to assign proper label to a job posting when its input to the model. As the nature of the data was "text" so this project also involved extensive usage of text mining techniques as well. Text in its basic form is unstructured and to develop predictive models, the data needs to be thoroughly pre-processed. So the pipeline of developing models that I followed was:

Data Profiling
Data Cleansing
Exploratory Analysis
Data Preprocessing
Feature Extraction and Selection
Model Development
Model Evaluation

When text data is pre-processed, the issue of curse of dimensionality usually appears i.e. data becomes highly multi-dimensional with lots of features ranging in thousands. Not all of those features are helpful and also it adversely affects the peformance of classifiers as well so following the best practices, I opted for best-in-class feature extraction methods and also applied feature selection techniques so as to compile only those features that will contribute in this prediction problem. For model development, I used and compared the following set of machine learning algorithms:

Bernoulli Naive Bayes
Multinomial Naive Bayes
Random Forests
Linear SVM

and compared these algorithms on different metrics like accuracy, training and testing time. As per my analysis, SVM outshines all of the other models when it comes to accuracy. Random Forests accuracy score was also quite good but took considerable time during training phase. For implementation, I used Python. Specifically, I used the following libraries/modules of Python for different set of tasks: pandas, numpy sklearn nltk matplotlib

To run the code, please make sure that the latest version of Python, Jupyter and aforementioned libraries are installed in your system.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Document Classification.ipynb		Document Classification.ipynb
README.md		README.md
dataset.zip		dataset.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document-Classification-of-Job-Titles-using-Job-Descriptions

About

Releases

Packages

Languages

GitCode11/Document-Classification-of-Job-Titles-using-Job-Descriptions

Folders and files

Latest commit

History

Repository files navigation

Document-Classification-of-Job-Titles-using-Job-Descriptions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages