# Job Recommendation System

## What is a recommendation system?


In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users. 

A job recommendation system is essentially one that finds a suitable job for the applicant, based on the data input in chosen categories (city, domain, salary, etc.)

#### Types of Recommendation systems: 

- Collaborative Filtering 
- Content-Based Filtering 
- Hybrid Recommendation Systems

## Content Based Recommenders

Content-based filtering methods are based on a description of the item and a profile of the user’s preference.The recommendations are primarily based on the keywords provided by the user or those picked up by the system (based of previous selections). The 'best-matched' items are recommended. <br>
<br>
This system is put to use where there isn't much information about the user prior to the search. For this reason, they are also called item-item interactions.

Example: At Pandora, a team of musicians labeled each music with more than 400 attributes. Then, when a user selects a music station, songs that match the station’s attributes will be added to the playlist.
For Pandora, manual efforts/costs are needed to create music attributes.

##### Similarity Between Content

Text A: London Paris London

Text B: Paris Paris London

Finding similarity between the text.

<img src="http://www.codeheroku.com/static/blog/images/pid14_find_cos_theta.png" align="left" style="width:500px; height:300px">





### Code for Cosine Distance Rule

Sample question stored in 'text'

In [None]:
text = ["London Paris London", "Paris Paris London"]

CountVectorizer- Convert a collection of text documents to a matrix of token counts

Using 'CountVectorizer' to count word frequency in a corpus.

Matrix:

[frequencies in first sentence]

[frequencies in second sentence]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
count_matrix = cv.fit_transform(text)
print(count_matrix.toarray())

[[2 1]
 [1 2]]


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores = cosine_similarity(count_matrix)
print(similarity_scores)

[[1.  0.8]
 [0.8 1. ]]


Interpreting this, says that Text A is similar to Text A(itself) by 100%(position [0,0]) and Text A is similar to Text B by 80%(position [0,1]).

*Example:* The user wants to find a job with the requirements based on Title, Location and Company Name

## Scrapping the Data

We used real time jobs posted on Naukri.com for building this recommendation system. <br>

*Note*: Dataset scrapped on 26th March 2020

### What we wanted to scrape?

<img src="https://i.ibb.co/ZztsFbb/scraping2.png" align="left" style="width:800px; height:300px">

### Limiting our scraping scope to: <br>
Data Scientist, SDE, App Developer and QA jobs

<img src="https://i.ibb.co/808KJWr/scraping1.png" align="left" style="width:800px; height:300px">

### Librariers required for building a Web Scrapper

In [None]:
from selenium import webdriver   # Used for Automated Scrapingon Chrome
from bs4 import BeautifulSoup    # Python's HTML Parser
from csv import DictWriter       # Python's DictWriter module
from time import sleep           # Used to avoid banning of web scrapper on website

### Example of scraping one category 

The below code is used to find the 'Job Title' in the div block. <br>
If found, the value is stored in title (try block).<br>
Else the value of title is 'None' (except block).

In [None]:
try:
    title = soup.find("a", class_="title").text.replace('\n','')
except:
    title = 'None'

Similarily, the other attributes such as location, company, salary etc are found.<br>
Also two seperate categories called 'Trending' and 'Sponsored' are scrapped too. 

## Pre-Processing the Data

The dataset that we have used in our project was obtained by scrapping the naukri.com site for ceratin job titles. But, only certain features have been used from the data scrapped.
<br>
The required features are extracted from the DataFrame. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import chain

The itertools is a module in Python having a collection of functions that are used for handling iterators. They make iterating through the iterables like lists and strings very easily. One such itertools function is chain().
<br>
<br>
The iternal working of chain can be implemented as given below :

In [None]:
def chain(*iterables):
     for it in iterables:
       for each in it:
           yield (each)

### Dataset before Pre-Processing:

![alt text](https://i.ibb.co/s677ZPk/dataset.jpg)

- We have made sure that the 'NaN' or 'none' values are replaced with 0 
- The Missing values where, 'not disclosed', are filled with aggregates (eg: salary). This was possible as each job title's scrapped data is stored in a different file
- We remove the text within the brackets along with the brackets.
- For years of experience the data is in the format (ex: 3-5yrs) in which we consider only the minimum number of years that is required (here 3).
- The repetetive itemsets are removed
- If a single job is available in multiple cities/locations, they were stored as separate itemsets (multivalued attributes).

All of the above techniques have been applied to the dataset and can be observed in <b><em>https://github.com/dhruvshettty/job-recommender/blob/master/Data_Preprocessing/Preprocessing.py </b></em>

## Building the Recommendation Engine

*Jobs.csv* contains all the scrapped jobs.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("Jobs.csv")

The preprocessed data is stored as a dataframe. The combineFeatures() method is applied to get the required features into a single column, as a string(as shown above). Post this, the dataset is cleaned to remove any blank values. 
<br>
Following which,the countVectorizer() method is applied to get the similarity between the specified parameters.

In [None]:
# This is done to consider the entire input as one entity
# Space in between the different inputs is required for it to work
def combine_features(row):
    return row['title']+" "+row['location']+" "+row['experience_yrs']

In [None]:
for feature in features:
    df[feature] = df[feature].fillna('') #filling all NaNs with blank string
df["combined_features"] = df.apply(combine_features,axis=1)

cv = CountVectorizer() 
count_matrix = cv.fit_transform(df["combined_features"])

cosine_sim = cosine_similarity(count_matrix)

The user may enter the values for the parameters selected above in order to get the similar results from the engine.

In [None]:
req_title = "Data Scientist"
req_location = "Bangalore"
req_experience = "5"

Two functions will serve to be useful to index all the elements from combined_features, to be idexed wrt to each other also, to be i dentified wrt user requirements.

In [None]:
def get_job_from_index(Index):
	return df[df.Index == Index]["Company_Name"].values[0],df[df.Index == Index]["title"].values[0],df[df.Index == Index]["location"].values[0],df[df.Index == Index]["experience_yrs"].values[0]
    
def get_index_from_job(Job_Title,Location,Job_Salary):
	return df[df.title == title]["Index"].values[0]

We will access the row corresponding to this job in the similarity matrix. Thus, get the similarity scores of all other jobs wrt the user requirements, as above. 
<br>
Then enumerate through all the similarity scores of that job, so attained, to make a tuple of job index and similarity score.

In [None]:
req_index = get_index_from_job(title,location,req_salary)

similar_jobs = list(enumerate(cosine_sim[req_index])) 

sorted_similar_jobs = sorted(similar_jobs,key=lambda x:x[1],reverse=True)

Finally, we will run a loop to print first 10 entries from sorted_similar_movieslist.
These need to be identified based on indices in order to be sorted and displayed such that the most likely match apprears first and the rest follow in descending order of similarity.
<br>


So, after the output is generated we write the result into a csv file 

In [None]:
i=0
with open('Result.csv','w') as file:
    csv_output = csv.writer(file)
    csv_output.writerow(['company','title', 'city', 'exp'])
    for job in sorted_jobs_available:    
        data=get_job_from_index(job[0])
        csv_output.writerow(data)
        i=i+1
        if i>10:
            break

![alt text](https://i.ibb.co/P4LZCwk/Screenshot-12.png)

## Future:
1) Add more user inputs like Salary, Skills etc <br>
2) Remove entiries where years of experience criteria isn't met <br>