### Job Recommendaion System Project Using cosine_similarity

This project focuses on developing a job recommendation system using the TfidfVectorizer and cosine similarity methods. The provided code imports necessary libraries such as pandas, nltk, and sklearn for data preprocessing and analysis. The project's goal is to recommend similar jobs based on job descriptions and titles.

The code begins by reading in a dataset from a CSV file titled 'Combined_Jobs_Final.csv.zip'. The dataset contains various columns such as Job.ID, Provider, Status, Slug, Title, Position, Company, City, State, Industry, Job.Description, Requirements, Salary, Employment.Type, Education.Required, Created.At, and Updated.At.

From the original dataset, the code selects only the 'Title' and 'Job.Description' columns, which are crucial for the recommendation system.

To build the recommendation system, the following steps are likely performed:

1. Text Preprocessing: The job descriptions and titles undergo preprocessing to eliminate unnecessary characters, convert text to lowercase, and tokenize the text into individual words. The nltk library is used for this purpose.
2. Stopword Removal: Common English stop words are removed from the text to exclude words with low semantic value.
3. Stemming: The words are stemmed using the SnowballStemmer from the nltk library, reducing them to their root form.
4. TF-IDF Vectorization: The TfidfVectorizer from the sklearn library is applied to convert the preprocessed text into numerical vectors. This process captures the importance of each word in every job description and title relative to the entire dataset.
5. Cosine Similarity: The cosine_similarity function from the sklearn library computes the similarity between every pair of job descriptions and titles. This similarity score indicates how closely related two jobs are based on their text.

The resulting job recommendation system suggests similar jobs based on the text similarity between job descriptions and titles. By leveraging the TF-IDF vectors and cosine similarity scores, the system identifies jobs that share similar characteristics and profiles.

This job recommender system provides valuable insights to users seeking similar job opportunities based on job titles and descriptions.

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('Combined_Jobs_Final.csv.zip')

In [6]:
df.head()

Unnamed: 0,Job.ID,Provider,Status,Slug,Title,Position,Company,City,State.Name,State.Code,...,Industry,Job.Description,Requirements,Salary,Listing.Start,Listing.End,Employment.Type,Education.Required,Created.At,Updated.At
0,111,1,open,palo-alto-ca-tacolicious-server,Server @ Tacolicious,Server,Tacolicious,Palo Alto,California,CA,...,Food and Beverages,Tacolicious' first Palo Alto store just opened...,,8.0,,,Part-Time,,2013-03-12 02:08:28 UTC,2014-08-16 15:35:36 UTC
1,113,1,open,san-francisco-ca-claude-lane-kitchen-staff-chef,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,California,CA,...,Food and Beverages,\r\n\r\nNew French Brasserie in S.F. Financia...,,0.0,,,Part-Time,,2013-04-12 08:36:36 UTC,2014-08-16 15:35:36 UTC
2,117,1,open,san-francisco-ca-machka-restaurants-corp-barte...,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,California,CA,...,Food and Beverages,We are a popular Mediterranean wine bar and re...,,11.0,,,Part-Time,,2013-07-16 09:34:10 UTC,2014-08-16 15:35:37 UTC
3,121,1,open,brisbane-ca-teriyaki-house-server,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,California,CA,...,Food and Beverages,● Serve food/drinks to customers in a profess...,,10.55,,,Part-Time,,2013-09-04 15:40:30 UTC,2014-08-16 15:35:38 UTC
4,127,1,open,los-angeles-ca-rosa-mexicano-sunset-kitchen-st...,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,California,CA,...,Food and Beverages,"Located at the heart of Hollywood, we are one ...",,10.55,,,Part-Time,,2013-07-17 15:26:18 UTC,2014-08-16 15:35:40 UTC


In [3]:
df.columns

Index(['Job.ID', 'Provider', 'Status', 'Slug', 'Title', 'Position', 'Company',
       'City', 'State.Name', 'State.Code', 'Address', 'Latitude', 'Longitude',
       'Industry', 'Job.Description', 'Requirements', 'Salary',
       'Listing.Start', 'Listing.End', 'Employment.Type', 'Education.Required',
       'Created.At', 'Updated.At'],
      dtype='object')

In [29]:
df = df[['Title','Job.Description']]

In [30]:
df.columns

Index(['Title', 'Job.Description'], dtype='object')

In [31]:
df.isnull().sum()

Title              0
Job.Description    0
dtype: int64

In [32]:
df.duplicated().sum()

73

In [33]:
df.drop_duplicates(inplace=True)

In [34]:
df.duplicated().sum()

0

In [5]:
df = df.sample(n=1000, random_state=42)  # Randomly sample 1000 rows from the dataframe 'df'

In [6]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re

In [7]:
ps = PorterStemmer()

In [8]:
def cleaning(txt):
    cleaned_txt = re.sub(r'[^a-zA-Z0-9\s)]', '', txt)  # Removing all non-alphanumeric characters except spaces and parentheses
    tokens = nltk.word_tokenize(cleaned_txt.lower())  # Tokenizing the cleaned text and converting to lowercase
    steaming = [ps.stem(word) for word in tokens if word not in stopwords.words('english')]  # Applying stemming and removing stopwords
    return ' '.join(steaming)  # Returning the cleaned and processed text as a single string

In [9]:
cleaning('AdkjAKie am not typing in 3daAddi')

'adkjaki type 3daaddi'

In [10]:
# Applying the 'cleaning' function to clean the 'Title' column in the dataframe
df['Title'] = df['Title'].apply(lambda x: cleaning(x))

# Converting the 'Job.Description' column to string and applying the 'cleaning' function to clean it
df['Job.Description'] = df['Job.Description'].astype(str).apply(lambda x: cleaning(x))

In [11]:
# Creating a new column 'new_col' by concatenating 'Title' and 'Job.Description' columns
df['new_col'] = df['Title'] + ' ' + df['Job.Description']


In [12]:
# Importing the required modules from Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [13]:
# Creating an instance of the TfidfVectorizer class
tfidf = TfidfVectorizer()

# Creating a matrix of TF-IDF features for the 'new_col' column in the dataframe
matrix = tfidf.fit_transform(df['new_col'])

# Computing the cosine similarity matrix for the TF-IDF features
similarity = cosine_similarity(matrix)


In [14]:
similarity

array([[1.        , 0.04041144, 0.02581255, ..., 0.05811792, 0.02554883,
        0.08444829],
       [0.04041144, 1.        , 0.0265957 , ..., 0.03222878, 0.00517218,
        0.0235714 ],
       [0.02581255, 0.0265957 , 1.        , ..., 0.05838795, 0.03162502,
        0.03912519],
       ...,
       [0.05811792, 0.03222878, 0.05838795, ..., 1.        , 0.06390824,
        0.11548323],
       [0.02554883, 0.00517218, 0.03162502, ..., 0.06390824, 1.        ,
        0.38454495],
       [0.08444829, 0.0235714 , 0.03912519, ..., 0.11548323, 0.38454495,
        1.        ]])

In [23]:
# Function to generate job recommendations
def recommendation(title):
    idx = df[df['Title'] == title].index[0]  # Find the index of the job title in the dataframe
    idx = df.index.get_loc(idx)  # Get the actual index location in the dataframe
    distances = sorted(list(enumerate(similarity[idx])), key=lambda x: x[1], reverse=True)[1:21]  # Calculate cosine similarity with other job titles
    
    jobs = []  # List to store the recommended job titles
    for i in distances:
        jobs.append(df.iloc[i[0]].Title)  # Get the titles of the similar jobs
        
    return jobs

In [16]:
df['Title']

64119                       site director knowledg univers
35827                          administr assist officeteam
72100                     account manag chi payment system
46355    outsid wholesal sale rep parttim ) river front...
34166    custom servic rep help peopl hear loss captioncal
                               ...                        
66282    sale repres sale associ entri level ) vector m...
39515                             staff account accountemp
69231                    unarm secur offic us secur associ
69618              line cook crown plaza independ own oper
6144     kitchen manag job detail artesian hotel casino...
Name: Title, Length: 1000, dtype: object

### Top 20 recommendation jobs

In [24]:
recommendation('administr assist officeteam')

['administr assist officeteam',
 'administr assist officeteam',
 'administr assist officeteam',
 'administr assist nonprofit officeteam',
 'administr assist officeteam',
 'administr assist officeteam',
 'administr assist officeteam',
 'seek colleg graduat amaz opportuni officeteam',
 'administr assist officeteam',
 'offic manag accountemp',
 'administr assist officeteam',
 'front desk coordin reput financ organ officeteam',
 'administr assist officeteam',
 'front desk coordin officeteam',
 'part time organiz guru officeteam',
 'administr assist long term potenti officeteam',
 'benefit administr long term contract accountemp',
 'administr assist officeteam',
 'temporari administr assist need asap officeteam',
 'season wed sale stylist david bridal']

In [None]:
import pickle


In [None]:
# # Save df in the desired directory
# pickle.dump(df, open('C:\\Users\\mrpai\\OneDrive\\Desktop\\df8.pkl', 'wb'))

# # Save similarity in the desired directory
# pickle.dump(similarity, open('C:\\Users\\mrpai\\OneDrive\\Desktop\\similarity8.pkl', 'wb'))