# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Krithik Vasan BAskar
#### Student ID: s3933152

Date: 01 - October - 2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* nlkt
* sklearn
* collections
* itertools

## Introduction
In Task 2, the objective is to generate various types of feature representations for a collection of job advertisements, with a specific focus on the descriptions within those job advertisements. The task involves creating three distinct feature representations:

1. **Bag-of-Words Model (Count Vector Representation)**:
   - Generate the count vector representation for each job advertisement description.
   - Save these count vectors into a file, following a specific format.
   - The count vector representations will be based on the vocabulary created in Task 1, which was saved in vocab.txt.

2. **Models Based on Word Embeddings**:
   - Choose one embedding language model, such as FastText, GoogleNews300, or another pre-trained Word2Vec model, or Glove.
   - Build two types of document embeddings for each job advertisement description:
     - **TF-IDF Weighted Vector Representation**: Generate a weighted vector representation using the chosen language model and TF-IDF weighting.
     - **Unweighted Vector Representation**: Create an unweighted vector representation for each description using the chosen language model.

The output for Task 2 will include the following:

- **count_vectors.txt**: This file will store the sparse count vector representations of job advertisement descriptions. Each line in this file corresponds to one advertisement and follows a specific format. It includes the webindex of the job advertisement, a comma, and the sparse representation of the description in the form of word_integer_index:word_freq, separated by commas.

Task 2 is crucial for transforming the textual data from job advertisements into numerical feature representations, which can be used for various downstream tasks such as text classification or clustering. These feature representations capture the essence of the job descriptions, enabling further analysis and modeling of the dataset.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
import nltk
import re

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

In [2]:
# Initialize an empty list to store the vocabulary
vocab  = []

with open('vocab.txt', 'r') as file:
    # Read each line in the file
    for line in file:
        # Split the line at the colon (":") character
        parts = line.strip().split(':')
        if len(parts) > 1:
            # Get the word before the colon and add it to the vocabulary list
            word = parts[0]
            vocab.append(word)

# Print the vocabulary list
vocab

['aap',
 'aaron',
 'aat',
 'abb',
 'abenefit',
 'aberdeen',
 'abi',
 'abilities',
 'abreast',
 'abroad',
 'absence',
 'absolute',
 'ac',
 'aca',
 'academic',
 'academy',
 'acca',
 'accept',
 'acceptable',
 'acceptance',
 'accepted',
 'access',
 'accessible',
 'accident',
 'accommodates',
 'accommodation',
 'accomplished',
 'accordance',
 'account',
 'accountabilities',
 'accountability',
 'accountable',
 'accountancy',
 'accountant',
 'accountants',
 'accounting',
 'accounts',
 'accreditation',
 'accredited',
 'accruals',
 'accuracy',
 'accurate',
 'accurately',
 'achievable',
 'achieve',
 'achieved',
 'achievement',
 'achievements',
 'achiever',
 'achieving',
 'acii',
 'acquired',
 'acquisition',
 'acquisitions',
 'act',
 'acting',
 'action',
 'actions',
 'actionscript',
 'active',
 'actively',
 'activites',
 'activities',
 'activity',
 'acts',
 'actual',
 'actuarial',
 'acumen',
 'acute',
 'ad',
 'adam',
 'adapt',
 'adaptability',
 'add',
 'added',
 'addiction',
 'adding',
 'addition

In [3]:
# Initialize an empty list to store the job description
job_desc = []

# Open the text file for reading
with open('job_description.txt', 'r') as file:
    # Read each line in the file
    for line in file:
        # Append each line (job description) to the job_desc list
        job_desc.append(line.strip())

# Print the job_desc list
job_desc

['accountant partqualified south east london manufacturing requirement accountant permanent modern offices south east london credit control purchase ledger daily collection debts phone letter email handling ledger accounts handling accounts negotiating payment terms cash reconciliation accounts adhoc administration duties person ideal previous credit control capacity possess exceptional customer communication part fully qualified accountant considered',
 'hedge funds london recruiting fund accountant paying outstanding west end report head fund accounting number fund accountants senior fund accountants responsible fund accounting number hedge funds dealing equity related products involves aspects fund accounting preparation journal voucher entries nav control part nav review fund accountant reviews cash securities reconciliation trade input pricing financial statements',
 'exciting arisen establish provider elderly care deputy home home day day running home passion care sector proven d

In [4]:
vVectorizer = CountVectorizer(analyzer = "word",vocabulary=vocab)
count_features  = vVectorizer.fit_transform(job_desc)
count_features.shape


(776, 5168)

In [5]:
print(count_features)

  (0, 33)	3
  (0, 36)	3
  (0, 93)	1
  (0, 102)	1
  (0, 666)	1
  (0, 707)	1
  (0, 874)	1
  (0, 910)	1
  (0, 1003)	1
  (0, 1058)	2
  (0, 1144)	2
  (0, 1169)	1
  (0, 1183)	1
  (0, 1220)	1
  (0, 1465)	1
  (0, 1484)	2
  (0, 1542)	1
  (0, 1708)	1
  (0, 1968)	1
  (0, 2116)	2
  (0, 2291)	1
  (0, 2653)	2
  (0, 2672)	1
  (0, 2751)	2
  (0, 2829)	1
  :	:
  (775, 3606)	1
  (775, 3699)	1
  (775, 3711)	1
  (775, 3821)	1
  (775, 3824)	1
  (775, 3942)	1
  (775, 3977)	1
  (775, 4107)	1
  (775, 4227)	1
  (775, 4244)	1
  (775, 4245)	1
  (775, 4249)	1
  (775, 4315)	1
  (775, 4361)	1
  (775, 4380)	2
  (775, 4399)	1
  (775, 4574)	1
  (775, 4606)	4
  (775, 4662)	1
  (775, 4770)	1
  (775, 4785)	1
  (775, 5032)	1
  (775, 5117)	1
  (775, 5136)	1
  (775, 5158)	1


In [6]:
# Initialize an empty list to store the values
webindex_list = []

# Open the text file for reading
with open('webindex.txt', 'r') as file:
    # Read each line in the file
    for line in file:
        # Append each line (value) to the values list, removing leading and trailing whitespace
        webindex_list.append(line.strip())

# Print the values list
webindex_list

['68997528',
 '68063513',
 '68700336',
 '67996688',
 '71803987',
 '70322392',
 '70086531',
 '68684698',
 '70251801',
 '72457901',
 '71851935',
 '70757932',
 '71215909',
 '70205492',
 '70207759',
 '69770990',
 '72232029',
 '71213522',
 '68258357',
 '71841735',
 '71692209',
 '71805092',
 '65101527',
 '68256188',
 '72198878',
 '68573837',
 '67749541',
 '71691899',
 '71139623',
 '72443411',
 '69799351',
 '69078766',
 '68508976',
 '68564061',
 '70762357',
 '71737507',
 '69577820',
 '67304988',
 '72452403',
 '70163439',
 '66544069',
 '71903513',
 '69568022',
 '69191349',
 '71142126',
 '72481557',
 '69539327',
 '72236089',
 '72233918',
 '68546047',
 '68217600',
 '72478300',
 '70598762',
 '71171000',
 '62016897',
 '68177629',
 '71196021',
 '70757636',
 '71793578',
 '70763481',
 '68678164',
 '70599432',
 '66887344',
 '68714905',
 '68257980',
 '62004211',
 '71367580',
 '71556854',
 '72411451',
 '69966126',
 '72438284',
 '72692186',
 '66399629',
 '72444694',
 '72448172',
 '69996401',
 '71443055',

In [7]:
def validator(count_features, vocab, a_ind, indeces):
    print("WEB INDEX:", indeces[a_ind]) # print out the Article ID
    print("--------------------------------------------")
    print("Job Detail:",job_desc[a_ind]) # print out the txt of the article
    #print("Article tokens:",tokenised_articles[a_ind]) # print out the tokens of the article
    print("--------------------------------------------\n")
    print("Vector representation:\n") # printing the vector representation as format 'word:value' (
                                      # the value is 0 or 1 in for binary vector; an integer for count vector; and a float value for tfidf

    for word, value in zip(vocab, count_features.toarray()[a_ind]): 
        if value > 0:
            print(word+":"+str(value), end =' ')

In [8]:
validator(count_features, vocab, 775, webindex_list)

WEB INDEX: 71185283
--------------------------------------------
Job Detail: title field executive office supplies solutions area nottingham basic pay ote car laptop phone field executive light reporting promoting range office supplies office furniture sme market place securing geographical area generating leads cold calling door space pre booked appointments focused wanting kick back relax accounts won progress account world person field executive possess jedi advantageous candidates years field selling office supplies solutions sell levels jedi master level communication presentation drive enthusiasm excel career extremely target driven motivated success established experiencing rapid growth side office supplies background considered room send totaljobs
--------------------------------------------

Vector representation:

account:1 accounts:1 advantageous:1 appointments:1 area:2 back:1 background:1 basic:1 booked:1 calling:1 candidates:1 car:1 career:1 cold:1 communication:1 consider

In [9]:
count_vector = []
for i in range(len(webindex_list)):
    cv = "#" + webindex_list[i]
    for index, value in enumerate(count_features.toarray()[i]):
        if value > 0:
            cv += "," + str(index) + ":" + str(value)
    count_vector.append(cv)

In [10]:
count_vector

['#68997528,33:3,36:3,93:1,102:1,666:1,707:1,874:1,910:1,1003:1,1058:2,1144:2,1169:1,1183:1,1220:1,1465:1,1484:2,1542:1,1708:1,1968:1,2116:2,2291:1,2653:2,2672:1,2751:2,2829:1,2999:1,3088:1,3196:1,3354:1,3367:1,3392:1,3431:1,3435:1,3463:1,3541:1,3619:1,3764:1,3788:1,3876:1,3991:1,4392:2,4714:1',
 '#68063513,33:2,34:2,35:3,322:1,707:1,1058:1,1211:1,1582:1,1632:1,1645:1,1842:1,1974:7,1980:2,2145:1,2163:2,2401:1,2514:1,2564:1,2751:1,3076:2,3145:2,3307:1,3354:1,3391:1,3590:1,3623:1,3676:1,3876:1,3887:1,3935:1,3975:1,4032:1,4061:1,4064:1,4228:1,4251:1,4478:1,4795:1,5016:1,5081:1',
 '#68700336,10:2,295:1,343:2,679:4,1197:2,1252:1,1285:1,1291:1,1316:1,1399:1,1526:1,1616:2,1660:1,1713:1,1876:1,2058:1,2153:1,2220:8,2365:1,2641:1,2675:1,2680:1,2809:1,2899:1,2924:1,3153:1,3156:1,3205:1,3373:1,3385:1,3412:1,3743:1,3745:1,3787:1,3917:5,3992:1,4008:1,4009:2,4031:2,4132:2,4223:1,4295:1,4453:1,4464:1,4598:1,4602:1,4785:1',
 '#67996688,62:1,207:1,274:1,586:3,647:1,659:1,814:1,843:1,866:1,1262:1,1474:1,

In [11]:
!pip install gensim



In [12]:
import gensim
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize

In [13]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(job_desc)

In [14]:
print(tfidf_vectors)

  (0, 996)	0.09531271318466682
  (0, 3773)	0.07821166567368464
  (0, 1959)	0.1061919231001392
  (0, 3341)	0.07189780182251101
  (0, 903)	0.070277709564177
  (0, 1161)	0.06996439492867648
  (0, 1699)	0.12835773375082868
  (0, 3527)	0.10970668693524623
  (0, 663)	0.13358452883303928
  (0, 3604)	0.08245826258151072
  (0, 2281)	0.08802432794046589
  (0, 3421)	0.09046563231320097
  (0, 1456)	0.08294417258358575
  (0, 102)	0.10355998392967496
  (0, 93)	0.13505500519123656
  (0, 3861)	0.14183179957804745
  (0, 704)	0.11922830097836144
  (0, 4694)	0.1308521762366382
  (0, 3379)	0.13824362212075778
  (0, 3077)	0.15985275501239027
  (0, 36)	0.2743125663759606
  (0, 2106)	0.24392130714952515
  (0, 1533)	0.07884175705492424
  (0, 2661)	0.12835773375082868
  (0, 3449)	0.11922830097836144
  :	:
  (775, 28)	0.06085329792056287
  (775, 291)	0.10764260305086507
  (775, 4750)	0.07743547159744146
  (775, 4230)	0.061055173928403574
  (775, 668)	0.06188381949314877
  (775, 1617)	0.08987739313020836
  (775,

## Word2Vec model
Popular word embedding model in natural language processing is Word2Vec. Words are transformed into numerical vectors that represent semantic meaning and word connections. Word2Vec uses the Skip-gram and Continuous Bag of Words (CBOW) as its two main algorithms while training on huge text datasets. These embeddings have intriguing features that make it possible to do vector arithmetic operations that frequently provide useful results. For a variety of NLP applications, from sentiment analysis to machine translation, where understanding context and word meanings is key, pre-trained Word2Vec models are easily accessible for many languages and domains.

In [15]:
# Tokenize the job descriptions
tokenized_descriptions = [word_tokenize(description.lower()) for description in job_desc]

# Train a Word2Vec model
model = Word2Vec(sentences=tokenized_descriptions, vector_size=100, window=5, min_count=1, sg=0)

# Save the trained model to a file (you can replace 'word2vec.model' with your desired filename)
model.save("word2vec.model")

In [16]:
unweighted_embeddings = np.array([np.mean([model.wv[word] for word in tokens if word in model.wv] if tokens else [np.zeros(model.vector_size)], axis=0)
                                  for tokens in tokenized_descriptions])

In [17]:
unweighted_embeddings

array([[-0.1827779 ,  0.41435915,  0.2448432 , ..., -0.33837962,
         0.20886739,  0.02275267],
       [-0.13028733,  0.29980144,  0.17351139, ..., -0.24361616,
         0.15365773,  0.01322478],
       [-0.2511788 ,  0.53991014,  0.29608253, ..., -0.4167968 ,
         0.28159934,  0.08112289],
       ...,
       [-0.1614958 ,  0.3656268 ,  0.21338367, ..., -0.29460832,
         0.1844104 ,  0.02729719],
       [-0.17271304,  0.38018182,  0.22707927, ..., -0.3092356 ,
         0.20047519,  0.03879891],
       [-0.17073716,  0.38987428,  0.2277231 , ..., -0.3181607 ,
         0.19841982,  0.02140598]], dtype=float32)

In [18]:
tfidf_weighted_embeddings = []
for tokens in tokenized_descriptions:
    embeddings = []
    for word in tokens:
        if word in model.wv and word in tfidf_vectorizer.vocabulary_:
            tfidf_weighted_vector = model.wv[word] * tfidf_vectorizer.idf_[tfidf_vectorizer.vocabulary_[word]]
            embeddings.append(tfidf_weighted_vector)
    if embeddings:
        tfidf_weighted_embeddings.append(np.mean(embeddings, axis=0))
    else:
        # If no valid word vectors found, use a zero vector
        tfidf_weighted_embeddings.append(np.zeros(model.vector_size))

# Convert the list of embeddings to a numpy array
tfidf_weighted_embeddings = np.array(tfidf_weighted_embeddings)

In [19]:
tfidf_weighted_embeddings

array([[-0.6308747 ,  1.4333018 ,  0.84795684, ..., -1.1694459 ,
         0.7229672 ,  0.07621756],
       [-0.49447152,  1.1389264 ,  0.66096914, ..., -0.9257678 ,
         0.58321697,  0.05202352],
       [-0.79741234,  1.7167013 ,  0.9443605 , ..., -1.3285596 ,
         0.89609534,  0.25551713],
       ...,
       [-0.55935156,  1.2648895 ,  0.73933357, ..., -1.0174814 ,
         0.63680387,  0.09535185],
       [-0.6044265 ,  1.3339407 ,  0.79850394, ..., -1.0845568 ,
         0.7012239 ,  0.1278356 ],
       [-0.5634965 ,  1.2887082 ,  0.7545939 , ..., -1.050995  ,
         0.65287703,  0.07093474]], dtype=float32)

In [20]:
#saving count vector in a file.
with open('count_vector.txt', 'w') as file:
    file.write('\n'.join(map(str, count_vector)))

## Task 3. Job Advertisement Classification

In Task 3, we delve into the domain of job advertisement classification, aiming to build machine learning models capable of categorizing the content of job advertisements into specific categories. This task involves conducting two sets of experiments to address two critical questions:

One of the central inquiries in this task revolves around assessing the performance of various language models that were generated in Task 2, using the feature representations derived from job advertisement descriptions. To tackle this question, we embark on building machine learning models, employing these feature representations, and scrutinizing their classification performance.

We will not only explore conventional models such as logistic regression from scikit-learn but also have the flexibility to consider other machine learning models, even those not explicitly covered in the course. Our primary objective is to ascertain which language model, coupled with the chosen machine learning algorithm, yields the most promising results. Through rigorous evaluation, we aim to determine the most effective combination for accurately classifying job advertisements into their respective categories.

The second question revolves around the potential benefits of incorporating additional information into the classification process. Specifically, we are interested in assessing whether including the title of the job position, in addition to the description, improves the accuracy of our classification models. To explore this, we will conduct experiments that consider three distinct scenarios:

1. **Using Only the Title**: In this scenario, we exclusively leverage the title of the job advertisement to build classification models.

2. **Using Only the Description**: This scenario involves utilizing only the job advertisement descriptions, a feature representation we have already crafted in Task 2.

3. **Using Both Title and Description**: In this scenario, we have the flexibility to either concatenate the title and description into a single feature representation or generate separate feature representations for both the title and description. We will explore both approaches and assess their impact on classification accuracy.

To ensure robust and reliable comparisons, we will employ a 5-fold cross-validation methodology during the evaluation process. This approach helps us mitigate bias and provides a comprehensive view of how different models and data combinations perform under various conditions.

Ultimately, the outcomes of Task 3 will shed light on the efficacy of language models, the potential advantages of incorporating additional information, and guide us in selecting the most suitable strategies for job advertisement classification, aligning our efforts with the overarching goal of optimizing accuracy and performance.

In [21]:
# Initialize an empty list to store the data
target = []

# Open the file in read mode ('r') and read the data
with open('target.txt', 'r') as file:
    for line in file:
        # Convert the line to an integer and append it to the 'target' list
        target.append(int(line.strip()))


In [22]:
target

[0,
 0,
 2,
 0,
 2,
 1,
 2,
 0,
 3,
 3,
 0,
 0,
 1,
 3,
 1,
 3,
 3,
 1,
 3,
 2,
 2,
 2,
 3,
 3,
 0,
 2,
 2,
 2,
 0,
 2,
 3,
 1,
 2,
 0,
 1,
 3,
 3,
 1,
 1,
 0,
 2,
 2,
 2,
 2,
 0,
 0,
 2,
 1,
 3,
 1,
 1,
 2,
 2,
 3,
 0,
 0,
 1,
 0,
 2,
 2,
 3,
 3,
 3,
 0,
 3,
 0,
 1,
 2,
 3,
 1,
 3,
 2,
 3,
 1,
 3,
 2,
 1,
 3,
 2,
 1,
 3,
 2,
 2,
 1,
 0,
 1,
 1,
 1,
 3,
 0,
 3,
 1,
 3,
 2,
 2,
 0,
 2,
 3,
 2,
 1,
 0,
 1,
 1,
 2,
 0,
 3,
 0,
 1,
 3,
 2,
 1,
 2,
 0,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 3,
 2,
 0,
 0,
 1,
 3,
 2,
 0,
 1,
 0,
 3,
 1,
 2,
 1,
 0,
 0,
 0,
 3,
 0,
 1,
 2,
 3,
 1,
 1,
 1,
 2,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 2,
 0,
 2,
 2,
 0,
 2,
 3,
 2,
 2,
 0,
 2,
 1,
 0,
 1,
 1,
 1,
 3,
 1,
 3,
 1,
 0,
 3,
 1,
 0,
 2,
 0,
 0,
 2,
 1,
 1,
 0,
 1,
 3,
 0,
 1,
 1,
 3,
 0,
 1,
 0,
 2,
 3,
 0,
 2,
 0,
 1,
 0,
 1,
 3,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 2,
 1,
 3,
 1,
 2,
 3,
 1,
 1,
 2,
 0,
 0,
 1,
 2,
 0,
 3,
 2,
 3,
 2,
 2,
 3,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,


In [23]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold


In [24]:
num_folds = 5
seed = 15
kf = KFold(n_splits= num_folds, random_state=seed, shuffle = True) # initialise a 5 fold validation
kf

KFold(n_splits=5, random_state=15, shuffle=True)

In [25]:
def evaluate(X_train,X_test,y_train, y_test,seed):
    model = LogisticRegression(random_state=seed,max_iter = 1000)
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [26]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

num_models = 2
cv_df = pd.DataFrame(columns = ['unweighted','weighted'],index=range(num_folds)) # creates a dataframe to store the accuracy scores in all the folds

fold = 0
for train_index, test_index in kf.split(list(range(0,len(target)))):
    y_train = [str(target[i]) for i in train_index]
    y_test = [str(target[i]) for i in test_index]

   
    X_train_unweighted, X_test_unweighted = unweighted_embeddings[train_index], unweighted_embeddings[test_index]
    cv_df.loc[fold,'unweighted'] = evaluate(unweighted_embeddings[train_index],unweighted_embeddings[test_index],y_train,y_test,seed)

    X_train_weighted, X_test_weighted = tfidf_weighted_embeddings[train_index], tfidf_weighted_embeddings[test_index]
    cv_df.loc[fold,'weighted'] = evaluate(tfidf_weighted_embeddings[train_index],tfidf_weighted_embeddings[test_index],y_train,y_test,seed)
    
    fold +=1

In [27]:
cv_df

Unnamed: 0,unweighted,weighted
0,0.410256,0.551282
1,0.470968,0.580645
2,0.464516,0.56129
3,0.554839,0.709677
4,0.509677,0.696774


In [28]:
cv_df.mean()

unweighted    0.482051
weighted      0.619934
dtype: float64

### Repeating the same process for the feature generation of "TITLE"

In [29]:
# Initialize an empty list to store the data
title = []

# Open the file in read mode ('r') and read the data
with open('title.txt', 'r') as file:
    for line in file:
        # Append each line (string) to the 'title' list
        title.append(line.strip())

# Print or use the 'title' list
title

['Finance / Accounts Asst Bromley to ****k',
 'Fund Accountant  Hedge Fund',
 'Deputy Home Manager',
 'Brokers Wanted Imediate Start',
 'RGN Nurses (Hospitals)  Penarth',
 'Production Coordinator',
 'Scrub Nurse',
 'Sales & Purchase Ledger Clerk  Maternity Cover',
 'Recruitment Sales Executive',
 'Business Development Executive  Field Sales  Dartford',
 'Investments & Treasury Controller',
 'European Payroll',
 'Engineering Assessor / Instructor  South Yorkshire',
 'International Account Manager',
 'Senior Production Technologist (Malaysia)',
 'Insurance Sales Executive  Horsham',
 'Vehicle Purchaser / Car Sales',
 'Marine Engines Specialist â€“ Product Support',
 'Sales Manager/Medical Sales Executive',
 'Optical Assistant  Oxfordshire',
 'PERM Unit Mgr RGN Kid minster Flexi ****K due',
 "PERM RGN's in Bangor CoDown  F/T Flexi  ****ph ExOpp  Bangor",
 'Ecommerce Country Manager (Netherlands)',
 'Business Development Manager  Leading Financial Lending PLC',
 'Dynamics AX Finance Consul

In [30]:
from nltk.tokenize import sent_tokenize, RegexpTokenizer
from itertools import chain
import numpy as np

def tokenizeReview(title):
    review = title.lower()
    sentences = sent_tokenize(review)
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern)
    token_lists = [tokenizer.tokenize(sen) for sen in sentences]

    # merge them into a list of tokens
    tokenized_title = list(chain.from_iterable(token_lists))
    return tokenized_title

def stats_print(tk_title):
    words = list(chain.from_iterable(tk_title))  # we put all the tokens in the corpus in a single list
    vocab = set(words)  # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
    lexical_diversity = len(vocab) / len(words)
    print("Vocabulary size: ", len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of titles:", len(tk_title))
    lens = [len(title) for title in tk_title]
    print("Average title length:", np.mean(lens))
    print("Maximum title length:", np.max(lens))
    print("Minimum title length:", np.min(lens))
    print("Standard deviation of title length:", np.std(lens))


In [31]:
tk_job_title = [tokenizeReview(job_title) for job_title in title]  # list comprehension, generate a list of tokenized articles

In [32]:
tk_job_title

[['finance', 'accounts', 'asst', 'bromley', 'to', 'k'],
 ['fund', 'accountant', 'hedge', 'fund'],
 ['deputy', 'home', 'manager'],
 ['brokers', 'wanted', 'imediate', 'start'],
 ['rgn', 'nurses', 'hospitals', 'penarth'],
 ['production', 'coordinator'],
 ['scrub', 'nurse'],
 ['sales', 'purchase', 'ledger', 'clerk', 'maternity', 'cover'],
 ['recruitment', 'sales', 'executive'],
 ['business', 'development', 'executive', 'field', 'sales', 'dartford'],
 ['investments', 'treasury', 'controller'],
 ['european', 'payroll'],
 ['engineering', 'assessor', 'instructor', 'south', 'yorkshire'],
 ['international', 'account', 'manager'],
 ['senior', 'production', 'technologist', 'malaysia'],
 ['insurance', 'sales', 'executive', 'horsham'],
 ['vehicle', 'purchaser', 'car', 'sales'],
 ['marine', 'engines', 'specialist', 'product', 'support'],
 ['sales', 'manager', 'medical', 'sales', 'executive'],
 ['optical', 'assistant', 'oxfordshire'],
 ['perm', 'unit', 'mgr', 'rgn', 'kid', 'minster', 'flexi', 'k', 'du

In [33]:
stats_print(tk_job_title)

Vocabulary size:  1003
Total number of tokens:  3157
Lexical diversity:  0.3177066835603421
Total number of titles: 776
Average title length: 4.068298969072165
Maximum title length: 13
Minimum title length: 1
Standard deviation of title length: 1.8386529115562282


In [34]:
st_list = [[d for d in desc if len(d) <= 1] \
                      for desc in tk_job_title] # create a list of single character token for each review
list(chain.from_iterable(st_list)) # merge them together in one list

['k',
 'k',
 'f',
 't',
 'c',
 'k',
 'b',
 'b',
 'a',
 'c',
 'x',
 'p',
 'h',
 'p',
 'p',
 'h',
 'k',
 'k',
 'a',
 'k',
 'm',
 'e',
 'k',
 'k',
 'c',
 'c',
 'k',
 'b',
 'x',
 'k',
 'p',
 't',
 'x',
 'r',
 'r',
 'v',
 'k',
 'k',
 'c',
 'p',
 'h',
 'b',
 'b',
 'a',
 'd',
 'k',
 'k',
 'k',
 'r',
 'x',
 'k',
 'k',
 'x',
 'x',
 'c',
 'i',
 'n',
 'c',
 'a',
 'k',
 'c',
 'j',
 'm',
 'f',
 'k',
 'k',
 'k',
 'x',
 'c',
 'c',
 'c',
 'k',
 'x',
 'c',
 'c',
 'c',
 'k',
 'k',
 'o',
 'p']

In [35]:
tk_job_title = [[d for d in desc if len(d) >=2] \
                      for desc in tk_job_title]

In [36]:
stats_print(tk_job_title)

Vocabulary size:  985
Total number of tokens:  3077
Lexical diversity:  0.32011699707507313
Total number of titles: 776
Average title length: 3.9652061855670104
Maximum title length: 12
Minimum title length: 1
Standard deviation of title length: 1.7350467799519218


In [37]:
stopwords = []

# Open the file in read mode ('r')
with open('stopwords_en.txt', 'r') as file:
    # Read each line and append it to the 'lines' list
    for line in file:
        stopwords.append(line.strip())  # Use strip() to remove newline characters

# Print or process the list of lines
stopwords

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [38]:
tk_job_title = [[w for w in words if w not in stopwords] 
                      for words in tk_job_title]

In [39]:
stats_print(tk_job_title)

Vocabulary size:  954
Total number of tokens:  2963
Lexical diversity:  0.32197097536280794
Total number of titles: 776
Average title length: 3.818298969072165
Maximum title length: 10
Minimum title length: 1
Standard deviation of title length: 1.5653217426587334


In [40]:
final_job_title = [" ".join(t) for t in tk_job_title]
final_job_title

['finance accounts asst bromley',
 'fund accountant hedge fund',
 'deputy home manager',
 'brokers wanted imediate start',
 'rgn nurses hospitals penarth',
 'production coordinator',
 'scrub nurse',
 'sales purchase ledger clerk maternity cover',
 'recruitment sales executive',
 'business development executive field sales dartford',
 'investments treasury controller',
 'european payroll',
 'engineering assessor instructor south yorkshire',
 'international account manager',
 'senior production technologist malaysia',
 'insurance sales executive horsham',
 'vehicle purchaser car sales',
 'marine engines specialist product support',
 'sales manager medical sales executive',
 'optical assistant oxfordshire',
 'perm unit mgr rgn kid minster flexi due',
 "perm rgn's bangor codown flexi ph exopp bangor",
 'ecommerce country manager netherlands',
 'business development manager leading financial lending plc',
 'dynamics ax finance consultant london',
 'nursing home manager crawley',
 'registere

In [41]:
stats_print(final_job_title)

Vocabulary size:  28
Total number of tokens:  23479
Lexical diversity:  0.0011925550491928957
Total number of titles: 776
Average title length: 30.25644329896907
Maximum title length: 71
Minimum title length: 3
Standard deviation of title length: 12.31878791941342


In [42]:
title_vectors = tfidf_vectorizer.fit_transform(final_job_title)

In [43]:
print(title_vectors)

  (0, 110)	0.5533760212687362
  (0, 63)	0.5875960231014571
  (0, 9)	0.43637706222252853
  (0, 315)	0.3975939540362385
  (1, 382)	0.4284342001106977
  (1, 7)	0.28674148789028314
  (1, 343)	0.8568684002213954
  (2, 500)	0.36771036522534967
  (2, 390)	0.642675948249739
  (2, 219)	0.6721284942978828
  (3, 829)	0.4122559023292801
  (3, 412)	0.5362084934103664
  (3, 931)	0.5362084934103664
  (3, 109)	0.5049811622069096
  (4, 633)	0.5705965284895917
  (4, 396)	0.5705965284895917
  (4, 586)	0.48055947564205387
  (4, 746)	0.3433683062280022
  (5, 187)	0.6561894928541688
  (5, 694)	0.7545961499158267
  (6, 585)	0.4861953118968752
  (6, 768)	0.8738501694738636
  (7, 194)	0.4842223654341737
  (7, 513)	0.4560225656037708
  (7, 153)	0.3878066129863246
  :	:
  (770, 130)	0.564011102504353
  (770, 345)	0.564011102504353
  (770, 831)	0.564011102504353
  (770, 500)	0.2137157662772507
  (771, 888)	0.5155197431070008
  (771, 887)	0.5155197431070008
  (771, 44)	0.5155197431070008
  (771, 758)	0.21360038919

In [44]:
# Tokenize the job descriptions
tokenized_title = [word_tokenize(t.lower()) for t in final_job_title]

# Train a Word2Vec model
model_title = Word2Vec(sentences=tokenized_title, vector_size=100, window=5, min_count=1, sg=0)

# Save the trained model to a file (you can replace 'word2vec.model' with your desired filename)
model_title.save("word2vec_title.model")

In [45]:
unweighted_title_embeddings = np.array([np.mean([model_title.wv[word] for word in tokens if word in model_title.wv] if tokens else [np.zeros(model_title.vector_size)], axis=0)
                                  for tokens in tokenized_title])

In [46]:
unweighted_title_embeddings

array([[-4.6403636e-04, -1.6054121e-04,  3.0985379e-03, ...,
        -1.3169802e-03, -2.7170989e-03,  2.1253049e-03],
       [ 8.8956038e-04,  1.9821182e-03,  3.7495107e-03, ...,
        -1.8218876e-04, -4.3815998e-03,  1.4244232e-03],
       [ 3.2118310e-03,  5.5406285e-03,  3.1212803e-03, ...,
        -8.5268373e-04, -1.3648668e-04,  8.2263444e-04],
       ...,
       [-8.7193621e-04,  1.0563894e-03, -1.0633612e-03, ...,
        -1.2110140e-03,  8.0708490e-04,  9.2797585e-05],
       [ 6.8898625e-03, -2.1282209e-03,  3.6582162e-03, ...,
        -2.5441090e-03,  2.9995164e-03, -4.8305895e-03],
       [-3.4806395e-03, -6.3355290e-04, -7.6567294e-04, ...,
        -5.8490313e-03, -7.5696601e-04, -1.3455012e-03]], dtype=float32)

In [47]:
tfidf_weighted_title_embeddings = []
for tokens in tokenized_title:
    embeddings = []
    for word in tokens:
        if word in model_title.wv and word in tfidf_vectorizer.vocabulary_:
            
            
            tfidf_weighted_vector = model_title.wv[word] * tfidf_vectorizer.idf_[tfidf_vectorizer.vocabulary_[word]]
            embeddings.append(tfidf_weighted_vector)
    if embeddings:
        tfidf_weighted_title_embeddings.append(np.mean(embeddings, axis=0))
    else:
        # If no valid word vectors found, use a zero vector
        tfidf_weighted_title_embeddings.append(np.zeros(model.vector_size))

# Convert the list of embeddings to a numpy array
tfidf_weighted_title_embeddings = np.array(tfidf_weighted_title_embeddings)

In [48]:
tfidf_weighted_title_embeddings

array([[-0.00435364, -0.00059501,  0.01803686, ..., -0.00952507,
        -0.0155688 ,  0.01066805],
       [ 0.00314526,  0.01050013,  0.02499346, ..., -0.00509798,
        -0.03373835,  0.00485043],
       [ 0.01644889,  0.02535501,  0.01108918, ...,  0.00238907,
        -0.00191616,  0.00052953],
       ...,
       [ 0.00545513,  0.00129187, -0.01133066, ...,  0.00159821,
         0.01031721,  0.00134598],
       [ 0.04489203, -0.01518824,  0.02472428, ..., -0.01484296,
         0.01975297, -0.02774998],
       [-0.0070001 , -0.0066447 , -0.00631952, ..., -0.02132855,
         0.00090508, -0.0057664 ]], dtype=float32)

In [49]:
cv_df_title = pd.DataFrame(columns = ['unweighted','weighted'],index=range(num_folds)) # creates a dataframe to store the accuracy scores in all the folds

fold = 0
for train_index, test_index in kf.split(list(range(0,len(target)))):
    y_train = [str(target[i]) for i in train_index]
    y_test = [str(target[i]) for i in test_index]
   
    X_train_unweighted, X_test_unweighted = unweighted_title_embeddings[train_index], unweighted_title_embeddings[test_index]
    cv_df_title.loc[fold,'unweighted'] = evaluate(unweighted_title_embeddings[train_index],unweighted_title_embeddings[test_index],y_train,y_test,seed)

    X_train_weighted, X_test_weighted = tfidf_weighted_title_embeddings[train_index], tfidf_weighted_title_embeddings[test_index]
    cv_df_title.loc[fold,'weighted'] = evaluate(tfidf_weighted_title_embeddings[train_index],tfidf_weighted_title_embeddings[test_index],y_train,y_test,seed)
    
    fold +=1

In [50]:
cv_df_title

Unnamed: 0,unweighted,weighted
0,0.217949,0.301282
1,0.296774,0.432258
2,0.264516,0.348387
3,0.380645,0.580645
4,0.329032,0.509677


In [51]:
cv_df_title.mean()

unweighted    0.297783
weighted      0.434450
dtype: float64

In [52]:
combined_unweighted_embeddings = np.hstack((unweighted_embeddings, unweighted_title_embeddings))
combined_weighted_embedings = np.hstack((unweighted_embeddings, tfidf_weighted_title_embeddings))

In [53]:
cv_df_combined = pd.DataFrame(columns = ['unweighted','weighted'],index=range(num_folds)) # creates a dataframe to store the accuracy scores in all the folds

fold = 0
for train_index, test_index in kf.split(list(range(0,len(target)))):
    y_train = [str(target[i]) for i in train_index]
    y_test = [str(target[i]) for i in test_index]
   
    X_train_unweighted, X_test_unweighted = combined_unweighted_embeddings[train_index], combined_unweighted_embeddings[test_index]
    cv_df_combined.loc[fold,'unweighted'] = evaluate(combined_unweighted_embeddings[train_index],combined_unweighted_embeddings[test_index],y_train,y_test,seed)

    X_train_weighted, X_test_weighted = combined_weighted_embedings[train_index], combined_weighted_embedings[test_index]
    cv_df_combined.loc[fold,'weighted'] = evaluate(combined_weighted_embedings[train_index],combined_weighted_embedings[test_index],y_train,y_test,seed)
    
    fold +=1

In [54]:
cv_df_combined

Unnamed: 0,unweighted,weighted
0,0.410256,0.455128
1,0.470968,0.548387
2,0.464516,0.509677
3,0.554839,0.612903
4,0.522581,0.63871


In [55]:
cv_df_combined.mean()

unweighted    0.484632
weighted      0.552961
dtype: float64

### Repeating the same process for the feature generation of "TITLE & DESCRIPTION" joined

In [56]:
# Initialize an empty list to store the job description
title_desc = []

# Open the text file for reading
with open('title_desc.txt', 'r', encoding='utf-8') as file:
    # Read each line in the file
    for line in file:
        # Append each line (job description) to the job_desc list
        title_desc.append(line.strip())

# Print the job_desc list
title_desc

['Finance / Accounts Asst Bromley to ****k Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role',
 'Fund Accountant  Hedge Fund One of the leading Hedge Funds in London is currently recruiting for a Fund Accountant to join their team. The role will be paying c**** with 

In [57]:
tk_title_desc = [tokenizeReview(td) for td in title_desc]  # list comprehension, generate a list of tokenized articles

In [58]:
print(tk_title_desc)



In [59]:
stats_print(tk_title_desc)

Vocabulary size:  9898
Total number of tokens:  190109
Lexical diversity:  0.0520648680493822
Total number of titles: 776
Average title length: 244.98582474226805
Maximum title length: 824
Minimum title length: 20
Standard deviation of title length: 125.24145156782018


In [60]:
st_list_ = [[d for d in desc if len(d) <= 1] \
                      for desc in tk_title_desc] # create a list of single character token for each review
list(chain.from_iterable(st_list_)) # merge them together in one list

['k',
 'p',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'k',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 'b',
 'b',
 'a',
 'b',
 'b',
 'a',
 'a',
 'b',
 'b',
 's',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 'a',
 's',
 'a',
 'd',
 'd',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'k',
 'a',
 's',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'k',
 'a',
 'a',
 'k',
 'a',
 'a',
 'k',
 'a',
 'a',
 'a',
 'm',
 'm',
 'm',
 'm',
 'a',
 'a',
 'a',
 'k',
 'a',
 's',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 's',
 'a',
 'a',
 'a',
 's',
 's',
 's',
 'a',
 'a',
 's',
 'f',
 't',
 'a',
 'a',
 'a',
 'a',
 'a',
 'b',
 'c',
 'd',
 'e',
 'k'

In [61]:
tk_title_desc = [[d for d in desc if len(d) >=2] \
                      for desc in tk_title_desc]

In [62]:
stats_print(tk_title_desc)

Vocabulary size:  9872
Total number of tokens:  183990
Lexical diversity:  0.05365508995054079
Total number of titles: 776
Average title length: 237.10051546391753
Maximum title length: 803
Minimum title length: 20
Standard deviation of title length: 121.84697687234663


In [63]:
tk_title_desc = [[w for w in words if w not in stopwords] 
                      for words in tk_title_desc]

In [64]:
stats_print(tk_title_desc)

Vocabulary size:  9468
Total number of tokens:  110124
Lexical diversity:  0.08597580908793724
Total number of titles: 776
Average title length: 141.91237113402062
Maximum title length: 491
Minimum title length: 18
Standard deviation of title length: 73.33261711921224


In [65]:
from collections import Counter

global_word_counts = Counter()

# Calculate term frequency (TF) for each job description and update the global word counts
for title_tokens in tk_title_desc:
    word_counts = Counter(title_tokens)
    global_word_counts.update(word_counts)

# Find words that appear only once (TF = 1) across all descriptions
words_to_remove = [word for word, count in global_word_counts.items() if count == 1]

words_to_remove

['asst',
 'disputed',
 'allocating',
 'proficiency',
 'equities',
 'fluctuations',
 'chiropodists',
 'deputyhomemanager',
 'imediate',
 'onetwotrade',
 'timehours',
 'banding',
 'referafriend',
 'immunisation',
 'outofhours',
 'postqualification',
 'faced',
 'arriving',
 'facsimile',
 'retrieve',
 'photocopying',
 'collating',
 'xp',
 'solvitt',
 'thoughts',
 'goaloriented',
 'laminar',
 'sterile',
 'presents',
 'ortho',
 'ophthalmic',
 'gynae',
 'exemplary',
 'mixes',
 'remittances',
 'duplicates',
 'chaps',
 'currencies',
 'salespurchaseledgerclerkmaternitycover',
 'embarking',
 'faint',
 'hearted',
 'recruitmentsalesexecutive',
 'da',
 'br',
 'personalities',
 'recycling',
 'washroom',
 'pest',
 'businessdevelopmentexecutivefieldsalesdartford',
 'kris',
 'shortfalls',
 'remediate',
 'custodians',
 'transacted',
 'august',
 'clarke',
 'investmentstreasurycontroller',
 'batley',
 'castleford',
 'morley',
 'pontefract',
 'porduction',
 'susurface',
 'promary',
 'maturation',
 'reservio

In [66]:
filtered_title_desc = [[word for word in desc if word not in words_to_remove] for desc in tk_title_desc]

In [67]:
stats_print(filtered_title_desc)

Vocabulary size:  5291
Total number of tokens:  105947
Lexical diversity:  0.04994006437180855
Total number of titles: 776
Average title length: 136.52963917525773
Maximum title length: 475
Minimum title length: 18
Standard deviation of title length: 70.60421592997334


In [68]:
# Calculate document frequency (DF) for each word across all job descriptions
document_frequencies = Counter()

# Count how many documents each word appears in
for title_tokens in filtered_title_desc:
    unique_tokens = set(title_tokens)  # Use set to count unique occurrences within each document
    document_frequencies.update(unique_tokens)

# Find the top 50 most frequent words based on DF
top_50_words = [word for word, df in document_frequencies.most_common(50)]

# Remove the top 50 most frequent words from each job description
final_title_desc = [[word for word in desc if word not in top_50_words] for desc in filtered_title_desc]
final_title_desc

[['finance',
  'accounts',
  'bromley',
  'accountant',
  'partqualified',
  'south',
  'east',
  'london',
  'manufacturing',
  'requirement',
  'accountant',
  'permanent',
  'modern',
  'offices',
  'south',
  'east',
  'london',
  'credit',
  'control',
  'purchase',
  'ledger',
  'daily',
  'collection',
  'debts',
  'phone',
  'letter',
  'email',
  'handling',
  'ledger',
  'accounts',
  'handling',
  'accounts',
  'negotiating',
  'payment',
  'terms',
  'cash',
  'reconciliation',
  'accounts',
  'adhoc',
  'administration',
  'duties',
  'person',
  'ideal',
  'previous',
  'credit',
  'control',
  'capacity',
  'possess',
  'exceptional',
  'customer',
  'communication',
  'part',
  'fully',
  'qualified',
  'accountant',
  'considered'],
 ['fund',
  'accountant',
  'hedge',
  'fund',
  'hedge',
  'funds',
  'london',
  'recruiting',
  'fund',
  'accountant',
  'paying',
  'outstanding',
  'west',
  'end',
  'report',
  'head',
  'fund',
  'accounting',
  'number',
  'fund',

In [69]:
stats_print(final_title_desc)

Vocabulary size:  5241
Total number of tokens:  83732
Lexical diversity:  0.06259255720632494
Total number of titles: 776
Average title length: 107.9020618556701
Maximum title length: 402
Minimum title length: 10
Standard deviation of title length: 58.62022359480943


In [70]:
final_title_desc = [" ".join(t) for t in final_title_desc]
final_title_desc

['finance accounts bromley accountant partqualified south east london manufacturing requirement accountant permanent modern offices south east london credit control purchase ledger daily collection debts phone letter email handling ledger accounts handling accounts negotiating payment terms cash reconciliation accounts adhoc administration duties person ideal previous credit control capacity possess exceptional customer communication part fully qualified accountant considered',
 'fund accountant hedge fund hedge funds london recruiting fund accountant paying outstanding west end report head fund accounting number fund accountants senior fund accountants responsible fund accounting number hedge funds dealing equity related products involves aspects fund accounting preparation journal voucher entries nav control part nav review fund accountant reviews cash securities reconciliation trade input pricing financial statements',
 'deputy home exciting arisen establish provider elderly care de

In [71]:
title_desc_vectors = tfidf_vectorizer.fit_transform(final_title_desc)

In [72]:
print(title_desc_vectors)

  (0, 1010)	0.09102389714467059
  (0, 3829)	0.0749871854132819
  (0, 1983)	0.1018138836235319
  (0, 3389)	0.06830173511849218
  (0, 917)	0.0673803273733736
  (0, 1176)	0.06707992995231626
  (0, 1719)	0.12306585081770718
  (0, 3580)	0.10518374213654076
  (0, 675)	0.12807715761664717
  (0, 3659)	0.07905870526853245
  (0, 2313)	0.08439529504064942
  (0, 3470)	0.08673595026224172
  (0, 1474)	0.07952458236123869
  (0, 103)	0.09881242200875405
  (0, 94)	0.12948700974507574
  (0, 3919)	0.13598441307761283
  (0, 716)	0.11431280276368228
  (0, 4759)	0.12545745339482892
  (0, 3427)	0.132544167610803
  (0, 3121)	0.15326240754096604
  (0, 2135)	0.23386501397100118
  (0, 1552)	0.07559129963109511
  (0, 2697)	0.12306585081770718
  (0, 3498)	0.11431280276368228
  (0, 1226)	0.1802215690940439
  :	:
  (775, 29)	0.05832692428178468
  (775, 295)	0.10425368149430647
  (775, 4816)	0.07519389576571439
  (775, 4290)	0.059287769411371816
  (775, 680)	0.06009242762467869
  (775, 1637)	0.08727565276364017
  (77

In [73]:
# Tokenize the job descriptions
tokenized_title_desc = [word_tokenize(t.lower()) for t in final_title_desc]

# Train a Word2Vec model
model_title_desc = Word2Vec(sentences=tokenized_title_desc, vector_size=100, window=5, min_count=1, sg=0)

# Save the trained model to a file (you can replace 'word2vec.model' with your desired filename)
model_title_desc.save("word2vec_title_desc.model")

In [74]:
unweighted_title_desc_embeddings = np.array([np.mean([model_title_desc.wv[word] for word in tokens if word in model_title_desc.wv] if tokens else [np.zeros(model_title_desc.vector_size)], axis=0)
                                  for tokens in tokenized_title_desc])

In [75]:
unweighted_title_desc_embeddings

array([[-0.43853405,  0.647388  ,  0.20538871, ..., -0.41124728,
         0.1843485 ,  0.0050393 ],
       [-0.3002621 ,  0.4514125 ,  0.14185315, ..., -0.28822932,
         0.12525466,  0.00583193],
       [-0.4715077 ,  0.75395584,  0.31789094, ..., -0.5035175 ,
         0.25970188, -0.01883413],
       ...,
       [-0.3720375 ,  0.55226016,  0.18337783, ..., -0.3563295 ,
         0.16256723,  0.00560528],
       [-0.37919492,  0.5664409 ,  0.19668306, ..., -0.36692998,
         0.17466518, -0.00402513],
       [-0.40894976,  0.59585035,  0.18534537, ..., -0.38272467,
         0.1692141 ,  0.00719105]], dtype=float32)

In [76]:
tfidf_weighted_title_desc_embeddings = []
for tokens in tokenized_title_desc:
    embeddings = []
    for word in tokens:
        if word in model_title.wv and word in tfidf_vectorizer.vocabulary_:
            
            
            tfidf_weighted_vector = model_title.wv[word] * tfidf_vectorizer.idf_[tfidf_vectorizer.vocabulary_[word]]
            embeddings.append(tfidf_weighted_vector)
    if embeddings:
        tfidf_weighted_title_desc_embeddings.append(np.mean(embeddings, axis=0))
    else:
        # If no valid word vectors found, use a zero vector
        tfidf_weighted_title_desc_embeddings.append(np.zeros(model.vector_size))

# Convert the list of embeddings to a numpy array
tfidf_weighted_title_desc_embeddings = np.array(tfidf_weighted_title_desc_embeddings)

In [77]:
tfidf_weighted_title_desc_embeddings

array([[ 6.75178226e-03, -4.90480475e-03,  9.80269979e-04, ...,
         1.05307619e-04,  8.84066569e-04, -3.00575281e-04],
       [-7.87132792e-03,  8.64231493e-03,  1.12064825e-02, ...,
        -2.14757794e-03, -9.70302895e-03, -4.31544008e-03],
       [ 4.10246197e-03,  8.79474264e-03,  2.17677490e-03, ...,
        -3.39960959e-03,  9.21758357e-04, -6.50284952e-03],
       ...,
       [-6.16589654e-03, -1.31966535e-03, -1.28153525e-02, ...,
        -2.76081474e-03, -2.25156541e-07,  1.03408122e-03],
       [ 1.04116127e-02,  2.58677569e-03,  4.05738782e-03, ...,
        -9.49297007e-03, -7.59274233e-04, -5.44439349e-03],
       [ 5.31136105e-03,  2.43200525e-03, -7.32290482e-06, ...,
        -1.14606973e-02,  6.01542648e-03, -7.06885150e-03]], dtype=float32)

In [78]:
cv_df_title_desc = pd.DataFrame(columns = ['unweighted','weighted'],index=range(num_folds)) # creates a dataframe to store the accuracy scores in all the folds

fold = 0
for train_index, test_index in kf.split(list(range(0,len(target)))):
    y_train = [str(target[i]) for i in train_index]
    y_test = [str(target[i]) for i in test_index]
   
    X_train_unweighted, X_test_unweighted = unweighted_title_desc_embeddings[train_index], unweighted_title_desc_embeddings[test_index]
    cv_df_title_desc.loc[fold,'unweighted'] = evaluate(unweighted_title_desc_embeddings[train_index], unweighted_title_desc_embeddings[test_index], y_train, y_test, seed)

    X_train_weighted, X_test_weighted = tfidf_weighted_title_desc_embeddings[train_index], tfidf_weighted_title_desc_embeddings[test_index]
    cv_df_title_desc.loc[fold,'weighted'] = evaluate(tfidf_weighted_title_desc_embeddings[train_index],tfidf_weighted_title_desc_embeddings[test_index],y_train,y_test,seed)
    
    fold +=1

In [79]:
cv_df_title_desc

Unnamed: 0,unweighted,weighted
0,0.442308,0.217949
1,0.483871,0.296774
2,0.464516,0.264516
3,0.593548,0.509677
4,0.580645,0.329032


In [80]:
cv_df_title_desc.mean()

unweighted    0.512978
weighted      0.323590
dtype: float64

### Model output explaination

In the context of regression modeling for job titles and descriptions, an intriguing observation arises when comparing the effectiveness of weighted (TF-IDF) and unweighted word embeddings. When these two types of text data are analyzed separately, TF-IDF weighting often proves to be more effective in explaining variance in the target variable. This result aligns with the expectation that TF-IDF helps emphasize crucial terms within each subset of data.

However, when we combine titles and descriptions for analysis, an interesting shift occurs. In this combined setting, unweighted word embeddings tend to perform exceptionally well, surpassing the performance of TF-IDF-weighted embeddings. Several factors contribute to this change, including differences in data distribution, potential redundancy between titles and descriptions, and the complex interactions introduced by their combination.

The choice between weighted and unweighted embeddings remains highly dependent on the nature of the data and the specific regression task at hand. Experimentation and careful consideration are key. It's not uncommon to explore various feature representations and model complexities to determine the optimal combination for a given dataset. Additionally, techniques such as feature selection and dimensionality reduction may further enhance model performance, especially when working with combined text data.

In conclusion, the choice of feature representation in regression modeling reflects the nuanced relationship between text data and target variables. It's a reminder that effective modeling in the realm of natural language processing requires adaptability and a thorough understanding of the data's intricacies.

## Summary
In the realm of Natural Language Processing (NLP), the art of transforming textual data into actionable insights plays a pivotal role in various applications. In Task 2 and Task 3, we embarked on a journey to explore the intricate nuances of feature representation and regression analysis for job advertisement descriptions and titles, aiming to unlock hidden patterns and predictive power within this textual domain.

**Task 2: Feature Representation**

In Task 2, we delved into the realm of feature representation. Our goal was to bridge the gap between unstructured text data and machine learning models. This journey involved:

1. **Data Preprocessing:** We began by meticulously preparing our textual data. Tokenization, stemming, and the removal of stopwords were key steps in ensuring that our text was machine-readable.

2. **Language Models:** The choice of a suitable language model was pivotal. We explored the power of Word2Vec, FastText, or GloVe embeddings, each offering unique advantages in capturing semantic meaning.

3. **Feature Engineering:** We crafted three distinct feature representations for job descriptions and titles. Count vectors, capturing word frequency, offered a straightforward representation. TF-IDF weighted vectors highlighted the importance of terms. Word embeddings, both weighted and unweighted, portrayed the semantic essence of the text.

**Task 3: Regression Analysis**

With our feature representations in hand, we embarked on Task 3: regression analysis. Here, our aim was to predict and understand the target variable. The outcome unveiled intriguing insights:

- **Feature Effectiveness:** We discovered that the effectiveness of feature representations varied across different settings. Count vectors often excelled, portraying the strength of capturing word frequency. However, TF-IDF weighted vectors sometimes outperformed unweighted word embeddings, emphasizing the significance of term weighting.

- **Interplay of Features:** When titles and descriptions were analyzed separately, TF-IDF weighting frequently enhanced performance. Yet, when combined, unweighted embeddings often shone. This transition reflected the intricate interplay between these two types of text data, including potential redundancy and complex feature interactions.

- **Model Flexibility:** The choice of regression model complexity also influenced the results. More complex models proved adept at capturing combined information, while simpler models excelled when features were more straightforward.

In essence, Tasks 2 and 3 underscored the dynamic nature of text data analysis. It emphasized the necessity of adaptability, experimentation, and a deep understanding of data intricacies. Whether it's uncovering patterns in job descriptions or unraveling the predictive power of titles, the journey through feature representation and regression analysis is a testament to the art of translating text into actionable insights.