# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Rylan Pease
#### Student ID: s3896416

Date: 1/10/23

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* numpy
* sklearn
* gensim

## Introduction
This assignment comprises the report for tasks 2 and 3 of the assignment. The report will be structured as follows:

Task 2:
1. Load the data
2. Convert the data into the sparse count vector representation
3. Output the data to the file system in correct format

Task 3:
1. Q1:
    1. Get all the labels in the training data and ensure its correct
    2. Initialize the logistic regression model and evaluate it using 5-fold cross validation
2. Q2:
    1. Load the vocabulary and count vectors.
    2. Split the dataset into features (count vectors) and target (job category).
    3. Train a logistic regression model using 5-fold cross-validation.
    4. Evaluate the performance of the model using accuracy on each of the datasets (description, title, combined).



<span style="color: red"> Note that this is a sample notebook only. You will need to fill in the proper markdown and code blocks. You might also want to make necessary changes to the structure to meet your own needs. Note also that any generic comments written in this notebook are to be removed and replace with your own words.</span>

## Importing libraries 

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import FastText

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

...... Sections and code blocks on buidling different document feature represetations


<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

In [2]:
# Load the vocabulary
with open("vocab.txt", "r") as f:
    vocab = f.read().splitlines()

# Create a dictionary mapping each word in the vocabulary to a unique integer
word_to_index = {word: idx for idx, word in enumerate(vocab)}

# Load the job advertisements dataset
job_ads = pd.read_csv("processed_job_ads.csv")

# Convert the 'Tokenized Description' column from string to list
job_ads["Tokenized Description"] = job_ads["Tokenized Description"].apply(lambda x: eval(x))

In [3]:
# Correcting the vocabulary by stripping the indices
vocab_cleaned = [word.split(":")[0] for word in vocab]

# Update the dictionary mapping each word in the cleaned vocabulary to a unique integer
word_to_index_cleaned = {word: idx for idx, word in enumerate(vocab_cleaned)}

# Convert each job advertisement description into a sparse count vector representation based on the cleaned vocabulary
count_vectors = []

for _, row in job_ads.iterrows():
    vector = {}
    for word in row["Tokenized Description"]:
        if word in word_to_index_cleaned:
            vector[word_to_index_cleaned[word]] = vector.get(word_to_index_cleaned[word], 0) + 1
    sorted_vector = sorted(vector.items(), key=lambda x: x[0])
    count_vectors.append((row["Webindex"], sorted_vector))

# Display the first few count vectors for review after correction
count_vectors[:5]

[(68802053,
  [(13, 1),
   (17, 1),
   (64, 1),
   (73, 1),
   (77, 1),
   (154, 1),
   (226, 2),
   (227, 2),
   (263, 1),
   (278, 1),
   (343, 1),
   (378, 1),
   (495, 1),
   (644, 1),
   (659, 1),
   (669, 1),
   (705, 1),
   (732, 2),
   (759, 1),
   (760, 1),
   (803, 1),
   (891, 1),
   (942, 1),
   (985, 1),
   (1136, 1),
   (1178, 1),
   (1316, 2),
   (1354, 1),
   (1367, 1),
   (1449, 1),
   (1457, 1),
   (1565, 2),
   (1585, 1),
   (1616, 1),
   (1737, 1),
   (1801, 1),
   (1823, 1),
   (1840, 1),
   (1842, 2),
   (1857, 1),
   (1922, 2),
   (1928, 1),
   (1932, 1),
   (2042, 1),
   (2078, 1),
   (2084, 1),
   (2137, 2),
   (2145, 1),
   (2401, 1),
   (2458, 1),
   (2459, 1),
   (2474, 1),
   (2501, 1),
   (2589, 1),
   (2641, 1),
   (2699, 1),
   (2751, 1),
   (2835, 1),
   (3102, 1),
   (3431, 1),
   (3487, 2),
   (3491, 4),
   (3522, 1),
   (3609, 1),
   (3646, 1),
   (3665, 2),
   (3673, 1),
   (3676, 1),
   (3722, 1),
   (3788, 1),
   (3824, 1),
   (3912, 1),
   (3913,

### Saving outputs
Save the count vector representation as per spectification.
- count_vectors.txt

In [4]:
# Saving the count vectors to count_vectors.txt
with open("count_vectors.txt", "w") as f:
    for webindex, vector in count_vectors:
        vector_str = ",".join([f"{idx}:{freq}" for idx, freq in vector])
        f.write(f"#{webindex},{vector_str}\n")

In [5]:
job_ads = pd.read_csv("processed_job_ads.csv")

# Convert tokenized descriptions back to list of words format for FastText training
tokenized_descriptions = [eval(desc) for desc in job_ads['Tokenized Description'].tolist()]

# Train the FastText model
fasttext_model = FastText(sentences=tokenized_descriptions, vector_size=100, window=5, min_count=1, workers=4)

In [6]:
# Function to generate unweighted document embeddings
def get_unweighted_embedding(tokens, model):
    # Retrieve embeddings for each token and average them
    embeddings = [model.wv[token] for token in tokens if token in model.wv.index_to_key]
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(model.vector_size)

# Generate unweighted embeddings for all descriptions
unweighted_embeddings = [get_unweighted_embedding(tokens, fasttext_model) for tokens in tokenized_descriptions]

# Convert embeddings to numpy array for easier manipulation
unweighted_embeddings = np.array(unweighted_embeddings)

unweighted_embeddings.shape

(776, 100)

In [7]:
# Extract the tokenized descriptions and convert them back to string format for vectorization
descriptions = [' '.join(eval(desc)) for desc in job_ads['Tokenized Description'].tolist()]

# Create a dictionary for vocabulary with word as key and its index as value
vocab_dict = {word: index for index, word in enumerate(vocab)}

# Compute TF-IDF scores for the terms in the descriptions
tfidf_vectorizer = TfidfVectorizer(vocabulary=vocab_dict)
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)

tfidf_matrix.shape

(776, 5168)

In [8]:
def get_tfidf_weighted_embedding(tokens, model, tfidf_vector):
    # Retrieve embeddings for each token and weight them by their TF-IDF scores
    embeddings = [model.wv[token] * tfidf_vector[tfidf_vectorizer.vocabulary_[token]] 
                  for token in tokens if token in model.wv.index_to_key and token in tfidf_vectorizer.vocabulary_]
    if embeddings:
        return np.sum(embeddings, axis=0)
    else:
        return np.zeros(model.vector_size)

# Generate TF-IDF weighted embeddings for all descriptions again using the corrected function
tfidf_weighted_embeddings = [get_tfidf_weighted_embedding(tokens, fasttext_model, tfidf_vector.toarray()[0]) 
                             for tokens, tfidf_vector in zip(tokenized_descriptions, tfidf_matrix)]

# Convert embeddings to numpy array for easier manipulation
tfidf_weighted_embeddings = np.array(tfidf_weighted_embeddings)

tfidf_weighted_embeddings.shape

(776, 100)

## Task 3. Job Advertisement Classification

...... Sections and code blocks on buidling classification models based on different document feature represetations. 
Detailed comparsions and evaluations on different models to answer each question as per specification. 

<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

### Q1: Unweighted Data and TFIDF Weighted Data

In [9]:
# Extract the labels (job categories) from the dataframe
labels = job_ads['Category'].tolist()

# Confirm the number of unique labels/categories
unique_labels = set(labels)
len(unique_labels), unique_labels

(4, {'Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales'})

In [10]:
# Initialize the logistic regression model
logreg = LogisticRegression(max_iter=1000, random_state=42)

# Evaluate the model using 5-fold cross-validation with TF-IDF weighted embeddings
tfidf_weighted_scores = cross_val_score(logreg, tfidf_weighted_embeddings, labels, cv=5)

# Evaluate the model using 5-fold cross-validation with unweighted embeddings
unweighted_scores = cross_val_score(logreg, unweighted_embeddings, labels, cv=5)

# Calculate average accuracy for both embeddings
avg_accuracy_tfidf_weighted = np.mean(tfidf_weighted_scores)
avg_accuracy_unweighted = np.mean(unweighted_scores)

avg_accuracy_tfidf_weighted, avg_accuracy_unweighted

(0.2976840363937138, 0.41882547559966915)

#### Q1 pt1 Results:

TFIDF Weighted Accuracy: 29.77%

Unweighted Accuracy: 42.40%

### Q2 Data Loading and Processing

In [11]:
# Load the dataset
df_job_ads = pd.read_csv('processed_job_ads.csv')
df_job_ads.head()


Unnamed: 0,Title,Webindex,Company,Description,Category,Tokenized Description
0,FP&A Blue Chip,68802053,Hays Senior Finance,A market leading retail business is going thro...,Accounting_Finance,"['market', 'retail', 'rapid', 'growth', 'due',..."
1,Part time Management Accountant,70757636,FS2 UK Ltd,You will be responsible for the efficient runn...,Accounting_Finance,"['responsible', 'efficient', 'running', 'accou..."
2,IFA EMPLOYED,71356489,Clark James Ltd,Role The purpose of the role is to provide adv...,Accounting_Finance,"['purpose', 'advice', 'telephone', 'leads', 's..."
3,Finance Manager,69073629,Accountancy Action Ltd,"Excellent opportunity to join our client, an e...",Accounting_Finance,"['expanding', 'recruit', 'aca', 'qualified', '..."
4,Management Accountant,70656648,Alexander Lloyd,Our client offers a interesting opportunity fo...,Accounting_Finance,"['offers', 'interesting', 'part', 'qualified',..."


In [12]:
# Load vocabulary
with open('vocab.txt', 'r') as f:
    vocab = f.read().splitlines()

# Load a few lines from the count vectors file to inspect its structure
count_vectors = pd.read_csv('count_vectors.txt', sep=" ", header=None)
count_vectors

Unnamed: 0,0
0,"#68802053,13:1,17:1,64:1,73:1,77:1,154:1,226:2..."
1,"#70757636,16:1,34:1,35:1,36:2,40:1,135:1,225:1..."
2,"#71356489,77:1,89:1,130:3,135:1,241:2,274:1,42..."
3,"#69073629,13:2,33:1,35:1,36:1,240:1,353:1,600:..."
4,"#70656648,13:1,16:2,32:1,33:2,34:2,36:1,39:1,4..."
...,...
771,"#68056671,104:2,105:1,137:2,154:6,155:1,156:1,..."
772,"#68256016,44:2,193:1,207:1,274:1,448:1,523:1,6..."
773,"#71737507,28:1,36:1,49:1,240:1,259:2,266:1,424..."
774,"#70205492,28:3,45:1,193:1,424:1,444:1,513:1,60..."


### Description as Data for Learning

In [13]:
# Extract Webindex and count vectors from the sparse representation
web_indices = []
dense_count_vectors = np.zeros((count_vectors.shape[0], len(vocab)))

for idx, row in enumerate(count_vectors[0]):
    items = row.split(',')
    web_indices.append(items[0][1:])  # Extract Webindex (excluding the '#')
    
    for item in items[1:]:
        word_idx, count = item.split(':')
        dense_count_vectors[idx, int(word_idx)] = int(count)

# Convert to DataFrame
df_dense_count_vectors = pd.DataFrame(dense_count_vectors, columns=vocab)
df_dense_count_vectors['Webindex'] = web_indices

df_dense_count_vectors.head()

Unnamed: 0,aap:0,aaron:1,aat:2,abb:3,abenefit:4,aberdeen:5,abi:6,abilities:7,abreast:8,abroad:9,...,yeovil:5159,yn:5160,york:5161,yorkshire:5162,youmust:5163,young:5164,younger:5165,yrs:5166,zest:5167,Webindex
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,68802053
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70757636
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71356489
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69073629
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70656648


In [15]:
# Load the entire count vectors dataset
full_count_vectors = pd.read_csv('count_vectors.txt', sep=" ", header=None)

# Extract Webindex and count vectors from the sparse representation
web_indices_full = []
dense_count_vectors_full = np.zeros((full_count_vectors.shape[0], len(vocab)))

for idx, row in enumerate(full_count_vectors[0]):
    items = row.split(',')
    web_indices_full.append(items[0][1:])  # Extract Webindex (excluding the '#')
    
    for item in items[1:]:
        word_idx, count = item.split(':')
        dense_count_vectors_full[idx, int(word_idx)] = int(count)

# Convert to DataFrame
df_dense_count_vectors_full = pd.DataFrame(dense_count_vectors_full, columns=vocab)
df_dense_count_vectors_full['Webindex'] = web_indices_full

df_dense_count_vectors_full.head()

Unnamed: 0,aap:0,aaron:1,aat:2,abb:3,abenefit:4,aberdeen:5,abi:6,abilities:7,abreast:8,abroad:9,...,yeovil:5159,yn:5160,york:5161,yorkshire:5162,youmust:5163,young:5164,younger:5165,yrs:5166,zest:5167,Webindex
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,68802053
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70757636
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71356489
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69073629
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70656648


In [16]:
# Convert Webindex to int for both dataframes
df_dense_count_vectors_full['Webindex'] = df_dense_count_vectors_full['Webindex'].astype(int)
df_job_ads['Webindex'] = df_job_ads['Webindex'].astype(int)

# Merge the datasets based on Webindex
merged_data = pd.merge(df_dense_count_vectors_full, df_job_ads[['Webindex', 'Category']], on='Webindex', how='inner')

# Splitting the dataset into features and target
X = merged_data.drop(columns=['Webindex', 'Category'])
y = merged_data['Category']

X.shape, y.shape

((776, 5168), (776,))

In [17]:
# Initialize the Logistic Regression model
lr_model = LogisticRegression(max_iter=1000, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(lr_model, X, y, cv=5, scoring='accuracy')

# Calculate the average accuracy from the 5-fold cross-validation
avg_accuracy = np.mean(cv_scores)
avg_accuracy

0.8852936311000826

### Description Accuracy (Q1 & Q2): 88.53%

Title as Data for Learning

In [18]:
# Initialize CountVectorizer using the provided vocabulary
vectorizer = CountVectorizer(vocabulary=vocab, lowercase=True, token_pattern=r'\b\w+\b')

# Generate count vectors for job titles
title_vectors = vectorizer.transform(df_job_ads['Title']).toarray()

# Convert to DataFrame
df_title_vectors = pd.DataFrame(title_vectors, columns=vocab)

df_title_vectors.head()

Unnamed: 0,aap:0,aaron:1,aat:2,abb:3,abenefit:4,aberdeen:5,abi:6,abilities:7,abreast:8,abroad:9,...,years:5158,yeovil:5159,yn:5160,york:5161,yorkshire:5162,youmust:5163,young:5164,younger:5165,yrs:5166,zest:5167
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Perform 5-fold cross-validation using title vectors as features
title_cv_scores = cross_val_score(lr_model, df_title_vectors, y, cv=5, scoring='accuracy')

# Calculate the average accuracy from the 5-fold cross-validation for titles
avg_title_accuracy = np.mean(title_cv_scores)
avg_title_accuracy

0.2976840363937138

### Title Data Accuracy: 29.77%

Combined Representations

In [20]:
# Concatenate the count vectors of titles and descriptions
combined_vectors = pd.concat([df_title_vectors, X.reset_index(drop=True)], axis=1)

combined_vectors.head()

Unnamed: 0,aap:0,aaron:1,aat:2,abb:3,abenefit:4,aberdeen:5,abi:6,abilities:7,abreast:8,abroad:9,...,years:5158,yeovil:5159,yn:5160,york:5161,yorkshire:5162,youmust:5163,young:5164,younger:5165,yrs:5166,zest:5167
0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# Perform 5-fold cross-validation using combined vectors as features
combined_cv_scores = cross_val_score(lr_model, combined_vectors, y, cv=5, scoring='accuracy')

# Calculate the average accuracy from the 5-fold cross-validation for combined vectors
avg_combined_accuracy = np.mean(combined_cv_scores)
avg_combined_accuracy

0.8852936311000826

### Title and Description Data Accuracy: 88.53%

## Summary
Give a short summary and anything you would like to talk about the assessment tasks here.

### Q1: Language model comparisons

Unweighted:                 42.39%

TF-IDF Weighted embeddings: 29.77%

Vectorized:                 88.53%

From this information we can see that the vectorized data performed the best, this is likely due to the nature of the task, in this task of predicting the job category, the vectorized data can effectivly learn from keywords in the description, and therefore can more accurately predict the job category. The TF-IDF weighted embeddings performed the worst, this is likely due to the fact that the TF-IDF weighted embeddings do not preform well when there are a large number of unique words, as is the case in this task as well as an over reliance on rare or unique words. The unweighted embeddings performed better than the TF-IDF weighted embeddings, this is likely due to the fact that the unweighted embeddings do not rely on the frequency of words, and therefore can more accurately predict the job category.

### Q2: Does more information provide higher accuracy?

Title only:         29.77%

Description only:   88.53%

Combined:           88.53%

From this information we can clearly see that having more information does not always increase the accuracy of the model, in this case adding the title information to the description information did not increase the accuracy of the model. This is likely due to the fact that the title information is not as descriptive as the description information, and therefore does not provide any additional information to the model.