# Enhanced NLP Workbook for Recruitment Chatbot with Advanced Feature Engineering, State Machine, Sentiment Analysis, Personalization and Graph Database Integration

In [1]:
#!pip install rasa rasa-sdk scikit-learn vaderSentiment matplotlib seaborn tqdm scipy wordcloud shap textstat transitions neo4j

### Step 2: Importing Required Libraries

We import necessary Python libraries for data manipulation, feature extraction, modeling, evaluation, and visualization.

- **Numpy and Pandas** for data manipulation.
- **Scikit-learn** for model building, feature extraction, and evaluation.
- **VaderSentiment** for sentiment analysis.
- **Matplotlib and Seaborn** for data visualization.
- **Tqdm** for progress bars to monitor loops and training processes.
- **Transitions** for implementing state machines to manage the conversation flow.
- **Neo4j** for using graph databases for a flexible conversation flow.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from tqdm import tqdm
import shap
from textstat import flesch_reading_ease
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transitions import Machine  # State machine for managing question flow
from neo4j import GraphDatabase  # Graph database for flexible conversation flow
import json  # For storing rule-based logic in a JSON format

### Step 3: Loading, Merging, and Integrating New Datasets

In this step, we merge various data sources (`sample_skills.csv`, `sample_job_summary.csv`, and `gd_rev_preprocessed.csv`) to form a unified dataset for further analysis. This helps provide a complete understanding of the job descriptions.

- **Data Sources**: Skills, job summaries, and interview Q&A.
- **Purpose**: To enrich the dataset with all possible information to produce insightful NLP analysis.
- **Merging Strategy**: Merge on `job_title` to ensure that all related information is brought together.


In [3]:
# Load datasets
gsearch_jobs = pd.read_csv('data/gsearch_jobs.csv')

In [4]:
gsearch_jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53743 entries, 0 to 53742
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           53743 non-null  int64  
 1   index                53743 non-null  int64  
 2   title                53743 non-null  object 
 3   company_name         53743 non-null  object 
 4   location             53706 non-null  object 
 5   via                  53734 non-null  object 
 6   description          53743 non-null  object 
 7   extensions           53743 non-null  object 
 8   job_id               53743 non-null  object 
 9   thumbnail            33792 non-null  object 
 10  posted_at            53708 non-null  object 
 11  schedule_type        53515 non-null  object 
 12  work_from_home       25462 non-null  object 
 13  salary               9199 non-null   object 
 14  search_term          53743 non-null  object 
 15  date_time            53743 non-null 

### Step 4: Data Cleaning and Preprocessing

We clean text data to remove any unnecessary characters and prepare the dataset for NLP operations. This involves removing punctuation, converting text to lowercase, and combining key textual information into a single column for analysis.


In [5]:
def clean_text(text):
    text = re.sub(r'\W+', ' ', text)  # Remove non-word characters
    text = text.lower()  # Convert to lowercase
    return text

# Apply text cleaning to relevant columns
gsearch_jobs['description_clean'] = gsearch_jobs['description'].apply(lambda x: clean_text(str(x)))

# Drop rows with missing descriptions
gsearch_jobs.dropna(subset=['description_clean'], inplace=True)

### Step 5: Advanced Feature Engineering

#### Step 5.1: TF-IDF Vectorization

We use **TF-IDF Vectorizer** to convert the textual data into numerical feature vectors that the model can process.

- **Why TF-IDF**: It captures the importance of words in a document relative to the corpus, making it a powerful feature extraction technique for NLP.

In [6]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(gsearch_jobs['description_clean'])

#### Step 5.2: Polynomial Features and Standard Scaling

- **Polynomial Features**: Increase the complexity of our features by generating interaction terms, which can improve model performance when relationships between features are non-linear.
- **Standard Scaling**: Standardizes the features by removing the mean and scaling to unit variance, which is especially important for linear models.


In [7]:
from sklearn.naive_bayes import ComplementNB

In [13]:
# Add polynomial features to increase feature complexity (Reduced degree to prevent memory issues)
poly = PolynomialFeatures(degree=1, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X.toarray())

# Min-Max Scaling to keep features non-negative
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_poly)

#### Step 5.3: Dimensionality Reduction

We use **Truncated SVD** to reduce the dimensionality of the TF-IDF matrix. This helps reduce computational cost and overfitting while preserving essential information.

In [14]:
# Apply Truncated SVD to reduce dimensionality
svd = TruncatedSVD(n_components=50, random_state=42)  # Reduce dimensions further to avoid overfitting
X_reduced = svd.fit_transform(X)


### Step 6: Introducing State Machine for Conversation Flow Management

We use a **State Machine** to define the conversation flow for guiding managers through the recruitment question generation process.

#### Step 6.1: Defining States and Transitions

- **States**: Represent parts of the conversation (e.g., Role Requirements, Company Environment, Compensation & Benefits).
- **Transitions**: Define how the flow moves from one state to another based on the manager's response.


In [15]:
from transitions import Machine

# Define states for the recruitment conversation flow
states = ['role_requirements', 'company_environment', 'compensation_benefits', 'role_nuances', 'final_summary']

# Define the state machine model
class RecruitmentAssistant:
    def __init__(self, name):
        self.name = name

# Create an instance of RecruitmentAssistant
recruitment_assistant = RecruitmentAssistant("Assistant")

# Create a state machine with defined states and transitions
machine = Machine(model=recruitment_assistant, states=states, initial='role_requirements')

# Define state transitions based on manager inputs
machine.add_transition(trigger='ask_company_environment', source='role_requirements', dest='company_environment')
machine.add_transition(trigger='ask_compensation', source='company_environment', dest='compensation_benefits')
machine.add_transition(trigger='ask_role_nuances', source='compensation_benefits', dest='role_nuances')
machine.add_transition(trigger='summarize', source='role_nuances', dest='final_summary')

# Example of using the state machine
recruitment_assistant.ask_company_environment()
print(recruitment_assistant.state)  # Output: company_environment

company_environment


### Step 7: Dynamic Question Generation with Decision Trees

We use **Decision Trees** for dynamic questioning, where each node represents a question and each branch represents possible answers leading to different follow-up questions.


In [16]:
from sklearn.tree import DecisionTreeClassifier

# Define sample training data for decision tree - features are hypothetical attributes, target is follow-up question ID
X_sample = [[1, 0, 1], [0, 1, 1], [1, 1, 0], [0, 0, 1]]  # Example feature vectors
y_sample = [0, 1, 2, 3]  # Follow-up question IDs

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_sample, y_sample)

# Use decision tree to determine the next question
sample_input = [1, 0, 1]
next_question = decision_tree.predict([sample_input])
print(f"Next question ID: {next_question}")

Next question ID: [0]


### Step 8: Introducing Personalization and Rule-Based Logic

We use **NLP models** for evaluating sentiment, determining the conversation tone, and dynamically adjusting questions to improve personalization.

#### Step 8.1: Sentiment Analysis and Adaptive Questioning

We use **Naïve Bayes** or **Logistic Regression** models for text classification to evaluate the sentiment of responses and determine the conversation's engagement level.


In [17]:
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import LogisticRegression

# Train a simple sentiment model (example data)
text_clf = ComplementNB()
y_labels = np.abs(np.random.choice([0, 1], len(X_reduced)))  # Randomly generated labels for demonstration, ensure no negative values
text_clf.fit(X_reduced, y_labels)

# Analyze sentiment and adjust follow-up
sample_response = "I would prefer a remote work setting."
sample_vector = vectorizer.transform([sample_response])
sentiment = text_clf.predict(sample_vector)
if sentiment == 1:
    print("Positive sentiment detected, proceeding with follow-up questions about remote tools and flexibility.")

ValueError: Negative values in data passed to ComplementNB (input X)

#### Step 8.2: Rule-Based Logic Stored in JSON

To make the decision flow configurable and easier to maintain, we store the conversation rules in a JSON file.


In [None]:
# Define rules for conversation flow in a JSON format
rules = {
    "role_requirements": {
        "next": "company_environment",
        "questions": ["What are the must-have skills for this role?", "Are there any certifications required?"]
    },
    "company_environment": {
        "next": "compensation_benefits",
        "questions": ["How many people are in the team?", "Can you describe the company culture?"]
    }
}

# Example usage of JSON-based rules
current_state = "role_requirements"
for question in rules[current_state]["questions"]:
    print(question)

### Step 9: Combining Predefined Question Templates and Dynamic Elements

We blend **predefined question templates** with dynamically generated content to ensure the conversation is both personalized and comprehensive.

- Start with a core set of questions (e.g., role-specific skills).
- Adaptively generate follow-up prompts based on previous answers.


In [None]:
# Example of predefined question and dynamically generated follow-up
core_questions = ["What are the must-have skills for this role?"]
response = "This role is temporary."

# Use response to create a personalized follow-up
if "temporary" in response.lower():
    follow_up = "Considering that the role is temporary, would you like to discuss the option for contract renewal and team integration procedures?"
    core_questions.append(follow_up)

for question in core_questions:
    print(question)

### Step 10: Splitting Data for Training and Evaluation

We split our dataset into training and testing sets to evaluate our model's performance accurately.

In [None]:
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_reduced, gsearch_jobs['title'], test_size=0.2, random_state=42)

### Step 11: Model Training

We use an **SGDClassifier**, a linear model with stochastic gradient descent learning, which is efficient for large datasets.

- **Why SGD**: It works well with high-dimensional data and supports various loss functions suitable for classification.


In [None]:
# Initialize and train SGD Classifier
model = SGDClassifier()
model.fit(X_train, y_train)

# Display training progress
print("Model training complete. Now proceeding to evaluation...")

### Step 12: Model Evaluation, Sentiment, Interview Response Analysis, and Explainability

#### Step 12.1: Sentiment and Interview Response Analysis

We add **sentiment analysis** to understand the overall sentiment behind the 'pros' and 'cons' sections, and the candidate interview responses. This helps gauge candidates' attitudes and alignment with company culture.


In [None]:
# Initialize sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Sentiment analysis for pros and cons
gsearch_jobs['pros_sentiment'] = gsearch_jobs['description_tokens'].apply(lambda x: analyzer.polarity_scores(str(x))['compound'])

### Step 12.2: Enhanced Analysis

We use various metrics to assess content quality and coverage, ensuring that the model-generated questions align with the hiring requirements and cover essential job aspects.

### Step 13: Explainability with SHAP

We use **SHAP (SHapley Additive exPlanations)** to explain the output of our model, providing transparency in decision-making and helping us understand which features are most influential in predictions.


In [None]:
# Use a small subset of the training data to fit the SHAP explainer
explainer = shap.Explainer(model, X_train[:100])
shap_values = explainer(X_test[:10])

# Plot summary of the SHAP values
shap.summary_plot(shap_values, X_test[:10], feature_names=vectorizer.get_feature_names_out())

### Step 14: Combining Graph Database for Conversation Flexibility

Using a **Graph Database** like Neo4j to store and navigate through your conversation flow provides flexibility to adapt questions based on user interaction.


In [None]:
# Establish a connection to Neo4j database
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Define a function to add nodes and relationships
def add_question(tx, question, answer_options):
    tx.run("CREATE (q:Question {text: $question})", question=question)
    for option in answer_options:
        tx.run("MATCH (q:Question {text: $question}) CREATE (q)-[:HAS_OPTION]->(:Option {text: $option})", question=question, option=option)