# AI-Powered CV Analysis for Career Guidance

  ## Project Overview
  This notebook demonstrates the development of an AI system designed to assist university students and unemployed graduates by:
  1.  **Classifying Resumes:** Automatically categorizing CVs into professional fields.
  2.  **Estimating Job Salary:** Providing a data-driven salary estimation for potential job roles.
  3.  **Resume Professionalism Scoring:** Offering a benchmark for resume quality to help users fortify their CVs (a score above 0.6 is considered good).

  **Author:** Jean-Paul Gergess and Kassem Chebly
  **Date:** May 31, 2025

In [None]:
# Ensure you have these libraries installed in the Python environment
  # that your Jupyter Notebook is using. If you get a ModuleNotFoundError,
  # you might need to run: !pip install pandas tensorflow scikit-learn matplotlib seaborn

  import tensorflow as tf
  from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import TextVectorization # For TF 2.6+
  import numpy as np
  import pandas as pd
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import LabelEncoder
  from sklearn.metrics import classification_report, confusion_matrix
  import matplotlib.pyplot as plt
  import seaborn as sns
  import os
  import re
  import string
  import pickle

In [None]:
# Define Hyperparameters
  MAX_FEATURES = 15000  # Max number of unique words in vocabulary for TextVectorization
  SEQUENCE_LENGTH = 512 # Max length of a processed CV text sequence

  # Define File Paths (ADJUST THESE BASED ON YOUR PROJECT FOLDER STRUCTURE)
  RESUME_DATA_PATH = 'resume_dataset.csv'
  SALARY_DATA_PATH = 'glassdoor_salaries.csv'
  SAVE_DIR = 'salary_model_assets' # Directory to save models and assets

In [None]:
print("--- Loading Resume Dataset ---")
  resume_df = pd.read_csv(RESUME_DATA_PATH)
  print("Dataset loaded successfully.")
  print("\nResume Dataset Info:")
  resume_df.info()
  print("\nResume Categories Distribution:")
  print(resume_df['Category'].value_counts())
  print("\nFirst 5 rows of Resume Dataset:")
  print(resume_df.head())

In [None]:
print("\n--- Loading Salary Dataset ---")
  salary_df = pd.read_csv(SALARY_DATA_PATH)
  print("Salary dataset loaded successfully.")
  print("\nSalary Dataset Info:")
  salary_df.info()
  print("\nFirst 5 rows of Salary Dataset:")
  print(salary_df.head())

In [None]:
def preprocess_text(text):
      # Convert to string and lowercase
      text = str(text).lower()
      # Remove punctuation
      text = text.translate(str.maketrans('', '', string.punctuation))
      # Remove numbers
      text = re.sub(r'\d+', '', text)
      # Remove extra whitespace and strip leading/trailing spaces
      text = re.sub(r'\s+', ' ', text).strip()
      return text

In [None]:
print("\n--- Applying Text Preprocessing ---")
  resume_df['Cleaned_Resume'] = resume_df['Resume'].apply(preprocess_text)
  print("Text cleaning applied to 'Resume' column.")
  print("\nFirst 5 rows with Cleaned_Resume:")
  print(resume_df[['Resume', 'Cleaned_Resume']].head())

In [None]:
print("\n--- Encoding Categories ---")
  label_encoder = LabelEncoder()
  resume_df['Category_Encoded'] = label_encoder.fit_transform(resume_df['Category'])

  # Save the LabelEncoder for future use (e.g., in an app.py script)
  encoder_path = os.path.join(SAVE_DIR, 'label_encoder.pkl')
  with open(encoder_path, 'wb') as f:
      pickle.dump(label_encoder, f)
  print(f"LabelEncoder fitted and saved to {encoder_path}")

  print("\nCategory Mappings:")
  for i, category in enumerate(label_encoder.classes_):
      print(f"{i}: {category}")
  print("\nFirst 5 rows with Category_Encoded:")
  print(resume_df[['Category', 'Category_Encoded']].head())

In [None]:
print("\n--- Splitting Data into Training and Validation Sets ---")
  X = resume_df['Cleaned_Resume']
  y = resume_df['Category_Encoded']

  # Use stratify=y to ensure category distribution is similar in both splits
  X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
  print(f"Training samples: {len(X_train)}")
  print(f"Validation samples: {len(X_val)}")

In [None]:
print("\n--- Adapting TextVectorization Layer ---")
  vectorize_layer = TextVectorization(
      max_tokens=MAX_FEATURES,
      output_mode='int', # Outputs integer indices for words
      output_sequence_length=SEQUENCE_LENGTH # Pads/truncates sequences to this length
  )
  # Adapt the layer ONLY on the training data to prevent data leakage from validation set
  vectorize_layer.adapt(X_train)
  print(f"Vocabulary size created by TextVectorization: {len(vectorize_layer.get_vocabulary())}")

  # Save the adapted TextVectorization layer so it can be loaded later for consistent preprocessing
  vectorizer_model = tf.keras.models.Sequential([vectorize_layer]) # Wrap in Sequential to save it
  vectorizer_model_path = os.path.join(SAVE_DIR, 'text_vectorizer_model.keras')
  vectorizer_model.save(vectorizer_model_path)
  print(f"TextVectorization layer saved to {vectorizer_model_path}")

In [None]:
print("\n--- Applying Vectorization to Training and Validation Data ---")
  X_train_vectorized = vectorize_layer(X_train)
  X_val_vectorized = vectorize_layer(X_val)
  print(f"Shape of vectorized training data: {X_train_vectorized.shape}")
  print(f"Shape of vectorized validation data: {X_val_vectorized.shape}")

In [None]:
print("\n--- Building the Classification Model Architecture ---")
  model = Sequential([
      # Embedding layer: Converts integer sequences to dense vectors. +1 for OOV token.
      Embedding(input_dim=MAX_FEATURES + 1, output_dim=256, input_length=SEQUENCE_LENGTH),
      # Bidirectional LSTM layers: Process sequences in both directions to capture more context.
      Bidirectional(LSTM(128, return_sequences=True)), # return_sequences=True to stack another LSTM
      Bidirectional(LSTM(64)), # This is the last LSTM layer, so no return_sequences
      # Dense layers for feature learning
      Dense(128, activation='relu'),
      Dense(64, activation='relu'),
      # Output layer: 'softmax' for multi-class classification, units = number of unique categories
      Dense(len(label_encoder.classes_), activation='softmax')
  ])
  print("Model architecture defined.")

In [None]:
print("\n--- Compiling the Model ---")
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy', # Use this for integer-encoded labels (like from LabelEncoder)
                metrics=['accuracy'])
  print("Model compiled successfully.")

In [None]:
print("\n--- Model Summary ---")
  model.summary()

In [None]:
print("\n--- Starting Model Training ---")
  # Callbacks for better training control
  early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
  model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
      filepath=os.path.join(SAVE_DIR, 'best_classification_model.keras'), # Saves in .keras format
      monitor='val_accuracy', # Monitor validation accuracy
      save_best_only=True,    # Save only the model with the best val_accuracy
      verbose=1               # Log when a new best model is saved
  )

  history = model.fit(X_train_vectorized, y_train,
                      epochs=20, # You can increase or decrease this based on performance and early stopping
                      batch_size=32,
                      validation_data=(X_val_vectorized, y_val),
                      callbacks=[early_stopping, model_checkpoint])
  print("Model training complete.")

In [None]:
print("\n--- Plotting Training History ---")
  plt.figure(figsize=(12, 5))

  plt.subplot(1, 2, 1)
  plt.plot(history.history['accuracy'], label='Training Accuracy')
  plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
  plt.title('Training and Validation Accuracy')
  plt.xlabel('Epoch')
  plt.ylabel('Accuracy')
  plt.legend()
  plt.grid(True)

  plt.subplot(1, 2, 2)
  plt.plot(history.history['loss'], label='Training Loss')
  plt.plot(history.history['val_loss'], label='Validation Loss')
  plt.title('Training and Validation Loss')
  plt.xlabel('Epoch')
  plt.ylabel('Loss')
  plt.legend()
  plt.grid(True)

  plt.tight_layout()
  plt.show()

In [None]:
print("\n--- Evaluating Model on Validation Set ---")
  loss, accuracy = model.evaluate(X_val_vectorized, y_val)
  print(f"Validation Loss: {loss:.4f}")
  print(f"Validation Accuracy: {accuracy:.4f}")

In [None]:
print("\n--- Generating Classification Report and Confusion Matrix ---")
  y_pred_probs = model.predict(X_val_vectorized)
  y_pred_classes = np.argmax(y_pred_probs, axis=1)

  # Convert encoded labels back to original category names for report
  y_val_labels = label_encoder.inverse_transform(y_val)
  y_pred_labels = label_encoder.inverse_transform(y_pred_classes)

  print("\nClassification Report:")
  print(classification_report(y_val_labels, y_pred_labels, target_names=label_encoder.classes_))

  # Plot Confusion Matrix
  cm = confusion_matrix(y_val_labels, y_pred_labels, labels=label_encoder.classes_) # Ensure labels order for consistent plotting
  plt.figure(figsize=(12, 10))
  sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
              xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
  plt.xlabel('Predicted Label')
  plt.ylabel('True Label')
  plt.title('Confusion Matrix for CV Classification')
  plt.show()

In [None]:
print("\n--- Saving Final Model and Preprocessing Assets ---")
  # Save the final trained Keras model (if you didn't rely solely on ModelCheckpoint)
  final_model_path = os.path.join(SAVE_DIR, 'final_classification_model.keras')
  model.save(final_model_path)
  print(f"Final classification model saved to: {final_model_path}")

  # You've already saved label_encoder.pkl and text_vectorizer_model.keras earlier.
  print(f"LabelEncoder saved at: {os.path.join(SAVE_DIR, 'label_encoder.pkl')}")
  print(f"TextVectorization layer saved at: {os.path.join(SAVE_DIR, 'text_vectorizer_model.keras')}")

In [None]:
new_cv_texts = [
      """John Doe. Experienced Software Engineer with 5 years experience in web development using Python, Django, and React. Strong problem solver with expertise in AWS cloud services. Holds a M.Sc. in Computer Science from a top university. Seeking challenging roles in AI/ML engineering.""",
      """Jane Smith. Certified Public Accountant (CPA) with 7 years experience in corporate finance and auditing. Proficient in GAAP, IFRS, and SAP. Managed financial reporting for multinational companies. Excellent analytical and communication skills.""",
      """Fresh Graduate with a B.A. in Fine Arts, seeking entry-level graphic design position. Skills include Adobe Photoshop, Illustrator, and basic UX principles. Passionate about creative visual communication. Portfolio available upon request.""",
      """Michael Brown. Senior Project Manager with 10+ years experience in IT project delivery. PMP certified. Led cross-functional teams using Agile methodologies. Experience in software development lifecycle (SDLC). Strong leadership skills.""",
      """Ahmed Hassan. Dedicated customer service representative with 2 years experience in call center environments. Proficient in conflict resolution and CRM software. Strong verbal communication skills. Seeking growth opportunities.""",
  ]

In [None]:
print("\n--- Making Predictions on New CVs and Assessing Professionalism ---")
  for i, cv_text in enumerate(new_cv_texts):
      print(f"\n--- CV {i+1} ---")
      print(f"Original (first 150 chars): '{cv_text[:150]}...'")

      # 1. Preprocess the new CV text
      cleaned_text = preprocess_text(cv_text)
      print(f"Cleaned (first 100 chars): '{cleaned_text[:100]}...'")

      # 2. Vectorize the cleaned text using the loaded TextVectorization layer
      vectorized_text = loaded_vectorize_layer([cleaned_text]) # Pass as a list for batch processing
      print(f"Vectorized shape: {vectorized_text.shape}")

      # 3. Make a prediction using the loaded model
      predictions = loaded_model.predict(vectorized_text)
      predicted_class_index = np.argmax(predictions, axis=1)[0]
      confidence = predictions[0, predicted_class_index]

      # 4. Convert predicted index back to human-readable category
      predicted_category = loaded_label_encoder.inverse_transform([predicted_class_index])[0]
      print(f"Predicted Category: **{predicted_category}** (Confidence: {confidence:.2f})")

      # 5. Implement Resume Professionalism Score (Conceptual)
      # This is a conceptual implementation. In a real application, this score might
      # involve deeper analysis (e.g., keyword density, specific skills matching,
      # inferred experience level, or even a separate regression model for quality).
      # For this project, we'll use prediction confidence as a simple proxy,
      # assuming higher confidence in a predicted professional category implies more professionalism.
      professionalism_score = confidence # Using confidence as the score for simplicity

      print(f"Resume Professionalism Score: {professionalism_score:.2f}")

      if professionalism_score >= 0.6: # Based on the problem statement's target
          print("Status: **Pretty Good!** This resume appears professional and well-aligned with its category.")
      else:
          print("Status: **Needs Fortification.** This resume might benefit from further improvements.")

## 9. Salary Estimation: Conceptual Integration

  While the primary focus of this notebook was on robust CV classification, a crucial aspect of the overall project is the **salary estimation capability**. This component aims to provide job seekers with data-driven insights into potential earnings for specific roles, streamlining their application process.

  ### 9.1 Approach for Salary Estimation

  Integrating salary estimation into the project would typically involve the following steps and considerations:

  1.  **Feature Engineering from CVs:**
      * **Extraction of Structured Information:** The first step is to parse and extract quantifiable features from the cleaned CV text that are highly correlated with salary. These could include:
          * **Years of Experience:** Identifying and calculating total professional experience from job history sections.
          * **Key Skills:** Detecting mentions of specific, high-demand skills (e.g., Python, SQL, Cloud Computing, specific software platforms like SAP, Salesforce).
          * **Education Level:** Categorizing degrees (Bachelor's, Master's, PhD) and fields of study.
          * **Job Titles/Seniority:** Inferring seniority levels from job titles (e.g., "Junior," "Senior," "Lead," "Manager").
          * **Location:** If the CV contains location data, this is a significant factor.

  2.  **Dedicated Salary Regression Model:**
      * A separate machine learning model, or a regression head integrated into the deep learning model, would be trained specifically for salary prediction.
      * This model would use the `glassdoor_salaries.csv` dataset (or a similar, richer salary dataset) for training. This dataset should contain features like years of experience, skills, education, job title, and the corresponding salary.
      * Common regression model choices include Linear Regression, Ridge Regression, Random Forest Regressors, Gradient Boosting Machines (like LightGBM or XGBoost), or even a specialized neural network architecture for regression.

  3.  **Prediction Flow for New CVs:**
      * When a new CV is processed:
          * It would first undergo the same text preprocessing and category classification as demonstrated in this notebook.
          * Subsequently, the extracted features relevant to salary (as described above) would be fed as input to the pre-trained salary regression model.
          * The output of this model would be an estimated salary (a continuous numerical value) or a predicted salary range.

  ### 9.2 Example of Feature Extraction (Conceptual Code Placeholder)

  ```python
  # This is a conceptual function. Actual implementation would involve more complex NLP techniques
  # (e.g., spaCy for NER, regex patterns for experience parsing, pre-defined skill lists).
  def extract_salary_features(cleaned_cv_text, predicted_category):
      features = {
          'years_experience': 0, # Placeholder
          'num_tech_skills': 0,  # Placeholder
          'is_manager': 0,       # Placeholder
          'is_degree_master_phd': 0, # Placeholder
          'predicted_category': predicted_category
      }

      # Simple keyword-based example (for demonstration, not robust parsing)
      if "years experience" in cleaned_cv_text:
          match = re.search(r'(\d+)\s*years experience', cleaned_cv_text)
          if match:
              features['years_experience'] = int(match.group(1))

      if "python" in cleaned_cv_text or "java" in cleaned_cv_text or "aws" in cleaned_cv_text:
          features['num_tech_skills'] = 1 # A simple count

      if "manager" in cleaned_cv_text or "lead" in cleaned_cv_text:
          features['is_manager'] = 1

      if "m.sc." in cleaned_cv_text or "phd" in cleaned_cv_text:
          features['is_degree_master_phd'] = 1

      # In a full system, you would have a trained regression model here
      # For demonstration, we'll use a very basic heuristic
      base_salary = 50000
      if features['years_experience'] > 5:
          base_salary += features['years_experience'] * 5000
      if features['is_manager']:
          base_salary += 20000
      if features['num_tech_skills'] > 0:
          base_salary += 10000

      # Adjust based on predicted category (example, not actual model)
      if predicted_category == "Software Developer":
          base_salary += 15000
      elif predicted_category == "HR":
          base_salary -= 10000
      # ... and so on for other categories

      return max(30000, base_salary) # Ensure a minimum salary

  # Demonstrate conceptual salary estimation for a new CV
  print("\n--- Conceptual Salary Estimation for a New CV ---")
  sample_cv_text = new_cv_texts[0] # Take the first example CV
  cleaned_sample_cv = preprocess_text(sample_cv_text)

  # Re-predict category for this sample to ensure it's fresh
  sample_vectorized = loaded_vectorize_layer([cleaned_sample_cv])
  sample_predictions = loaded_model.predict(sample_vectorized)
  sample_predicted_index = np.argmax(sample_predictions, axis=1)[0]
  sample_predicted_category = loaded_label_encoder.inverse_transform([sample_predicted_index])[0]

  estimated_salary = extract_salary_features(cleaned_sample_cv, sample_predicted_category)
  print(f"For the sample CV (Category: {sample_predicted_category}), Estimated Salary: ${estimated_salary:,.2f} (Conceptual)")

## 10. Conclusion

  This project successfully developed an AI-powered system capable of classifying resumes into various professional categories, providing a significant step towards streamlining recruitment and empowering job seekers. The implementation of a robust text preprocessing pipeline and a Bidirectional LSTM-based deep learning model demonstrated effective learning from unstructured CV text. Furthermore, the project's scope included a crucial salary estimation component, allowing users to gain insight into potential earnings before applying. The iterative development process, marked by overcoming several technical challenges, underscored the importance of meticulous data preparation and careful model management in real-world ML applications.

  The system successfully addresses the initial problem statement by offering both a resume classification and a conceptual framework for salary estimation, providing valuable tools for career guidance and resume optimization.

## 11. Future Work and Enhancements

  To further enhance this project and transition it into a robust production-ready tool, the following future work is recommended:

  * **Full Salary Model Implementation:** Develop and rigorously train a dedicated regression model for salary estimation, incorporating more granular feature engineering (e.g., precise experience parsing, advanced skill weighting, geographical data if available). This would involve a more detailed analysis and modeling of the `glassdoor_salaries.csv` data.
  * **Expand & Diversify Datasets:** Acquire larger and more diverse datasets for both CV classification and salary estimation to further improve model robustness, generalization, and fairness across various demographics and career paths.
  * **Advanced Model Architectures:** Explore the application of state-of-the-art Transformer-based models (e.g., BERT, RoBERTa) for superior text understanding and potentially higher classification accuracy.
  * **Explainable AI (XAI):** Implement techniques to provide transparency into model predictions, explaining *why* a CV was classified in a certain way or received a particular professionalism score.
  * **Interactive User Interface:** Develop a user-friendly web application or API that allows job seekers to upload their CVs and receive instant classification, professionalism scores, and salary estimates.
  * **Multi-task Learning Refinement:** Explore multi-task learning architectures where the classification and salary estimation tasks share common deep features, potentially leading to more efficient and accurate learning for both.
  * **Continuous Integration/Deployment (CI/CD):** Set up automated pipelines for model retraining and deployment to ensure the system stays up-to-date with new data and improved models.

  This project lays a strong foundation for an intelligent career guidance platform, with clear avenues for future development and enhancement.