## Usage Instructions

This notebook performs **end-to-end classification of potential public sector data buyers** using structured and unstructured features from federal job postings. Specifically, it:

- Loads and prepares USAJobs data scraped via the automated acquisition script
- Applies preprocessing pipelines for both text and structured variables
- Predicts buyer likelihood using a trained logistic regression model
- Outputs:
  - `PredictedLabel`: Binary indicator (1 = likely data buyer)
  - `DataBuyerScore`: Probability from the model
- Exports a **ranked lead list** of high-priority job postings for potential outreach or analysis

###  Setup Steps Before Running

1. **Set File Paths**  
   Replace all file paths in the notebook with your own local paths.  
   Use `Ctrl + F` to search for and replace all instances of the default path (`C://Users/.../`) with your desired directory.

2. **Data File Consistency**  
   Ensure the input CSV file has the **same name and format** as the one created by the automated data acquisition script.  
   If you used the USAJobs fetcher provided in this repository, no changes are needed.

3. **Pipeline Files**  
   The notebook depends on pretrained pipeline files (`.pkl` files for vectorizers, transformers, and the classifier), which are **included in this repository**.  
   Be sure they are **downloaded and saved in the correct folder** as referenced in the notebook.

---

Once the setup is complete, simply run the notebook to generate buyer predictions and export the final lead list.


In [4]:
import joblib
import pandas as pd

# Load the saved model and vectorizer from the same folder as the notebook
model = joblib.load('nlp_model.joblib')
vectorizer = joblib.load('vectorizer.joblib')
pipeline = joblib.load('nlp_pipeline_with_smote.joblib')



In [11]:
# Load your new dataset of job descriptions (CSV example)
# Replace with your actual file path and dataset of jobs
df_new_jobs = pd.read_csv('C://Users//...//all_keyword_job_listings.csv')  # Replace with your file path

In [12]:
# These are the same columns used during training: CombinedText, AgencySize, Industry, IsSeniorRole
X_new = df_new_jobs[['CombinedText', 'AgencySize', 'Industry', 'IsSeniorRole']]

# Transform the new data using the preprocessor part of the pipeline
X_new_transformed = pipeline.named_steps['preprocessor'].transform(X_new)

# Make predictions using the trained classifier
predictions = pipeline.named_steps['classifier'].predict(X_new_transformed)

# Add predictions to the new dataframe (0 or 1, where 1 is a data buyer)
df_new_jobs['predicted_label'] = predictions

# Filter data-buying leads (predicted_label == 1)
data_buying_jobs = df_new_jobs[df_new_jobs['predicted_label'] == 1]

In [13]:
# Optionally, save the predicted data-buying leads to a CSV file
data_buying_jobs.to_csv('C://Users//...//predicted_data_buying_leads.csv', index=False)

print("Data-buying leads have been saved to 'predicted_data_buying_leads.csv'")

Data-buying leads have been saved to 'predicted_data_buying_leads.csv'
