<a href="https://colab.research.google.com/github/Niranjana-08/AI-Ascent/blob/main/notebooks/data_cleaning/analysing_csvs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Notebook Objective:

This notebook represents the final and most advanced stage of data preparation. The core goals are:


*   Create a Master DataFrame: Merge the cleaned job data with the classified job categories.
*   Engineer an 'AI Relevance' Score: Use a sophisticated Sentence Transformer model to calculate a score for each job, indicating how closely it relates to the concept of AI.
*   Classify AI Role Tiers: Analyze the relevance scores to define thresholds and categorize each job as a 'Traditional Role', 'AI-Impacted Role', or 'Core AI Role'.
*   Final Cleaning: Perform the last set of data cleaning tasks to create two polished, analysis-ready datasets.

### Note:


*   RUN THE CODE IN GPU MODE
*   this code is to be run after job classification into main and sub categories







## Part 1: Setup and Master DataFrame Creation

### 1.1. Installing Libraries and Mounting Drive

In [None]:
!pip install sentence-transformers -q

In [None]:
import pandas as pd
from google.colab import drive
import pandas as pd
from IPython.display import display
import sys
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm
import torch
import pandas as pd
from IPython.display import display
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

In [None]:
print("Mounting Google Drive...")
drive.mount('/content/drive', force_remount=True)

### 1.2. Loading Processed Datasets

In [None]:
base_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/'
cleaned_data_path = base_path + 'data_cleaning/cleaned_for_classification.csv'
classified_data_path = base_path + 'classified_jobs/classified_jobs.csv'

In [None]:
cleaned_df = pd.read_csv(cleaned_data_path)
classified_df = pd.read_csv(classified_data_path)
print("Files loaded successfully.")

### 1.3. Merging to Create the Master Analysis DataFrame

In [None]:
print("\nMerging new classification columns into the main DataFrame")
columns_to_add = ['job_id', 'main_category', 'sub_category', 'confidence_score']

analysis_df = pd.merge(
    cleaned_df,
    classified_df[columns_to_add],
    on='job_id',
    how='left'
    # Using left merge to keep all rows from the original cleaned_df
)
print("Merge complete.")

In [None]:
print("\n--- Master Analysis DataFrame is Ready! ---")
print("Displaying info to confirm all columns are now present:")
analysis_df.info()

print("\nDisplaying the first 5 rows of the final analysis table:")
analysis_df.head(50)

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

pd.set_option('display.width', 1000)
print("\nDisplaying the first 50 rows of the final analysis table:")
display(analysis_df.head(5))

## Part 2: Feature Engineering - AI Relevance Scoring

### 2.1. Setting Up the Sentence Transformer Model

We import our AI keywords and load the all-MiniLM-L6-v2 Sentence Transformer model. This model is excellent at converting text into numerical vectors (embeddings) that capture its semantic meaning. We'll ensure it runs on the GPU for speed.

In [None]:
keywords_folder_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/keywords/'
sys.path.append(keywords_folder_path)
try:
    from keywords_ai import AI_KEYWORDS
    print("keywords_ai.py imported successfully.")
except (ImportError, FileNotFoundError) as e:
    print(f"Error: Could not import AI_KEYWORDS. Make sure 'keywords_ai.py' is in the correct folder.")
    raise e

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"\nUsing device: {device}")
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
print("Sentence Transformer model loaded.")

### 2.2. Creating and Encoding the Target 'AI Concept'

We combine all our AI-related keywords into a single paragraph. The model then encodes this paragraph into one target vector that numerically represents the "concept of AI."

In [None]:
ai_concept_paragraph = ' '.join(AI_KEYWORDS)
print("\nEncoding the 'AI Concept' paragraph into a vector...")
ai_concept_embedding = model.encode(ai_concept_paragraph, convert_to_tensor=True, show_progress_bar=True)

### 2.3. Encoding All Job Descriptions (Embedding)

Now, the model processes every job's combined_text and converts each one into its own vector. This is the most time-consuming step.

In [None]:
print("\nEncoding all job descriptions into vectors (this will take a while)...")
job_texts = analysis_df['combined_text'].astype(str).tolist()
job_embeddings = model.encode(job_texts, convert_to_tensor=True, show_progress_bar=True, batch_size=32)
print("Job encoding complete.")

### 2.4. Calculating and Analyzing Scores

We use cosine similarity to compare each job vector to our target "AI Concept" vector. The result is a score from -1 to 1 (practically 0 to 1) where a higher score means the job text is semantically closer to the concept of AI.

In [None]:
print("\nCalculating AI relevance scores for all jobs")
cosine_scores = util.pytorch_cos_sim(job_embeddings, ai_concept_embedding)

analysis_df['ai_relevance_score'] = cosine_scores.cpu().numpy().flatten()
print("Scores calculated and added to the DataFrame.")
top_ai_jobs = analysis_df.sort_values(by='ai_relevance_score', ascending=False)

In [None]:
print("Displaying the top 15 most AI-relevant jobs found:")
display(top_ai_jobs[['title', 'main_category', 'ai_relevance_score']].head(15))

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("Displaying the first 5 rows with all columns:")
display(analysis_df.head(5))

## Part 3: Exploratory Analysis - Determining AI Role Thresholds

Before we can classify jobs into tiers, we need to understand how the ai_relevance_score is distributed across different job categories. This exploratory analysis helps us set informed, data-driven thresholds.

### 3.1. Analyzing Score Distributions by Category

We'll loop through each main job category, plotting a histogram of its AI relevance scores and printing descriptive statistics. This gives us a clear picture of what a "high" or "low" score looks like for each field.

In [None]:
categories_to_analyze = analysis_df[analysis_df['main_category'] != 'Other']['main_category'].unique()

print(f"Starting individual analysis for {len(categories_to_analyze)} main categories...")
for category in categories_to_analyze:
    print("\n" + "="*50)
    print(f"ANALYSIS FOR: {category}")
    print("="*50)

    category_df = analysis_df[analysis_df['main_category'] == category]

    plt.figure(figsize=(10, 5))
    sns.histplot(category_df['ai_relevance_score'], bins=30, kde=True)
    plt.title(f'Distribution of AI Relevance Scores for {category}')
    plt.xlabel('AI Relevance Score')
    plt.ylabel('Number of Jobs')
    plt.grid(True)
    plt.show()

    print(f"\n--- Descriptive Statistics for {category} ---")
    score_stats = category_df['ai_relevance_score'].describe(percentiles=[.25, .5, .75, .90, .95, .99])
    print(score_stats)

    print(f"\n--- Example Job Titles for {category} ---")

    def show_examples_for_category(df, score_min, score_max, num_examples=3):
        """Filters a category-specific DataFrame for a score range and shows examples."""
        sample = df[(df['ai_relevance_score'] >= score_min) & (df['ai_relevance_score'] < score_max)]
        print(f"\nExamples with scores between {score_min} and {score_max}:")
        if sample.empty:
            print("No jobs found in this score range.")
        else:
            display(sample[['title', 'sub_category', 'ai_relevance_score']].head(num_examples))

    show_examples_for_category(category_df, 0.2, 0.3)
    show_examples_for_category(category_df, 0.4, 0.5)
    show_examples_for_category(category_df, 0.6, 1.0)

print("\n\n--- Individual analysis for all categories is complete. ---")


### 3.2. Manual Inspection of Borderline Cases ( per job field )

hello

In [None]:
media_df = analysis_df[analysis_df['main_category'] == 'Media & Journalism']
borderline_media_jobs = media_df[
    (media_df['ai_relevance_score'] >= 0.28) &
    (media_df['ai_relevance_score'] < 0.32)
]

print("--- Borderline Jobs (Scores between 0.28 and 0.32) for Media & Journalism ---")
display(borderline_media_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
education_df = analysis_df[analysis_df['main_category'] == 'Education & EdTech']
borderline_education_jobs = education_df[
    (education_df['ai_relevance_score'] >= 0.35) &
    (education_df['ai_relevance_score'] < 0.40)
]

print("--- Borderline Jobs (Scores between 0.35 and 0.40) for Education & EdTech ---")
display(borderline_education_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
finance_df = analysis_df[analysis_df['main_category'] == 'Finance']

borderline_finance_jobs = finance_df[
    (finance_df['ai_relevance_score'] >= 0.35) &
    (finance_df['ai_relevance_score'] < 0.40)
]

print("--- Borderline Jobs (Scores between 0.35 and 0.40) for Finance ---")
display(borderline_finance_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
automotive_df = analysis_df[analysis_df['main_category'] == 'Automotive']
borderline_automotive_jobs = automotive_df[
    (automotive_df['ai_relevance_score'] >= 0.35) &
    (automotive_df['ai_relevance_score'] < 0.40)
]
print("--- Borderline Jobs (Scores between 0.35 and 0.40) for Automotive ---")
display(borderline_automotive_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
design_df = analysis_df[analysis_df['main_category'] == 'Design']
borderline_design_jobs = design_df[
    (design_df['ai_relevance_score'] >= 0.35) &
    (design_df['ai_relevance_score'] < 0.40)
]

print(f"--- Found {len(borderline_design_jobs)} Borderline Jobs (Scores between 0.35 and 0.40) for Design ---")
display(borderline_design_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
design_df = analysis_df[analysis_df['main_category'] == 'Design']
borderline_design_jobs = design_df[
    (design_df['ai_relevance_score'] >= 0.35) &
    (design_df['ai_relevance_score'] < 0.40)
]

print("--- Borderline Jobs (Scores between 0.35 and 0.40) for Design ---")
display(borderline_design_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
technology_df = analysis_df[analysis_df['main_category'] == 'Technology'].copy()
score_bins = [-float('inf'), 0.28, 0.55, float('inf')]
tier_labels = [
    'Traditional Role',
    'AI-Impacted Role',
    'Core AI Role'
]
technology_df['ai_role_tier_check'] = pd.cut(
    technology_df['ai_relevance_score'],
    bins=score_bins,
    labels=tier_labels,
    right=False
)
tier_counts = technology_df['ai_role_tier_check'].value_counts()
print("--- Job Distribution for Technology using new thresholds ---")
print(tier_counts)

In [None]:
marketing_df = analysis_df[analysis_df['main_category'] == 'Marketing']
borderline_marketing_jobs = marketing_df[
    (marketing_df['ai_relevance_score'] >= 0.30) &
    (marketing_df['ai_relevance_score'] < 0.35)
]
print("--- Borderline Jobs (Scores between 0.30 and 0.35) for Marketing ---")
display(borderline_marketing_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
legal_df = analysis_df[analysis_df['main_category'] == 'Legal']

borderline_legal_jobs = legal_df[
    (legal_df['ai_relevance_score'] >= 0.28) &
    (legal_df['ai_relevance_score'] < 0.32)
]
print("--- Borderline Jobs (Scores around 0.30) for Legal ---")
display(borderline_legal_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
healthcare_df = analysis_df[analysis_df['main_category'] == 'Healthcare (Research & Admin)']
borderline_healthcare_jobs = healthcare_df[
    (healthcare_df['ai_relevance_score'] >= 0.35) &
    (healthcare_df['ai_relevance_score'] < 0.40)
]
print("--- Borderline Jobs (Scores between 0.35 and 0.40) for Healthcare ---")
display(borderline_healthcare_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
hr_df = analysis_df[analysis_df['main_category'] == 'Human Resources']
borderline_hr_jobs = hr_df[
    (hr_df['ai_relevance_score'] >= 0.30) &
    (hr_df['ai_relevance_score'] < 0.35)
]
print("--- Borderline Jobs (Scores between 0.30 and 0.35) for Human Resources ---")
display(borderline_hr_jobs[['title', 'sub_category', 'ai_relevance_score']])

In [None]:
consulting_df = analysis_df[analysis_df['main_category'] == 'Consulting & Strategy']

borderline_jobs = consulting_df[
    (consulting_df['ai_relevance_score'] >= 0.35) &
    (consulting_df['ai_relevance_score'] < 0.40)
]
print("--- Borderline Jobs (Scores between 0.35 and 0.40) for Consulting & Strategy ---")
display(borderline_jobs[['title', 'sub_category', 'ai_relevance_score']])

## Part 4: Classifying Jobs into AI Tiers

### 4.1. Defining Category-Specific Thresholds

We create a dictionary to hold the unique thresholds for each job category. A default is used for any category not specified. float('inf') is used where a category has no 'Core AI' roles.

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display

new_category_thresholds = {}
categories = analysis_df[analysis_df['main_category'] != 'Other']['main_category'].unique()

for category in categories:
    category_df = analysis_df[analysis_df['main_category'] == category]
    impacted_threshold = round(category_df['ai_relevance_score'].quantile(0.75), 2)
    core_ai_threshold = round(category_df['ai_relevance_score'].quantile(0.95), 2)
    if core_ai_threshold <= impacted_threshold:
        core_ai_threshold = impacted_threshold + 0.01
    new_category_thresholds[category] = {
        'traditional': impacted_threshold,
        'impacted': core_ai_threshold
    }

new_category_thresholds['Technology'] = {'traditional': 0.28, 'impacted': 0.55}
default_thresholds = {'traditional': 0.40, 'impacted': 0.60}

### 4.2. Applying the Classification Logic

This function applies our custom thresholds to each row in the DataFrame to assign an ai_role_type. We also include a special rule to ensure no 'Technology' job is classified as purely 'Traditional'.

In [None]:
def classify_ai_tier_new(row):
    main_cat = row['main_category']
    score = row['ai_relevance_score']
    thresholds = new_category_thresholds.get(main_cat, default_thresholds)
    if score >= thresholds['impacted']:
        return 'Core AI Role'
    elif score >= thresholds['traditional']:
        return 'AI-Impacted Role'
    else:
        return 'Traditional Role'

analysis_df['ai_role_type'] = analysis_df.apply(classify_ai_tier_new, axis=1)

tech_upgrade_mask = (analysis_df['main_category'] == 'Technology') & (analysis_df['ai_role_type'] == 'Traditional Role')
if tech_upgrade_mask.sum() > 0:
    analysis_df.loc[tech_upgrade_mask, 'ai_role_type'] = 'AI-Impacted Role'
    print(f"Applied special rule: Upgraded {tech_upgrade_mask.sum()} 'Technology' jobs from 'Traditional' to 'AI-Impacted'.")

print("\n--- Final Classification is Complete! ---")
print("\nFinal distribution of jobs across the three tiers:")
print(analysis_df['ai_role_type'].value_counts())

print("\nSample of the final DataFrame with the correct 'ai_role_type' column:")
display(analysis_df[['title', 'main_category', 'ai_relevance_score', 'ai_role_type']].head(15))

## Part 5: Preparing Final DataFrames for Downstream Use

Now that our main analysis and feature engineering are done, we will select the final columns needed and create two separate, clean DataFrames: one for dashboarding (dashboard_df) and one for future modeling (modeling_df).

### Data Cleaning : checking how ready the columns are

In [None]:
column_list = analysis_df.columns.tolist()
print("--- List of All Columns ---")
print(column_list)

print("\n\n--- Detailed Summary of Each Column ---")
analysis_df.info()

## Working on selecting the req columns which will be used ahead in analysis
will be then performing complete data cleaning on all the these selected columns taken in a new df

*Note : If any column is missing and edit this code at later point of time to include req column when neeed and run the complete code*

In [None]:
import pandas as pd
from IPython.display import display
if 'original_listed_time' in analysis_df.columns:
    analysis_df['date_posted'] = pd.to_datetime(analysis_df['original_listed_time'] / 1000, unit='s').dt.date
else:
    analysis_df['date_posted'] = pd.NaT

final_columns_for_dashboard = [
    'job_id', 'title', 'company_name', 'location', 'date_posted',
    'main_category', 'sub_category', 'ai_role_type', 'ai_relevance_score',
    'formatted_experience_level', 'min_salary', 'med_salary', 'max_salary',
    'pay_period', 'currency', 'cleaned_skills'
]
dashboard_df = analysis_df[final_columns_for_dashboard].copy()
print("--- Final DataFrame for Dashboard is Ready! ---")
dashboard_df.info()

print("\nSample of the final dashboard data:")
display(dashboard_df.head())

In [None]:
if 'original_listed_time' in analysis_df.columns:
    analysis_df['date_posted'] = pd.to_datetime(analysis_df['original_listed_time'] / 1000, unit='s').dt.date
else:
    analysis_df['date_posted'] = pd.NaT

final_columns = [
    'job_id', 'title', 'company_name', 'location', 'date_posted',
    'main_category', 'sub_category', 'ai_role_type', 'ai_relevance_score',
    'formatted_experience_level', 'min_salary', 'med_salary', 'max_salary',
    'pay_period', 'currency', 'cleaned_skills', 'combined_text'
]
dashboard_df = analysis_df[final_columns].copy()
print("--- Final DataFrame for Dashboard & Modeling is Ready! ---")
dashboard_df.info()

print("\nSample of the final data:")
display(dashboard_df.head())

above includes combined text

In [None]:
if 'date_posted' not in analysis_df.columns and 'original_listed_time' in analysis_df.columns:
    analysis_df['date_posted'] = pd.to_datetime(analysis_df['original_listed_time'] / 1000, unit='s').dt.date

dashboard_columns = [
    'job_id', 'title', 'company_name', 'location', 'date_posted',
    'main_category', 'sub_category', 'ai_role_type', 'ai_relevance_score',
    'formatted_experience_level', 'min_salary', 'med_salary', 'max_salary',
    'pay_period', 'currency', 'cleaned_skills'
]
dashboard_df = analysis_df[dashboard_columns].copy()
modeling_columns = dashboard_columns + ['combined_text']
modeling_df = analysis_df[modeling_columns].copy()

print("--- DataFrames are Ready ---")
print("\n1. dashboard_df (without combined_text):")
print("This is the lean DataFrame for analysis and visualization.")
display(dashboard_df.head(2))

print("\n2. modeling_df (with combined_text):")
print("This is the larger DataFrame for future model building.")
display(modeling_df.head(2))

### for reference

In [None]:
print("DataFrame for future modeling (contains 'combined_text'): modeling_df")
print("DataFrame for dashboarding and analysis (does not contain 'combined_text'): dashboard_df")

## Part 6: Final Data Cleaning and Preprocessing

### Working initially with dashboard_df

focusing on column one by one

In [None]:
display(dashboard_df.head(5))

currency column

In [None]:
currency_counts = dashboard_df['currency'].value_counts()

print("Currency distribution in the dataset:")
print(currency_counts)

In [None]:
conversion_rates = {
    'EUR': 1.08,  # Euro to USD
    'CAD': 0.73,  # Canadian Dollar to USD
    'BBD': 0.50,  # Barbadian Dollar to USD
    'AUD': 0.66,  # Australian Dollar to USD
    'GBP': 1.27,  # British Pound to USD
    'USD': 1.00   # USD to USD is 1
}
print("Conversion rates defined.")

dashboard_df['conversion_rate'] = dashboard_df['currency'].map(conversion_rates)
dashboard_df['conversion_rate'].fillna(1.0, inplace=True)
salary_cols = ['min_salary', 'med_salary', 'max_salary']
for col in salary_cols:
    dashboard_df[col] = dashboard_df[col] * dashboard_df['conversion_rate']

print("Salaries converted to USD.")
dashboard_df.drop(columns=['currency', 'conversion_rate'], inplace=True)
print("Cleaned up temporary columns.")

print("\n--- Salary Conversion Complete! ---")
display(dashboard_df[dashboard_df['job_id'].isin([283, 294])])
display(dashboard_df.head())

In [None]:
print("DataFrame Info:")
dashboard_df.info()

print("\nUpdated Salary Statistics (now all in USD):")
print(dashboard_df[['min_salary', 'med_salary', 'max_salary']].describe())

date_posted column removed

In [None]:
# Remove the 'date_posted' column from the DataFrame
# dashboard_df.drop(columns=['date_posted'], inplace=True)

In [None]:
dashboard_df.head()

formatted_experience_level -> filling null with not specified

In [None]:
dashboard_df['formatted_experience_level'] = dashboard_df['formatted_experience_level'].fillna('Not Specified')

print("Missing values have been filled. Here are the new counts for each experience level:")
print(dashboard_df['formatted_experience_level'].value_counts())

In [None]:
dashboard_df['formatted_experience_level'].info()

updating unknown salary columns with zero, while using it in analysis we will consider only rows greater than 0


In [None]:
print("Cleaning salary-related columns...")
salary_cols = ['min_salary', 'med_salary', 'max_salary']

for col in salary_cols:
    dashboard_df[col] = dashboard_df[col].fillna(0)
dashboard_df['pay_period'] = dashboard_df['pay_period'].fillna('Not Specified')

print("Salary columns cleaned.")
print("\n--- Verifying the cleaned DataFrame ---")
dashboard_df.info()

In [None]:
dashboard_df['company_name'] = dashboard_df['company_name'].fillna('Unknown')
print("Missing company names have been filled.")
print("\nVerifying the DataFrame (note the 'company_name' non-null count):")
dashboard_df.info()

filling unknown cleaned_skills section with blank string

In [None]:
dashboard_df['cleaned_skills'] = dashboard_df['cleaned_skills'].fillna('')
print("Missing skills have been filled.")
print("\nVerifying the DataFrame (note the 'cleaned_skills' non-null count):")
dashboard_df.info()

In [None]:
dashboard_df = dashboard_df.drop(columns=['date_posted'])

print("The 'date_posted' column has been removed.")
dashboard_df.info()

### adding combined_text as an extra column in modelling_df derived from dashboard_df

In [None]:
modeling_df = dashboard_df.copy()
modeling_df['combined_text'] = analysis_df['combined_text']

print("--- Updated 'modeling_df' is Ready! ---")
print("It now contains the clean dashboard columns plus 'combined_text'.")
modeling_df.info()
print("\nSample of the new modeling_df:")
display(modeling_df.head())

## Part 8: Saving the Analysis-Ready Datasets

In [None]:
output_folder_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/data_cleaning/'
dashboard_file_path = output_folder_path + 'analysis_ready_without_combinedtext.csv'
modeling_file_path = output_folder_path + 'analysis_ready_with_combinedtext.csv'

print(f"Saving the dashboard DataFrame to: {dashboard_file_path}")
dashboard_df.to_csv(dashboard_file_path, index=False)
print("dashboard_df saved successfully.")

print(f"\nSaving the modeling DataFrame to: {modeling_file_path}")
modeling_df.to_csv(modeling_file_path, index=False)
print("modeling_df saved successfully.")

In [None]:
dashboard_rows, dashboard_cols = dashboard_df.shape
print(f"dashboard_df has:")
print(f"- {dashboard_rows} rows")
print(f"- {dashboard_cols} columns")

modeling_rows, modeling_cols = modeling_df.shape
print(f"\nmodeling_df has:")
print(f"- {modeling_rows} rows")
print(f"- {modeling_cols} columns")

## Part 9: Post-Processing - Adding State Codes

As an enhancement, this section re-loads the saved files to add a state_code column(two-letter state code), which is useful for geographical analysis. This is done by extracting two-letter codes from the location string.

In [None]:
import pandas as pd
import re

base_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/'
data_folder = base_path + 'data_cleaning/'
dashboard_file_path = data_folder + 'analysis_ready_without_combinedtext.csv'
modeling_file_path = data_folder + 'analysis_ready_with_combinedtext.csv'

try:
    dashboard_df = pd.read_csv(dashboard_file_path)
    modeling_df = pd.read_csv(modeling_file_path)
    print("Both DataFrames loaded successfully!")
except FileNotFoundError as e:
    print(f"\\nError: A file was not found. Please check your file paths.")
    raise e

def get_state_code(location):
    """A robust function to extract a two-letter state code."""
    if not isinstance(location, str):
        return 'None'

    parts = location.split(',')

    if len(parts) > 1:
        potential_state = parts[-1].strip()

        if len(potential_state) == 2 and potential_state.isalpha() and potential_state.isupper():
            return potential_state
    return 'None'

print("\\nAdding 'state_code' column to the dashboard DataFrame...")
dashboard_df['state_code'] = dashboard_df['location'].apply(get_state_code)

print("Adding 'state_code' column to the modeling DataFrame...")
modeling_df['state_code'] = modeling_df['location'].apply(get_state_code)

try:
    dashboard_df.to_csv(dashboard_file_path, index=False)
    modeling_df.to_csv(modeling_file_path, index=False)
    print("\\nSuccess! Both CSV files have been updated with the 'state_code' column and saved.")
    print(f"-> {dashboard_file_path}")
    print(f"-> {modeling_file_path}")
except Exception as e:
    print(f"\\nAn error occurred while saving the files: {e}")

state_code_counts = dashboard_df['state_code'].value_counts()
none_count = state_code_counts.get('None', 0)
state_codes_only_counts = state_code_counts.drop('None', errors='ignore')

print("\n" + "="*50)
print("Detailed State Code Verification:")
print("="*50)
print("\n--- Occurrences of Each State Code ---\n")
print(state_codes_only_counts)
print("\n" + "="*50)
print(f"\nTotal number of jobs with no state code found (written as 'None'): {none_count}")
print("\n" + "="*50)