### **Background, problem statement and Approach:**

This notebook is for mapping the business areas of focus with different levels of intervention categories attached to each area. We will use different different columns from FactInteventions table to map keywords with matching rules associated to each level of area of focus. The scoping document for this project is available on following confluence page

https://ingeusbi.atlassian.net/wiki/spaces/BD/pages/206471169/Scoping+-+Interventions+Categorization+DOD+Draft

The keyword mapping rules in the mapping_rules dictionary, along with how the match_keywords function is designed to handle them, are crucial for ensuring accurate categorization based on textual content. Let's break down both the structure of the rules and the logic of the keyword matching function.

**Understanding of Keyword Mapping Rules**

The mapping_rules dictionary organizes information under various "Areas of Focus," each containing several "Levels" that describe specific actions or considerations. Each level has:

Description: A brief description of the action or focus at that level.
Keywords: A set of keywords designed to match related content in the textual data. These keywords are formatted in a way that indicates how they should be matched, using parentheses () and slashes /.

**Examples:**

Caring Responsibilities: Level 1

Description: "Bring children to appointments"
Keywords: "(Children) (childcare) (child)/ (appointment) (office)"
This means the text should contain either "children," "childcare," or "child" AND "appointment" or "office."

Physical Capability: Level 3

Description: "Identify benefits of increasing Physical activity"
Keywords: "Physical/activity/benefits"
This indicates a need for the presence of "Physical," "activity," and "benefits" in any order.

**Keyword Matching Function Logic**

The match_keywords function processes these keyword patterns to match text data effectively. Here’s how it works with the key aspects being addressed:

Regular Expressions: It uses Python’s re module to create and apply regular expressions based on the keyword patterns.

Plural and Singular Forms: By appending s? to the keywords (where applicable), the regex pattern accounts for both plural and singular forms. This regex adjustment is done in the prepare_regex function which modifies the keywords before they are compiled into regex patterns.

Case Insensitivity: It performs case-insensitive matching (re.IGNORECASE) to ensure that variations in capitalization do not affect the matching process.

Handling of Parentheses and Slashes:

Parentheses (): Items within parentheses are treated as separate options within a group. The text should match any of the options listed within the parentheses.
Slashes /: These are used to denote essential words that must all appear in the text, though not necessarily adjacent to each other or in the order listed.
Stopwords Filtering: Common words like "the," "and," "to," etc., are filtered out to prevent them from triggering matches that are not meaningful. This filtering occurs before the regex matching checks, thus ensuring that these common words do not skew the matching results.

Applying the Matching Function
When match_keywords is invoked, it processes each group of keywords:

It constructs a regex pattern for each keyword or group of keywords (considering stopwords and plural/singular forms).
It then checks the provided text against these patterns.
If a match is found based on the conditions set out by the keyword patterns (accounting for necessary groups and optionals), these keywords are added to the list of found keywords.


### **Imports Packages and Sample Data**

In [3]:
import pandas as pd 
import numpy as np    
import matplotlib.pyplot as plt       
import seaborn as sns
import re

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 5, Finished, Available)

In [4]:
sdf = spark.sql("SELECT * FROM RestartData.fact_interventions_sample_v1")
df_interventions_sample = sdf.toPandas()

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 6, Finished, Available)

### **Helper Functions for Mapping**

In [5]:
# Function for basic pre processing of data

def preprocess_and_concatenate(df, columns_to_concat):
  
    # Preprocess all columns in 'columns_to_concat'
    for col in columns_to_concat:
        df[col] = df[col].str.lower().str.strip()

    # Creating a new column 'ConcatenatedText' that merges the text from the specified columns
    df['ConcatenatedText'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)).lower(), axis=1)

    return df



# def load_and_preprocess_data(df, column_name):
#     # Basic text preprocessing
#     df[column_name] = df[column_name].str.lower().str.strip()
#     return df

# # Function to conatenate unstructered text data from different columns

# def concatenate_columns(df, columns):
#     # Creating a new column 'ConcatenatedText' that merges the text from the specified columns
#     df['ConcatenatedText'] = df[columns].apply(lambda x: ' '.join(x.dropna().astype(str)).lower(), axis=1)
#     return df

# Function to load keywords and rules from excel file if needed, however in this notebook,
# we have updated dictionary of mapping rule which will be used 

def load_keyword_rules(excel_path):
    excel_data = pd.ExcelFile(excel_path)
    interventions_df = pd.read_excel(excel_data, sheet_name='Interventions')
    mapping_rules = {}
    for index, row in interventions_df.iterrows():
        area_of_focus = row['Area of Focus']
        mapping_rules[area_of_focus] = {}
        for level in range(1, 6):
            level_key = f"Level {level}"
            keyword_key = f"Level {level} keywords"
            if pd.notna(row[level_key]) and pd.notna(row[keyword_key]):
                mapping_rules[area_of_focus][level_key] = {
                    'description': row[level_key],
                    'keywords': row[keyword_key]
                }
    return mapping_rules

# Function to match keywords 

def match_keywords(text, keyword_pattern, stopwords):
    def prepare_regex(words):
        # Handling both plural and singular forms by making the last 's' optional
        words = re.sub(r'\b(\w+?)\b', r'\1s?', words)
        # Escaping special characters in words
        words = re.escape(words)
        # Replace spaces with regex that allows optional extra words in between
        words = words.replace(r'\ ', r'\s+')
        return words

    # Initialize found keywords list
    found_keywords = []

    # Normalize text for case insensitive matching
    text = text.lower()

    # Handle each group divided by '/'
    for group in keyword_pattern.split('/'):
        # Extract keywords inside parentheses and handle them with regex
        group_keywords = re.findall(r'\((.*?)\)', group)
        if group_keywords:
            # Create regex pattern for each keyword considering stopwords
            for keywords in group_keywords:
                if keywords.lower() not in stopwords:
                    regex = re.compile(prepare_regex(keywords), re.IGNORECASE)
                    if regex.search(text):
                        found_keywords.append(keywords)
        else:
            # If no parentheses, treat the whole group as one keyword
            if group.strip().lower() not in stopwords:
                regex = re.compile(prepare_regex(group.strip()), re.IGNORECASE)
                if regex.search(text):
                    found_keywords.append(group.strip())

    return found_keywords if found_keywords else None

# def match_keywords(text, keyword_pattern):
#     slash_split = keyword_pattern.split('/')
#     found_keywords = []
#     if len(slash_split) > 1:
#         matches = [(part, re.findall(f"\\b{part}\\b", text)) for part in slash_split]
#         match_counts = sum([bool(match[1]) for match in matches])
#         if match_counts >= 2:
#             found_keywords = [match[0] for match in matches if match[1]]
#     else:
#         paren_split = re.findall(r'\((.*?)\)', keyword_pattern)
#         if paren_split:
#             for word in paren_split:
#                 if re.search(r'\b{}\b'.format(re.escape(word)), text):
#                     found_keywords.append(word)
#         else:
#             if re.search(r'\b{}\b'.format(re.escape(keyword_pattern)), text):
#                 found_keywords.append(keyword_pattern)

#     return found_keywords if found_keywords else None

# Function to assign matched keywords as per mapping rules to text columns

def assign_categories(df, mapping_rules, text_column='ConcatenatedText'):
    df['AreaOfFocus'] = None
    df['Level'] = None
    df['Category'] = None
    df['MatchingKeywords'] = None

    for idx, row in df.iterrows():
        text = row[text_column]  # Use the concatenated text for keyword matching
        for area, levels in mapping_rules.items():
            for level, details in levels.items():
                matching_keywords = match_keywords(text, details['keywords'], stopwords)
                if matching_keywords:
                    df.at[idx, 'AreaOfFocus'] = area
                    df.at[idx, 'Level'] = level
                    df.at[idx, 'Category'] = details['description']
                    df.at[idx, 'MatchingKeywords'] = ", ".join(matching_keywords)
                    break
            if df.at[idx, 'Area of Focus'] is not None:
                break

    return df

########################################################



StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 7, Finished, Available)

### **Mapping Rules Dictionary**

In [6]:
mapping_rules = {

'Caring Responsibilities': {'Level 1': {'description': 'Bring children to appointments',
   'keywords': '(Children) (childcare) (child)/ (appointment) (office)'},
  'Level 2': {'description': 'Identify support available locally',
   'keywords': '(Identify support)'},
  'Level 3': {'description': 'Identify nurseries offering funded childcare',
   'keywords': '(Identify) / (nurseries)(funded childcare)'},
  'Level 4': {'description': 'Plan a care timetable',
   'keywords': '(Care timetable) (timetable for care)'}},
 
 'Finance': {'Level 1': {'description': 'Make a budget ',
   'keywords': '(Budget)'},
  'Level 2': {'description': 'Explore debt management options',
   'keywords': '(Debt management)'},
  'Level 3': {'description': 'Complete debt management course on iWorks',
   'keywords': 'Debt management/iworks/course '}},
 
 'Photo ID': {'Level 1': {'description': 'Apply for photo ID (NB - various types, passport, driving licence, citizen card, etc.)',
   'keywords': '(Passport) (Driving Licence) (Citizen Card) (photo ID) (photo)'}},
 
 'Criminal Record': {'Level 2': {'description': 'Identify skill and qualifications gaps',
   'keywords': '(Skill gaps) (qualification gaps)'},
  'Level 3': {'description': 'Practice interview skills and how to talk to an employer about criminal convictions',
   'keywords': '(Criminal) (criminal record) (convictions)'},
  'Level 4': {'description': 'Speak to Probation Officer',
   'keywords': '(Probation officer)'}},
 
 'Housing': {'Level 1': {'description': 'Meet with Housing Officer to discuss support available',
   'keywords': '(Housing officer)'},
  'Level 2': {'description': 'Contact housing association',
   'keywords': '(Housing association)'},
  'Level 3': {'description': "Contact Citizen's Advice Bureau",
   'keywords': "(Citizen's Advice) (CAB) (Citizens Advice)"}},
 
 'Physical Capability': {'Level 1': {'description': 'Speak to GP',
   'keywords': '(Doctor) (GP)'},
  'Level 2': {'description': 'Obtain a fit note',
   'keywords': '(Fit note) (sick note) (fit for work)'},
  'Level 3': {'description': 'Identify benefits of increasing Physical activity',
   'keywords': 'Physical/activity/benefits '},
  'Level 4': {'description': 'Identify suitable pathways to employment',
   'keywords': '(Pathways to employment)'},
  'Level 5': {'description': 'Identify specialist intervention services relating to physical activity',
   'keywords': 'Physical activity/services '}},
 
 'Digital Skills': {'Level 1': {'description': 'Complete a digital audit',
   'keywords': '(Digital audit)'},
  'Level 2': {'description': 'Apply for VOXI', 
   'keywords': '(VOXI)'},
  'Level 3': {'description': 'Complete iworks digital skills sessions',
   'keywords': '(iWorks) /(digital) (digital skills) '},
  'Level 4': {'description': 'Look at attending a college course',
   'keywords': '(College) (course) (digital college)'}},
 
 'Mental Wellbeing': {'Level 1': {'description': 'Complete a mental wellbeing assessment',
   'keywords': '(Mental wellbeing assessment) (wellbeing assessment)'},
  'Level 2': {'description': 'Attend an internal wellbeing workshop',
   'keywords': 'wellbeing/workshop '},
  'Level 3': {'description': 'Attend a 121 with the wellbeing officer',
   'keywords': '(Wellbeing officer)'}},
 
 'Communication Skills': {'Level 1': {'description': 'Identify immediate support available to develop your understanding of English (E.g. family)',
   'keywords': 'Understanding/english '},
  'Level 2': {'description': 'Identify ESOL courses available',
   'keywords': '(ESOL course)'},
  'Level 3': {'description': 'Attend an ESOL course',
   'keywords': 'Attend/ESOL  '},
  'Level 4': {'description': 'Complete role play exercises with people known to you to develop questioning techniques',
   'keywords': 'Role play/questioning techniques '},
  'Level 5': {'description': 'Attend an interview skills workshop',
   'keywords': 'Attend/interview/session/workshop '}},
 
 'Learning Capability': {'Level 1': {'description': 'Identifying areas for improving reading and writing with iWorks',
   'keywords': 'Reading/writing/iworks '},
  'Level 2': {'description': 'Identifying areas for improving reading and writing within local community',
   'keywords': 'Reading and writing/local/community '}},
 
 'Qualifications': {'Level 1': {'description': 'Provide evidence of qualifications',
   'keywords': 'Evidence/qualifications '},
  'Level 2': {'description': 'Identify qualifications needed to fill gaps',
   'keywords': 'Identify/qualifications  '},
  'Level 3': {'description': "Identify qualifications that could be gained via iWork's",
   'keywords': 'Qualifications/iworks '}},
 
 'Literacy and Numeracy': {'Level 1': {'description': 'Attend a numeracy course (External)',
   'keywords': '(Numeracy course) (numeracy session)'},
  'Level 2': {'description': 'Attend the journey to employment workshop',
   'keywords': '(journey to employment)'}},
 
 'Attending Job Interviews': {'Level 1': {'description': 'Research the role being applied for',
   'keywords': 'Research/role '},
  'Level 2': {'description': 'Attend a group session on interview preparation',
   'keywords': 'Attend/group/session/interview '},
  'Level 4': {'description': 'Complete a mock interview with an Advisor',
   'keywords': '(Mock interview) (practice interview)'}},
 
 'Completing CV and Job Applications': {'Level 1': {'description': 'Gather work and education history',
   'keywords': 'Education/work/history '},
  'Level 2': {'description': 'Use iworks to build a CV/curriculum vitae',
   'keywords': '(iWorks) / (CV) (curriculum vitae)/ (create) / (build)'},
  'Level 3': {'description': 'Use iWorks CV checker to assess the quality of your CV',
   'keywords': '(iWorks) / (CV) (curriculum vitae)/ (check)'},
  'Level 4': {'description': 'Create a CV tailored to the role you are applying for',
   'keywords': 'CV/Tailored '},
  'Level 5': {'description': 'Attend a group session on CV building',
   'keywords': '(Session) / (CV) (curriculum vitae)'}},
 
 'Job Search': {'Level 1': {'description': 'Sign up for job search websites',
   'keywords': '(Job search) (job websites) (job sites)'},
  'Level 2': {'description': 'Use the job search function in iWorks to shortlist roles to apply for',
   'keywords': 'job search/iworks '},
  'Level 3': {'description': 'Attend internal job fair at the Restart site',
   'keywords': '(Job fair)'},
  'Level 4': {'description': 'Use computers at Restart office to apply for jobs',
   'keywords': 'Apply/jobs  '},
  'Level 5': {'description': 'Create multiple CVs relevant to the sectors being applied for',
   'keywords': '(Multiple) / (CV) (CVs) (curriculum vitae)'}},
 
 'Transport': {'Level 1': {'description': 'Plan travel routes for attending appointments',
   'keywords': '(Plan) /(travel) (routes) '},
  'Level 2': {'description': 'Bring evidence of travel costs to appointment to claim expenses',
   'keywords': 'Travel/costs(cost)/expenses'},
  'Level 3': {'description': 'Plan travel routes for attending job interviews/work locations',
   'keywords': 'Plan/travel/(interviews)(interview)/work '}},
 
 'Confidence': {'Level 1': {'description': 'Attend a confidence building session',
   'keywords': '(Confidence building session)'},
  'Level 2': {'description': 'Complete confidence building sessions on iWorks',
   'keywords': 'Confidence building/iworks '}},
 
 'Motivation': {'Level 1': {'description': 'Complete iWorks motivational sessions',
   'keywords': 'iworks/motivational session '}}
   
   }

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 8, Finished, Available)

In [7]:
#rules_path = 'abfss://ef7b5398-bd38-48b1-b070-71bfdc4f75b1@onelake.dfs.fabric.microsoft.com/dd18ebf8-5392-43cd-acef-ed29b5e123c2/Files/Interventions_categorization/Interventions - Data Science Project V2.0.xlsx'

stopwords = set(["the", "and", "to", "of", "in", "a", "is", "that", "for", "on", "it", "with", "as", "this", "by", "are", "or", "from"])

# Select the columns we want to perform the mapping on
columns_to_use = ['InterventionPersonalisedAction','InterventionTypeDescription', 'InterventionDetailsAndObjectivesDescription']

# Concatenate the specified columns into a single column for processing
df_concatenate = preprocess_and_concatenate(df_interventions_sample, columns_to_use)

# Load the keyword mapping rules (assuming this has been already defined)
#mapping_rules = load_keyword_rules(rules_path)

# Apply the keyword matching and category assignment
df_mapped = assign_categories(df_concatenate, mapping_rules)

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 9, Finished, Available)

In [17]:
df_concatenate[['InterventionPersonalisedAction','InterventionTypeDescription', 'InterventionDetailsAndObjectivesDescription','ConcatenatedText']].head(10)

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 19, Finished, Available)

Unnamed: 0,InterventionPersonalisedAction,InterventionTypeDescription,InterventionDetailsAndObjectivesDescription,ConcatenatedText
0,job searching -handing out cv in the local com...,job searching -handing out cv in the local com...,as agreed and discussed i will take my cv to l...,job searching -handing out cv in the local com...
1,job sites,bolton college courses,during our meeting on 13.02.24 michelle and i...,job sites bolton college courses during our me...
2,job searching - sign up for job search websites,job searching - sign up for job search websites,i will sign up to a number of online job sites...,job searching - sign up for job search website...
3,iworks tools,iworks tools,david to utilise all aspects of iworks to help...,iworks tools iworks tools david to utilise all...
4,photo id - applying for a provisional driving ...,photo id- applying for a provisional driving l...,"as discussed and agreed, i will have a look at...",photo id - applying for a provisional driving ...
5,caring responsibilities - identify support wit...,caring responsibilities - identify support wit...,"as discussed and agreed with satveer, she will...",caring responsibilities - identify support wit...
6,iworks tools,iworks tools,as agreed and discussed. omar to continue job ...,iworks tools iworks tools as agreed and discus...
7,confidence - training sessions to be held to ...,confidence - training sessions to be held to ...,"as discussed and ageed, qamar will attend the ...",confidence - training sessions to be held to ...
8,attend business admin course,"people plus - sia, cscs, cctv, customer services",as discussed and agreed i will start the busin...,attend business admin course people plus - sia...
9,sia security course,"people plus - sia, cscs, cctv, customer services",amanda to attend sia security course starting ...,"sia security course people plus - sia, cscs, c..."


In [8]:
df_mapped_selected_columns = df_mapped[['InterventionPersonalisedAction','InterventionTypeDescription', 'InterventionDetailsAndObjectivesDescription', 'AreaOfFocus', 'Level', 'Category', 'MatchingKeywords']]

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 10, Finished, Available)

In [10]:
df_mapped_selected_columns.head(20)

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 12, Finished, Available)

Unnamed: 0,InterventionPersonalisedAction,InterventionTypeDescription,InterventionDetailsAndObjectivesDescription,AreaOfFocus,Level,Category,MatchingKeywords
0,job searching -handing out cv in the local com...,job searching -handing out cv in the local com...,as agreed and discussed i will take my cv to l...,,,,
1,job sites,bolton college courses,during our meeting on 13.02.24 michelle and i...,,,,
2,job searching - sign up for job search websites,job searching - sign up for job search websites,i will sign up to a number of online job sites...,,,,
3,iworks tools,iworks tools,david to utilise all aspects of iworks to help...,,,,
4,photo id - applying for a provisional driving ...,photo id- applying for a provisional driving l...,"as discussed and agreed, i will have a look at...",,,,
5,caring responsibilities - identify support wit...,caring responsibilities - identify support wit...,"as discussed and agreed with satveer, she will...",,,,
6,iworks tools,iworks tools,as agreed and discussed. omar to continue job ...,,,,
7,confidence - training sessions to be held to ...,confidence - training sessions to be held to ...,"as discussed and ageed, qamar will attend the ...",,,,
8,attend business admin course,"people plus - sia, cscs, cctv, customer services",as discussed and agreed i will start the busin...,,,,
9,sia security course,"people plus - sia, cscs, cctv, customer services",amanda to attend sia security course starting ...,,,,


In [11]:
percent_missing = df_mapped_selected_columns.isnull().mean() * 100

# Create a new DataFrame to display the results
missing_value_df = pd.DataFrame({
    'column_name': percent_missing.index,  # Use the index as column names
    'percent_missing': percent_missing.values  # Use the values from the Series
})

# Sort the results by percentage (optional)
missing_value_df.sort_values('percent_missing', inplace=True)

print(missing_value_df)

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 13, Finished, Available)

                                   column_name  percent_missing
0               InterventionPersonalisedAction         0.000000
1                  InterventionTypeDescription         0.000000
2  InterventionDetailsAndObjectivesDescription         0.000000
3                                  AreaOfFocus        99.998202
4                                        Level        99.998202
5                                     Category        99.998202
6                             MatchingKeywords        99.998202


In [18]:
not_null_mask = df_mapped_selected_columns.notnull().all(axis=1)
not_null_rows = df_mapped_selected_columns[not_null_mask]
not_null_rows.head(20)

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 20, Finished, Available)

Unnamed: 0,InterventionPersonalisedAction,InterventionTypeDescription,InterventionDetailsAndObjectivesDescription,AreaOfFocus,Level,Category,MatchingKeywords
32343,job search/ confidence,ad-hoc actions,1. https://www.charityjob.co.uk/courses?keywor...,Digital Skills,Level 4,Look at attending a college course,course


In [19]:
not_null_rows.shape

StatementMeta(, 6c784bd6-760a-493f-8c01-170ebb7a4a51, 21, Finished, Available)

(1, 7)