# Extracting actionable information from client case notes using Natrual language Processing

We recently ran into the problem of having a lot of casenotes for clients available, but not being able to do much with them or synthesize them in anyway. Here is the beginning of the Natural language Processing I utilized to help mitigate this issue. 

In [8]:
import pandas as pd
import os

In [10]:
casenotes = pd.read_csv("/Users/williamearley/Projects/Case Notes - NLP/Case Notes May 2024.csv")

In [79]:
cols_to_drop = ['Client Notes - Client Level Staff', 'Enrollments Assigned Staff']

In [83]:
casenotes.drop(['Client Notes - Client Level Staff', 'Enrollments Assigned Staff'], axis=1, inplace=True)

In [84]:
casenotes.head(5)

Unnamed: 0,Clients Unique Identifier,Client Notes - Client Level User Home Agency,Client Notes - Client Level Date Added Date,Client Notes - Client Level Case Note Date,Programs Name,Enrollments Project Start Date,Enrollments Project Exit Date,Client Notes - Client Level Note
0,9085A5F26,People Assisting the Homeless (PATH),2024-06-14,2024-05-22,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...
1,9085A5F26,People Assisting the Homeless (PATH),2024-06-14,2024-05-22,PATH - Outreach - PATH SD Countywide Outreach,2024-01-08,2024-01-25,Veteran met with Case Manager for weekly case ...
2,9085A5F26,People Assisting the Homeless (PATH),2024-06-13,2024-05-28,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...
3,9085A5F26,People Assisting the Homeless (PATH),2024-06-13,2024-05-28,PATH - Outreach - PATH SD Countywide Outreach,2024-01-08,2024-01-25,Veteran met with Case Manager for weekly case ...
4,2CCDEE6A6,People Assisting the Homeless (PATH),2024-06-10,2024-05-30,PATH - Outreach - PATH SD Countywide Outreach,2022-01-25,2022-01-25,Client was referred to Safe Sleep Lot due to a...


In [85]:
casenotes_unique = casenotes.drop_duplicates(subset=['Client Notes - Client Level Note'])

In [86]:
len(casenotes)

728

In [87]:
len(casenotes_unique)

511

In [19]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/williamearley/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [27]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/williamearley/nltk_data...


True

In [31]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/williamearley/nltk_data...


True

In [32]:
# Doing some pre processing to prepare the data
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [33]:
# Setting up a little function to preprocess the text
def preprocess_text(text):
    # Tokenization
    tokens = tokenizer.tokenize(text.lower())
    # Remove stopwords
    tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

In [88]:
# Apply our pre processing task to the casenotes column
casenotes_unique['casenotes_processed'] = casenotes_unique['Client Notes - Client Level Note'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  casenotes_unique['casenotes_processed'] = casenotes_unique['Client Notes - Client Level Note'].apply(preprocess_text)


In [89]:
casenotes_unique.head()

Unnamed: 0,Clients Unique Identifier,Client Notes - Client Level User Home Agency,Client Notes - Client Level Date Added Date,Client Notes - Client Level Case Note Date,Programs Name,Enrollments Project Start Date,Enrollments Project Exit Date,Client Notes - Client Level Note,casenotes_processed
0,9085A5F26,People Assisting the Homeless (PATH),2024-06-14,2024-05-22,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...,"[veteran, met, case, manager, weekly, case, ma..."
2,9085A5F26,People Assisting the Homeless (PATH),2024-06-13,2024-05-28,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...,"[veteran, met, case, manager, weekly, case, ma..."
4,2CCDEE6A6,People Assisting the Homeless (PATH),2024-06-10,2024-05-30,PATH - Outreach - PATH SD Countywide Outreach,2022-01-25,2022-01-25,Client was referred to Safe Sleep Lot due to a...,"[client, referred, safe, sleep, lot, due, abat..."
5,05ED4F189,People Assisting the Homeless (PATH),2024-06-04,2024-05-29,PATH - Outreach - PATH SD Countywide Outreach,2024-04-29,,Client came to the El Cajon Library to check i...,"[client, came, el, cajon, library, check, o, c..."
6,C6292C48F,People Assisting the Homeless (PATH),2024-06-04,2024-05-29,PATH - Outreach - PATH SD Countywide Outreach,2023-01-12,,Preferences: Client would like to obtain perma...,"[preference, client, would, like, obtain, perm..."


In [37]:
all_tokens = [token for sublist in casenotes_unique['casenotes_processed'] for token in sublist]

In [39]:
fdist = FreqDist(all_tokens)

In [41]:
fdist.most_common(20)

[('client', 2892),
 ('o', 1832),
 ('clt', 550),
 ('plan', 424),
 ('housing', 356),
 ('past', 330),
 ('present', 323),
 ('get', 213),
 ('met', 202),
 ('contact', 162),
 ('veteran', 156),
 ('5', 155),
 ('ce', 152),
 ('would', 150),
 ('time', 141),
 ('next', 141),
 ('week', 134),
 ('meet', 132),
 ('work', 128),
 ('currently', 127)]

## Working past exploratory analysis to do further synthesizing of information contained in the casenotes.

In [42]:
from gensim import corpora, models

In [43]:
# Creating a dictionary and corpus for LDA
dictionary = corpora.Dictionary(casenotes_unique['casenotes_processed'])
corpus = [dictionary.doc2bow(text) for text in casenotes_unique['casenotes_processed']]

In [44]:
# Apply our LDA model
lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary)

In [45]:
# Print topics and their keywords
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}: {topic}")

Topic 0: 0.034*"client" + 0.016*"o" + 0.009*"plan" + 0.009*"present" + 0.008*"past" + 0.007*"get" + 0.004*"cm" + 0.004*"next" + 0.004*"bcnc" + 0.004*"application"
Topic 1: 0.087*"client" + 0.047*"o" + 0.013*"plan" + 0.011*"past" + 0.010*"present" + 0.009*"housing" + 0.008*"veteran" + 0.006*"contact" + 0.005*"met" + 0.005*"week"
Topic 2: 0.059*"clt" + 0.026*"client" + 0.018*"o" + 0.018*"housing" + 0.009*"plan" + 0.008*"get" + 0.007*"outreach" + 0.007*"program" + 0.006*"homeless" + 0.006*"ca"
Topic 3: 0.060*"client" + 0.033*"o" + 0.013*"veteran" + 0.012*"plan" + 0.008*"present" + 0.008*"housing" + 0.008*"get" + 0.008*"clt" + 0.007*"past" + 0.006*"case"
Topic 4: 0.096*"client" + 0.065*"o" + 0.013*"clt" + 0.012*"plan" + 0.010*"past" + 0.009*"housing" + 0.009*"present" + 0.006*"met" + 0.005*"get" + 0.005*"5"


## Finding key words associated with clients or situations using TF-IDF (Term frequency-Inverse Document Frequency)

In [53]:
# Import library
from sklearn.feature_extraction.text import TfidfVectorizer

In [60]:
# Initialize TF-IDF Vecotrizer
tfidf_vec = TfidfVectorizer(max_features=100)

In [61]:
# Fit and transform processed case notes
tfidf_matrix = tfidf_vec.fit_transform(casenotes_unique['casenotes_processed'].apply(lambda x: ' '.join(x)))

In [62]:
# Get the feature names (words)
feature_names = tfidf_vec.get_feature_names_out()

In [63]:
# Identify top key words by TF-IDF score
top_keywords = []
for i, casenote in enumerate(casenotes_unique['casenotes_processed']):
    tfidf_scores = [(feature_names[j], tfidf_matrix[i, j]) for j in tfidf_matrix[i].indices]
    tfidf_scores_sorted = sorted(tfidf_scores, key = lambda x: x[1], reverse=True)[:5]
    top_keywords.append((i, tfidf_scores_sorted))

In [64]:
top_keywords

[(0,
  [('veteran', 0.8752286266461274),
   ('case', 0.3119164949739776),
   ('manager', 0.24516112118092853),
   ('appointment', 0.13925576183025196),
   ('get', 0.10944439171039147)]),
 (1,
  [('veteran', 0.9225689834885215),
   ('case', 0.2515226574285857),
   ('manager', 0.23430231727123463),
   ('appointment', 0.09981582989086532),
   ('health', 0.08118491844674292)]),
 (2,
  [('due', 0.6472247390187047),
   ('provided', 0.6141982659299736),
   ('client', 0.4515092771259335)]),
 (3,
  [('client', 0.5960277915950601),
   ('referral', 0.44176925426893127),
   ('still', 0.25318893964099093),
   ('reached', 0.24787548859375533),
   ('enrolled', 0.24660070559336797)]),
 (4,
  [('client', 0.5851647542877774),
   ('get', 0.35272877738034236),
   ('housing', 0.302715079211222),
   ('medical', 0.2738217725540783),
   ('psh', 0.2499347729999351)]),
 (5,
  [('service', 0.688046957421629),
   ('ca', 0.3735448983132295),
   ('id', 0.3615742406662539),
   ('voucher', 0.35784446275711695),
   ('

## Sentiment Analysis of casenotes using TextBlod and VADER

In [67]:
from textblob import TextBlob

In [68]:
# Creating a function to calculate sentiment polarity
def calculate_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

In [90]:
# Applying sentiment analysis to casenotes
casenotes_unique['sentiment'] = casenotes_unique['Client Notes - Client Level Note'].apply(calculate_sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  casenotes_unique['sentiment'] = casenotes_unique['Client Notes - Client Level Note'].apply(calculate_sentiment)


In [91]:
casenotes_unique.head(5)

Unnamed: 0,Clients Unique Identifier,Client Notes - Client Level User Home Agency,Client Notes - Client Level Date Added Date,Client Notes - Client Level Case Note Date,Programs Name,Enrollments Project Start Date,Enrollments Project Exit Date,Client Notes - Client Level Note,casenotes_processed,sentiment
0,9085A5F26,People Assisting the Homeless (PATH),2024-06-14,2024-05-22,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...,"[veteran, met, case, manager, weekly, case, ma...",0.025575
2,9085A5F26,People Assisting the Homeless (PATH),2024-06-13,2024-05-28,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...,"[veteran, met, case, manager, weekly, case, ma...",0.017572
4,2CCDEE6A6,People Assisting the Homeless (PATH),2024-06-10,2024-05-30,PATH - Outreach - PATH SD Countywide Outreach,2022-01-25,2022-01-25,Client was referred to Safe Sleep Lot due to a...,"[client, referred, safe, sleep, lot, due, abat...",0.31875
5,05ED4F189,People Assisting the Homeless (PATH),2024-06-04,2024-05-29,PATH - Outreach - PATH SD Countywide Outreach,2024-04-29,,Client came to the El Cajon Library to check i...,"[client, came, el, cajon, library, check, o, c...",0.375
6,C6292C48F,People Assisting the Homeless (PATH),2024-06-04,2024-05-29,PATH - Outreach - PATH SD Countywide Outreach,2023-01-12,,Preferences: Client would like to obtain perma...,"[preference, client, would, like, obtain, perm...",0.05


In [72]:
# using VADER for sentiment analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [74]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/williamearley/nltk_data...


True

In [75]:
# Initializing
sid = SentimentIntensityAnalyzer()

In [76]:
# Function to calculate sentiment using VADER
def calculate_vader_sentiment(text):
    scores = sid.polarity_scores(text)
    return scores['compound'] # utilizing compound score for overall sentiment here

In [92]:
# Apply to casenotes
casenotes_unique['sentiment_vader'] = casenotes_unique['Client Notes - Client Level Note'].apply(calculate_vader_sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  casenotes_unique['sentiment_vader'] = casenotes_unique['Client Notes - Client Level Note'].apply(calculate_vader_sentiment)


In [93]:
casenotes_unique.head(5)

Unnamed: 0,Clients Unique Identifier,Client Notes - Client Level User Home Agency,Client Notes - Client Level Date Added Date,Client Notes - Client Level Case Note Date,Programs Name,Enrollments Project Start Date,Enrollments Project Exit Date,Client Notes - Client Level Note,casenotes_processed,sentiment,sentiment_vader
0,9085A5F26,People Assisting the Homeless (PATH),2024-06-14,2024-05-22,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...,"[veteran, met, case, manager, weekly, case, ma...",0.025575,0.8815
2,9085A5F26,People Assisting the Homeless (PATH),2024-06-13,2024-05-28,PATH - Outreach - PATH SD Countywide Outreach,2022-01-26,2022-01-26,Veteran met with Case Manager for weekly case ...,"[veteran, met, case, manager, weekly, case, ma...",0.017572,-0.3771
4,2CCDEE6A6,People Assisting the Homeless (PATH),2024-06-10,2024-05-30,PATH - Outreach - PATH SD Countywide Outreach,2022-01-25,2022-01-25,Client was referred to Safe Sleep Lot due to a...,"[client, referred, safe, sleep, lot, due, abat...",0.31875,0.802
5,05ED4F189,People Assisting the Homeless (PATH),2024-06-04,2024-05-29,PATH - Outreach - PATH SD Countywide Outreach,2024-04-29,,Client came to the El Cajon Library to check i...,"[client, came, el, cajon, library, check, o, c...",0.375,0.9169
6,C6292C48F,People Assisting the Homeless (PATH),2024-06-04,2024-05-29,PATH - Outreach - PATH SD Countywide Outreach,2023-01-12,,Preferences: Client would like to obtain perma...,"[preference, client, would, like, obtain, perm...",0.05,0.6486
