# FinchHealth Assessment
**By:** ***Abodunde Ojo***

#### Import necessary libraries and connect to the postgres engine

In [7]:
from sqlalchemy import create_engine, inspect
import pandas as pd
import numpy as np
import spacy
import re
from nltk.corpus import wordnet

nlp = spacy.load('en_core_web_md')# Load spaCy's English language model with NER

import warnings
warnings.filterwarnings("ignore")


In [8]:
engine = create_engine('postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist')
engine

Engine(postgresql://niphemi.oyewole:***@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist)

##### Let us check all the tables in the engine before importing the data

In [9]:
inspector = inspect(engine)

# Get a list of all table names in the database
table_names = inspector.get_table_names()

# Print the list of table names
print("Tables in the database:")
for table_name in table_names:
    print(table_name)

Tables in the database:
reddit_usernames_comments
reddit_usernames


We only need the reddit_usernames_comments table for this task. Therefore, we will load only that table

In [10]:
# Execute a sample query
df = pd.read_sql('SELECT * FROM reddit_usernames_comments', con=engine)

# Display the first few rows of the DataFrame
df.head()

engine.dispose()


Let us look at a sample comment

In [11]:
df.comments[0]

'Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a wee

The comments seem repeated One of the things we will do is clean each comment to remove repetition. As we can see, each repetition seems to be split by "|". We will split on this delimiter and return only the first one. **For this we can write a simple function** but before then,we can check this assumption.

**Assert**: We can check to confirm that the comments are indeed repeated

In [12]:
data = df.comments[1].split("|")
assert data[0] == data[1] == data[2]

**Result**: As we can see, an assertion error does not occur. We can then assume that this is the structure of the comments for all comments in the table

In [13]:
def trim_comments(x):
    data = x.split("|")
    data = data[0] #That is, extract only the first instance of the comment.
    return data

In [14]:
#Let us see how it works
data = trim_comments(df.comments[0])
data

'Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.'

###### We can subsequently apply this trimming to the whole dataset.

In [15]:
df["trimmed_comments"] = df["comments"].apply(trim_comments)

In [16]:
df

Unnamed: 0,username,comments,trimmed_comments
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...","Female, Kentucky. 4 years out. Work equine on..."
1,wahznooski,"As a woman of reproductive age, fuck Texas|As ...","As a woman of reproductive age, fuck Texas"
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,what makes you want to become a vet?
3,abarthch,"I see of course there are changing variables, ...","I see of course there are changing variables, ..."
4,VoodooKing,I have 412+ and faced issues because wireguard...,I have 412+ and faced issues because wireguard...
...,...,...,...
3271,B1u3Chips_,I’m looking into applying for veterinary nursi...,I’m looking into applying for veterinary nursi...
3272,Daktari2018,Good for you for sticking to standards of care...,Good for you for sticking to standards of care...
3273,Sheepb1,"Yes feel free to ask someone to double check, ...","Yes feel free to ask someone to double check, ..."
3274,Elyrath,"Same! Helps massively. Errors can still occur,...","Same! Helps massively. Errors can still occur,..."


#### Drop the initial comments column only

In [17]:
df1 = df.drop("comments", axis =1)

In [18]:
df1

Unnamed: 0,username,trimmed_comments
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on..."
1,wahznooski,"As a woman of reproductive age, fuck Texas"
2,Churro_The_fish_Girl,what makes you want to become a vet?
3,abarthch,"I see of course there are changing variables, ..."
4,VoodooKing,I have 412+ and faced issues because wireguard...
...,...,...
3271,B1u3Chips_,I’m looking into applying for veterinary nursi...
3272,Daktari2018,Good for you for sticking to standards of care...
3273,Sheepb1,"Yes feel free to ask someone to double check, ..."
3274,Elyrath,"Same! Helps massively. Errors can still occur,..."


##### Next we preporcess the Dataset.
This includes lowercase conversion, and removal of stopwords and special characters.

In [19]:
#We define the preprocessing function

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def preprocess_text(text):
    text = text.lower() #lowercase conversion
    text = re.sub(r'[^a-zA-Z\s]', '', text) #removal of special characters
    tokens = word_tokenize(text) #tokenization of the text
    
    stop_words = set(stopwords.words('english')) #load the stopword instance
    filtered_tokens = [token for token in tokens if token not in stop_words] #remove stopwords
    
    preprocessed_text = ' '.join(filtered_tokens) # Join the tokens back into a single string
    
    return preprocessed_text

In [20]:
df1["processed_text"] = df1["trimmed_comments"].apply(preprocess_text)

In [21]:
df1

Unnamed: 0,username,trimmed_comments,processed_text
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...",female kentucky years work equine private prac...
1,wahznooski,"As a woman of reproductive age, fuck Texas",woman reproductive age fuck texas
2,Churro_The_fish_Girl,what makes you want to become a vet?,makes want become vet
3,abarthch,"I see of course there are changing variables, ...",see course changing variables dimension change...
4,VoodooKing,I have 412+ and faced issues because wireguard...,faced issues wireguard natively supported henc...
...,...,...,...
3271,B1u3Chips_,I’m looking into applying for veterinary nursi...,im looking applying veterinary nursing college...
3272,Daktari2018,Good for you for sticking to standards of care...,good sticking standards care caring enough spe...
3273,Sheepb1,"Yes feel free to ask someone to double check, ...",yes feel free ask someone double check used wo...
3274,Elyrath,"Same! Helps massively. Errors can still occur,...",helps massively errors still occur signficantl...


In [22]:
maximum_similarity = []
for word in df1["processed_text"][0].split():
    maximum_similarity.append(nlp("veterinarian").similarity(nlp(word)))

In [23]:
max(maximum_similarity)

1.0000000713113224

### Let us restate the Task Requirements
**Main Task:** Create a classifier that will accurately classify a list of reddit comments into the proper labels.

**Additional criteria:**

Your classifier should run through this list and determine if they are of these categories:

**Medical Doctor**
These should only include practicing doctors, medical school students or nurses or medical professionals who aren’t doctors should go into the “other” label

**Veterinarian**
These should only include practicing vets, vet students or vet techs should go into the “other” label

**Other**


### Considerations

The task itself has asked for a classifier. Now, ***classifiers*** in the context of machine learning could mean to create a classification algorithm like a logistic regression or a neural network that would automatically assign values to the text data based on a pretrained knowledge. However, to achieve this task, some sort of labelling must have been done. Since we do not have labels, we can autogenrate labels using different methods. ***we will get back to this shortly***.

This leads us to the second task which is autogenerating labels for each comment based on the stringent rules that have been set. There are multiple ways to go about this. We will focus on four and choose the best two for this task.

***Manual Labelling***: *This involves manually reading each Reddit comment and assigning the appropriate label based on its content or context. We have over 3000 comments. This is unrealistic*

***Semi-Supervised Learning***: *This involves manually labelling a small subset of the dataset. Afterwards we can use semi-supervised learning techniques to bootstrap the labeling process. The manual intervention might also defeat the purpose of the task*

***Named Entity Recognision (NER)/Cosine Similarity***: *This can be an effective approach for labeling Reddit comments, especially  since we're interested in identifying specific entities mentioned in the comments. NER can automatically identify and classify named entities such as persons, organizations, locations, dates, and more within text data. This will also involve semantic similarity checks that we can do on the dataset*

***Regular Expression***: *This involves setting a classification criteria and looping through each of the comments to check if the classification criteria is correct*

***We will focus on synthesizing the last two approaches for a robust outcome***

### Labelling The Comments

But first, let us define the functions

In [24]:
#initialise key words to check and drop
medical_doctor_regex = re.compile(r'\b(medical student|nurse)\b', flags=re.IGNORECASE)
    
veterinarian_regex = re.compile(r'\b(vet student|vet tech)\b', flags=re.IGNORECASE)

def other_words_check(comment, keyword):
    """
    This function uses regular expression to implement a check to see if a particular buzz word is 
    """
    if keyword == 'doctor':
         if medical_doctor_regex.search(comment):
                return True
    elif keyword == 'veterinarian':
        if veterinarian_regex.search(comment):
            return True
    else:
        return False
    
    

def get_related_keywords(word):
    """
    This function is used to get all key words (synonyms and antonyms)
    for a particular word. Since we are trying to remove instances where they are students or nurses, 
    this might come in handy comes in handy. 
    
    """
    
    synonyms = set()
    antonyms = set()

    # Iterate over each synset (a set of synonyms) of the word in WordNet
    for synset in wordnet.synsets(word):
        # Add synonyms of the word
        synonyms.update(synset.lemma_names())

        # Add antonyms of the word (if available)
        for lemma in synset.lemmas():
            if lemma.antonyms():
                antonyms.update(lemma.antonyms()[0].name())

    return synonyms, antonyms




def similarity_check(keyword, x):
    
    keywords = get_related_keywords(keyword)[0] #get all key words and related synonyms
    similarity = []
    sentence = x.split()
    sentence = [word if word in keywords else word for word in sentence] #replacing all synonyms of the keyword with keyword itself
    sentence1 = ' '.join(sentence)#######
    other_inquiry = other_words_check(sentence1, keyword)
    student_synonyms = get_related_keywords("student")[0] #Get all synonyms of student
    
    
    
    for word in sentence:
        if word in student_synonyms or other_inquiry:
                return 0
        else:
            similarity.append(nlp(keyword.lower()).similarity(nlp(word))) #check the similarity between the keyword and each word in the reddit comment
             
    
    similarity = max(similarity) if similarity else 0
    if similarity >=0.70:
        return keyword
    else:
        return 0
    
    

In [25]:
sdf = "I am a student vet"

In [26]:
similarity_check(keyword = "veterinarian", x=df1["processed_text"][0])

'veterinarian'

In [27]:
similarity_check(keyword = "veterinarian", x=sdf)

0

**Add new columns to the dataframe**

In [28]:
df1["veterinary_label"] = df1["processed_text"].apply(lambda row: similarity_check(keyword = "veterinarian", x = row))

In [29]:
df1["doctor_label"] = df1["processed_text"].apply(lambda row: similarity_check(keyword = "doctor", x = row))

In [30]:
df1.veterinary_label.value_counts()

0               3019
veterinarian     257
Name: veterinary_label, dtype: int64

In [31]:
df1

Unnamed: 0,username,trimmed_comments,processed_text,veterinary_label,doctor_label
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...",female kentucky years work equine private prac...,veterinarian,0
1,wahznooski,"As a woman of reproductive age, fuck Texas",woman reproductive age fuck texas,0,0
2,Churro_The_fish_Girl,what makes you want to become a vet?,makes want become vet,0,0
3,abarthch,"I see of course there are changing variables, ...",see course changing variables dimension change...,0,0
4,VoodooKing,I have 412+ and faced issues because wireguard...,faced issues wireguard natively supported henc...,0,0
...,...,...,...,...,...
3271,B1u3Chips_,I’m looking into applying for veterinary nursi...,im looking applying veterinary nursing college...,veterinarian,0
3272,Daktari2018,Good for you for sticking to standards of care...,good sticking standards care caring enough spe...,0,0
3273,Sheepb1,"Yes feel free to ask someone to double check, ...",yes feel free ask someone double check used wo...,0,0
3274,Elyrath,"Same! Helps massively. Errors can still occur,...",helps massively errors still occur signficantl...,0,0


In [32]:
df1.doctor_label.value_counts()

0         3095
doctor     181
Name: doctor_label, dtype: int64

### Let us Assign a final Label

With the way we have structured the logic, there are possible instances where the check would pass for both veterinarian and doctor. We will prioritize veterinarian in these scenarios, We can write a simple function for this

In [37]:
def label_assignment(row):
    veterinary_label = str(row["veterinary_label"])
    doctor_label = str(row["doctor_label"])
    
    if veterinary_label == "0" and doctor_label == "0":
        return "others"
    elif veterinary_label != "0" and doctor_label != "0":
        return veterinary_label
    else:
        return doctor_label if doctor_label != "0" else veterinary_label

# Here we apply the label_assignment function to create a new column
df1["label"] = df1.apply(label_assignment, axis=1)



In [39]:
# The distribution of the labels in the dataset
df1.label.value_counts()

others          2874
veterinarian     257
doctor           145
Name: label, dtype: int64

In [40]:
#Save the dataset to a datafram

df1.to_csv("reddit_labelled.csv")

### Extreme Gradient Boosting (XGBoost) Algorithm
Since we have our lable, we can look to train (basic model, no tuning) labelled data  on a subset of the dataset and check the performance on the test set

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder


In [46]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df1["processed_text"], df1["label"], test_size=0.2, random_state=42)

In [47]:
# Vectorize the preprocessed comments
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [48]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode the class labels
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Initialize XGBoost classifier
classifier = XGBClassifier()

# Train the classifier on the training data
classifier.fit(X_train_vectorized, y_train_encoded)

# Predict on the testing data
y_pred_encoded = classifier.predict(X_test_vectorized)

# Decode the predicted labels
y_pred = label_encoder.inverse_transform(y_pred_encoded)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9801829268292683


In [52]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
pd.DataFrame(conf_matrix)



Confusion Matrix:


Unnamed: 0,0,1,2
0,21,3,0
1,1,572,2
2,1,6,50


In [53]:
# Calculate precision, recall, and F1-score
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

Classification Report:
              precision    recall  f1-score   support

      doctor       0.91      0.88      0.89        24
      others       0.98      0.99      0.99       575
veterinarian       0.96      0.88      0.92        57

    accuracy                           0.98       656
   macro avg       0.95      0.92      0.93       656
weighted avg       0.98      0.98      0.98       656



In [56]:
import joblib

# Save the trained model to a file
joblib.dump(classifier, 'xgboost_model.pkl')

#save the encoder
joblib.dump(label_encoder, 'label_encoder.pkl')

#save the vectorizer
joblib.dump(vectorizer, 'vectorizer.pkl')


['vectorizer.pkl']