In [46]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


! pip install imblearn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler




In [47]:
path = '/Users/tomcio/Documents/GitHub/MIT_MBAn_NER/data/'
data = pd.read_csv(path + 'training_data_RAW.csv')
X = data["Name"]
y = data["label"]

### 3. Character-level encoding:

Since you want to analyze characters individually, we'll create a character-level vocabulary and convert each word into a vector of character counts. Here, we'll use CountVectorizer for this purpose. Note that we set ngram_range=(1, 1) to consider only single characters.

In [48]:
vectorizer = CountVectorizer(ngram_range=(1, 5))
X_encoded = vectorizer.fit_transform(X)


### 4. Train-test split:

Split the data into training and testing sets for model evaluation.

Key Change = Undersampling:  We introduce RandomUnderSampler from imblearn to reduce the majority class. The sampling_strategy=0.5 aims to achieve a 1:1 ratio between the drug name class and the normal word class. Adjust this ratio if needed.

In [49]:
# 3. Undersampling (Modify ratio as needed)
undersample = RandomUnderSampler(sampling_strategy=0.25)  
X_undersampled, y_undersampled = undersample.fit_resample(X_encoded, y)

# 4. Split data 
X_train, X_test, y_train, y_test = train_test_split(X_undersampled, y_undersampled, test_size=0.2, random_state=42)



### 5. Build a Logistic Regression model:

Logistic regression is a good choice for binary classification problems like this. Train the model on the training data.

Key Change = Class Weights: During the creation of the LogisticRegression model, we specify class_weight={0: 1, 1: 10} to place more importance on correctly classifying drug names.

In [50]:
# model = LogisticRegression()
# model.fit(X_train_res, y_train_res) # changed from X/Y_train to X/Y_train_res

# 5. Logistic Regression with class weights
model = LogisticRegression(class_weight={0: 1, 1: 10}) 
model.fit(X_train, y_train)


### 6. Evaluate the model:
#### Make predictions on the test data and assess the model's performance using metrics like accuracy, precision, recall, and F1-score.

Accuracy: The accuracy seems decent (89.9%), implying the model does well on the overall classification task.

Precision: A perfect precision of 1.0 indicates that every word the model predicted as a drug name was indeed a drug name (no false positives).

Recall: The extremely low recall (0.012) is a significant problem. It suggests that the model is only able to identify a tiny fraction of actual drug names in the dataset. This indicates that the model is too restrictive in its classification.

F1-score: The low F1-score (0.024) confirms the poor balance between precision and recall.

In [51]:
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))


Accuracy: 0.8085927770859278
Precision: 0.8947368421052632
Recall: 0.042579837194740136
F1-score: 0.0812910938433951


### 7. Predict for a new word:

Once you're satisfied with the model's performance, you can use it to predict the probability of a new word being a drug name. Encode the new word using the same vectorizer and make a prediction using the trained model.

In [52]:
new_word = "metylphenidate"  # Example new word
new_word_encoded = vectorizer.transform([new_word])
prediction = model.predict_proba(new_word_encoded)[0][1]
print("Probability of", new_word, "being a drug name:", prediction)


Probability of metylphenidate being a drug name: 0.4958514267855859


### Note:

This approach leverages character-level information instead of relying on whole words or n-grams, considering your requirement.
CountVectorizer creates a vocabulary of unique characters and represents each word as a count vector of these characters.
Logistic regression classifies each word based on the learned relationship between character features and drug name labels.
The final step demonstrates how to use the trained model to predict the probability for a new unseen word.

This is a basic example. You can explore more advanced techniques like character-level recurrent neural networks (RNNs) for potentially better performance.
Consider handling misspellings or variations in drug names during data preparation or model training.
Remember that the model's performance depends on the quality and quantity of your training data.

### Potential problem = Imbalanced Data:

It's highly likely that your dataset has far more normal words than drug names. This imbalance forces the model to be conservative to maintain high precision. Here's what you can do:

Oversampling: Increase the representation of the drug name class through techniques like SMOTE (Synthetic Minority Oversampling Technique).



