Intellectually curious Health Data Scientist passionate about the data revolution in the healthcare sector. I am particularly interested in creating innovative AI/ML solutions that deliver high value in the pharmaceutical/healthcare sector. After 6+ years of consultancy experience delivering complex data products and assisting large client organisations, I wish to expand my Data Science skills and knowledge. My goal is to pivot my career to continue as a Data Scientist in the healthcare industry.
To know more about me, please feel free to contact me or visit my LinkedIn
As part of my Health Data Science MSc dissertation at UCL, I have built a Knowledge Graph Retrieve Augmented Generation (KG-RAG) system that leverages Large Language Models to efficiently interrogate and analyse a large collection of clinical trial protocols from ClinicalTrials.gov.
Key learnings:
- Deploy open-source Large Language Models (LLMs), such as Llama3 or Mixtral8x7b, in High-Performance Computing (HPC) using vLLM.
- Process semi-structured Clinical Trial Protocols using Non-SQL/MongoDB.
- Creation and hosting of a Knowlege Graph using BioCypher and Neo4j AuraDB.
- Implementation of a ReAct design using DSPy, creating custom tools that can be used by an LLM to query Knowledge Graphs and SQL dbs.
- Use high-level frameworks such as Llama-index and LangChain for txt-2-SQL and txt-2-Cypher.
- How to evaluate Large Language Models.
Do you want to know more about this project?
Please, see below a summary of a few projects showcasing my Data Science skills.
Skill \ Technology | UCI Heart Disease | Card Fraud | Disaster Tweets | Causal Impact |
---|---|---|---|---|
Business question | Diagnose which patients are suffering heart diseases |
Detect likely fraudulent transactions |
Identify disaster events mentioned in text/tweets |
Quantify the effect of COVID lockdown in stock price |
Language | Python | Python | Python | Python / R |
ML type | Classifier | Classifier | NLP Classifier | Time Series Regression |
Data Engineering | pySpark | |||
Feature Engineering | Time Series Features | Word Embedding | ||
Over / Under sampling | SMOTE | |||
Traditional ML | Sklearn | Sklearn | Causal Impact | |
Gradient Boosting | XGBoost | CatBoost | ||
Deep Learning | LSTM, GRU, DistilBert | |||
Hyper fine tunning | Optuna | |||
Explainable ML | SHAP Values | |||
User Interface | Streamlit | |||
ML Ops | MLFlow | MLFlow |
I participated with the NHS Pycom in the development of nhspy-plothedots, a package for Statistical Process Control analysis and plotting. My mean contribution was creating unit test scripts. This gave me an opportunity to (a) know more about the package so I can contribute in other areas in the future and (b) practice software development skills (e.g. unit testing, raise pull request) that I have used in my professional career but they may not show up in my Data Science portfolio.