In [12]:
import yake
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
description = '''
1.0 FTE Full time Day - 08 Hour R2437350 Hybrid 84866 IT RESEARCH Technology & Digital Solutions 1830 Embarcadero Road,PALO ALTO,California

If you're ready to be part of our legacy of hope and innovation, we encourage you to take the first step and explore our current job openings. Your best is waiting to be discovered...

Day - 08 Hour (United States of America)

This is a Stanford Health Care job.

A Brief Overview
The Senior Biomedical Informatics Data Scientist will partner with researchers and clinicians to enable effective and efficient use of data and resources available via Stanford's research clinical data repository (STARR, https://starr.stanford.edu/about-starr ) including the Electronic Health Records in the OMOP Common Data Model, radiology and cardiology imaging data and associated metadata, and new data types as they get integrated along with their databases and respective cohort query tools and interfaces e.g., OHDSI ATLAS. This individual will enable researchers to maximize their understanding, interpretation and use of these clinical and research tools for more informed and productive research, clinical trials, patient care and quality outcome projects.
Clean, extract, transform and analyze various kinds of clinical data to create analysis-ready datasets that follow the FAIR (Findable, Accessible, Interoperable and Re-usable) principles. Partner with researchers and clinicians to enable effective and efficient use of Stanford Clinical data and resources for the advancement of research and the educational mission.

Locations
Stanford Health Care

What you will do
• Work closely with the data science and engineering team on data quality analysis. Develop processes to measure and ensure quality, completeness, integrity and compliance of institutional research data assets, including data/metadata documentation and data model specifications.
• Work closely with the hospital clinical teams to understand the provenance of the clinical data as well as the data workflow that will enrich and augment the research ready datasets.
• Identify best practices in the technical community and help to shape and implement policies that enhance data quality, compliance and customer support. Activities will include attending conferences, taking trainings (e.g. Coursera), reading peer review publications, engaging in customer feedback etc.
• Develop ETL (extract, transform, load) specifications to go from raw data to research ready datasets. Work closely with the engineering team on production implementation of the ETL’s and methods. Run QA metrics as needed to generate dashboards
• Employ new and existing tools to interpret, analyze, and visualize relationships in data. Create databases, datasets and reports and perform statistical analyses appropriate to data. Use system reports and analyses to identify potentially problematic data, make corrections, and determine root cause for data problems from input errors or inadequate field edits, and suggest possible solutions.
• Analyze and incorporate external data sets that may augment the power of clinical data such as social determinants of health data, claims data, environmental data, death data etc.
• Represent Stanford through presentations at technical conferences, consortiums, participation in standard committees, working groups and other venues.
• Engage in other departmental activities to ensure an inclusive and transparent work culture such as continuous process improvement, agile software development, documentation, writing of manuscripts and white papers, creating training videos
• Provides guidance and training to less experienced data scientists; mentor students and interns

Education Qualifications
• Bachelor’s degree in a scientific field (Engineering, Math, Physics, Chem). Relevant experience would be considered in lieu of a degree.

Experience Qualifications
• 4+ years of related experience. Masters or PhD may count in lieu of experience.

Required Knowledge, Skills and Abilities
• Strong analytical skills.
• Experience with data manipulation and integration, databases, and statistics.
• Fluency with data science programming paradigms such as Jupyter notebooks, SQL,
• Python or R.
• Ability to understand scientific literature, experimental procedures and their limitations, and applications of this information in the research and clinical setting.
• Familiarity with one or more data types such as Electronic Health Records (EHR), radiology, omics, device data, wearable data, pathology etc.
• Strong written and oral communication skills as demonstrated by technical manuscripts, poster presentations, conference speaking, and other forums
• Effective Communicator with the ability to engage with all levels in the organization.
• Knowledge of the nuanced interaction of Clinical systems used between the Stanford Hospitals.
• Ability to work with little supervision.

Physical Demands and Work Conditions
Blood Borne Pathogens
• Category III - Tasks that involve NO exposure to blood, body fluids or tissues, and Category I tasks that are not a condition of employment

These principles apply to ALL employees:

SHC Commitment to Providing an Exceptional Patient & Family Experience

Stanford Health Care sets a high standard for delivering value and an exceptional experience for our patients and families. Candidates for employment and existing employees must adopt and execute C-I-CARE standards for all of patients, families and towards each other. C-I-CARE is the foundation of Stanford’s patient-experience and represents a framework for patient-centered interactions. Simply put, we do what it takes to enable and empower patients and families to focus on health, healing and recovery.

You will do this by executing against our three experience pillars, from the patient and family’s perspective:
• Know Me: Anticipate my needs and status to deliver effective care
• Show Me the Way: Guide and prompt my actions to arrive at better outcomes and better health
• Coordinate for Me: Own the complexity of my care through coordination

Equal Opportunity Employer Stanford Health Care (SHC) strongly values diversity and is committed to equal opportunity and non-discrimination in all of its policies and practices, including the area of employment. Accordingly, SHC does not discriminate against any person on the basis of race, color, sex, sexual orientation or gender identity and/or expression, religion, age, national or ethnic origin, political beliefs, marital status, medical condition, genetic information, veteran status, or disability, or the perception of any of the above. People of all genders, members of all racial and ethnic groups, people with disabilities, and veterans are encouraged to apply. Qualified applicants with criminal convictions will be considered after an individualized assessment of the conviction and the job requirements.

Base Pay Scale: Generally starting at $55.80 - $73.92 per hour

The salary of the finalist selected for this role will be set based on a variety of factors, including but not limited to, internal equity, experience, education, specialty and training. This pay scale is not a promise of a particular wage
'''

In [11]:
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(description)

keywords

[('FTE Full time', 0.0011938157431390056),
 ('Stanford Health Care', 0.003725070642026324),
 ('Full time Day', 0.004031904052294564),
 ('Embarcadero Road,PALO ALTO,California', 0.004432293977570661),
 ('FTE Full', 0.004739663666773152),
 ('Data', 0.006667705063714647),
 ('clinical data', 0.011907023742827611),
 ('Stanford Health', 0.013685555838866333),
 ('Stanford Clinical data', 0.014291486907551389),
 ('Health Care', 0.015500289781716574),
 ('Embarcadero Road,PALO', 0.016909770188685166),
 ('Full time', 0.017176642801139046),
 ('Stanford research clinical', 0.019590222748421912),
 ('research clinical data', 0.019617193673679353),
 ('RESEARCH Technology', 0.020337019318759815),
 ('Experience Stanford Health', 0.022026940247440254),
 ('Stanford', 0.024417819750568497),
 ('Electronic Health Records', 0.024432216461451325),
 ('clinical', 0.02553686300854865),
 ('Health', 0.02624123061290959)]

In [14]:
# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the text
tfidf_matrix = vectorizer.fit_transform([description])

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Get the scores for each word
scores = tfidf_matrix.toarray().flatten()

# Create a dictionary of words and their scores
word_scores = dict(zip(feature_names, scores))

# Sort the words by their scores
sorted_words = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top keywords
top_keywords = [word for word, score in sorted_words[:30]]
top_keywords

['data',
 'clinical',
 'experience',
 'stanford',
 'care',
 'health',
 'research',
 'work',
 'patient',
 'datasets',
 'effective',
 'enable',
 'including',
 'quality',
 'ready',
 'use',
 'ability',
 'analyze',
 'closely',
 'databases',
 'employment',
 'engineering',
 'families',
 'hour',
 'job',
 'patients',
 'researchers',
 'shc',
 'skills',
 'starr']