Searching Pubmed's database for papers by using the most common topics found by topic modeling

In [19]:
import pandas as pd
import os

# Loading the CSV file with common strings
strings = pd.read_csv(os.path.join('..','results','common_strings.csv'))
strings

Unnamed: 0,String,Frequency
0,team,47
1,use,38
2,performance,33
3,base,30
4,data,29
...,...,...
68,146a 5p,5
69,team performance indicators,10
70,artificial neural network,5
71,support vector machine,5


Subset dataset by only retaining keywords of interest

In [20]:
# Subset dataset by only retaining keywords with more than 3 letters that contain no numbers

strings = strings[strings['String'].apply(lambda x: len(x) > 3 and not any(char.isdigit() for char in x))]

# Remove some keywords we deem too generic manually

# List of strings to remove
strings_to_remove = ["base", "game", "time", "tree", "data", "model", "train", "study",
                     "method", "analysis", "indicator", "accuracy", "outcome",
                     "approach", "decision", "provide", "position", "different"]

# Further subsetting the DataFrame by removing specified entries
strings = strings[~strings['String'].isin(strings_to_remove)]
strings.head()

Unnamed: 0,String,Frequency
0,team,47
2,performance,33
6,network,24
7,match,24
8,football,22
9,neural,21
10,classification,20
14,injury,19
15,basketball,19
16,player,18


Add a column that categorizes each keyword (ML/statistics related, sports related, injury or performance related)

In [21]:
# Keywords for categorization
ml_statistics_keywords = ["network", "neural", "classification", "machine", "classify", "regression", 
                          "learning", "feature", "prediction", "classifier", "logistic", "decision tree", 
                          "vector machine", "neural network", "machine learning", "data mining", 
                          "artificial neural", "neural networks", "support vector", "classification accuracy", 
                          "artificial neural network", "support vector machine", "non linear"]
sports_keywords = ["team", "match", "football", "basketball", "player", "ball", "athlete", "play", "ball possession",
                   "professional", "olympic", "sport", "soccer", "team performance", "match outcome", "individual ball",
                   "olympic games", "australian football", "team performance indicators", "turnover"]
injury_performance_keywords = ["injury", "performance", "training", "ground reaction", "training load", "performance indicators"]

# Function to categorize strings
def categorize_string(s):
    if any(keyword in s for keyword in ml_statistics_keywords):
        return "ML/Statistics"
    elif any(keyword in s for keyword in sports_keywords):
        return "Sports"
    elif any(keyword in s for keyword in injury_performance_keywords):
        return "Injury/Performance"
    else:
        return "Other"

# Apply the function to create a new column
strings['Category'] = strings['String'].apply(categorize_string)
strings.head()

Unnamed: 0,String,Frequency,Category
0,team,47,Sports
2,performance,33,Injury/Performance
6,network,24,ML/Statistics
7,match,24,Sports
8,football,22,Sports
9,neural,21,ML/Statistics
10,classification,20,ML/Statistics
14,injury,19,Injury/Performance
15,basketball,19,Sports
16,player,18,Sports


Construct the search command

Search on Pubmed and show results

Compare papers found with papers in the references list