## Frequency word analysis code
This code is used to analyze how frequently words appear in an inquiry in a database like PubMed, Web of Science, Embase, ScienceDirect etc. 

Frequency word analysis can help to identify hot topics, relevant keywords, synonyms and frequent ortographical mistakes. 

If used qualitatively, results can be used to broaden a search scope and start or refine a search equation. 

If used quantitatively, it can be used to analyze a big group of results or, if combined with a time-restricted search, it can help to identify tendecy changes within topics. 

### 1. Identify the working directory
The working directory is the place in the computer where your files are stored.
To find it, you can right-click on the file that you want to open and select "properties". Copy-paste the location of the file starting with "C:\\"
Do not include the name of the file, only the name of the directory.

In [3]:
%cd "C:\Users\Braus\OneDrive\Documentos\Escuela\Internship\Literature\SearchEq" 

C:\Users\Braus\OneDrive\Documentos\Escuela\Internship\Literature\SearchEq


### 2. Introduce information of your search inquiry
Copy the **exact name** of the **text** file (.txt) that includes the titles and abstracts of the search. Avoid using spaces and special characters in the names to prevent python malfunction. Using underscores ( _ ) is allowed. The text file can be easily created by copy-pasting the exported results of a search into a blocknote or into a word file and explicitly saving as .txt The order doesn't matter.

Fill manually the number of records retrieved in the search. 

If you want, you can tune the treshold of the search to exclude more or less words. By default, I arbitrarily chose to remove words that appear on less than 5% of the results.

In [4]:
#File to open:
file=(r"WoS2.txt") #put the name inside the "", keep the r before the "".

#Number of records retrieved 
N=341

#Treshold (between 0.00 and 0.99)
T=0.05

if N > 100:
    ninf=int(N*T)
else:
    ninf=2

### 3. This is the functioning part
The only thing you have to do here is delete or add the ```encoding="utf8"``` function. If your text file is manually made, you don't need it. If the text file is retrieved directly from PubMed, it has to be included after the ```"r",```. 

You will find the frequency list as an output at the end. 

In [5]:
txt=open(file,"r",encoding="utf8").read() #when the txt file is retrieved from pubmed, encoding utf8 has to be explicitly shown here, otherwise is ok
rmv = ",;:.\n!\"')(][" #removes any symbols by replacing them with spaces, don't include hyphens bc of IUPAC nomenclature
for c in rmv:
    txt = txt.replace(c,"") 
txt=txt.lower() #change all text into lowercase
words=txt.split(" ") #split words in each space

In [13]:
Dict = {} #create an empty dictionary to fill it with the words in the abstracts
for w in words: #fills the dictionary with the words in the abstracts and counts the frequency 
    if w in Dict:
        Dict[w] += 1
    else:
        Dict[w] = 1

exclude=[
    'the', 'of', 'and', 'in', 'to', 'a', 'with', 'for', 'or', 'been', 
    'use', 'which', 'c', 'during', 'into', 'samples', 'their', 'results', 
    'have', 'levels', 'concentrations', 'all', 'not', 'mu', 'products', 
    'were', 'was', 'is', 'its', 'as', 'are', 'that', 'it', 'from', 'such', 
    'study', 'on', 'this', 'using', 'may', 'be', 'by', 'used', 'we', 'an', 
    'these', 'at', 'on', 'data', 'analysis', 'based', 'methods', 'method', 
    'one', 'two', 'three', 'four', 'five', 'between', 'after', 'before', 
    'process', 'than', 'values', 'parameter', 'parameters', 'can', 
    'different', 'other', 'more', 'number', 'including', 'shown', 'example', 
    'time', 'each', 'several', 'about', 'according', 'although', 'thus', 
    'therefore', 'those', 'due', 'both', 'first', 'second', 'third', 'if', 
    'then', 'under', 'over', 'through', 'any', 'important', 'observed', 
    'indicate', 'observations', 'respectively', 'obtained', 'shown', 
    'relative', 'presence', 'absence', 'type', 'types', 'mean', 'average', 
    'conditions', 'condition', 'small', 'large', 'high', 'low', 'among', 
    'further', 'within', 'per', 'units', 'during', 'recent', 'new', 'most', 
    'amount', 'common', 'specific', 'similar', 'highly', 'strong', 'weak', 
    'higher', 'lower', 'standard', 'deviation', 'significant', 'significance'
    'studies','but','has','increased','also','showed','=','university','doi','indexed'
]
CleanDict={k:v for k,v in Dict.items() if k not in exclude} #removes the stopwords from the dictionary 

#sorts the dictionary from higher to lower values
CleanDict={key: val for key, val in sorted(CleanDict.items(), key = lambda ele: ele[1],reverse=True)}

for w in CleanDict:
    if CleanDict[w] > ninf: #Compares the frequency with the treshold
        print(f"{w}  {CleanDict[w]}") #This prints the frequency list

pvc  853
dehp  660
phthalate  389
plasticizer  358
plasticizers  270
chloride  250
polyvinyl  216
extraction  211
blood  206
leaching  190
plasticized  168
exposure  157
bags  150
concentration  135
migration  129
materials  126
acid  122
medical  119
stability  117
properties  115
thermal  111
dop  108
devices  106
investigated  103
compared  102
temperature  98
plastic  98
found  96
infusion  94
solution  90
phthalates  89
degradation  89
metabolites  88
mechanical  85
degrees  84
surface  82
chemical  81
tubing  80
range  79
x  77
effects  76
effect  75
additives  75
resistance  71
liquid  70
human  69
patients  69
when  69
studied  69
+/-  68
adsorption  68
solvent  67
storage  66
mehp  66
mg  66
potential  64
rate  64
films  64
solutions  63
however  63
dbp  63
containing  63
no  63
significantly  63
mass  62
membrane  62
polymer  61
material  61
urine  61
could  59
sets  58
dioctyl  58
content  57
determined  56
response  56
elsevier  55
leached  55
prepared  55
electrode  55
wel

### Export it as an Excel file
If you want to export the frequency list into excel, run the code below. The new file will appear in the same location as the original file. Python **will not tell you** if the documents are being overwritten so be careful to export it with a unique name. 

In [27]:
import pandas as pd
d = {key: val for key, val in CleanDict.items() if val > ninf}
df = pd.DataFrame(list(d.items()), columns=['Word', 'Frequency'])

df.to_excel("example1.xlsx", index=False) #write the name here
print("Exported")

exported
