<a href="https://colab.research.google.com/github/MananShukla7/SkimLit/blob/main/skimlit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project : SkimLit📝🔥

`Short for Skimming Literature`

The purpose of this notebook is to build a NLP model to make reading medical abstracts
a lot easier

The Dataset that we are using is PubMed 200k RCT and the paper we are replicating is :
https://arxiv.org/abs/1710.06071



##Confirming access to GPU

In [None]:
!nvidia-smi

Fri Jun  2 17:38:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

##Getting Data

Downloading the dataset from the paper's author: https://github.com/Franck-Dernoncourt/pubmed-rct

In [28]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

fatal: destination path 'pubmed-rct' already exists and is not an empty directory.
PubMed_200k_RCT
PubMed_200k_RCT_numbers_replaced_with_at_sign
PubMed_20k_RCT
PubMed_20k_RCT_numbers_replaced_with_at_sign
README.md


In [29]:
#Check what filer are in the PubMed 20k dataset
!ls /content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

dev.txt  test.txt  train.txt


In [30]:
#Start our experiments by exploring and experimenting on 20k dataset!
data_dir="/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign"

In [31]:
#Check all of the filename of its directory
import os
filenames=[data_dir+"/"+filename for filename in os.listdir(data_dir)]
filenames

['/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt']

##Data Preprocessing

Now we've got some text data, now we have to explore it throughly.

To do that we need to visualise it first.

In [32]:
#Reading the files with python

def get_lines(filename):
  """
  Reads the filename (a text filename) and returns all of the lines of the text file
  as a list.

  Args:
  filname: a string containing the target filepath

  Returns:
  A list of string with one string per line from the input text line
  """
  f=open(filename,"r")
  return f.readlines()



In [33]:
# Let's read into the training lines
train_dir=filenames[1]
train_lines=get_lines(train_dir)
train_lines[:20]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [34]:
len(train_lines)

210040

###Data Structuring

Representing this data into dictionaries as it is easily manipulated.

A sample structure would look like:

```[{'line_number' : 0,
   'target' : 'BACKGROUND',
   'text' : 'Emotional eating is associated with overeating and the development of obesity .\n',
   'total_lines' : 11, 
   ...
   }]```

In [63]:
def lst_to_dict(txtData):
  types=["BACKGROUND", "OBJECTIVE", "METHODS", "RESULTS", "CONCLUSION"]
  linecount=0
  data=dict()
  for txt in txtData:
    if "###" in txt:
        start_index=txtData.index(txt)
        end_index=txtData.index("\n")
        total_lines=end_index-start_index
        # dict.update()
    else:
      for typ in types:
        if typ in txt:
          target=typ

      start=txt.find("\t")
      end=txt.find(".\n")
      text=txt[start:end+1]
      
      
      

      update_val={"line_number":linecount,
                  "target":target,
                  "text":text,
                  "total_line":total_lines}
      data.update(update_val)
      print(f"[line_number:{linecount},\ntarget:{target},\ntext:{text},\ntotal_line:{total_lines}]\n\n")
    linecount+=1

  
  return data




  
    
    

In [64]:
d=lst_to_dict(train_lines[:15])
d

[line_number:1,
target:OBJECTIVE,
text:	To investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .,
total_line:13]


[line_number:2,
target:METHODS,
text:	A total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .,
total_line:13]


[line_number:3,
target:METHODS,
text:	Outcome measures included pain reduction and improvement in function scores and systemic inflammation markers .,
total_line:13]


[line_number:4,
target:METHODS,
text:	Pain was assessed using the visual analog pain scale ( @-@ mm ) .,
total_line:13]


[line_number:5,
target:METHODS,
text:	Secondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment 

{'line_number': 13, 'target': 'CONCLUSION', 'text': '', 'total_line': 13}

In [None]:
str1="###35279E"
x="###" in str1
x

In [None]:
l1=["BAKCGROUND","obj\n","\n"]
str2="BAKCGROUND \n ribdiuabda oddnaoid.\n"
for i in l1:
  print(i)
  if i in str2:
    x=i
s=str2.find("KC")
x,s

In [None]:
str2[0:2]

In [None]:
l1.index("\n")