#NLP Extraction of Clinical Trial/Research Document Attribute


---


**Technologies: Logistical Regression Classifier on Sentence Embedding to Spacy tokenization to Word2vec for Sample Size extraction to number. **



This document will go into a step by step process for training a logistical regression classifier to find the sentences and parse throug the sentence to determine the best match exact values for the specific clinical research document attribute. Please note this repo is part of a larger ETL process and is only for getting simple attributes about the study as a whole and not full processing of extended datasets within the study that are needed. A seperate processing script would be used utilizing the information here in further processing. 

Ex - Finding all patients within the document and splitting the data by patient in order to more accurately map conditions, symptoms, risk factors/circumstances, and treatments for a more accurate representation if the structured data is not available or does not show all the observations. 


Current Attributes Supported:
1.   Sample Size
1.   Sample Methodology
2.   Method

Future Support:

1.   Topics/Drug/Condition Topic 
2.   Recruiting Periods 
3. Researchers - in case not in Meta data for Author
4. Countries Included 

Items Already in Meta:
1. Authors 
2. Title 
3. Citations 




#Initial Installations 

In [1]:
#Load everything 
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz
!pip install word2number
!Pip install spacy
import os
import shutil
import re
import requests
import zipfile
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import seaborn as sns
import spacy
import en_core_web_sm

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz
[?25l  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz (70.0MB)
[K     |████████████████████████████████| 70.0MB 54kB/s 
Building wheels for collected packages: en-core-sci-md
  Building wheel for en-core-sci-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-sci-md: filename=en_core_sci_md-0.2.4-cp36-none-any.whl size=70498247 sha256=5430d4d6f23f01fa53b367ed344e0f7799965a0cbcf86da01d44641c4838fdbc
  Stored in directory: /root/.cache/pip/wheels/12/b3/89/7fbb30f56411e8b4002eac6d5568ab46da63191a2287aa17bf
Successfully built en-core-sci-md
Installing collected packages: en-core-sci-md
Successfully installed en-core-sci-md-0.2.4
Collecting word2number
  Downloading https://files.pythonhosted.org/packages/4a/29/a31940c848521f0725f0df6b25dca8917f13a2025b0e8fcbe5d0457e45e6/word2number-1.1.zip
Building wheels for coll

  import pandas.util.testing as tm


#Load the training data into the environment


In [0]:
try:
    os.stat("./data/")
except:
    os.mkdir("./data/")  
url = 'https://github.com/Deamoner/annotated-clinical-research-study-attribtes-data/archive/master.zip'
r = requests.get(url, allow_redirects=True)
open('./data/design-data.zip', 'wb').write(r.content)
#Now unzip the file 
with zipfile.ZipFile('./data/design-data.zip', 'r') as zip_ref:
    zip_ref.extractall('./data/')
#Move the data manually to the data folder - Testing if your paying attention
# attribute.csv is the only file currently being processed. 
# Note press refresh on the files tab to see the new directories and files to move. 

#Load the data in a dataframe for access

In [3]:
df = pd.read_csv("./data/attribute.csv", encoding = "ISO-8859-1")
df.head()

Unnamed: 0,other,method,sample,text
0,0,0,0,'Epidemiology' includes studies on the epidemi...
1,0,0,0,(3) Approved the final version of the manuscri...
2,0,0,0,[17] Further researches are needed to investig...
3,0,0,0,"[2] [3, 4] It disproportionately affects the e..."
4,0,0,0,[30] It has been conjectured the loss of funct...


#Train the classifier in a pipeline 

Split the data into train test split. 

In [4]:
train, test = train_test_split(df, random_state=42, test_size=0.33, shuffle=True)
X_train = train.text
X_test = test.text
print(X_train.shape)
print(X_test.shape)

(2682,)
(1322,)


In [5]:
#Define a pipeline combining a text feature extractor with multi lable classifier
#You will notice that I have chosen to only train one category at this point. 
#the for loop on categories will retrain over the NB_pipeline model.
#In future will need to turn into array of models or redo the multiclassification in one model. 
categories = ['sample']
NB_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(MultinomialNB(
                    fit_prior=True, class_prior=None))),
            ])
for category in categories:
    print('... Processing {}'.format(category))
    # train the model using X_dtm & y
    NB_pipeline.fit(X_train, train[category])
    # compute the testing accuracy
    prediction = NB_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

... Processing sample
Test accuracy is 0.9046898638426626


#Model Saving and Loading
This section is important for production implementation scenarios. 

In [0]:
#Save the model 

import pickle
filename = "abc.sav"
pickle.dump(NB_pipeline, open(filename, 'wb'))



In [0]:
#load the model
load_model = pickle.load(open(filename,'rb'))

#Simple Interfence Demo

In [0]:
#Inference Demo
#One Sentence Test of Prediction
sentence = "	Methods: From January 20 to February 5, 2020, a total of 130 patients diagnosed with COVID-19 from seven hospitals in China were collected."
#All data preidiction accuracy 
testsentences=pd.Series(df.text[0:1000])
sseries = pd.Series(sentence)
#Run Predictions 
prediction_all= load_model.predict(testsentences)
predict_one = load_model.predict(sseries)

In [9]:
#Print all to view all predictions - Good Practice as the accuracy reported never matches
prediction_all


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [0]:
#Print one prediction
predict_one[0]

1

#Review



These are a pretty good starting point. Upon further review I think it is saying it gets 90% accuracy accross all predictions, yet the subset that are sample I think has a number of false positives and an equal number of false negatives. 

 
2.   Testing with NLP Tags from Spacy added to the vectors 
3. Further NLP Process to extract exact information - This is from the sample script for later. 




In [0]:
#Let's define some Vocab objects to be used 
#Credit to https://www.kaggle.com/savannareid for providing keywords and analysis.
class Vocab(object):
    """
    Defines vocabulary terms for studies.
    """

    # Study titles
    TITLE = [r"case (?:series|study)", r"cross[\-\s]?sectional", r"mathematical model(?:ing)?", "meta-analysis",
             r"non[\-\s]randomized", "prospective cohort", "randomized", "retrospective cohort", "systematic review",
             r"time[\-\s]?series"]

    # Study design vocabulary
    DESIGN = [r"(?:electronic )?health records", r"(?:electronic )?medical records", "adherence", "adjusted hazard ratio",
              "adjusted odds ratio", "ahr", "aic", "akaike information criterion", "allocation method", "allocation method double-blind",
              "aor", "area under the curve", "associated with", "associated with random sample", "association", "attack rate", "auc"
              "baseline", "blind", "bootstrap", "bootstrap auc", "case report", "case report clinical findings", "case series",
              "case study", r"case[\-\s]control", "censoring", "chart review", "cochrane review", "coefficient", "cohen's d", "cohen's kappa",
              "cohort", r"computer model(?:ing)?", "confounding", "consort", "control arm", r"correlation(?:s)?",
              "covariates", "cox proportional hazards", "cross-sectional survey", r"cross[\-\s]sectional", "d-pooled", "data abstraction forms",
              "data collection instrument", r"database(?:s)? search(?:ed)?", "databases searched", r"deep[\-\s]learning", "demographics", "diagnosis",
              "difference between means", "difference in means", "dosage", "double-blind", "duration", "editor", "ehr", "electronic health records",
              "electronic search", "eligib(?:e|ility)", "eligibility criteria", r"enroll(?:ed|ment)?", "estimation", "etiology", "exclusion criteria",
              "exposure status", "follow-up", "followed", r"forecast(?:ing)?", "frequency", "gamma", "hazard ratio", "heterogeneity", "hr", "i2",
              "incidence", "inclusion criteria", "inter-rater reliability", "interrater reliability", "interventions", "kaplan-meier", "log odds",
              "logistic regression", "lognormal", "longitudinal", "loss to follow-up", r"machine[\-\s]learning", r"match(?:ed|ing)? case",
              r"match(?:ed|ing)? criteria", "matched", "matching", r"mathematical model(?:ing)?", "mean difference", "median time to event",
              "meta-analysis", "model fit", "model simulation", "monte carlo", "multivariate hazard ratio", "narrative review",
              "non-comparative study", r"non[\-\s]randomised", r"non[\-\s]randomized",
              "non-response bias", "number of controls per case", "odds", "odds ratio", "outcomes", "patients", "per capital", "placebo",
              "pooled adjusted odds ratio", "pooled aor", "pooled odds ratio", "pooled or", "pooled relative risk", "pooled risk ratio",
              "pooled rr", "potential confounders", "power", "prevalence", "prevalence survey", "prisma", "prospective cohort",
              r"prospective(?:ly)?", "protocol", "pseudo-randomised", "pseudo-randomized", "psychometric evaluation of instrument",
              "psychometric evaluaton of instrument", "publication bias", "quasi-randomised", "quasi-randomized", "questionnaire development",
              "r-squared", "randomisation", "randomisation consort", "randomised", "randomization method", "randomized", "randomized clinical trial",
              "randomized controlled trial", "rct", "receiver-operator curve", r"recruit(?:ed|ment)?", "registry", "registry data",
              "relative risk", "response rate", "retrospective", "retrospective chart review", "retrospective cohort", "right-censored",
              "risk factor analysis", "risk factors", "risk factors data collection instrument", "risk of bias", "risk ratio", "roc", "rr",
              "search criteria", "search strategy", "search string", r"simulat(?:e|ed|ion)", "statistical model", "stochastic model", "strength",
              "subjects", "surveillance", "survey instrument", "survival analysis", "symptoms", "syndromic surveillance", "synthetic",
              "synthetic data", r"synthetic data(?:set(?:s)?)?", "systematic review", "time-to-event analysis", r"time[\-\s]series",
              r"time[\-\s]varying", "tolerability", "treatment arm", "treatment effect", "truncated", "weibull"]

    # Sample vocabulary
    SAMPLE = ["articles", "cases", "children", "individuals", "men", "participants", "patients", "publications", "samples", "sequences",
              "studies", "trials", "total", "women"]

    # Sample methods vocabulary
    METHOD = ["analyse", "analyze", "ci", "clinical", "collect", "compare", "data", "database", "demographic", "enroll", "epidemiological",
              "evidence", "findings", "hospital", "include", "materials", r"method(?:s)?:?", "observe", "obtain", "perform", "publication",
              "publish", "recruit", r"result(?:s)?:?", "retrieve", "review", "search", "study", "studie"]

In [0]:
#Let's define some functions we will need
Vocab
from word2number import w2n
def extract(sentence, attribute):
  size = None
  nlp = en_core_web_sm.load()
  #nlp = spacy.load("en_core_sci_md")
  tokens = nlp(sentence)
 
  #nlp = en_core_web_sm.load()
  #Need to NLP tokenize
  size = find(tokens, Vocab.SAMPLE)
  return size
def tonumber(token):
  try:
      return "%d" % w2n.word_to_num(token.replace(",", ""))
  # pylint: disable=W0702
  except:
      pass

  return token
def find(tokens, keywords):
  matches = [match(token, keywords) for token in tokens]
  matches = [match for match in matches if match]

  return matches[0][0] if matches else None
  
def match(token, keywords):
  if token.text.lower() in keywords:
    return [tonumber(c.text) for c in token.children if isnumber(c)]

  return None

def isnumber(token):
  # Returns true if following conditions are met:
  #  - Token POS is a number of it's all digits
  #  - Token DEP is in [amod, nummod]
  #  - None of the children are brackets (ignore citations [1], [2], etc)
  return (token.text.isdigit() or token.pos_ == "NUM") and token.dep_ in ["amod", "nummod"] and not any([c.text == "[" for c in token.children])

extracted = extract(sentence,"sample")

#Validation and Enhancement

In [12]:
#Find all the bad perming sentences for review 
prediction_all
df.text[0:1000]
count = 0
#Test out a loop for all the predictions and review for where the error
for i in range(len(prediction_all)):
  if prediction_all[i]:
    count = count + 1
    #print(prediction_all[i])
    print(testsentences[i])
    extracted = extract(testsentences[i],"sample")
    print(extracted)
    print("-----------------------")
print("Total:")
print(count)

315  316  317  318  319  320  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352  353  354  355  356  357  358  359  360  361  362  363  364  365  366   367  368  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418
None
-----------------------
E093392, E093393, E093399, E09341, E093411, E093412, E093413, E093419, E09349, E093491, E093492, E093493, E093499, E09351, E093511, E093512,  E093513, E093519, E093521, E093522, E093523, E093529, E093531, E093532, E093533, E093539, E093541, E093542, E093543, E093549, E093551, E093552,  E093553, E093559, E09359, E093591, E093592, E093593, E093599, E0936, E0937X1, E0937X2, E0937X3, E0937X9, E0939, E0940, E0941, E0942, E0943,  E0944, E0949, E0951, E0

#Review all falgged sentences and extract to see bad return Sample Sizes
This section will do a systematic review of the results, and note the corner cases to formulate a plan on how to improve accuracy. 

Notes:



# Errors in results:
Each sentence below is followed by the extraction number for sample size and the flag it was actually given in the training data. 


Methods The clinical records, laboratory findings and radiological assessments included chest X-ray or computed tomography were extracted from electronic medical records of 25 died patients with COVID-19 in Renmin Hospital of Wuhan University from Jan 14 to Feb 13, 2020.
None

Three-hundred and fifty-five COVID-19 patients with were recruited and clinical data were collected from electronic medical records.
100

Methods A total of 8 274 cases in Wuhan were enrolled in this cross-sectional study during January 20 to February 9, 2020, and were tested for 2019-nCoV using fluorescence quantitative PCR.
9

Results 112 COVID-19 patients were enrolled in our study.
None

In total, there were 13 infected evacuees including five asymptomatic individuals as of 16 February 2020.
5 - 

Huang and colleagues 1 only included 59 suspected cases with fever and dry cough, and 41 patients were con firmed to be infected with SARS-CoV-2.
59 - technically right?

Bad Labelling - this one shouldn't flag imo
All countries that had reported at least 15 days of at least 100 total confirmed cases, and that had available data on BCG policy and covariates (median age, gross domestic product per capita, population density, population size, net migration rate, and geographical region) were included (52 countries in total).
100

315  316  317  318  319  320  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352  353  354  355  356  357  358  359  360  361  362  363  364  365  366   367  368  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418
None - bad flag or bad tagging 
-----------------------
E093392, E093393, E093399, E09341, E093411, E093412, E093413, E093419, E09349, E093491, E093492, E093493, E093499, E09351, E093511, E093512,  E093513, E093519, E093521, E093522, E093523, E093529, E093531, E093532, E093533, E093539, E093541, E093542, E093543, E093549, E093551, E093552,  E093553, E093559, E09359, E093591, E093592, E093593, E093599, E0936, E0937X1, E0937X2, E0937X3, E0937X9, E0939, E0940, E0941, E0942, E0943,  E0944, E0949, E0951, E0952, E0959, E09610, E09618, E09620, E09621, E09622, E09628, E09630, E09638, E09641, E09649, E0965, E0969, E098, E1010,  E1011, E1021, E1022, E1029, E10311, E10319, E10321, E103211, E103212, E103213, E103219, E10329, E103291, E103292, E103293, E103299, E10331,  E103311, E103312, E103313, E103319, E10339, E103391, E103392, E103393, E103399, E10341, E103411, E103412, E103413, E103419, E10349, E103491,  E103492, E103493, E103499, E10351, E103511, E103512, E103513, E103519, E103521, E103522, E103523, E103529, E103531, E103532, E103533, E103539,  E103541, E103542, E103543, E103549, E103551, E103552, E103553, E103559, E10359, E103591, E103592, E103593, E103599, E1036, E1037X1, E1037X2,  E1037X3, E1037X9, E1039, E1040, E1041, E1042, E1043, E1044, E1049, E1051, E1052, E1059, E10610, E10618, E10620, E10621, E10622, E10628, E10630,  E10638, E10641, E10649, E1065, E1069, E108, E1100, E1101, E1110, E1111, E1121, E1122, E1129, E11311, E11319, E11321, E113211, E113212, E113213,  E113219, E11329, E113291, E113292, E113293, E113299, E11331, E113311, E113312, E113313, E113319, E11339, E113391, E113392, E113393, E113399,  E11341, E113411, E113412, E113413, E113419, E11349, E113491, E113492, E113493, E113499, E11351, E113511, E113512, E113513, E113519, E113521,  E113522, E113523, E113529, E113531, E113532, E113533, E113539, E113541, E113542, E113543, E113549, E113551, E113552, E113553, E113559, E11359,  E113591, E113592, E113593, E113599, E1136, E1137X1, E1137X2, E1137X3, E1137X9, E1139, E1140, E1141, E1142, E1143, E1144, E1149, E1151, E1152,  E1159, E11610, E11618, E11620, E11621, E11622, E11628, E11630, E11638, E11641, E11649, E1165, E1169, E118, E1300, E1301, E1310, E1311, E1321,  E1322, E1329, E13311, E13319, E13321, E133211, E133212, E133213, E133219, E13329, E133291, E133292, E133293, E133299, E13331, E133311, E133312,  E133313, E133319, E13339, E133391, E133392, E133393, E133399, E13341, E133411, E133412, E133413, E133419, E13349, E133491, E133492, E133493,  E133499, E13351, E133511, E133512, E133513, E133519, E133521, E133522, E133523, E133529, E133531, E133532, E133533, E133539, E133541, E133542,  E133543, E133549, E133551, E133552, E133553, E133559, E13359, E133591, E133592, E133593, E133599, E1336, E1337X1, E1337X2, E1337X3, E1337X9,  E1339, E1340, E1341, E1342, E1343, E1344, E1349, E1351, E1352, E1359, E13610, E13618, E13620, E13621, E13622, E13628, E13630, E13638, E13641,  E13649 , E1365, E1369, E138.', E093392, E093393, E093399, E09341, E093411, E093412, E093413, E093419, E09349, E093491, E093492, E093493, E093499, E09351, E093511, E093512,  E093513, E093519, E093521, E093522, E093523, E093529, E093531, E093532, E093533, E093539, E093541, E093542, E093543, E093549, E093551, E093552,  E093553, E093559, E09359, E093591, E093592, E093593, E093599, E0936, E0937X1, E0937X2, E0937X3, E0937X9, E0939, E0940, E0941, E0942, E0943,  E0944, E0949, E0951, E0952, E0959, E09610, E09618, E09620, E09621, E09622, E09628, E09630, E09638, E09641, E09649, E0965, E0969, E098, E1010,  E1011, E1021, E1022, E1029, E10311, E10319, E10321, E103211, E103212, E103213, E103219, E10329, E103291, E103292, E103293, E103299, E10331,  E103311, E103312, E103313, E103319, E10339, E103391, E103392, E103393, E103399, E10341, E103411, E103412, E103413, E103419, E10349, E103491,  E103492, E103493, E103499, E10351, E103511, E103512, E103513, E103519, E103521, E103522, E103523, E103529, E103531, E103532, E103533, E103539,  E103541, E103542, E103543, E103549, E103551, E103552, E103553, E103559, E10359, E103591, E103592, E103593, E103599, E1036, E1037X1, E1037X2,  E1037X3, E1037X9, E1039, E1040, E1041, E1042, E1043, E1044, E1049, E1051, E1052, E1059, E10610, E10618, E10620, E10621, E10622, E10628, E10630,  E10638, E10641, E10649, E1065, E1069, E108, E1100, E1101, E1110, E1111, E1121, E1122, E1129, E11311, E11319, E11321, E113211, E113212, E113213,  E113219, E11329, E113291, E113292, E113293, E113299, E11331, E113311, E113312, E113313, E113319, E11339, E113391, E113392, E113393, E113399,  E11341, E113411, E113412, E113413, E113419, E11349, E113491, E113492, E113493, E113499, E11351, E113511, E113512, E113513, E113519, E113521,  E113522, E113523, E113529, E113531, E113532, E113533, E113539, E113541, E113542, E113543, E113549, E113551, E113552, E113553, E113559, E11359,  E113591, E113592, E113593, E113599, E1136, E1137X1, E1137X2, E1137X3, E1137X9, E1139, E1140, E1141, E1142, E1143, E1144, E1149, E1151, E1152,  E1159, E11610, E11618, E11620, E11621, E11622, E11628, E11630, E11638, E11641, E11649, E1165, E1169, E118, E1300, E1301, E1310, E1311, E1321,  E1322, E1329, E13311, E13319, E13321, E133211, E133212, E133213, E133219, E13329, E133291, E133292, E133293, E133299, E13331, E133311, E133312,  E133313, E133319, E13339, E133391, E133392, E133393, E133399, E13341, E133411, E133412, E133413, E133419, E13349, E133491, E133492, E133493,  E133499, E13351, E133511, E133512, E133513, E133519, E133521, E133522, E133523, E133529, E133531, E133532, E133533, E133539, E133541, E133542,  E133543, E133549, E133551, E133552, E133553, E133559, E13359, E133591, E133592, E133593, E133599, E1336, E1337X1, E1337X2, E1337X3, E1337X9,  E1339, E1340, E1341, E1342, E1343, E1344, E1349, E1351, E1352, E1359, E13610, E13618, E13620, E13621, E13622, E13628, E13630, E13638, E13641,  E13649 , E1365, E1369, E138.
None - bad flagging or bad data tagging 
-----------------------
12e14 A similar finding was observed in two recent non-peer-reviewed studies: one study with 1099 patients from 552 hospitals in 31 provinces in China, in which the median age was 47.0 years, and 55.1% of the patients were between the ages of 15e49 years; and a second study that included 4021 confirmed cases in 30 provinces of China, in which the mean age was 49 years and 50.7% of patients were between the ages of 20e50 years.
2 - bad tagging if it is flagging. It's talking about another report. 
-----------------------
229 The nasal/throat/anal swab samples and mainly tissue compartments collected from 230 infected monkeys were tested for SARS-CoV-2 RNA by quantitative real-time reverse 231 author/funder.
None - this study was related to monkeys and monkeys wasn't in vocabulary. 

#New Method for finding the sample size in the sentence 
Issues:  

*   Only works when single token - Will improve further NLP Parsing 
* Bad tagging or bad sentence inference classification - Need to improve quality and accuracy on sentence classification. 

What can be done about these issues in the future? 

New Method for increasing accuracy accross multi word 

Training accuracy improvements could be gained from:
Increase in good training data
Add spacy tags as a feature vector
 


In [0]:
#Add in all w2v conversion first per token. Then put all the sequential numbers today. Then send to the matching and child check functions. 
