# NKIS policy report data cleaning

## report ~ 2018

In [54]:
import pandas as pd
import numpy as np

In [55]:
pone = pd.read_csv('../../Data/Experiment/Plosone/pone_final.csv',index_col='Unnamed: 0')

In [56]:
pone

Unnamed: 0,Year,Title,Abstract,Role,Affiliation index,Affiliation list,Discipline,Filename
0,2016,"1,25(OH)2D3 and VDR Signaling Pathways Regulat...",\n\nBackground\nThe objective of this study is...,"[[], [], [], [], [], [], [], []]","[['aff001'], ['aff001', 'cor001'], ['aff001'],...","['1 Department of Ophthalmology, The Affiliate...","['Medicine and health sciences', 'Anatomy', 'B...",/data02/plos/pone/journal.pone.0164717.xml
1,2018,"A non-invasive, quantitative study of broadban...","\nCurrently, non-invasive methods for studying...","[['Data curation', 'Formal analysis', 'Investi...","[['aff001', 'cor001'], ['aff001'], ['aff002'],...",['1 Department of Psychology and Center for Ne...,"['Engineering and technology', 'Signal process...",/data02/plos/pone/journal.pone.0193107.xml
2,2016,The Effects of Financial Education on Impulsiv...,"\nDelay discounting, as a behavioral measure o...","[[], [], [], []]","[['aff001', 'cor001'], ['aff001'], ['aff002'],...","['1 Department of Psychology, Utah State Unive...","['Social sciences', 'Psychology', 'Personality...",/data02/plos/pone/journal.pone.0159561.xml
3,2016,Root-Zone Warming Differently Benefits Mature ...,\nSub-optimal temperature extensively suppress...,"[[], [], [], []]","[['aff001'], ['aff001'], ['aff001'], ['cor001'...",['Beijing Key Laboratory of Growth and Develop...,"['Medicine and health sciences', 'Vascular med...",/data02/plos/pone/journal.pone.0155298.xml
4,2017,A hypomorphic PIGA gene mutation causes severe...,\nMutations in genes involved in glycosylphosp...,"[[], [], [], [], [], [], [], [], [], [], []]","[['aff001'], ['aff002'], ['aff001'], ['aff001'...","['1 Division of Hematology, Department of Medi...","['Biology and life sciences', 'Cell biology', ...",/data02/plos/pone/journal.pone.0174074.xml
...,...,...,...,...,...,...,...,...
70382,2018,Measurement agreement of the self-administered...,\nBefore organizing mixed-mode data collection...,"[['Formal analysis', 'Investigation', 'Methodo...","[['aff001', 'aff002', 'cor001'], ['aff001'], [...","['1 Department Epidemiology and public health,...","['Medicine and health sciences', 'Mental healt...",/data02/plos/pone/journal.pone.0197434.xml
70383,2019,ST analysis of the fetal electrocardiogram – C...,"\nIn their paper, Andriessen at al present a v...","[['Conceptualization', 'Formal analysis', 'Wri...","[['aff001'], ['aff002'], ['aff003', 'currentaf...","['1 Department of Pediatrics, Institute of Cli...","['Medicine and health sciences', 'Cardiology',...",/data02/plos/pone/journal.pone.0221210.xml
70384,2018,Commentary on “The number of undocumented immi...,\n“The number of undocumented immigrants in th...,"[['Writing – original draft'], ['Methodology',...","[['aff001'], ['aff001'], ['aff002'], ['aff001'...","['1 Migration Policy Institute, Washington, D....","['Social sciences', 'Human geography', 'Housing']",/data02/plos/pone/journal.pone.0204199.xml
70385,2019,Why -aVF can be used in STAN as a proxy for sc...,\nThe conclusion of our recent paper that perf...,"[['Conceptualization', 'Data curation', 'Forma...","[['aff001', 'aff002', 'cor001'], ['aff003', 'a...","['1 Department of Biomedical Engineering, Maas...","['Medicine and health sciences', 'Medical devi...",/data02/plos/pone/journal.pone.0221220.xml


1) Each abstract is stemmed to the root word (for example, “computer” to “comput”),
2) stop words (such as “and”, “the”) are removed. The first step in converting text to data is to represent words and documents in their simplest vector forms. 
For all algorithms besides Document Vectors, input into the algorithms involve the
construction of a document-term matrix from all patents; each row is indexed by the
document ID and each column represents a word in the vocabulary. A document row
vector represents the count of the number of times the term appears in the document.
For the terms, I drop all terms that appear in more than 10% of all patents, and those
that appear in fewer than 20.12 Of the resulting terms, I keep the most common 40,000,
in order to maintain a manageable matrix dimensionality. Once all 2,306,041 patents
have been transformed into a document-term matrix of dimension 2306041 ⇥ 40000, I
proceed to transforming patents into a smaller dimensional vector representation using
the methods described below. This procedure is commonly called the bag-of-words
representation of text data

## Data cleaning

### Remove missing values 

In [57]:
pone[pone['Title'].isna()]

Unnamed: 0,Year,Title,Abstract,Role,Affiliation index,Affiliation list,Discipline,Filename


There is no nan title data

In [58]:
pone[pone['Abstract'].isna()].head()

Unnamed: 0,Year,Title,Abstract,Role,Affiliation index,Affiliation list,Discipline,Filename


In [59]:
pone['Title'][3]

'Root-Zone Warming Differently Benefits Mature and Newly Unfolded Leaves of Cucumis sativus L. Seedlings under Sub-Optimal Temperature Stress'

### Remove '\n'

In [60]:
import re

In [61]:
pone['Abstract']= pone['Abstract'].apply(lambda x: x.replace('\n', ' '))
pone['Abstract'] = pone['Abstract'].apply(lambda x: x.rstrip())
pone['Abstract'] = pone['Abstract'].apply(lambda x: x.lstrip())

### Remove '-'

In [11]:
pone['Abstract']= pone['Abstract'].apply(lambda x: x.replace('-', ' '))
pone['Abstract']= pone['Abstract'].apply(lambda x: x.replace('/', ' '))
pone['Abstract']= pone['Abstract'].apply(lambda x: x.replace('—', ' '))

In [12]:
pone['Abstract']= pone['Abstract'].apply(lambda x: " ".join(x.split()))
pone['Title']= pone['Title'].apply(lambda x: " ".join(x.split()))

## Lowering Text (Case Fold)

In [13]:
pone['Abstract'] = pone['Abstract'].apply(lambda x: x.casefold())
pone['Title'] = pone['Title'].apply(lambda x: x.casefold())

## Remove Hyperlink

In [14]:
pone['Abstract'] = pone['Abstract'].apply(lambda x: re.sub(r"https?://\S+", "", x))
pone['Title'] = pone['Title'].apply(lambda x: re.sub(r"https?://\S+", "", x))

## Remove symbol and number

In [15]:
pone['Abstract'] = pone['Abstract'].apply(lambda x: re.sub(r"[^A-Za-z0-9\s]+", " ", x))
pone['Title'] = pone['Title'].apply(lambda x: re.sub(r"[^A-Za-z0-9\s]+", " ", x))
pone['Abstract'] = pone['Abstract'].apply(lambda x: re.sub(r"\b[0-9]+\b\s*",'', x))
pone['Title'] = pone['Title'].apply(lambda x: re.sub(r"\b[0-9]+\b\s*",'', x))

## Tokeninze

In [16]:
import nltk
from nltk.tokenize import word_tokenize

In [17]:
pone['Abstract'] = pone['Abstract'].apply(lambda x:  word_tokenize(x))
pone['Title'] = pone['Title'].apply(lambda x:  word_tokenize(x))

## Remove Stopwords

In [18]:
from nltk.corpus import stopwords

In [19]:
stop_words = set(stopwords.words('english'))

In [20]:
pone['Abstract'] = pone['Abstract'].apply(lambda x:  [i for i in x if i not in stop_words ])
pone['Title'] = pone['Title'].apply(lambda x:   [i for i in x if i not in stop_words])

## Remove Chemical Formula

In [21]:
pone['Abstract'] = pone['Abstract'].apply(lambda x:  [i for i in x if not any(map(str.isdigit, i))])
pone['Title'] = pone['Title'].apply(lambda x:   [i for i in x if not any(map(str.isdigit,i)) ])

In [22]:
pone['Abstract'][0]

['background',
 'objective',
 'study',
 'observe',
 'whether',
 'cyclosporine',
 'csa',
 'inhibits',
 'expression',
 'dectin',
 'human',
 'corneal',
 'epithelial',
 'cells',
 'infected',
 'aspergillus',
 'fumigatus',
 'fumigatus',
 'investigate',
 'molecular',
 'mechanisms',
 'inhibition',
 'methods',
 'immortalized',
 'human',
 'corneal',
 'epithelial',
 'cells',
 'hcecs',
 'pretreated',
 'oh',
 'vdr',
 'inhibitor',
 'h',
 'pretreated',
 'csa',
 'pretreatments',
 'hcecs',
 'stimulated',
 'fumigatus',
 'curdlan',
 'respectively',
 'expression',
 'dectin',
 'proinflammatory',
 'cytokines',
 'il',
 'tnf',
 'detected',
 'rt',
 'pcr',
 'western',
 'blot',
 'elisa',
 'results',
 'dectin',
 'mrna',
 'dectin',
 'protein',
 'expression',
 'increased',
 'hcecs',
 'stimulated',
 'fumigatus',
 'curdlan',
 'csa',
 'inhibited',
 'dectin',
 'expression',
 'mrna',
 'protein',
 'levels',
 'specifically',
 'dectin',
 'proinflammatory',
 'cytokine',
 'expression',
 'levels',
 'higher',
 'hcecs',
 'pretr

## Stemming

In [23]:
from nltk.stem import PorterStemmer

In [24]:
ps = PorterStemmer()

In [25]:
pone['Abstract'] = pone['Abstract'].apply(lambda x:  [ps.stem(i) for i in x])
pone['Title'] = pone['Title'].apply(lambda x:  [ps.stem(i) for i in x])

In [26]:
pone['Abstract'][0]

['background',
 'object',
 'studi',
 'observ',
 'whether',
 'cyclosporin',
 'csa',
 'inhibit',
 'express',
 'dectin',
 'human',
 'corneal',
 'epitheli',
 'cell',
 'infect',
 'aspergillu',
 'fumigatu',
 'fumigatu',
 'investig',
 'molecular',
 'mechan',
 'inhibit',
 'method',
 'immort',
 'human',
 'corneal',
 'epitheli',
 'cell',
 'hcec',
 'pretreat',
 'oh',
 'vdr',
 'inhibitor',
 'h',
 'pretreat',
 'csa',
 'pretreat',
 'hcec',
 'stimul',
 'fumigatu',
 'curdlan',
 'respect',
 'express',
 'dectin',
 'proinflammatori',
 'cytokin',
 'il',
 'tnf',
 'detect',
 'rt',
 'pcr',
 'western',
 'blot',
 'elisa',
 'result',
 'dectin',
 'mrna',
 'dectin',
 'protein',
 'express',
 'increas',
 'hcec',
 'stimul',
 'fumigatu',
 'curdlan',
 'csa',
 'inhibit',
 'dectin',
 'express',
 'mrna',
 'protein',
 'level',
 'specif',
 'dectin',
 'proinflammatori',
 'cytokin',
 'express',
 'level',
 'higher',
 'hcec',
 'pretreat',
 'vdr',
 'inhibitor',
 'csa',
 'compar',
 'pretreat',
 'csa',
 'alon',
 'dectin',
 'proin

# Remove one word

In [27]:
pone['Abstract'] = pone['Abstract'].apply(lambda x:  [i for i in x if len(i)>1])
pone['Title'] = pone['Title'].apply(lambda x:   [i for i in x if len(i)>1])

nothing

In [56]:
pone[['Year','Title','Abstract','Role','Affiliation index','Affiliation list','Discipline','Filename','Total']].to_csv('../../Data/Experiment/Plosone/pone_text_cleaning.csv')

In [4]:
pone = pd.read_csv('../../Data/Experiment/Plosone/pone_text_cleaning.csv',index_col='Unnamed: 0')

In [5]:
pone

Unnamed: 0,Year,Title,Abstract,Role,Affiliation index,Affiliation list,Discipline,Filename,Total
0,2016,"['oh', 'vdr', 'signal', 'pathway', 'regul', 'i...","['background', 'object', 'studi', 'observ', 'w...","[[], [], [], [], [], [], [], []]","[['aff001'], ['aff001', 'cor001'], ['aff001'],...","['1 Department of Ophthalmology, The Affiliate...","['Medicine and health sciences', 'Anatomy', 'B...",/data02/plos/pone/journal.pone.0164717.xml,"['background', 'object', 'studi', 'observ', 'w..."
1,2018,"['non', 'invas', 'quantit', 'studi', 'broadban...","['current', 'non', 'invas', 'method', 'studi',...","[['Data curation', 'Formal analysis', 'Investi...","[['aff001', 'cor001'], ['aff001'], ['aff002'],...",['1 Department of Psychology and Center for Ne...,"['Engineering and technology', 'Signal process...",/data02/plos/pone/journal.pone.0193107.xml,"['current', 'non', 'invas', 'method', 'studi',..."
2,2016,"['effect', 'financi', 'educ', 'impuls', 'decis...","['delay', 'discount', 'behavior', 'measur', 'i...","[[], [], [], []]","[['aff001', 'cor001'], ['aff001'], ['aff002'],...","['1 Department of Psychology, Utah State Unive...","['Social sciences', 'Psychology', 'Personality...",/data02/plos/pone/journal.pone.0159561.xml,"['delay', 'discount', 'behavior', 'measur', 'i..."
3,2016,"['root', 'zone', 'warm', 'differ', 'benefit', ...","['sub', 'optim', 'temperatur', 'extens', 'supp...","[[], [], [], []]","[['aff001'], ['aff001'], ['aff001'], ['cor001'...",['Beijing Key Laboratory of Growth and Develop...,"['Medicine and health sciences', 'Vascular med...",/data02/plos/pone/journal.pone.0155298.xml,"['sub', 'optim', 'temperatur', 'extens', 'supp..."
4,2017,"['hypomorph', 'piga', 'gene', 'mutat', 'caus',...","['mutat', 'gene', 'involv', 'glycosylphosphati...","[[], [], [], [], [], [], [], [], [], [], []]","[['aff001'], ['aff002'], ['aff001'], ['aff001'...","['1 Division of Hematology, Department of Medi...","['Biology and life sciences', 'Cell biology', ...",/data02/plos/pone/journal.pone.0174074.xml,"['mutat', 'gene', 'involv', 'glycosylphosphati..."
...,...,...,...,...,...,...,...,...,...
70382,2018,"['measur', 'agreement', 'self', 'administ', 'q...","['organ', 'mix', 'mode', 'data', 'collect', 's...","[['Formal analysis', 'Investigation', 'Methodo...","[['aff001', 'aff002', 'cor001'], ['aff001'], [...","['1 Department Epidemiology and public health,...","['Medicine and health sciences', 'Mental healt...",/data02/plos/pone/journal.pone.0197434.xml,"['organ', 'mix', 'mode', 'data', 'collect', 's..."
70383,2019,"['st', 'analysi', 'fetal', 'electrocardiogram'...","['paper', 'andriessen', 'al', 'present', 'vali...","[['Conceptualization', 'Formal analysis', 'Wri...","[['aff001'], ['aff002'], ['aff003', 'currentaf...","['1 Department of Pediatrics, Institute of Cli...","['Medicine and health sciences', 'Cardiology',...",/data02/plos/pone/journal.pone.0221210.xml,"['paper', 'andriessen', 'al', 'present', 'vali..."
70384,2018,"['commentari', 'number', 'undocu', 'immigr', '...","['number', 'undocu', 'immigr', 'unit', 'state'...","[['Writing – original draft'], ['Methodology',...","[['aff001'], ['aff001'], ['aff002'], ['aff001'...","['1 Migration Policy Institute, Washington, D....","['Social sciences', 'Human geography', 'Housing']",/data02/plos/pone/journal.pone.0204199.xml,"['number', 'undocu', 'immigr', 'unit', 'state'..."
70385,2019,"['avf', 'use', 'stan', 'proxi', 'scalp', 'elec...","['conclus', 'recent', 'paper', 'perform', 'sta...","[['Conceptualization', 'Data curation', 'Forma...","[['aff001', 'aff002', 'cor001'], ['aff003', 'a...","['1 Department of Biomedical Engineering, Maas...","['Medicine and health sciences', 'Medical devi...",/data02/plos/pone/journal.pone.0221220.xml,"['conclus', 'recent', 'paper', 'perform', 'sta..."


In [6]:
import ast

In [8]:
pone['Total'] = pone['Total'].apply(lambda x : ast.literal_eval(x))

ValueError: malformed node or string: ['background', 'object', 'studi', 'observ', 'whether', 'cyclosporin', 'csa', 'inhibit', 'express', 'dectin', 'human', 'corneal', 'epitheli', 'cell', 'infect', 'aspergillu', 'fumigatu', 'fumigatu', 'investig', 'molecular', 'mechan', 'inhibit', 'method', 'immort', 'human', 'corneal', 'epitheli', 'cell', 'hcec', 'pretreat', 'oh', 'vdr', 'inhibitor', 'pretreat', 'csa', 'pretreat', 'hcec', 'stimul', 'fumigatu', 'curdlan', 'respect', 'express', 'dectin', 'proinflammatori', 'cytokin', 'il', 'tnf', 'detect', 'rt', 'pcr', 'western', 'blot', 'elisa', 'result', 'dectin', 'mrna', 'dectin', 'protein', 'express', 'increas', 'hcec', 'stimul', 'fumigatu', 'curdlan', 'csa', 'inhibit', 'dectin', 'express', 'mrna', 'protein', 'level', 'specif', 'dectin', 'proinflammatori', 'cytokin', 'express', 'level', 'higher', 'hcec', 'pretreat', 'vdr', 'inhibitor', 'csa', 'compar', 'pretreat', 'csa', 'alon', 'dectin', 'proinflammatori', 'cytokin', 'level', 'lower', 'hcec', 'pretreat', 'oh', 'csa', 'compar', 'pretreat', 'csa', 'alon', 'conclus', 'data', 'provid', 'evid', 'csa', 'inhibit', 'express', 'dectin', 'proinflammatori', 'cytokin', 'dectin', 'hcec', 'stimul', 'fumigatu', 'curdlan', 'activ', 'form', 'vitamin', 'oh', 'vdr', 'signal', 'pathway', 'regul', 'inhibit', 'csa', 'inhibit', 'enhanc', 'oh', 'vdr', 'inhibitor', 'suppress', 'inhibit', 'oh', 'vdr', 'signal', 'pathway', 'regul', 'inhibit', 'dectin', 'caus', 'cyclosporin', 'respons', 'aspergillu', 'fumigatu', 'human', 'corneal', 'epitheli', 'cell']

In [9]:
pone['Total']

0        [background, object, studi, observ, whether, c...
1        [current, non, invas, method, studi, human, br...
2        [delay, discount, behavior, measur, impuls, ch...
3        [sub, optim, temperatur, extens, suppress, cro...
4        [mutat, gene, involv, glycosylphosphatidylinos...
                               ...                        
70382    [organ, mix, mode, data, collect, self, admini...
70383    [paper, andriessen, al, present, valid, fetal,...
70384    [number, undocu, immigr, unit, state, estim, b...
70385    [conclus, recent, paper, perform, stan, devic,...
70386    [introduct, intern, extern, qualiti, control, ...
Name: Total, Length: 70387, dtype: object

# Data Parsing

### TF-IDF

In [14]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
import matplotlib.pyplot as plt

In [10]:
pone['Total'] = pone['Abstract']+pone['Title']

### IDF

In [10]:
into_one_list = [" ".join(x) for x in pone['Total']]

In [15]:
into_one_list

["[ ' b a c k g r o u n d ' ,   ' o b j e c t ' ,   ' s t u d i ' ,   ' o b s e r v ' ,   ' w h e t h e r ' ,   ' c y c l o s p o r i n ' ,   ' c s a ' ,   ' i n h i b i t ' ,   ' e x p r e s s ' ,   ' d e c t i n ' ,   ' h u m a n ' ,   ' c o r n e a l ' ,   ' e p i t h e l i ' ,   ' c e l l ' ,   ' i n f e c t ' ,   ' a s p e r g i l l u ' ,   ' f u m i g a t u ' ,   ' f u m i g a t u ' ,   ' i n v e s t i g ' ,   ' m o l e c u l a r ' ,   ' m e c h a n ' ,   ' i n h i b i t ' ,   ' m e t h o d ' ,   ' i m m o r t ' ,   ' h u m a n ' ,   ' c o r n e a l ' ,   ' e p i t h e l i ' ,   ' c e l l ' ,   ' h c e c ' ,   ' p r e t r e a t ' ,   ' o h ' ,   ' v d r ' ,   ' i n h i b i t o r ' ,   ' p r e t r e a t ' ,   ' c s a ' ,   ' p r e t r e a t ' ,   ' h c e c ' ,   ' s t i m u l ' ,   ' f u m i g a t u ' ,   ' c u r d l a n ' ,   ' r e s p e c t ' ,   ' e x p r e s s ' ,   ' d e c t i n ' ,   ' p r o i n f l a m m a t o r i ' ,   ' c y t o k i n ' ,   ' i l ' ,   ' t n f ' ,   ' d e 

In [11]:
vocab_corpus = [item for sublist in pone['Total'] for item in sublist]

In [17]:
#instantiate CountVectorizer() 
cv1=CountVectorizer(tokenizer=lambda txt: txt.split(),lowercase=False) 

# this steps generates word counts for the words in your docs 
word_count_vector=cv1.fit_transform(into_one_list)

In [18]:
print(len(cv1.get_feature_names()))

107085


In [19]:
tfidf_transformer1=TfidfTransformer(smooth_idf=True,use_idf=True,) 
tfidf_transformer1.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [20]:
# print idf values 
df_idf = pd.DataFrame(tfidf_transformer1.idf_, index=cv1.get_feature_names(),columns=["idf_weights"]) 
 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
studi,1.374626
use,1.425664
result,1.458781
method,1.909996
differ,1.982555
...,...
exocoetu,11.468631
odontocid,11.468631
exocoetida,11.468631
odessa,11.468631


### TF-IDF

In [21]:
vectorizer1 = TfidfVectorizer(tokenizer=lambda txt: txt.split())

In [22]:
X1 = vectorizer1.fit_transform(into_one_list)

In [23]:
tfidf_tokens1 = vectorizer1.get_feature_names()

In [26]:
tfidf_tokens1

['aa',
 'aaa',
 'aaaaa',
 'aaaatgaaaa',
 'aaac',
 'aaacaac',
 'aaad',
 'aaamm',
 'aaap',
 'aaav',
 'aab',
 'aabb',
 'aabbdd',
 'aabmr',
 'aabr',
 'aabt',
 'aac',
 'aaca',
 'aacc',
 'aaccdd',
 'aaccr',
 'aacgtt',
 'aachen',
 'aaci',
 'aact',
 'aactg',
 'aacvd',
 'aad',
 'aada',
 'aadb',
 'aadc',
 'aadd',
 'aadhar',
 'aadk',
 'aadl',
 'aadr',
 'aae',
 'aaep',
 'aaf',
 'aafco',
 'aafd',
 'aafmm',
 'aag',
 'aagaat',
 'aagc',
 'aagn',
 'aagr',
 'aagroel',
 'aagt',
 'aah',
 'aahf',
 'aai',
 'aaic',
 'aaih',
 'aak',
 'aal',
 'aalborg',
 'aalen',
 'aalg',
 'aalphi',
 'aam',
 'aama',
 'aamc',
 'aametw',
 'aan',
 'aanat',
 'aanav',
 'aancc',
 'aanlysi',
 'aannv',
 'aanp',
 'aant',
 'aao',
 'aaor',
 'aap',
 'aapa',
 'aapc',
 'aapex',
 'aapexpost',
 'aaph',
 'aapi',
 'aapm',
 'aapo',
 'aapoaii',
 'aaq',
 'aar',
 'aardvark',
 'aarf',
 'aarhu',
 'aari',
 'aarogyasri',
 'aarp',
 'aarss',
 'aart',
 'aasa',
 'aasdhppt',
 'aaser',
 'aasld',
 'aasm',
 'aass',
 'aasv',
 'aat',
 'aatd',
 'aatf',
 'aath',
 

In [28]:
#df_tfidfvect1 = pd.DataFrame(data = X1.toarray(),columns = tfidf_tokens1,index=pone.index)

In [31]:
# del df_tfidfvect1

In [24]:
array_tfidfvect1 = X1.toarray()

In [25]:
array_tfidfvect1[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [44]:
# min(df_tfidfvect1[df_tfidfvect1['fibrosi']>0]['fibrosi'])

0.017726132254446074

In [104]:
import matplotlib.pyplot as plt

In [46]:
# df_tfidfvect1[df_tfidfvect1['use']>0]

Unnamed: 0,aa,aaa,aaaaa,aaaatgaaaa,aaac,aaacaac,aaad,aaamm,aaap,aaav,...,zymograph,zymographi,zymomona,zymosan,zymoseptoria,zymowm,zymuphen,zyprexa,zyxin,zz
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
pone[pone['Abstract'].apply(lambda x: True if 'aaamm' in x else False)]

Unnamed: 0,Year,Title,Abstract,Role,Affiliation index,Affiliation list,Discipline,Filename,Total
6126,2018,"[asymmetr, hybrid, gene, flow, eisenia, andrei...","[uniformli, pigment, eisenia, andrei, ea, stri...","[['Conceptualization', 'Formal analysis', 'Inv...","[['aff001', 'cor001'], ['aff001'], ['aff002'],...","['1 Department of Evolutionary Immunology, Ins...","['Biology and life sciences', 'Biochemistry', ...",/data02/plos/pone/journal.pone.0204469.xml,"[uniformli, pigment, eisenia, andrei, ea, stri..."


In [99]:
pone['Abstract'][33627]

['renal',
 'denerv',
 'rd',
 'report',
 'reduc',
 'suscept',
 'atrial',
 'fibril',
 'af',
 'underli',
 'mechan',
 'well',
 'understood',
 'studi',
 'perform',
 'investig',
 'effect',
 'rd',
 'induc',
 'af',
 'rabbit',
 'model',
 'atrial',
 'fibrosi',
 'explor',
 'potenti',
 'mechan',
 'thirti',
 'five',
 'rabbit',
 'randomli',
 'assign',
 'sham',
 'oper',
 'group',
 'abdomin',
 'aortic',
 'constrict',
 'aac',
 'group',
 'aac',
 'rd',
 'aac',
 'rd',
 'group',
 'incid',
 'af',
 'induc',
 'burst',
 'pace',
 'atrium',
 'determin',
 'blood',
 'collect',
 'measur',
 'level',
 'rennin',
 'angiotensin',
 'ii',
 'aldosteron',
 'atrial',
 'sampl',
 'preserv',
 'evalu',
 'protein',
 'gene',
 'express',
 'collagen',
 'connect',
 'tissu',
 'growth',
 'factor',
 'ctgf',
 'transform',
 'growth',
 'factor',
 'tgf',
 'data',
 'suggest',
 'cardiac',
 'structur',
 'remodel',
 'atrial',
 'fibrosi',
 'success',
 'induc',
 'aac',
 'compar',
 'aac',
 'group',
 'aac',
 'rd',
 'rabbit',
 'smaller',
 'ascend',


In [47]:
temp_doc = df_tfidfvect1.loc[0]
temp_doc[temp_doc<0.1].index

Index(['aa', 'aaa', 'aaaaa', 'aaaatgaaaa', 'aaac', 'aaacaac', 'aaad', 'aaamm',
       'aaap', 'aaav',
       ...
       'zymograph', 'zymographi', 'zymomona', 'zymosan', 'zymoseptoria',
       'zymowm', 'zymuphen', 'zyprexa', 'zyxin', 'zz'],
      dtype='object', length=107073)

### Checking words consist of only one character

### Eliminate words whose tf-idf lower than threshold = 0.025

In [28]:
from tqdm import tqdm

In [40]:
a_list = [1, 2, 3]
indices_to_access = [0, 2]

a_numpy_array = np.array(a_list)
accessed_array = a_numpy_array[indices_to_access]

In [41]:
tfidf_tokens1

['aa',
 'aaa',
 'aaaaa',
 'aaaatgaaaa',
 'aaac',
 'aaacaac',
 'aaad',
 'aaamm',
 'aaap',
 'aaav',
 'aab',
 'aabb',
 'aabbdd',
 'aabmr',
 'aabr',
 'aabt',
 'aac',
 'aaca',
 'aacc',
 'aaccdd',
 'aaccr',
 'aacgtt',
 'aachen',
 'aaci',
 'aact',
 'aactg',
 'aacvd',
 'aad',
 'aada',
 'aadb',
 'aadc',
 'aadd',
 'aadhar',
 'aadk',
 'aadl',
 'aadr',
 'aae',
 'aaep',
 'aaf',
 'aafco',
 'aafd',
 'aafmm',
 'aag',
 'aagaat',
 'aagc',
 'aagn',
 'aagr',
 'aagroel',
 'aagt',
 'aah',
 'aahf',
 'aai',
 'aaic',
 'aaih',
 'aak',
 'aal',
 'aalborg',
 'aalen',
 'aalg',
 'aalphi',
 'aam',
 'aama',
 'aamc',
 'aametw',
 'aan',
 'aanat',
 'aanav',
 'aancc',
 'aanlysi',
 'aannv',
 'aanp',
 'aant',
 'aao',
 'aaor',
 'aap',
 'aapa',
 'aapc',
 'aapex',
 'aapexpost',
 'aaph',
 'aapi',
 'aapm',
 'aapo',
 'aapoaii',
 'aaq',
 'aar',
 'aardvark',
 'aarf',
 'aarhu',
 'aari',
 'aarogyasri',
 'aarp',
 'aarss',
 'aart',
 'aasa',
 'aasdhppt',
 'aaser',
 'aasld',
 'aasm',
 'aass',
 'aasv',
 'aat',
 'aatd',
 'aatf',
 'aath',
 

In [50]:
# total_list = [np.array(x) for x in pone['Total']]
after_total = []

for k,i in tqdm(enumerate(pone.index)):
    temp_doc = array_tfidfvect1[i]
    del_word_list = set([x for x in np.array(tfidf_tokens1)[list(np.where(temp_doc<0.01))]])
    temp_abstract = [x for x in total_list[k] if x not in del_word_list]
    after_total.append(temp_abstract)
    #after_abstract.append(list(filter(lambda a: a not in del_word_list, report_2018['Abstract'][i])))

  
51952it [1:24:49, 10.17it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

68394it [1:52:07,  9.98it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [45]:
1

1

In [51]:
pone['Total_tokenize'] = after_total

In [52]:
pone['Total_tokenize'][pone['Total_tokenize'].apply(lambda x : True if len(x)<10 else False)]

Series([], Name: Total_tokenize, dtype: object)

In [53]:
pone.to_csv('../../Data/Experiment/Plosone/pone_tfidf0_01.csv')

In [None]:
# total_list = [np.array(x) for x in pone['Total']]
after_total = []

for k,i in tqdm(enumerate(pone.index)):
    temp_doc = array_tfidfvect1[i]
    del_word_list = set([x for x in np.array(tfidf_tokens1)[list(np.where(temp_doc<0.05))]])
    temp_abstract = [x for x in total_list[k] if x not in del_word_list]
    after_total.append(temp_abstract)
    #after_abstract.append(list(filter(lambda a: a not in del_word_list, report_2018['Abstract'][i])))

In [75]:
report_2018 = report_2018[report_2018['Total'].apply(lambda x : True if len(x)>5 else False)]

In [76]:
report_1920 = report_1920[report_1920['Total'].apply(lambda x : True if len(x)>5 else False)]

In [97]:
report_2018

Unnamed: 0,Title,English title,Institute,Year,Type,표준분류(대),표준분류(중),Responsibility,연구책임자 소속기관,공동책임자1,공동책임자2,공동책임자3,Internal Author,External Author,Abstract,Keywords,Total
0,21세기 생물산업 창조를 향한 일본 정부의 기본전략과 프로젝트,,과학기술정책연구원,2000,기본연구보고서,과학기술,기술개발,과학기술정책연구원,과학기술정책연구원,,,,0,0,지금까지 경제구조의 변혁과 창조를 위한 행동계획 1997년 5월 16일 각료회의 결...,"바이오테크놀로지산업,21세기,생물산업,창조,일본 정부,기본전략,프로젝트","[지금, 경제, 구조, 변혁, 창조, 행동, 계획, 각료회, 결정, 라이프사이언스,..."
1,EU의 연구개발 정책동향,R&D Policy Trends in the EU,과학기술정책연구원,2000,기본연구보고서,과학기술,과학기술일반,정성철,과학기술정책연구원,,,,0,0,EU는 미국 일본과 함께 세계 과학기술 발전을 주도하고 있는 국가군으로써 OEC...,"EU,연구개발,정책동향","[미국, 일본, 세계, 과학, 기술, 발전, 주도, 국가군, 개발, 투자, 차지, ..."
2,PBS의 관련 개념과 적용조건,,과학기술정책연구원,2000,기본연구보고서,경제,예산,과학기술정책연구원,과학기술정책연구원,,,,0,0,1990년대 접어들면서 세계시장의 경쟁환경이 점점 심화됨에 따라 국가경쟁력을높이기 ...,"PBS,개념,적용조건","[세계, 시장, 경쟁, 환경, 심화, 국가, 경쟁력, 노력, 가속, 시작, 선진국,..."
3,R D 평가시스템의 이론적 체계 구축 및 적용방안에 관한 연구,A Framework for R&D Evaluation System and Its ...,과학기술정책연구원,2000,기본연구보고서,과학기술,기술개발,이정원,과학기술정책연구원,,,,0,0,본 연구는 연구개발활동뿐 아니라 연구개발 평가 자체의 효율성을 높이고 그 결과가 효...,"R&D 평가시스템,이론적 체계 구축,적용방안","[개발, 활동, 개발, 평가, 자체, 효율, 결과, 효과, 개발, 전략, 수립, 연..."
4,개방형 모듈형 기술패러다임에 대응한 기술혁신전략 리눅스를 중심으로,,과학기술정책연구원,2000,기본연구보고서,방송·통신·정보,정보,송위진,과학기술정책연구원,,,,0,0,인터넷의 발전과 리눅스의 확산에 따라 기술개발 환경에 개방화 openness ...,"개방형·모듈형,기술패러다임,기술혁신,리눅스","[인터넷, 발전, 리눅스, 확산, 기술, 개발, 환경, 개방, 모듈, 구조, 변화,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21579,생태공학적 기법을 활용한 지역단위 생태계 보호지역 확대방안,Prioritizing ecologically important areas for ...,한국환경정책평가연구원,2018,기본연구보고서,환경,환경일반,구경아,한국환경정책평가연구원,,,,"[오일찬, 박선욱, 이현우, 홍현정]",0,본 연구에서는 다양한 생태계의 보전 및 생물다양성 보전과 증진을 통한 생태계의 지속...,"보호지역, 국립공원, 생태공학, 지역 생태계, 토지이용변화, 제주도","[다양, 생태, 보전, 생물, 다양, 보전, 생태, 이용, 우리나라, 생태, 보호,..."
21580,가뭄지역 농촌용수 개발계획의 전략환경영향평가 개선방안 연구 지하수 지표수 복합이...,Improving effectiveness of strategic environme...,한국환경정책평가연구원,2018,기본연구보고서,환경,환경일반,김경호,한국환경정책평가연구원,,,,"[박종윤, 하지연, 안준영, 이후승]",0,본 연구는 가뭄지역의 농촌용수 개발계획을 친환경적이고 지속가능한 이수 방안으로 수립...,"전략환경영향평가, 농촌용수 개발계획, 지하수-지표수 복합이용, 저수지(소규모 댐),...","[가뭄, 지역, 농촌, 용수, 개발, 계획, 지속, 방안, 수립, 개발, 계획, 전..."
21581,원자력시설 해체 부지의 재사용을 위한 환경관리 전략 토양 및 지하수 분야를 중심으로,Environmental strategy for site reuse of decom...,한국환경정책평가연구원,2018,기본연구보고서,환경,환경일반,신경희,한국환경정책평가연구원,,,,"[이희선, 권진경, 조공장, 김경호, 양경]",0,고리 1호기의 영구 정지 및 해체 확정에 이어 월성 1호기는 조기 폐쇄 결정을 내린...,"원자력시설, 해체, 재사용, 환경관리, 토양·지하수","[호기, 영구, 정지, 해체, 국내, 원전, 예정, 원전, 해체, 해체, 개념, 정..."
21582,중소하천 물환경 개선을 위한 용배수로 관리 및 활용 방안,Management policy of irrigation and drainage c...,한국환경정책평가연구원,2018,기본연구보고서,환경,수질오염,김익재,한국환경정책평가연구원,,,,"[박종윤, 곽효은, 김교범]",0,그동안 우리나라 하천의 관리와 재정투자는 대하천 위주로 추진되어 왔다 이러한 대하...,"중소하천, 물환경, 용수로, 배수로, 농업용수, 비점오염수질부하, 물관리 일원화","[하천, 관리, 대하천, 위주, 대하천, 본류, 정책, 조류, 천, 유역, 수질, ..."


In [98]:
report_1920

Unnamed: 0,Title,Institute,Year,Abstract,Responsibility,Co-Responsibility,Internal Author,External Author,연구책임자 소속기관,Total
0,분기별 범죄동향 리포트 제16호 2020년 3분기,한국형사정책연구원,2021,1 본 통계는 전국 각급 수사기관 검찰 경찰 특별사법경찰 에서 범죄사건을 수사...,김민영,[최지선],['최금나'],[''],[한국형사정책연구원],"[통계, 전국, 각급, 수사, 기관, 검찰, 경찰, 사법, 경찰, 범죄, 사건, 수..."
1,인구변동과 지속 가능한 발전 저출산의 경제 사회 문화 정치적 맥락에 관한 종합적...,경제인문사회연구회,2021,인구변동이 전 세계를 근본적으로 변화시킬 수 있는 힘의 원동력으로 인식되지만 느리...,우해봉,0,[''],[''],[한국보건사회연구원],"[인구, 변동, 근본, 변화, 힘, 원동력, 인식, 진행, 특징, 중요, 간과, 최..."
2,감염병 시대 보균과 돌봄 인문사회학적 성찰과 방향,경제인문사회연구회,2021,본 연구는 한국사회 감염병에 대한 대응을 보균과 돌봄에 초점을 두어 성찰하며 보균...,최은경,0,[''],[''],[경북대학교],"[한국, 사회, 감염병, 대응, 보균, 돌, 성찰, 보균, 돌, 봄, 인문, 사회,..."
3,한반도 동북아 평화체제의 정착을 위한 시베리아 인문학의 학적 체계 구성,경제인문사회연구회,2021,본 총서는 2020년 경제인문사회연구회의 정책과제로 선정된 주제인 한반도 동북...,정세진,0,[''],[''],[한양대학교],"[총서, 문사회연구회, 정책, 주제, 한반도, 동북아, 평화, 체제, 정착, 시베리..."
4,팬데믹 시대의 민주주의와 공동체 한국모델의 모색,경제인문사회연구회,2021,본 연구는 팬데믹 시대가 국가를 비롯한 공동체를 커먼즈 commons 로서 다시...,황정아,0,[''],[''],[한림대학교],"[팬데믹, 시대, 국가, 공동체, 커, 로서, 사유, 정치, 우애, 집단, 주체, ..."
...,...,...,...,...,...,...,...,...,...,...
2615,글로벌 프론티어 과학기술혁신정책 조사 및 국내적용 방안 탐색,과학기술정책연구원,2019,목 차 요약 ...,조용래,0,"['장진규', '정일영', '하태정']",[''],[과학기술정책연구원],"[목, 차, 요약, 제, 서론, 절, 절, 주요, 내용, 추진, 방법, 주요, 내용..."
2616,한국 산업발전 비전 2030 제2권 산업편 제조업,산업연구원,2019,우리나라 주력 제조업은 산업의 특징 경쟁력 원천 성장 경로 등을 고려할 때 다음...,조영삼,0,"['김인철', '김주영', '박정수', '이준', '이용호', '조철', '산업 비...",[''],[산업연구원],"[우리나라, 주력, 제조업, 산업, 특징, 경쟁력, 원천, 성장, 경로, 고려, 때..."
2617,미국의 보호무역주의 강화가 우리 산업에 미치는 영향,산업연구원,2019,미국의 보호주의를 통한 통상압박은 철강 가전 태양광산업뿐만 아니라 앞으로 계속해...,김수동,0,"['강지현', '빙현지']","['설 윤', '김종탁']",[산업연구원],"[미국, 보호주의, 통상, 압박, 철강, 가전, 태양광, 산업, 앞, 계속, 업종,..."
2618,중소 지식서비스기업의 수출실태 분석 및 정책적 육성방안,산업연구원,2019,4차 산업혁명 기술을 활용한 산업경쟁력 제고 및 서비스 경제화와 같은 핵심 정책 이...,이영주,0,"['한창용', '김홍석']",[''],[산업연구원],"[산업, 혁명, 기술, 산업, 경쟁력, 제고, 서비스, 핵심, 정책, 이슈, 성공,..."


In [78]:
report_2018.to_csv('../../Data/Experiment/tokenized_report_2018.csv')
report_1920.to_csv('../../Data/Experiment/tokenized_report_1920.csv')

In [100]:
len(report_2018)+len(report_1920)#23111

23111

In [103]:
flat_list = [item for sublist in after_abstract for item in sublist]

In [104]:
all_vocab = list(set(flat_list))

In [105]:
[x for x in all_vocab if len(x)<2].index('쁨')

39

In [236]:
len(all_vocab)

81157