1. Objective.

**Identify the files that are related to the vaccination process.**

The process will search within the metadata file and then specifically search for the filtered files.

In future updates we will use nlp tools to find which documents have positive results.

Thank what was explained in:

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv

2. Load Metadata

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

In [None]:
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/2020-03-13/all_sources_metadata_2020-03-13.csv')
df.shape

we check the first lines of the dataframe

In [None]:
df.head()

we check the final lines of the dataframe

In [None]:
df.tail()

To have a good initial filter, we will reform the title field, so that we can later filter on the subject of vaccines.

In [None]:
title = df.copy()
title = title.dropna(subset=['title'])
title['title'] = title['title'].str.replace('[^a-zA-Z]', ' ', regex=True)
title['title'] = title['title'].str.lower()
title.head()

In [None]:
title.tail()

now we make the filter by the title field

In [None]:
title['keyword_vaccine'] = title['title'].str.find('vaccine') 
title.head()

If the result prompt -1, then the title doesn't contained the keyword.

In [None]:
included_vaccine = title.loc[title['keyword_vaccine'] != -1]
included_vaccine

Now that we have the filter for the articles that contain the vaccine title, we will filter the files and load them for further analysis, for this we create an array with the names of the sha that is part of the name of the json file.

In [None]:
shaid = []
for index, row in included_vaccine.iterrows():
    id = str(row['sha']) + ".json"
    shaid.append(id)

now we go through the json files and we compare them with the metadata array

In [None]:
import json
import os
datafiles = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if not (filename==''):
            if(filename in shaid):
                ifile = os.path.join(dirname, filename)
                if ifile.split(".")[-1] == "json":
                    datafiles.append(ifile)

we check how many files crossed

In [None]:
len(datafiles)

Now we go through the selected files and create an arrangement with the body of the article.

In [None]:
ArrBodyText = []
for file in datafiles:
    with open(file,'r')as f:
        doc = json.load(f)
    id = doc['paper_id'] 
    bodytext = ''
    for item in doc['body_text']:
        bodytext = bodytext + item['text']
        
    ArrBodyText.append({id:bodytext})

we check that the correct information will be loaded

In [None]:
ArrBodyText[20]

Now that we have a filter, we will start to perform other text techniques
to be continue...

4. Text Analysis

We will use the NLP library called NLTK, first we will make a split on an item of the specific arrangement to be able to separate the words.

In [None]:
text_split = str(ArrBodyText[20]).split()
len(text_split)

In [None]:
#Identify common words
freq = pd.Series(' '.join(text_split).split()).value_counts()[:20]
freq

In [None]:
#Identify uncommon words
freq1 =  pd.Series(' '.join(text_split).split()).value_counts()[-20:]
freq1

Text pre-processing can be divided into two broad categories — noise removal & normalization. Data components that are redundant to the core text analytics can be considered as noise.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
stem = PorterStemmer()
word = "inversely"
print("stemming:",stem.stem(word))
print("lemmatization:", lem.lemmatize(word, "v"))

In [None]:
# Libraries for text preprocessing
import re
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
#nltk.download('wordnet') 
from nltk.stem.wordnet import WordNetLemmatizer

Creating a list of stop words and adding custom stopwords and Creating a list of custom stopwords

In [None]:
stop_words = set(stopwords.words("english"))

new_words = ["using", "show", "result", "large", "also", "iv", "one", "two", "new", "previously", "shown","et",'al']
stop_words = stop_words.union(new_words)

In [None]:
corpus = []
for i in range(0, 4548):
    #Remove punctuations
    text = re.sub('[^a-zA-Z]', ' ', text_split[i])
    
    #Convert to lowercase
    text = text.lower()
    
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    ##Convert to list from string
    text = text.split()
    
    ##Stemming
    ps=PorterStemmer()
    #Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus.append(text)

In [None]:
#View corpus item
corpus[222]

Data Exploration
We will now visualize the text corpus that we created after pre-processing to get insights on the most frequently used words.

In [None]:
#Word cloud
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
#matplotlib inline
wordcloud = WordCloud(
                          background_color='white',
                          stopwords=stop_words,
                          max_words=100,
                          max_font_size=90, 
                          random_state=62
                         ).generate(str(corpus))
print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
fig.savefig("word1.png", dpi=900)