# Read a JSON file, store it as pandas dataframe and extract metrics with NLTK 

In this notebook we perform basic text analysis using the NLTK package to find out frequency of a particular word and its rank among other keywords in the output of tweepy scraping. 

We first load useful libraries

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
import json

[nltk_data] Downloading package punkt to /Users/jorge/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We create empty arrays to store the variables

In [2]:
therank = []
thefreq = []
themonth = []
thearticles = []

We open the output from tweepy recovering several tweets over a particular time range. 

In [3]:
articles_data_path = '../../DATA/tweepy_techcrunch_output.json'
with open(articles_data_path) as f:
    data = json.load(f)

The data variable contains a dictionary with all the fields retrieved by tweepy

In [4]:
data

[{'index': 0,
  'date': '2021-04-16T14:49:08.000Z',
  'content': 'The IPO market is sending us mixed messages https://t.co/NAhG4SQwUT by @alex',
  'user': 'TechCrunch',
  'retweets': 2,
  'likes': 1,
  'datec': '2021-04-16T14:49:08.000Z'},
 {'index': 1,
  'date': '2021-04-16T14:34:03.000Z',
  'content': "GM's second $2.3B battery plant with LG Chem to open in late 2023 https://t.co/4uo9yxQOjv by @kirstenkorosec",
  'user': 'TechCrunch',
  'retweets': 2,
  'likes': 8,
  'datec': '2021-04-16T14:34:03.000Z'},
 {'index': 2,
  'date': '2021-04-16T14:11:04.000Z',
  'content': "What we all missed in UiPath's latest IPO filing https://t.co/AEsspGJFaC by @alex",
  'user': 'TechCrunch',
  'retweets': 2,
  'likes': 15,
  'datec': '2021-04-16T14:11:04.000Z'},
 {'index': 3,
  'date': '2021-04-16T14:10:11.000Z',
  'content': "Level raises $27M from Khosla, Lightspeed 'to rebuild insurance from the ground up' https://t.co/L1RsyUylrx by @bayareawriter",
  'user': 'TechCrunch',
  'retweets': 8,
  'like

We now create a pandas dataframe to store the fields of interest

In [5]:
content = pd.DataFrame()

We associate fields such as date, content of the tweet and number of retweets and likes to dataframe columns

In [6]:
content['dates'] = list(map(lambda data: data['date'], data))
content['body'] = list(map(lambda data: data['content'], data))
content['retweets'] = list(map(lambda data: data['retweets'], data))
content['likes'] = list(map(lambda data: data['likes'], data))
content['month']=content['dates'].apply(lambda x: int(x[5:7]))

The data is now structured in a dataframe as follows

In [7]:
content

Unnamed: 0,dates,body,retweets,likes,month
0,2021-04-16T14:49:08.000Z,The IPO market is sending us mixed messages ht...,2,1,4
1,2021-04-16T14:34:03.000Z,GM's second $2.3B battery plant with LG Chem t...,2,8,4
2,2021-04-16T14:11:04.000Z,What we all missed in UiPath's latest IPO fili...,2,15,4
3,2021-04-16T14:10:11.000Z,"Level raises $27M from Khosla, Lightspeed 'to ...",8,9,4
4,2021-04-16T14:00:49.000Z,Do you need a SPAC therapist? https://t.co/F3n...,4,15,4
...,...,...,...,...,...
3245,2021-01-21T17:04:18.000Z,Why you should add TechCrunch Early Stage 2021...,6,23,1
3246,2021-01-21T17:02:05.000Z,These are the 20 companies presenting at Alche...,8,18,1
3247,2021-01-21T16:56:40.000Z,Walking Duck is a digital news startup trying ...,12,32,1
3248,2021-01-21T16:56:18.000Z,A.I.-powered transcription service https://t.c...,15,51,1


We now iterate over the four months of data, we convert the body text into lower case words and we create a dataframe with the frequency and the rank. We store the output for each month into the arrays we defined previously. 

In [8]:
keyword = 'market'
for nn in range(1,5):
    dslice = content.loc[content['month'] == nn]
    a = dslice['body'].str.lower().str.cat(sep=' ')
    words = nltk.tokenize.word_tokenize(a)
    word_dist = nltk.FreqDist(words)

    rslt = pd.DataFrame(word_dist.most_common(),columns=['Word', 'Frequency'])
    thenumber = rslt[rslt['Word'].str.contains(keyword)]
    
    if (len(thenumber)>0):
        position = thenumber.iloc[[0]]
#         print("Rank"+str(position))

        therank.append(thenumber.iloc[[0]].index[0].tolist())
        thefreq.append(thenumber['Frequency'].values[0].tolist())
        themonth.append(nn)
        thearticles.append(dslice.size)
    else:
        therank.append(0)
        thefreq.append(0)
        themonth.append(nn)
        thearticles.append(dslice.size)

The output is stored into a dataframe, where we display the text analysis metrics we have calculated

In [9]:
mypanda=pd.DataFrame({"month":themonth,"rank":therank,"freq":thefreq,"articles":thearticles}) 
mypanda

Unnamed: 0,month,rank,freq,articles
0,1,148,5,1790
1,2,150,13,4975
2,3,135,18,6495
3,4,80,14,2990
