# CONTIDIONAL FREQUENCY DISTRIBUTION


Let's look at how some specific words are used over time. In order to do this, NLTK's Conditional Frequency Distribution will be used. A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition".
This code checks if the words start with either of the "targets" (words we want to see over the time). 

In [1]:
#import data
import nltk
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from collections import Counter
import inflection as inf
from inflection import singularize
import re

data = pd.read_csv(r'\Users\claud\OneDrive\Desktop\abstract_csv\ahmfinal.csv')
df= pd.DataFrame(data)

In [4]:
df['date'] = [x[2:6] for x in df['advanced']] #create a dataframe for the date
df

Unnamed: 0,advanced,date
0,['2012 mar 22708076 fabrication of a hybrid mi...,2012
1,['2012 jul 23061030 dual imaging enabled cance...,2012
2,['2013 jan 23184367 eradicating antibiotic res...,2013
3,['2013 jan 23184402 sirna transfection with ca...,2013
4,['2013 jan 23184404 towards smart tattoos impl...,2013
...,...,...
2311,['2020 dec 13 33314663 an alternating irradiat...,2020
2312,['2020 dec 16 33326185 which is better for nan...,2020
2313,['2020 dec 18 33336546 nanocellulose reinforce...,2020
2314,['2020 sep 33448676 an ultrasound excitable ag...,2020


In [8]:
#now that we have the dates in our dataframe, we clean the data by deleting all digits and puntuaction signs. 
#words with less than 3 characters will also be removed
def preprocess(text):
    text = re.sub('[^a-zA-Z ]','' , text) #remove puntuaction signs and digits 
    text = re.sub(r'\b\w{1,3}\b','' , text) #remove all words with less than 4 letters
    return text

In [10]:
df['advanced'] = df['advanced'].apply(lambda x:preprocess(x))

In [11]:
#create the list of stopwords
stop = stopwords.words('english')
newstopwords = ["report","summarize","review","demonstated","significantly","efficiently","appeared","loosening", "['","using","based",".","statement","significance","result","results","used","application ","release","effect","study","significant","showed","p","also","model","models"]
stop.extend(newstopwords)

In [12]:
df['advanced'] = df['advanced'].apply(lambda x: ' '.join([inf.singularize(word) for word in x.split() if word not in (stop)]))
df #check the clean text

Unnamed: 0,advanced,date
0,fabrication hybrid microfluidic system incorpo...,2012
1,dual imaging enabled cancer targeting nanopart...,2012
2,eradicating antibiotic resistant biofilm silve...,2013
3,sirna transfection calcium phosphate nanoparti...,2013
4,toward smart tattoo implantable biosensor cont...,2013
...,...,...
2311,alternating irradiation strategy driven combin...,2020
2312,better nanomedicine nanocatalyst single atom c...,2020
2313,nanocellulose reinforced hydroxyapatite nanobe...,2020
2314,ultrasound excitable aggregation induced emiss...,2020


In [14]:
#tokenize the data
tokens = pd.Series(' '.join(df.advanced).split())
tokens

0           fabrication
1                hybrid
2          microfluidic
3                system
4         incorporating
              ...      
247220       multimodal
247221          imaging
247222           guided
247223       synergetic
247224          therapy
Length: 247225, dtype: object

In [15]:
from nltk import ConditionalFreqDist
#calculate CFD
cfd = nltk.ConditionalFreqDist((term, date) for term in ['silk'] for date in df.date[:] for w in tokens if w.startswith(term))

In [16]:
#to see the results
cfd.tabulate()

      2012  2013  2014  2015  2016  2017  2018  2019  2020 
silk 18396 34602 44019 58254 65919 65043 76650 72270 72051 


In [None]:
#to plot the results
cfd.plot()

In [None]:
#count top 15 most commont words in the document
wordcount = df.groupby("date")["text"].apply(lambda x: Counter(" ".join(x).split()).most_common(15))
wordcount

In [18]:
#now we will count the words of a specific word in the document using re, but without grouping it by year
lista = df['advanced'].tolist()
lista = [''.join(lista[:])]
string = " ".join(str(x) for x in lista)

In [19]:
count = sum(1 for match in re.finditer(r"\bsilk\b", string)) #to count a specific word
count #the difference is huge 

195