<h1> 11. Other analyses </h1>

These are a set of smaller additional analyses, that provide additional information but are not necessary for the co-occurence network or BERTopic analyses.

<br>

Varibales

- df = Dataframe from pre-processing 3 ( title | substance | classes | url | text )
- df2 = Dataframe with number of reports per substance
- df4 = Dataframe with number of reports per class

In [1]:
#imports
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from collections import Counter, OrderedDict, defaultdict
from wordcloud import WordCloud, STOPWORDS

In [2]:
df = pd.read_pickle("processed_data_3.pkl")

#corpus list
corpus_list = []
for text in df.text:
    corpus_list += text

corpus_list = list(filter(lambda a: a not in ["PERSON", "ORG", "GPE", "LOC", "DATE", "PLACEHOLDER"], corpus_list))

<h4> Sample size (substances) </h4>

Find the number of reports for each substance.

In [3]:


#Dict with number of reports per substance
size_dict = dict(df.substance.value_counts())

#Create df
df2  = pd.DataFrame(columns = ["substance", "classes","sample size"])


substance_list = []

#Create dataframe with substance, corresponding class and number of reports
for i, substance in enumerate(df.substance):
    if substance not in substance_list:
        df2.loc[i, "substance"] = substance
        df2.loc[i, "classes"] = df.loc[i, "classes"]
        df2.loc[i, "sample size"] = size_dict[substance]
        substance_list.append(substance)

#save
df2.to_csv('substances_sample_sizes.csv')


<h4> Sample sizes (classes) </h4>

Find the number of reports for each class. Map as pie chart.

In [None]:
df3 = df2.groupby("classes")["sample size"].sum()

#pie chart
plt.pie(df3.values, labels=df3.index, autopct=lambda pct: f"{round(pct/100.*df3.sum()):,d} ({pct:.1f}%)")
plt.show()

<h4> Seed time perception words </h4>

In [5]:
ntp_words = ['time', 'period', 'periods', 'duration', 'clock', 'temporal', 'spacetime', 'timespan', 'timespans', 'timeline', 'timelines', 'elapse', 'elapsed', 'length', 'timewise', 'velocity', 'pace', 'rate', 'tempo', 'pass', 'passing', 'passed']
ftp_words = ['quick','quicker', 'quickly', 'quickest', 'fast', 'faster', 'fastest', 'fastened', 'rapid','rapidly', 'short', 'shorter', 'shortly', 'shortest','speedy', 'speedy','speeded', 'speedier', 'hurry', 'hurried', 'swift', 'swifter', 'swiftly', 'haste', 'hasty', 'brisk', 'turbo', 'accelerate', 'acceleration', 'accelerated', 'accelerating']
stp_words = ['slow', 'slower', 'slowly', 'slows', 'slowed', 'slowest', 'slowing', 'slowdown', 'long', 'looong', 'longer', 'longer', 'longest', 'steady', 'deceleration', 'decelerate', 'decelerating', 'decelerated', 'dilatory', 'dilation', 'infinity', 'eternity', 'lengthy', 'prolonged', 'protracted', 'extended', 'unending', 'endless']
time_words = sorted(ntp_words + ftp_words + stp_words)


<h4> Balancing fast-slow seed words </h4>

- Ftp-stp words were 'balanced' by considering their relative frequency in the common English language. The SUBTLEX-UK is a 201.3 million words corpus generated from subtitles in BBC broadcasts. (Available [here](http://crr.ugent.be/archives/1423)).

- The code below gets the frequencies of all ftp-stp words in SUBTLEX-UK. Adjust ftp-stp words, so they have an equal cummulative frequency in the SUBTLEX-UK corpus.

- df4 compares the relative frequencies of specific time words in both corpora


<h6> van Heuven, W.J.B., Mandera, P., Keuleers, E. and Brysbaert, M., 2014. Subtlex-UK: A New and Improved Word Frequency Database for British English. Quarterly Journal of Experimental Psychology [Online], 67(6), pp.1176–1190. Available from: https://doi.org/10.1080/17470218.2013.850521. </h6>

In [None]:
#SUBTLEX-UK corpus
df4 = pd.read_excel("SUBTLEX-UK.xlsx")

#renaming
df4.rename(columns = {'FreqCount':'freq', 'Spelling':"word"}, inplace = True)

#remove all columns but word 
df4.drop(df4.columns.difference(['word', 'freq']), 1, inplace=True)
#sort values
df4.sort_values('freq', ascending=False, inplace=True)
df4.reset_index(inplace=True, drop=True)

#remove words from SUBTLEX not in Erowid corpus
#df5 for later Zipfs law analysis 
df5 = df4[df4['word'].isin(set(corpus_list))]

#remove words from SUBTLEX not in time words
df4 = df4[df4['word'].isin(time_words)]


#frequency of ftp-stp words in SUBTLEX 
df_fast = df4[~df4.filter(items=['word']).isin(ntp_words + stp_words).any(1)] 
print("SUBTLEX-UK FTP sum is")
print(df_fast["freq"].sum())

df_slow = df4[~df4.filter(items=['word']).isin(ntp_words + ftp_words).any(1)] 
print("SUBTLEX-UK STP sum is")
print(df_slow["freq"].sum())


#relative word frequency (=Erowid/SUBTLEX) dataframe

df4["Erowid_freq"] = 0
df4["Relative_freq"] = 0

for i, word in enumerate(df4.word):
    df4.loc[i, "Erowid_freq"] = corpus_list.count(word)
    df4.loc[i, "Relative_freq"] = corpus_list.count(word)/df4.loc[i, "freq"]

df4 = df4.sort_values(by='Relative_freq', ascending=False)


<h4> Get context window function </h4>

In [8]:
# Get context window function for time_words
def get_windows(words, C):
    i = C
    while i < len(words) - C:
        center_word = words[i]
        if center_word not in time_words:
            i += 1
            pass
        else:
            context_words = words[(i - C):i] + words[(i+1):(i+C+1)]
            yield context_words, center_word
            i += 1

<h4> Modified get context window (CW) function </h4>

See 7. Get Time corpus 2 (For BERTopic).ipynb

In [7]:
# Modified get context window function for time_words
def modified_get_windows(words, C):
    i = C
    while i < len(words) - C:
        center_word = words[i]
        if center_word not in time_words:
            i += 1
            pass
        else:
            context_words = words[(i - C):i] + words[(i+1):(i+C+1)]
            for word in time_words:
                if word in context_words:
                    i += C
                    pass
            yield context_words, center_word
            i += 1


<h3> Zipf's law - data preparation </hr3>

In [9]:
#Erowid corpus frequency dict
counter = Counter(corpus_list)
corpus_dict = dict(Counter({k: c for k, c in counter.items()}))


#time corpus (C=4) 
time_corpus_list_C4 = []
for context_words, center_word in get_windows(corpus_list, C=4):
    time_corpus_list_C4 += context_words

#Time corpus frequency dict (C=4)
counter_C4 = Counter(time_corpus_list_C4)
time_corpus_dict_C4 = dict(Counter({k: c for k, c in counter_C4.items()}))


#tf idf frequency dict (C=4)
tfidf_df_C4 = pd.read_pickle("tfidf_df_C=4.pkl")

#remove rows with values 0, as these are words not in the time corpus
tfidf_df_C4 = tfidf_df_C4[tfidf_df_C4["all"] != 0]
#remove rows with time seed words
tfidf_df_C4 = tfidf_df_C4[~tfidf_df_C4.filter(items=['word']).isin(time_words).any(axis=1)]

#sort by tf-idf
tfidf_df_C4.sort_values(by=["all"], ascending=False, inplace=True)
tfidf_df_C4.reset_index(drop=True, inplace=True)
tfidf_list_C4 = tfidf_df_C4["all"].to_list()

#tf_idf_values go from negative values to positives values, normalize them from 0 to 1 for graph
min_value = min(tfidf_list_C4)
max_value = max(tfidf_list_C4)
normalized_tfidf_list_C4 = [(x - min_value) / (max_value - min_value) for x in tfidf_list_C4]



#time corpus (C=15)
time_corpus_list_C15 = []
for context_words, center_word in get_windows(corpus_list, C=15):
        time_corpus_list_C15 += context_words

#Time corpus frequency dict (C=15)
counter_C15 = Counter(time_corpus_list_C15)
time_corpus_dict_C15 = dict(Counter({k: c for k, c in counter_C15.items()}))

#tf idf frequency dict (C=15)
tfidf_df_C15 = pd.read_pickle("tfidf_df_C=15.pkl")

#remove rows with values 0, as these are words not in the time corpus
tfidf_df_C15 = tfidf_df_C15[tfidf_df_C15["all"] != 0]
#remove rows with time seed words
tfidf_df_C15 = tfidf_df_C15[~tfidf_df_C15.filter(items=['word']).isin(time_words).any(axis=1)]

#sort by tf-idf
tfidf_df_C15.sort_values(by=["all"], ascending=False, inplace=True)
tfidf_df_C15.reset_index(drop=True, inplace=True)
tfidf_list_C15 = tfidf_df_C15["all"].to_list()

#tf_idf_values go from negative values to positives values, normalize them from 0 to 1 for graph
min_value = min(tfidf_list_C15)
max_value = max(tfidf_list_C15)
normalized_tfidf_list_C15 = [(x - min_value) / (max_value - min_value) for x in tfidf_list_C15]


<h3> Zipf's law - word frequency </h3> 

In [None]:
#sorted dictionaries from large to small freq
sorted_dict_corpus = dict(OrderedDict(sorted(corpus_dict.items(), key = lambda x: x[1],reverse = True)))
sorted_dict_C4 = dict(OrderedDict(sorted(time_corpus_dict_C4.items(), key = lambda x: x[1],reverse = True)))
sorted_dict_C15 = dict(OrderedDict(sorted(time_corpus_dict_C15.items(), key = lambda x: x[1],reverse = True)))

#freq and rank of each word in list format
freqs_subtlex = df5.freq.tolist()[:12703]
ranks_subtlex = np.arange(1, len(freqs_subtlex)+1)


freqs_corpus = list(sorted_dict_corpus.values())[:12703] #since they are at slightly varying lengths, cap all at 12650
ranks_corpus = np.arange(1, len(freqs_corpus)+1)

freqs_C4 = list(sorted_dict_C4.values())[:12703]
ranks_C4 = np.arange(1, len(freqs_C4)+1)

freqs_C15 = list(sorted_dict_C15.values())[:12703]
ranks_C15 = np.arange(1, len(freqs_C15)+1)



#plot 3 lines on loglog curve
plt.loglog(ranks_subtlex, freqs_subtlex, label='SUBTLEX corpus')
plt.loglog(ranks_corpus, freqs_corpus, label='Erowid corpus')
plt.loglog(ranks_C15, freqs_C4, label='Time corpus (C=4)')
plt.loglog(ranks_C15, freqs_C15, label='Time corpus (C=15)')



# Set up the tick formatter
def log_tick_formatter(val, pos=None):
    """
    Convert a log tick value to a plain tick value.
    """
    if val < 1:
        return '{:.3g}'.format(val)
    else:
        return int(val)


# Format the x-axis and y-axis tick labels
plt.gca().xaxis.set_major_formatter(FuncFormatter(log_tick_formatter))
plt.gca().yaxis.set_major_formatter(FuncFormatter(log_tick_formatter))
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)



# annotate with arrows for words
zero_value = 0

#subtlex dictionary
subtlex_dict = dict(zip(df5["word"], df5["freq"]))

#blue (SUBTLEX corpus) 
plt.annotate('"experience"', xy=(list(subtlex_dict).index("experience"), subtlex_dict["experience"]), xytext=(zero_value+30, zero_value+400000),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(48/255, 129/255, 185/255)))     

plt.annotate('"heart"', xy=(list(subtlex_dict).index("heart"), subtlex_dict["heart"]), xytext=(zero_value+18, zero_value+200),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(48/255, 129/255, 185/255)))

plt.annotate('"continuum"', xy=(list(subtlex_dict).index("continuum"), subtlex_dict["continuum"]), xytext=(zero_value+300, zero_value+10),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(48/255, 129/255, 185/255)))


#orange (Erowid coprus)
plt.annotate('          ', xy=(list(sorted_dict_corpus).index("experience"), sorted_dict_corpus["experience"]), xytext=(zero_value+30, zero_value+400000),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(252/255, 142/255, 42/255)))     

plt.annotate('     ', xy=(list(sorted_dict_corpus).index("heart"), sorted_dict_corpus["heart"]), xytext=(zero_value+25, zero_value+200),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(252/255, 142/255, 42/255)))

plt.annotate('         ', xy=(list(sorted_dict_corpus).index("continuum"), sorted_dict_corpus["continuum"]), xytext=(zero_value+600, zero_value+10),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(252/255, 142/255, 42/255)))



#green (C=4)
plt.annotate('          ', xy=(list(sorted_dict_C4).index("experience"), sorted_dict_C4["experience"]), xytext=(zero_value+30, zero_value+400000),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(61/255, 168/255, 61/255)))

plt.annotate('     ', xy=(list(sorted_dict_C4).index("heart"), sorted_dict_C4["heart"]), xytext=(zero_value+20, zero_value+200),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(61/255, 168/255, 61/255)))

plt.annotate('         ', xy=(list(sorted_dict_C4).index("continuum"), sorted_dict_C4["continuum"]), xytext=(zero_value+300, zero_value+10),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(61/255, 168/255, 61/255)))


#red (C=15)
plt.annotate('          ', xy=(list(sorted_dict_C15).index("experience"), sorted_dict_C15["experience"]), xytext=(zero_value+30, zero_value+400000),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(218/255, 60/255, 61/255)))
      
plt.annotate('     ', xy=(list(sorted_dict_C15).index("heart"), sorted_dict_C15["heart"]), xytext=(zero_value+20, zero_value+200),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(218/255, 60/255, 61/255)))

plt.annotate('         ', xy=(list(sorted_dict_C15).index("continuum"), sorted_dict_C15["continuum"]), xytext=(zero_value+300, zero_value+10),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(218/255, 60/255, 61/255)))




plt.title('Zipf\'s law (Word frequency)'); plt.xlabel('Word rank'); plt.ylabel('Word frequency'); plt.legend(); plt.show()

<h3> Zipf's law - cummulative distribution </h3> 

In [None]:

pct_subtlex = [100*x / sum(freqs_subtlex) for x in freqs_subtlex]

pct_corpus = [100*x / sum(freqs_corpus) for x in freqs_corpus]

pct_C4 = [100*x / sum(freqs_C4) for x in freqs_C4]

pct_C15 = [100*x / sum(freqs_C15) for x in freqs_C15]


# Plot all dictionaries on the same graph
cum_freqs_subtlex = np.cumsum(pct_subtlex)
cum_freqs_corpus = np.cumsum(pct_corpus)
cum_freqs_C4 = np.cumsum(pct_C4)
cum_freqs_C15 = np.cumsum(pct_C15)

#plot 3 lines on loglog curve
plt.plot(ranks_subtlex, cum_freqs_subtlex, label='SUBTLEX corpus')
plt.plot(ranks_corpus, cum_freqs_corpus, label='Erowid corpus')
plt.plot(ranks_C4, cum_freqs_C4, label='Time corpus (C=4)')
plt.plot(ranks_C15, cum_freqs_C15, label='Time corpus (C=15)')


# Format the y-axis as a percentage
def percent_tick_formatter(val, pos=None):
    """
    Convert a numeric tick value to a percentage tick value.
    """
    return '{:.0f}%'.format(val)

plt.gca().yaxis.set_major_formatter(FuncFormatter(percent_tick_formatter))

# Format the x-axis and y-axis tick labels
plt.gca().xaxis.set_major_formatter(FuncFormatter(log_tick_formatter))
plt.gca().yaxis.set_tick_params(which='minor', labelleft=False)  # Remove minor tick labels
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)


plt.title('Zipf\'s law (Cumulative percentage of corpus)'); plt.xlabel('Word rank'); plt.ylabel('Cumulative percentage of corpus')
plt.legend(); plt.show()

<h3> Zipf's law - tf-idf </h3> 

In [None]:
#prep
index_exper_C4 = tfidf_df_C4["word"].to_list().index("experience")
index_heart_C4 = tfidf_df_C4["word"].to_list().index("heart")
index_continuum_C4 = tfidf_df_C4["word"].to_list().index("continuum")

index_exper_C15 = tfidf_df_C15["word"].to_list().index("experience")
index_heart_C15 = tfidf_df_C15["word"].to_list().index("heart")
index_continuum_C15 = tfidf_df_C15["word"].to_list().index("continuum")

In [None]:
#freq and rank of each word in list format
freqs_C4 = normalized_tfidf_list_C4
ranks_C4 = np.arange(1, len(freqs_C4)+1)

freqs_C15 = normalized_tfidf_list_C15
ranks_C15 = np.arange(1, len(freqs_C15)+1)


# Create the figure and first axis
fig, ax = plt.subplots()


#plot Erowid corpus on the left axis
ax.loglog(ranks_corpus, freqs_corpus, label='Erowid corpus', color='orange')


# Set up the tick formatter
def log_tick_formatter(val, pos=None):
    """
    Convert a log tick value to a plain tick value.
    """
    if val < 1:
        return '{:.3g}'.format(val)
    else:
        return int(val)


def log_tick_formatter_right(val, pos=None):
    """
    Convert a log tick value to a plain tick value for the second y-axis.
    """
    return '{:.3g}'.format(val)


# Format the x-axis and y-axis tick labels
ax.xaxis.set_major_formatter(FuncFormatter(log_tick_formatter))
ax.yaxis.set_major_formatter(FuncFormatter(log_tick_formatter))


plt.gca().xaxis.set_major_formatter(FuncFormatter(log_tick_formatter))
plt.gca().yaxis.set_major_formatter(FuncFormatter(log_tick_formatter))
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)

# Set the font size of the x-axis and y-axis tick labels
ax.tick_params(axis='both', labelsize=8)

# Create a second y-axis on the right side
ax2 = ax.twinx()

# Plot the lines for the Time corpus (C=4) and Time corpus (C=15) on the second y-axis
ax2.semilogx(ranks_C4, freqs_C4, label='Time corpus (C=4)', color='green')
ax2.semilogx(ranks_C15, freqs_C15, label='Time corpus (C=15)', color='red')

# Set up the tick formatter for the second y-axis
ax2.yaxis.set_major_formatter(FuncFormatter(log_tick_formatter_right))
ax.xaxis.set_major_formatter(FuncFormatter(log_tick_formatter))


# Set the font size of the y-axis tick labels for the second y-axis
ax2.tick_params(axis='y', labelsize=8)

#axes labels
ax.set_xlabel('Word rank')
ax.set_ylabel('Word frequency')
ax2.set_ylabel('Normalized tf-idf value')

# Add a legend to the plot
handles1, labels1 = ax.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()


fig.legend(handles1, labels1, loc='lower left', fontsize=10)
fig.legend(handles2, labels2, loc='lower right', fontsize=10)



# annotate with arrows for words




#orange Erowid corpus
ax.annotate('"experience"', xy=(list(sorted_dict_corpus).index("experience"), sorted_dict_corpus["experience"]), xytext=(zero_value+2, zero_value+100),
             arrowprops=dict(facecolor='red', arrowstyle='->', color=(252/255, 142/255, 42/255)),
             annotation_clip=False)

ax.annotate('"heart"', xy=(list(sorted_dict_corpus).index("heart"), sorted_dict_corpus["heart"]), xytext=(zero_value+150, zero_value+35000),
             arrowprops=dict(facecolor='red', arrowstyle='->', color=(252/255, 142/255, 42/255)),
             annotation_clip=False)

ax.annotate('"continuum"', xy=(list(sorted_dict_corpus).index("continuum"), sorted_dict_corpus["continuum"]), xytext=(zero_value+60, zero_value+500),
             arrowprops=dict(facecolor='red', arrowstyle='->', color=(252/255, 142/255, 42/255)),
             annotation_clip=False)


#indices (C=4)
index_exper_C4 = tfidf_df_C4["word"].to_list().index("experience")
index_heart_C4 = tfidf_df_C4["word"].to_list().index("heart")
index_continuum_C4 = tfidf_df_C4["word"].to_list().index("continuum")

#green (C=4)
ax2.annotate('          ', xy=(index_exper_C4, freqs_C4[index_exper_C4]), xytext=(zero_value+2, zero_value+0.23),
             arrowprops=dict(facecolor='green', arrowstyle='->', color=(61/255, 168/255, 61/255)))

ax2.annotate('     ', xy=(index_heart_C4, freqs_C4[index_heart_C4]), xytext=(zero_value+190, zero_value+0.9),
             arrowprops=dict(facecolor='green', arrowstyle='->', color=(61/255, 168/255, 61/255)))

ax2.annotate('         ', xy=(index_continuum_C4, freqs_C4[index_continuum_C4]), xytext=(zero_value+60, zero_value+0.43),
             arrowprops=dict(facecolor='green', arrowstyle='->', color=(61/255, 168/255, 61/255)))


#indices (C=)
index_exper_C15 = tfidf_df_C15["word"].to_list().index("experience")
index_heart_C15 = tfidf_df_C15["word"].to_list().index("heart")
index_continuum_C15 = tfidf_df_C15["word"].to_list().index("continuum")


#red (C=15)
ax2.annotate('          ', xy=(index_exper_C15, freqs_C15[index_exper_C15]), xytext=(zero_value+2, zero_value+0.22),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(218/255, 60/255, 61/255)))
      
ax2.annotate('     ', xy=(index_heart_C15, freqs_C15[index_heart_C15]), xytext=(zero_value+200, zero_value+0.9),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(218/255, 60/255, 61/255)))

ax2.annotate('         ', xy=(index_continuum_C15, freqs_C15[index_continuum_C15]), xytext=(zero_value+60, zero_value+0.43),
             arrowprops=dict(facecolor='blue', arrowstyle='->', color=(218/255, 60/255, 61/255)))



plt.title('Word frequency vs tf-idf')


<h3> Wordclouds - word frequency vs tf-idf </h3>

Create 4 word clouds, to compare 4x4: two classes as word freq vs tf-idf.  

Download as csv file and use at https://wordart.com/create, it looks nicer.

In [None]:
#remove from wordclouds

#temp remove list
remove_list = ["zyprexa", "tryptamine", "vaporiser", "redose", "peter", "psychedelic", "stramonium", "shannon", "pod", "atropine", "deliriants", "nightshade", "sceletium", "suboxone", "subutex", "zyprexa", "citalopram", "zoloft", "fluoxetine", "benzo", "cbd", "lotus", "rayanne", "opiates", "hashish", "alcoholic", "analgesia", "vodka", "apnea", "analgesic", "tylenol", "cigar", "antihistamine", "benedryl", "vaped", "trazodone", "tweaker", "mah", "desoxyn", "crank", "booster", "meph", "canker" , "entactogenic", "xtc", "albert", "antihistamine", "capsule", "butane", "sinicuichi", "ferris", "cory", "bromo", "bodyload", "tryptamine", "phenethylamines", "zeta", "glauca", "stims", "vivarin", "gourd", "yerba", "caffine", "cigs", "modalert", "cigar", "nutmegs", "tachycardia", "adderal", "bzp", "espresso", "hookah", "piperazine", "phentermine", "nut", "ciggarette", "betel"]

#remove non_seed_time_words + ... - later done by pre-processing
master_remove_list = remove_list + time_words + ["second", "seconds", "minute", "minutes", "hour", "hours", "day", "days", "week", "weeks", "weekend", "weekends", "month", "months", "year", "years", "times", "spend", "spent", "spending", "timestamp", "timestamps"]


#generate and save word freq word clouds for two types of classes
df_Serot = df[df.filter(items=['classes']).isin(['Serotonergic psychedelics']).any(1)]
df_Stimul = df[df.filter(items=['classes']).isin(['Stimulants']).any(1)]


Serot_corpus_list = []
for text in df_Serot.text:
    Serot_corpus_list += text

Stimul_corpus_list = []
for text in df_Stimul.text:
    Stimul_corpus_list += text

Serot_time_corpus_list = []
for context_words, center_word in get_windows(Serot_corpus_list, C=4):
    Serot_time_corpus_list += context_words

Stimul__time_corpus_list = []
for context_words, center_word in get_windows(Stimul_corpus_list, C=4):
    Stimul__time_corpus_list += context_words

Serot_counter = Counter(Serot_time_corpus_list)
wfreq_df_Serot = pd.DataFrame.from_dict(dict(Serot_counter), orient='index', columns=['frequency'])


Stimul_counter = Counter(Stimul__time_corpus_list)
wfreq_df_Stimul = pd.DataFrame.from_dict(dict(Stimul_counter), orient='index', columns=['frequency'])


wfreq_df_Serot = wfreq_df_Serot[~wfreq_df_Serot.filter(items=['']).isin(master_remove_list).any(axis=1)]
wfreq_df_Serot.to_csv("wordcloud_wfreq_Serot.csv")

wfreq_df_Stimul = wfreq_df_Stimul[~wfreq_df_Stimul.filter(items=['']).isin(master_remove_list).any(axis=1)]
wfreq_df_Stimul.to_csv("wordcloud_wfreq_Stimul.csv")



#generate and save tf idf word clouds
tfidf_df_Serot = tfidf_df_C4.loc[:, ['word', 'Serotonergic psychedelics']]
tfidf_df_Serot = tfidf_df_Serot[~tfidf_df_Serot.filter(items=['word']).isin(remove_list).any(axis=1)]
tfidf_df_Serot.to_csv("wordcloud_tfidf_Serot.csv")




tfidf_df_Stimul = tfidf_df_C4.loc[:, ['word', 'Stimulants']]
tfidf_df_Stimul = tfidf_df_Stimul[~tfidf_df_Stimul.filter(items=['word']).isin(remove_list).any(axis=1)]
tfidf_df_Stimul.to_csv("wordcloud_tfidf_Stimul.csv")