In [1]:
import pandas as pd

## Loading Data
Brings the urls generated by the Slack Bot into a pandas dataframe.

If this is the first time running this, comment out the first line where an existing dataframe is loaded in.

In [38]:
# If there is an existing data file, load it in.
df = pd.read_csv('text_data.csv', sep='\t')

# Gets the new url list
urls_list = list(set(open("slack_output.txt", "r").read().splitlines()))
urls = pd.DataFrame({'url' : urls_list, 'title': None, 'text': None})

# Gets only new urls not currently in the df.
new_urls = urls[~urls['url'].isin(df['url'])]
df = pd.concat([df, new_urls], axis=0)

# Removes youtube urls
df = df[~df['url'].str.contains("youtu")].reset_index(drop=True)

## Filling Data
A little automated "interface" for copy-pasting each article's title and text body into the data frame.
If the link is invalid (or you do not want to add it for any reason) just press enter to skip and the values will be whitespace which can be filtered out later.
If you make a mistake, stop the execution of the cell and just restart it, it will pickup on the last save (which is last full article submitted).

In [39]:
"""
Here is the block for filling in the titles and text of each article.
This could be done with webscraping or something, but with the fairly small amount of articles I will just get them manually.
"""
import os

unlabeled_df = df[df['title'].isna()]

for i, row in unlabeled_df.iterrows():
    os.system(f'cmd /c start iexplore {row.url}')
    print("")
    print(f'Index: {i} / {df.shape[0]}')
    print("Url: " + row.url)
    title = input("Title: ").strip('\t')
    text = input("Text: ").strip('\t')
    
    # Makes invalid articles have not null values
    if title == "":
        title = " "
        text = " "
    
    df.loc[i]['title'] = title
    df.loc[i]['text'] = text
    
    df.to_csv('text_data.csv', sep='\t', index=False)
    
print("~~~~~~")
print("Done!")


Index: 135 / 141
Url: https://www.publishersweekly.com/pw/by-topic/digital/content-and-e-books/article/92471-ai-is-about-to-turn-book-publishing-upside-down.html

Index: 136 / 141
Url: https://cointelegraph.com/news/weak-competition-ai-could-hurt-consumers-uk-competition-watchdog

Index: 137 / 141
Url: https://www.forbes.com/sites/bernardmarr/2018/07/27/how-is-ai-used-in-healthcare-5-powerful-real-world-examples-that-show-the-latest-advances/?sh=5c09b8375dfb

Index: 138 / 141
Url: https://arxiv.org/abs/2306.03809

Index: 139 / 141
Url: https://www.osmo.ai/about

Index: 140 / 141
Url: https://www.forbes.com/sites/beatajones/2023/09/05/how-educators-can-leverage-generative-ai-to-promote-student-innovation/?sh=69af3248f83e
~~~~~~
Done!


In [40]:
# Make sure there aren't any strange titles
df['title'].apply(len).sort_values()

111      1
26       1
63       1
128      1
107      1
      ... 
90     106
116    108
93     108
95     111
56     114
Name: title, Length: 141, dtype: int64

In [37]:
df.iloc()[62]['title']

'31% of investors are OK with using artificial intelligence as their advisor'

In [29]:
len(df.iloc()[21]['title'])

1124