<a href="https://colab.research.google.com/github/NancWN/ScrumBusters/blob/main/KeywordSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What can it do?:
it takes a predefined list of keywords which is divided between pump and no_pump words. it uses the list to search trough chat data. if a match is found, the group name, the word and label is written down in a seperate list.

Requirements:
1. upload a list with Keywords you want to search for
2. provide chat data from telegram groups in CSV format
3. adjust the path in the code

Lets go! :)

In [17]:
# settings
# feel free to change settings here
# only change other cells if you know what you are doing
keyword_set = 'brainstorm' # choose one of 'paper', 'brainstorm', 'mixed'
message_path = 'data/raw/messages.csv' # put in the path of the message file

In [63]:
# import libraries, load message data and keyword list
import csv
import pandas as pd 

#List of Keywords we want to search

lists = pd.read_csv('references/{}_kw.csv'.format(keyword_set),index_col=0)
lists.columns=['pump','no_pump']

pump_words=lists["pump"].dropna()
no_pump_words=lists["no_pump"].dropna()

#telegram chat export (csv-file)
searchlists=pd.read_csv(message_path, sep=',',na_values = ['no info', '.'])
searchlists.text= searchlists.text.astype('str')

In [64]:
# create dataset of all chat names and chat ids
chat_names = searchlists.chat.unique()
chat_ids = range(len(chat_names))
chat_name_history = pd.DataFrame(data=[chat_names,chat_ids]).T
chat_name_history.columns = ['chat_name','chat_id']

# add the chat ids to the message dataframe
searchlists = searchlists.merge(
    chat_name_history,
    how='left',
    left_on = 'chat',
    right_on = 'chat_name'
    )
searchlists = searchlists.set_index('Unnamed: 0')
searchlists.index.name = 'id'
# drop the extended chat ids
searchlists.drop(['chat_name','chat'],axis=1,inplace =True)
# simplify date column
searchlists.date = pd.to_datetime(searchlists.date).dt.tz_localize(None)

In [65]:
searchlists

Unnamed: 0_level_0,text,date,id,chat_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11009_@CryptoDemonz 😈⚔️🕸 World's First Video Game NFT Arcade Series_2022-02-06,[mike](tg://user?id=5061538260) press the butt...,2022-02-06 11:22:38,1.627264e+08,0
11006_@CryptoDemonz 😈⚔️🕸 World's First Video Game NFT Arcade Series_2022-02-06,just staked my top 25 boy. my first ever nft f...,2022-02-06 00:24:42,1.691692e+09,0
11005_@CryptoDemonz 😈⚔️🕸 World's First Video Game NFT Arcade Series_2022-02-06,were all booked up thank you,2022-02-06 00:22:26,1.648536e+09,0
11004_@CryptoDemonz 😈⚔️🕸 World's First Video Game NFT Arcade Series_2022-02-05,as a senior blockchain developer id like to wo...,2022-02-05 23:37:39,5.058643e+09,0
11003_@CryptoDemonz 😈⚔️🕸 World's First Video Game NFT Arcade Series_2022-02-05,hello can i help you with something?,2022-02-05 23:37:18,1.984785e+09,0
...,...,...,...,...
39_☠️ PIRATES PUMPS💀_2018-01-18,,2018-01-18 21:26:57,-1.001339e+12,61
38_☠️ PIRATES PUMPS💀_2018-01-18,,2018-01-18 20:57:25,-1.001339e+12,61
37_☠️ PIRATES PUMPS💀_2018-01-18,,2018-01-18 20:56:36,-1.001339e+12,61
1_☠️ PIRATES PUMPS💀_2018-01-10,,2018-01-10 01:14:36,-1.001339e+12,61


In [20]:
# simple function to state if a substring was found in a string
# this is not necessary but it's easier to understand than alternative solutions
def key_found(key,msg):
    if key in str(msg):
        return True
    else:
        return False

In [21]:
# iterate over keyword list
# create new col per keyword to state if the keyword was found in the message
for keyword in pump_words:
    searchlists.loc[:,keyword] = [
        key_found(keyword, msg)
        for msg
        in searchlists.text
    ]
    searchlists.text = [
        msg.replace(keyword,'**'+keyword+'**')
        for msg in searchlists.text
    ]

# same process for no pump words
for keyword in no_pump_words:
    searchlists.loc[:,keyword] = [
        key_found(keyword, msg)
        for msg
        in searchlists.text
    ]
searchlists.text = [
        msg.replace(keyword,'##'+keyword+'##')
        for msg in searchlists.text
    ]

# get the number of keywords found per message
searchlists.loc[:,'n_pump_words'] = (searchlists[pump_words]).sum(axis=1)
searchlists.loc[:,'n_nopump_words'] = (searchlists[no_pump_words]).sum(axis=1)
searchlists.loc[:,'n_words'] = searchlists.n_pump_words + searchlists.n_nopump_words

searchlists.sort_values('n_words',ascending=False,inplace=True)
msg_sus_df = searchlists.loc[searchlists.n_words>0][['text','date','chat','n_words']]

In [22]:
msg_sus_df.to_csv('data/chats/sus_msgs_{}.csv'.format(keyword_set))