<a href="https://colab.research.google.com/github/NancWN/ScrumBusters/blob/main/KeywordSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What can it do?:
it takes a predefined list of keywords which is divided between pump and no_pump words. it uses the list to search trough chat data. if a match is found, the group name, the word and label is written down in a seperate list.

Requirements:
1. upload a list with Keywords you want to search for
2. provide chat data from telegram groups in CSV format
3. adjust the path in the code

Lets go! :)

In [41]:
# settings
# feel free to change settings here
# only change other cells if you know what you are doing
keyword_set = 'paper' # choose one of 'paper', 'brainstorm', 'mixed'
batch_number = 6
message_path = 'data/raw/messages_batch{}_old.csv'.format(batch_number) # put in the path of the message file

In [42]:
# run this cell to change working directory to its parent directory
# %cd ..

In [43]:
# import libraries, load message data and keyword list
import csv
import pandas as pd 

#List of Keywords we want to search

lists = pd.read_csv('references/{}_kw.csv'.format(keyword_set),index_col=0)
lists.columns=['pump','no_pump']

pump_words=lists["pump"].dropna()
no_pump_words=lists["no_pump"].dropna()

#telegram chat export (csv-file)
searchlists=pd.read_csv(message_path, sep=',',na_values = ['no info', '.'])
searchlists.text= searchlists.text.astype('str')

In [44]:
# create dataset of all chat names and chat ids
chat_names = searchlists.chat.unique()
chat_ids = range(len(chat_names))
chat_name_history = pd.DataFrame(data=[chat_names,chat_ids]).T
chat_name_history.columns = ['chat_name','chat_id']

# add the chat ids to the message dataframe
searchlists = searchlists.merge(
    chat_name_history,
    how='left',
    left_on = 'chat',
    right_on = 'chat_name'
    )
searchlists = searchlists.set_index('Unnamed: 0')
searchlists.index.name = 'row_id'
# drop the extended chat ids
searchlists.drop(['chat_name'],axis=1,inplace =True)
# simplify date column
searchlists.date = pd.to_datetime(searchlists.date).dt.tz_localize(None)
searchlists.reset_index(inplace = True)

In [45]:
searchlists

Unnamed: 0,row_id,text,date,id,chat,chat_id
0,5315_TRADE MARKETING_2022-02-15,incredibile ritiro congratulazioni credo che t...,2022-02-15 10:19:12,1248776993,TRADE MARKETING,0
1,5314_TRADE MARKETING_2022-02-15,ciao investitori. il mio profitto è arrivato ...,2022-02-15 10:16:48,2139743672,TRADE MARKETING,0
2,5313_TRADE MARKETING_2022-02-15,attenzione alla truffa❗️. attenzione alla truf...,2022-02-15 09:17:24,5005546897,TRADE MARKETING,0
3,5312_TRADE MARKETING_2022-02-15,la banca usa i tuoi soldi per fare soldi prest...,2022-02-15 09:11:01,5005546897,TRADE MARKETING,0
4,5311_TRADE MARKETING_2022-02-15,,2022-02-15 09:06:34,5005546897,TRADE MARKETING,0
...,...,...,...,...,...,...
9169,2429_Crypto Rocket ®_2019-09-16,#celr (binance )..buy around 54-50..sell - 56-...,2019-09-16 19:25:43,-1001196424883,Crypto Rocket ®,63
9170,2421_Crypto Rocket ®_2019-09-07,🚨 huge sale on premium membership last day for...,2019-09-07 17:43:54,-1001196424883,Crypto Rocket ®,63
9171,2419_Crypto Rocket ®_2019-09-04,bitmex.#btc/usd take-profit target 3 ✅.profit:...,2019-09-04 20:01:58,-1001196424883,Crypto Rocket ®,63
9172,2418_Crypto Rocket ®_2019-09-04,#dlt (binance)..buy around 355-340..sell – 370...,2019-09-04 12:27:54,-1001196424883,Crypto Rocket ®,63


In [46]:
# simple function to state if a substring was found in a string
# this is not necessary but it's easier to understand than alternative solutions
def key_found(key,msg):
    if key in str(msg):
        return True
    else:
        return False

In [47]:
# iterate over keyword list
# create new col per keyword to state if the keyword was found in the message
for keyword in pump_words:
    searchlists.loc[:,keyword] = [
        key_found(keyword, msg)
        for msg
        in searchlists.text
    ]
    searchlists.text = [
        msg.replace(keyword,'**'+keyword+'**')
        for msg in searchlists.text
    ]

# same process for no pump words
for keyword in no_pump_words:
    searchlists.loc[:,keyword] = [
        key_found(keyword, msg)
        for msg
        in searchlists.text
    ]
searchlists.text = [
        msg.replace(keyword,'##'+keyword+'##')
        for msg in searchlists.text
    ]

# get the number of keywords found per message
searchlists.loc[:,'n_pump_words'] = (searchlists[pump_words]).sum(axis=1)
searchlists.loc[:,'n_nopump_words'] = (searchlists[no_pump_words]).sum(axis=1)
searchlists.loc[:,'n_words'] = searchlists.n_pump_words + searchlists.n_nopump_words

searchlists.sort_values('n_words',ascending=False,inplace=True)
msg_sus_df = searchlists.loc[searchlists.n_words>0][['text','date','chat','n_words']]

In [48]:
msg_sus_df.to_csv('data/chats/sus_msgs_{}_batch{}.csv'.format(keyword_set,batch_number))