## Training dataset formats?

Model 1, logistic regression.
Unit of analysis: Two discrete messages in a chat log.
INPUT: One feature list of a pair of messages
OUTPUT: Boolean, same conversation or not
### Idea 1: n x m sample of data
Get every pair of messages within a given timeframe, create X, y format of training data:
                feature_1, feature_2, feature_3
message_pair_1 |
message_pair_2 |
message_pair_3 |

### Idea 2: Feed each pair individually to the model
Create function to get every pair of messages within a given timeframe, and preprocess to create a feature list to feed into the model


## Features to extract
1. Time distance between messages in minutes (integer)
2. Same or different user (boolean)
3.

In [2]:
class Message:
    def __init__(self, timestamp, user_id, message, mention, file_ind, uuid, uuid_parent=None, conversation_ind=None):
        self.timestamp = timestamp
        self.user = user_id
        self.message = message
        self.file_ind = file_ind
        self.uuid = uuid
        self.uuid_parent = uuid_parent
        self.conversation_ind = conversation_ind
        self.has_mention = mention > -1
        self.mention = mention

    def __lt__(self, other):
        return self.file_ind < other.file_ind

    def to_dict(self):
        return vars(self)

In [3]:
class MessageProcessor:
    DEFAULT_PARAMS = {
        'message_cur': '',
        'message_prev': '',
        'same_user': 0,
        'time_distance': -1,
        'label': -1
    }
    #def __init__(self):
    @staticmethod
    def process_reply_pair(message_primary, message_candidate):
        """

        :param message_primary: Message object, primary message
        :param message_candidate: Message object, potential antecedent message
        :return: modified dictionary of DEFAULT_PARAMS
        """

        results = MessageProcessor.DEFAULT_PARAMS.copy()

        results['message_cur'] = message_primary.uuid
        results['message_prev'] = message_candidate.uuid
        results['same_user'] = MessageProcessor._check_user_pair(message_primary, message_candidate)
        results['mentions_user'] = MessageProcessor._mentions_user(message_primary, message_candidate)
        results['time_distance'] = MessageProcessor._calculate_time_distance(message_primary, message_candidate)
        results['label'] = MessageProcessor._check_is_reply(message_primary, message_candidate)
        return results

    @staticmethod
    def process_same_convo(message_primary, message_candidate):

        results = MessageProcessor.DEFAULT_PARAMS.copy()

        results['message_cur'] = message_primary.uuid
        results['message_prev'] = message_candidate.uuid
        results['same_user'] = MessageProcessor._check_user_pair(message_primary, message_candidate)
        results['mentions_user'] = MessageProcessor._mentions_user(message_primary, message_candidate)
        results['time_distance'] = MessageProcessor._calculate_time_distance(message_primary, message_candidate)
        results['label'] = MessageProcessor._check_is_same_convo(message_primary, message_candidate)
        return results

    @staticmethod
    def _check_user_pair(message_primary, message_candidate):
        return int(message_primary.user == message_candidate.user)

    @staticmethod
    def _calculate_time_distance(message_primary, message_candidate):
        if message_primary.timestamp < message_candidate.timestamp:
            raise Exception(f"Primary Message {message_primary.uuid} is before candidate {message_candidate.uuid}.")
        return message_primary.timestamp - message_candidate.timestamp

    @staticmethod
    def _mentions_user(message_primary, message_candidate):
        return int(message_primary.mention == message_candidate.user)
    #Methods that generate labels
    @staticmethod
    def _check_is_reply(message_primary, message_candidate):
        return int(message_primary.uuid_parent == message_candidate.uuid)

    @staticmethod
    def _check_is_same_convo(message_primary, message_candidate):
        return int(message_primary.conversation_ind == message_candidate.conversation_ind)

In [4]:
import pandas as pd

In [5]:
dev_df = pd.read_csv('../data/cleaned/agg_dev.csv', index_col=0)

In [6]:
dev_convo_df = pd.read_csv('../data/cleaned/dev_conversations.csv', index_col=0)

In [7]:
dev_convo_df[dev_convo_df['date'] == '2005-08-08'].sort_values('file_ind')

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,mentions,date,parent,child,file_ind_parent,uuid_parent,conversation_ind
2005_08_08_12_28_aceb747_685,[12:28] <aceb747> is someone was comparing the...,685,12:28,748,aceb747,9541,is someone was comparing the ubuntu repository...,-1,2005-08-08,685.0,685.0,685.0,2005_08_08_12_28_aceb747_685,111
2005_08_08_12_56_LasseL_975,[12:56] <LasseL> I wonder how ubuntu is going ...,975,12:56,776,LasseL,10687,I wonder how ubuntu is going to deal with open...,-1,2005-08-08,975.0,975.0,975.0,2005_08_08_12_56_LasseL_975,112
2005_08_08_12_58_harold__990,[12:58] <harold_> Anyone: Know how to compile ...,990,12:58,778,harold_,14790,Anyone: Know how to compile madwifi for a Powe...,-1,2005-08-08,990.0,990.0,990.0,2005_08_08_12_58_harold__990,113
2005_08_08_12_59_ubotu_999,[12:59] <ubotu> somebody said paste was please...,999,12:59,779,ubotu,12983,somebody said paste was please use http://past...,-1,2005-08-08,999.0,999.0,999.0,2005_08_08_12_59_ubotu_999,114
2005_08_08_01_00_MartenH_1000,[01:00] <MartenH> What does this error mean an...,1000,01:00,780,MartenH,11748,What does this error mean and how can I correc...,-1,2005-08-08,1000.0,1000.0,1000.0,2005_08_08_01_00_MartenH_1000,115
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2005_08_08_01_23_kaffeend_1245,[01:23] <kaffeend> heya,1245,01:23,803,kaffeend,6525,heya,-1,2005-08-08,1245.0,1245.0,1245.0,2005_08_08_01_23_kaffeend_1245,180
2005_08_08_System_System_1246,=== macgyver2 [~macgyver2@macgyver2.user] has...,1246,System,803,System,3792,,-1,2005-08-08,1246.0,1246.0,1246.0,2005_08_08_System_System_1246,181
2005_08_08_System_System_1247,=== boga [~boga@CPE0011095f2041-CM00e06f240dd8...,1247,System,803,System,3792,,-1,2005-08-08,1247.0,1247.0,1247.0,2005_08_08_System_System_1247,182
2005_08_08_System_System_1248,=== HollowFrank [~HollowFra@ACCB2D21.ipt.aol.c...,1248,System,803,System,3792,,-1,2005-08-08,1248.0,1248.0,1248.0,2005_08_08_System_System_1248,183


In [8]:
dev_df.head()

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,mentions,date,uuid,parent,child,file_ind_parent,uuid_parent
0,"[12:18] <|trey|> usual, quite stable though :)",0,12:18,738,|trey|,10946,"usual, quite stable though :)",7183,2004-11-15,2004_11_15_12_18_|trey|_0,,,,
1,[12:18] <tweaked> HrdwrBoB: ok how many partit...,1,12:18,738,tweaked,1375,HrdwrBoB: ok how many partitions should i make?,6814,2004-11-15,2004_11_15_12_18_tweaked_1,,,,
2,"[12:18] <Matt|> |trey|, top in the list --> ub...",2,12:18,738,Matt|,6784,ubuntu servers,-1,2004-11-15,2004_11_15_12_18_Matt|_2,,,,
3,[12:18] <usual> a few libs and media,3,12:18,738,usual,7183,a few libs and media,-1,2004-11-15,2004_11_15_12_18_usual_3,,,,
4,[12:18] <usual> maybe some others,4,12:18,738,usual,7183,maybe some others,-1,2004-11-15,2004_11_15_12_18_usual_4,,,,


In [9]:
labeled_df = dev_df[dev_df['parent'].notnull()]

In [10]:
labeled_df.head()

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,mentions,date,uuid,parent,child,file_ind_parent,uuid_parent
685,[01:35] <djtansey> i have a problem re: k3b an...,685,01:35,815,djtansey,948,i have a problem re: k3b and am looking for so...,-1,2004-11-15,2004_11_15_01_35_djtansey_685,685.0,685.0,685.0,2004_11_15_01_35_djtansey_685
1000,[03:01] <LinuxJones> night all :),1000,03:01,901,LinuxJones,1839,night all :),-1,2004-11-15,2004_11_15_03_01_LinuxJones_1000,1000.0,1000.0,1000.0,2004_11_15_03_01_LinuxJones_1000
1001,=== yohannes [~yohannes@adsl-67-112-218-21.dsl...,1001,03:10,910,System,3792,,-1,2004-11-15,2004_11_15_03_10_System_1001,1001.0,1001.0,1001.0,2004_11_15_03_10_System_1001
1002,[03:10] <yohannes> can anyone recommend any ap...,1002,03:10,910,yohannes,14260,can anyone recommend any app to create/open *....,-1,2004-11-15,2004_11_15_03_10_yohannes_1002,1002.0,1002.0,1002.0,2004_11_15_03_10_yohannes_1002
1003,"[03:10] <Hikaru79> yohannes, why not WinRAR?",1003,03:10,910,Hikaru79,9550,"yohannes, why not WinRAR?",14260,2004-11-15,2004_11_15_03_10_Hikaru79_1003,1002.0,1003.0,1002.0,2004_11_15_03_10_yohannes_1002


In [11]:
dev_convo_df.shape

(2642, 14)

In [12]:
MESSAGE_SIZE = 1000
message_list = []
for ind, row in dev_convo_df.iloc[0:MESSAGE_SIZE].iterrows():
    message = Message(timestamp=row['minutes'],
                      user_id=row['user_ind'],
                      message=row['message'],
                      file_ind=row['file_ind'],
                      mention=row['mentions'],
                      uuid=ind,
                      uuid_parent=None,
                      conversation_ind=row['conversation_ind'])
    message_list.append(message)

In [13]:
message_list[0].uuid

'2004_11_15_01_35_djtansey_685'

In [19]:
MessageProcessor.process_same_convo(message_list[1], message_list[0])

{'message_cur': '2004_11_15_03_41_djtansey_1087',
 'message_prev': '2004_11_15_01_35_djtansey_685',
 'same_user': 1,
 'time_distance': 126,
 'label': 1,
 'mentions_user': 0}

In [20]:
class DatasetBuilder:
    def __init__(self):
        self.training_data = pd.DataFrame()
        self.labels = None
        self.size = len(self.training_data)

    def load_pair(self, data):
        """
        load single Message object
        :param message:
        :return: None
        """
        self.training_data = self.training_data.append(data, ignore_index=True)

    def load_messages(self, message_list, window=0):
        """
        Load messages and create dataset on conversation format.
        :param message_list: list of Message Objects
        :return: None
        """

        if window:
            pair_list = self.pair_generator_window(message_list, window)
        else:
            pair_list = self.pair_generator(message_list)

        for pair in pair_list:
            prev_message, cur_message = pair
            data = MessageProcessor.process_same_convo(cur_message, prev_message)
            self.load_pair(data)

    @staticmethod
    def pair_generator(message_list):
        """
        Takes list of Message objects and returns a combination of all possible pairs
        :param message_list:
        :return: list of length-2 tuples
        """
        pair_list = []
        for idx, message_1 in enumerate(message_list):
            for message_2 in message_list[idx +1:]:
                if message_1 < message_2:
                    pair_list.append((message_1, message_2))
                else:
                    pair_list.append((message_2, message_1))

        return pair_list

    @staticmethod
    def pair_generator_window(message_list, window):
        """
        Takes list of Message objects and returns a combination of all possible pairs
        :param message_list:
        :return: list of length-2 tuples
        """
        pair_list = []
        for idx, message_1 in enumerate(message_list):
            for message_2 in message_list[idx+1:]:
                if message_2.timestamp - message_1.timestamp > window:
                    break
                if message_1 < message_2:
                    pair_list.append((message_1, message_2))
                else:
                    pair_list.append((message_2, message_1))
        return pair_list

    def export_data(self, path):
        self.training_data.to_csv(path)

In [24]:
class DatasetBuilderReply(DatasetBuilder):
        def __init__(self):
            super().__init__()

        def load_messages(self, message_list, window=0):
            """
            Load messages and create dataset on message-reply format
            :param message_list: list of Message Objects
            :return: None
            """

            if window:
                pair_list = self.pair_generator_window(message_list, window)
            else:
                pair_list = self.pair_generator(message_list)

            for pair in pair_list:
                prev_message, cur_message = pair
                data = MessageProcessor.process_reply_pair(cur_message, prev_message)
                self.load_pair(data)

In [25]:
MESSAGE_SIZE = 1000
message_list = []
day_list = list(set(dev_convo_df['date'].values))

In [42]:
# for ind, row in dev_convo_df.iloc[0:MESSAGE_SIZE].iterrows():
#     message = Message(timestamp=row['minutes'],
#                       user_id=row['user_ind'],
#                       message=row['message'],
#                       file_ind=row['file_ind'],
#                       uuid=ind,
#                       uuid_parent=None,
#                       conversation_ind=row['conversation_ind'])
#     message_list.append(message)

dataset_builder = DatasetBuilderReply()

for day in day_list:
    message_list = []

    day_df = dev_convo_df[dev_convo_df['date'] == day].sort_values('file_ind')
    for ind, row in day_df.iterrows():
        message = Message(timestamp=row['minutes'],
                          user_id=row['user_ind'],
                          message=row['message'],
                          file_ind=row['file_ind'],
                          mention=row['mentions'],
                          uuid=ind,
                          uuid_parent=row['uuid_parent'],
                          conversation_ind=row['conversation_ind'])
        message_list.append(message)

    dataset_builder.load_messages(message_list, window=5)

In [43]:
day_df

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,mentions,date,parent,child,file_ind_parent,uuid_parent,conversation_ind
2005_06_27_11_47_s00d_830,[11:47] <s00d> Hi. I'm having a little trouble...,830,11:47,1427,s00d,2173,Hi. I'm having a little trouble installing IPT...,-1,2005-06-27,830.0,830.0,830.0,2005_06_27_11_47_s00d_830,66
2005_06_27_11_53_lukus001_909,[11:53] <lukus001> How can i install w32codec ...,909,11:53,1433,lukus001,7817,How can i install w32codec on chroot,-1,2005-06-27,909.0,909.0,909.0,2005_06_27_11_53_lukus001_909,67
2005_06_27_11_55_bob2_936,[11:55] <bob2> lukus001: wiki.ubuntu.com/Restr...,936,11:55,1435,bob2,388,lukus001: wiki.ubuntu.com/RestrictedFormats,7817,2005-06-27,936.0,936.0,936.0,2005_06_27_11_55_bob2_936,68
2005_06_27_11_59_bob2_999,[11:59] <bob2> microhaxo: ok!,999,11:59,1439,bob2,388,microhaxo: ok!,3776,2005-06-27,999.0,999.0,999.0,2005_06_27_11_59_bob2_999,69
2005_06_27_12_00_bob2_1000,[12:00] <bob2> oga: please keep things in the ...,1000,12:00,1440,bob2,388,oga: please keep things in the channel,10055,2005-06-27,1000.0,1000.0,1000.0,2005_06_27_12_00_bob2_1000,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2005_06_27_12_28_Dalkus_1245,"[12:28] <Dalkus> Vassilis, dont ask to ask, ju...",1245,12:28,1468,Dalkus,15170,"Vassilis, dont ask to ask, just ask :)",3545,2005-06-27,1239.0,1245.0,1239.0,2005_06_27_12_27_Vassilis_1239,110
2005_06_27_12_28_microhaxo_1246,[12:28] <microhaxo> but WHAT Games can i play?!,1246,12:28,1468,microhaxo,3776,but WHAT Games can i play?!,-1,2005-06-27,1244.0,1246.0,1244.0,2005_06_27_12_28_microhaxo_1244,109
2005_06_27_12_28_Vassilis_1247,[12:28] <Vassilis> is there a way to pm just n...,1247,12:28,1468,Vassilis,3545,is there a way to pm just not fill the window?,-1,2005-06-27,1245.0,1247.0,1245.0,2005_06_27_12_28_Dalkus_1245,110
2005_06_27_12_28_microhaxo_1248,[12:28] <microhaxo> who wants to play tux race...,1248,12:28,1468,microhaxo,3776,"who wants to play tux racer, integrated graphi...",-1,2005-06-27,1246.0,1248.0,1246.0,2005_06_27_12_28_microhaxo_1246,109


In [44]:
dev_convo_df[dev_convo_df['date'] == '2011-05-29'].sort_values('file_ind')

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,mentions,date,parent,child,file_ind_parent,uuid_parent,conversation_ind
2011_05_29_18_59_gmachine_24_998,[18:59] <gmachine_24> I want to copy/clone the...,998,18:59,1859,gmachine_24,16665,I want to copy/clone the hard drive where I st...,-1,2011-05-29,998.0,998.0,998.0,2011_05_29_18_59_gmachine_24_998,372
2011_05_29_18_59_ActionParsnip_999,"[18:59] <ActionParsnip> EvoGamer: i see, so yo...",999,18:59,1859,ActionParsnip,4695,"EvoGamer: i see, so you are taking an image wi...",4389,2011-05-29,999.0,999.0,999.0,2011_05_29_18_59_ActionParsnip_999,373
2011_05_29_19_00_Usuario_1000,[19:00] <Usuario> what do I have to type in pa...,1000,19:00,1860,Usuario,9590,what do I have to type in password if I have n...,-1,2011-05-29,1000.0,1000.0,1000.0,2011_05_29_19_00_Usuario_1000,374
2011_05_29_19_00_EvoGamer_1001,"[19:00] <EvoGamer> ActionParsnip, yes",1001,19:00,1860,EvoGamer,4389,"ActionParsnip, yes",4695,2011-05-29,999.0,1001.0,999.0,2011_05_29_18_59_ActionParsnip_999,373
2011_05_29_19_00_ActionParsnip_1002,[19:00] <ActionParsnip> gmachine_24: if it is ...,1002,19:00,1860,ActionParsnip,4695,"gmachine_24: if it is ext based filesystem, yo...",16665,2011-05-29,998.0,1002.0,998.0,2011_05_29_18_59_gmachine_24_998,372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2011_05_29_19_45_sudokill_1245,[19:45] <sudokill> burn whatever to cd,1245,19:45,1905,sudokill,15974,burn whatever to cd,-1,2011-05-29,1244.0,1245.0,1244.0,2011_05_29_19_45_toad`_1244,411
2011_05_29_19_45_System_1246,=== purpleGecko is now known as GeeksAreForLife,1246,19:45,1905,System,3792,,-1,2011-05-29,1246.0,1246.0,1246.0,2011_05_29_19_45_System_1246,415
2011_05_29_19_45_toad`_1247,[19:45] <toad`> I'm wondering tho,1247,19:45,1905,toad`,16325,I'm wondering tho,-1,2011-05-29,1244.0,1247.0,1244.0,2011_05_29_19_45_toad`_1244,411
2011_05_29_19_45_sudokill_1248,[19:45] <sudokill> use the built in iso burner...,1248,19:45,1905,sudokill,15974,"use the built in iso burner, infrarecorder or ...",-1,2011-05-29,1245.0,1248.0,1245.0,2011_05_29_19_45_sudokill_1245,411


In [45]:
training_data = dataset_builder.training_data
training_data

Unnamed: 0,message_cur,message_prev,same_user,time_distance,label,mentions_user
0,2009_03_03_09_59_ActionParsnip_998,2009_03_03_09_57_ubottu_995,0.0,2.0,0.0,0.0
1,2009_03_03_09_59_ActionParsnip_999,2009_03_03_09_57_ubottu_995,0.0,2.0,0.0,0.0
2,2009_03_03_10_00_ActionParsnip_1000,2009_03_03_09_57_ubottu_995,0.0,3.0,0.0,0.0
3,2009_03_03_10_01_ActionParsnip_1001,2009_03_03_09_57_ubottu_995,0.0,4.0,0.0,0.0
4,2009_03_03_10_01_ActionParsnip_1002,2009_03_03_09_57_ubottu_995,0.0,4.0,0.0,0.0
...,...,...,...,...,...,...
79626,2005_06_27_12_28_microhaxo_1248,2005_06_27_12_28_microhaxo_1246,1.0,0.0,1.0,0.0
79627,2005_06_27_12_28_microhaxo_1249,2005_06_27_12_28_microhaxo_1246,1.0,0.0,0.0,0.0
79628,2005_06_27_12_28_microhaxo_1248,2005_06_27_12_28_Vassilis_1247,0.0,0.0,0.0,0.0
79629,2005_06_27_12_28_microhaxo_1249,2005_06_27_12_28_Vassilis_1247,0.0,0.0,1.0,0.0


In [46]:
training_data[training_data['label'] == 1].shape

(2150, 6)

In [47]:
training_data.shape

(79631, 6)

In [48]:
dataset_builder.export_data("training_reply_data.csv")