## Training dataset formats?

Model 1, logistic regression.
Unit of analysis: Two discrete messages in a chat log.
INPUT: One feature list of a pair of messages
OUTPUT: Boolean, same conversation or not
### Idea 1: n x m sample of data
Get every pair of messages within a given timeframe, create X, y format of training data:
                feature_1, feature_2, feature_3
message_pair_1 |
message_pair_2 |
message_pair_3 |

### Idea 2: Feed each pair individually to the model
Create function to get every pair of messages within a given timeframe, and preprocess to create a feature list to feed into the model


## Features to extract
1. Time distance between messages in minutes (integer)
2. Same or different user (boolean)
3.

In [1]:
class Message:
    def __init__(self, timestamp, user_id, message, file_ind, uuid, uuid_parent=None, conversation_ind=None):
        self.timestamp = timestamp
        self.user = user_id
        self.message = message
        self.file_ind = file_ind
        self.uuid = uuid
        self.uuid_parent = uuid_parent
        self.conversation_ind = conversation_ind
        self.has_mention = False
        self.mention = None

    def __lt__(self, other):
        return self.file_ind < other.file_ind

    def to_dict(self):
        return vars(self)

In [2]:
class MessageProcessor:
    DEFAULT_PARAMS = {
        'message_cur': '',
        'message_prev': '',
        'same_user': 0,
        'time_distance': -1,
        'label': -1
    }
    #def __init__(self):
    @staticmethod
    def process_reply_pair(message_primary, message_candidate):
        """

        :param message_primary: Message object, primary message
        :param message_candidate: Message object, potential antecedent message
        :return: modified dictionary of DEFAULT_PARAMS
        """

        results = MessageProcessor.DEFAULT_PARAMS.copy()

        results['message_cur'] = message_primary.uuid
        results['message_prev'] = message_candidate.uuid
        results['same_user'] = MessageProcessor._check_user_pair(message_primary, message_candidate)
        results['time_distance'] = MessageProcessor._calculate_time_distance(message_primary, message_candidate)
        results['label'] = MessageProcessor._check_is_reply(message_primary, message_candidate)
        return results

    @staticmethod
    def process_same_convo(message_primary, message_candidate):

        results = MessageProcessor.DEFAULT_PARAMS.copy()

        results['message_cur'] = message_primary.uuid
        results['message_prev'] = message_candidate.uuid
        results['same_user'] = MessageProcessor._check_user_pair(message_primary, message_candidate)
        results['time_distance'] = MessageProcessor._calculate_time_distance(message_primary, message_candidate)
        results['label'] = MessageProcessor._check_is_same_convo(message_primary, message_candidate)
        return results

    @staticmethod
    def _check_user_pair(message_primary, message_candidate):
        return int(message_primary.user == message_candidate.user)

    @staticmethod
    def _calculate_time_distance(message_primary, message_candidate):
        if message_primary.timestamp < message_candidate.timestamp:
            raise Exception(f"Primary Message {message_primary.uuid} is before candidate {message_candidate.uuid}.")
        return message_primary.timestamp - message_candidate.timestamp

    @staticmethod
    def _check_is_reply(message_primary, message_candidate):
        return int(message_primary.uuid_parent == message_candidate.uuid)

    @staticmethod
    def _check_is_same_convo(message_primary, message_candidate):
        return int(message_primary.conversation_ind == message_candidate.conversation_ind)

In [8]:
import pandas as pd

In [60]:
dev_df = pd.read_csv('../data/cleaned/agg_dev.csv', index_col=0)

In [61]:
dev_convo_df = pd.read_csv('../data/cleaned/dev_conversations.csv', index_col=0)

In [62]:
dev_convo_df[dev_convo_df['date'] == '2005-08-08'].sort_values('file_ind')

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,date,parent,child,file_ind_parent,uuid_parent,conversation_ind
2005_08_08_12_28_aceb747_685,[12:28] <aceb747> is someone was comparing the...,685,12:28,748,aceb747,9541,is someone was comparing the ubuntu repositor...,2005-08-08,685.0,685.0,685.0,2005_08_08_12_28_aceb747_685,111
2005_08_08_12_56_LasseL_975,[12:56] <LasseL> I wonder how ubuntu is going ...,975,12:56,776,LasseL,10687,I wonder how ubuntu is going to deal with ope...,2005-08-08,975.0,975.0,975.0,2005_08_08_12_56_LasseL_975,112
2005_08_08_12_58_harold__990,[12:58] <harold_> Anyone: Know how to compile ...,990,12:58,778,harold_,14790,Anyone: Know how to compile madwifi for a Pow...,2005-08-08,990.0,990.0,990.0,2005_08_08_12_58_harold__990,113
2005_08_08_12_59_ubotu_999,[12:59] <ubotu> somebody said paste was please...,999,12:59,779,ubotu,12983,somebody said paste was please use http://pas...,2005-08-08,999.0,999.0,999.0,2005_08_08_12_59_ubotu_999,114
2005_08_08_01_00_MartenH_1000,[01:00] <MartenH> What does this error mean an...,1000,01:00,780,MartenH,11748,What does this error mean and how can I corre...,2005-08-08,1000.0,1000.0,1000.0,2005_08_08_01_00_MartenH_1000,115
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2005_08_08_01_23_kaffeend_1245,[01:23] <kaffeend> heya,1245,01:23,803,kaffeend,6525,heya,2005-08-08,1245.0,1245.0,1245.0,2005_08_08_01_23_kaffeend_1245,180
2005_08_08_System_System_1246,=== macgyver2 [~macgyver2@macgyver2.user] has...,1246,System,803,System,3792,,2005-08-08,1246.0,1246.0,1246.0,2005_08_08_System_System_1246,181
2005_08_08_System_System_1247,=== boga [~boga@CPE0011095f2041-CM00e06f240dd8...,1247,System,803,System,3792,,2005-08-08,1247.0,1247.0,1247.0,2005_08_08_System_System_1247,182
2005_08_08_System_System_1248,=== HollowFrank [~HollowFra@ACCB2D21.ipt.aol.c...,1248,System,803,System,3792,,2005-08-08,1248.0,1248.0,1248.0,2005_08_08_System_System_1248,183


In [63]:
dev_df.head()

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,date,uuid,parent,child,file_ind_parent,uuid_parent
0,"[12:18] <|trey|> usual, quite stable though :)",0,12:18,738,|trey|,10946,"usual, quite stable though :)",2004-11-15,2004_11_15_12_18_|trey|_0,,,,
1,[12:18] <tweaked> HrdwrBoB: ok how many partit...,1,12:18,738,tweaked,1375,HrdwrBoB: ok how many partitions should i make?,2004-11-15,2004_11_15_12_18_tweaked_1,,,,
2,"[12:18] <Matt|> |trey|, top in the list --> ub...",2,12:18,738,Matt|,6784,ubuntu servers,2004-11-15,2004_11_15_12_18_Matt|_2,,,,
3,[12:18] <usual> a few libs and media,3,12:18,738,usual,7183,a few libs and media,2004-11-15,2004_11_15_12_18_usual_3,,,,
4,[12:18] <usual> maybe some others,4,12:18,738,usual,7183,maybe some others,2004-11-15,2004_11_15_12_18_usual_4,,,,


In [64]:
labeled_df = dev_df[dev_df['parent'].notnull()]

In [65]:
labeled_df.head()

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,date,uuid,parent,child,file_ind_parent,uuid_parent
685,[01:35] <djtansey> i have a problem re: k3b an...,685,01:35,815,djtansey,948,i have a problem re: k3b and am looking for s...,2004-11-15,2004_11_15_01_35_djtansey_685,685.0,685.0,685.0,2004_11_15_01_35_djtansey_685
1000,[03:01] <LinuxJones> night all :),1000,03:01,901,LinuxJones,1839,night all :),2004-11-15,2004_11_15_03_01_LinuxJones_1000,1000.0,1000.0,1000.0,2004_11_15_03_01_LinuxJones_1000
1001,=== yohannes [~yohannes@adsl-67-112-218-21.dsl...,1001,03:10,910,System,3792,,2004-11-15,2004_11_15_03_10_System_1001,1001.0,1001.0,1001.0,2004_11_15_03_10_System_1001
1002,[03:10] <yohannes> can anyone recommend any ap...,1002,03:10,910,yohannes,14260,can anyone recommend any app to create/open *...,2004-11-15,2004_11_15_03_10_yohannes_1002,1002.0,1002.0,1002.0,2004_11_15_03_10_yohannes_1002
1003,"[03:10] <Hikaru79> yohannes, why not WinRAR?",1003,03:10,910,Hikaru79,9550,"yohannes, why not WinRAR?",2004-11-15,2004_11_15_03_10_Hikaru79_1003,1002.0,1003.0,1002.0,2004_11_15_03_10_yohannes_1002


In [66]:
dev_convo_df.shape

(2642, 13)

In [67]:
MESSAGE_SIZE = 1000
message_list = []
for ind, row in dev_convo_df.iloc[0:MESSAGE_SIZE].iterrows():
    message = Message(timestamp=row['minutes'],
                      user_id=row['user_ind'],
                      message=row['message'],
                      file_ind=row['file_ind'],
                      uuid=ind,
                      uuid_parent=None,
                      conversation_ind=row['conversation_ind'])
    message_list.append(message)

In [68]:
message_list[0].uuid

'2004_11_15_01_35_djtansey_685'

In [69]:
MessageProcessor.process_same_convo(message_list[1], message_list[0])

{'message_cur': '2004_11_15_03_41_djtansey_1087',
 'message_prev': '2004_11_15_01_35_djtansey_685',
 'same_user': 1,
 'time_distance': 126,
 'label': 1}

In [70]:
class DatasetBuilder:
    def __init__(self):
        self.training_data = pd.DataFrame()
        self.labels = None
        self.size = len(self.training_data)

    def load_pair(self, data):
        """
        load single Message object
        :param message:
        :return: None
        """
        self.training_data = self.training_data.append(data, ignore_index=True)

    def load_messages(self, message_list, window=0):
        """

        :param message_list: list of Message Objects
        :return: None
        """

        if window:
            pair_list = self.pair_generator_window(message_list, window)
        else:
            pair_list = self.pair_generator(message_list)

        for pair in pair_list:
            prev_message, cur_message = pair
            data = MessageProcessor.process_same_convo(cur_message, prev_message)
            self.load_pair(data)


    @staticmethod
    def pair_generator(message_list):
        """
        Takes list of Message objects and returns a combination of all possible pairs
        :param message_list:
        :return: list of length-2 tuples
        """
        pair_list = []
        for idx, message_1 in enumerate(message_list):
            for message_2 in message_list[idx +1:]:
                if message_1 < message_2:
                    pair_list.append((message_1, message_2))
                else:
                    pair_list.append((message_2, message_1))

        return pair_list

    @staticmethod
    def pair_generator_window(message_list, window):
        """
        Takes list of Message objects and returns a combination of all possible pairs
        :param message_list:
        :return: list of length-2 tuples
        """
        pair_list = []
        for idx, message_1 in enumerate(message_list):
            for message_2 in message_list[idx+1:]:
                if message_2.timestamp - message_1.timestamp > window:
                    break
                if message_1 < message_2:
                    pair_list.append((message_1, message_2))
                else:
                    pair_list.append((message_2, message_1))
        return pair_list

    def export_data(self, path):
        self.training_data.to_csv(path)

In [71]:
MESSAGE_SIZE = 1000
message_list = []
day_list = list(set(dev_convo_df['date'].values))

In [72]:
# for ind, row in dev_convo_df.iloc[0:MESSAGE_SIZE].iterrows():
#     message = Message(timestamp=row['minutes'],
#                       user_id=row['user_ind'],
#                       message=row['message'],
#                       file_ind=row['file_ind'],
#                       uuid=ind,
#                       uuid_parent=None,
#                       conversation_ind=row['conversation_ind'])
#     message_list.append(message)

dataset_builder = DatasetBuilder()

for day in day_list:
    message_list = []

    day_df = dev_convo_df[dev_convo_df['date'] == day].sort_values('file_ind')
    for ind, row in day_df.iterrows():
        message = Message(timestamp=row['minutes'],
                          user_id=row['user_ind'],
                          message=row['message'],
                          file_ind=row['file_ind'],
                          uuid=ind,
                          uuid_parent=None,
                          conversation_ind=row['conversation_ind'])
        message_list.append(message)

    dataset_builder.load_messages(message_list, window=20)

In [73]:
day_df

Unnamed: 0,raw,file_ind,timestamp,minutes,user,user_ind,message,date,parent,child,file_ind_parent,uuid_parent,conversation_ind
2005_08_08_12_28_aceb747_685,[12:28] <aceb747> is someone was comparing the...,685,12:28,748,aceb747,9541,is someone was comparing the ubuntu repositor...,2005-08-08,685.0,685.0,685.0,2005_08_08_12_28_aceb747_685,111
2005_08_08_12_56_LasseL_975,[12:56] <LasseL> I wonder how ubuntu is going ...,975,12:56,776,LasseL,10687,I wonder how ubuntu is going to deal with ope...,2005-08-08,975.0,975.0,975.0,2005_08_08_12_56_LasseL_975,112
2005_08_08_12_58_harold__990,[12:58] <harold_> Anyone: Know how to compile ...,990,12:58,778,harold_,14790,Anyone: Know how to compile madwifi for a Pow...,2005-08-08,990.0,990.0,990.0,2005_08_08_12_58_harold__990,113
2005_08_08_12_59_ubotu_999,[12:59] <ubotu> somebody said paste was please...,999,12:59,779,ubotu,12983,somebody said paste was please use http://pas...,2005-08-08,999.0,999.0,999.0,2005_08_08_12_59_ubotu_999,114
2005_08_08_01_00_MartenH_1000,[01:00] <MartenH> What does this error mean an...,1000,01:00,780,MartenH,11748,What does this error mean and how can I corre...,2005-08-08,1000.0,1000.0,1000.0,2005_08_08_01_00_MartenH_1000,115
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2005_08_08_01_23_kaffeend_1245,[01:23] <kaffeend> heya,1245,01:23,803,kaffeend,6525,heya,2005-08-08,1245.0,1245.0,1245.0,2005_08_08_01_23_kaffeend_1245,180
2005_08_08_System_System_1246,=== macgyver2 [~macgyver2@macgyver2.user] has...,1246,System,803,System,3792,,2005-08-08,1246.0,1246.0,1246.0,2005_08_08_System_System_1246,181
2005_08_08_System_System_1247,=== boga [~boga@CPE0011095f2041-CM00e06f240dd8...,1247,System,803,System,3792,,2005-08-08,1247.0,1247.0,1247.0,2005_08_08_System_System_1247,182
2005_08_08_System_System_1248,=== HollowFrank [~HollowFra@ACCB2D21.ipt.aol.c...,1248,System,803,System,3792,,2005-08-08,1248.0,1248.0,1248.0,2005_08_08_System_System_1248,183


In [74]:
dataset_builder.training_data.shape

(218497, 5)

In [75]:
dataset_builder.export_data("training_data.csv")