## Training dataset formats?

Model 1, logistic regression.
Unit of analysis: Two discrete messages in a chat log.
INPUT: One feature list of a pair of messages
OUTPUT: Boolean, same conversation or not
### Idea 1: n x m sample of data
Get every pair of messages within a given timeframe, create X, y format of training data:
                feature_1, feature_2, feature_3
message_pair_1 |
message_pair_2 |
message_pair_3 |

### Idea 2: Feed each pair individually to the model
Create function to get every pair of messages within a given timeframe, and preprocess to create a feature list to feed into the model


## Features to extract
1. Time distance between messages in minutes (integer)
2. Same or different user (boolean)
3.

In [16]:
class Message:
    def __init__(self, timestamp, user_id, message, uuid, uuid_parent, **kwargs):
        self.timestamp = timestamp
        self.user = user_id
        self.message = message
        self.uuid = uuid
        self.uuid_parent = uuid_parent
        self.conversation_ind = kwargs.get('conversation_ind', -1)

    def to_dict(self):
        return vars(self)

In [25]:
class MessageProcessor:
    DEFAULT_PARAMS = {
        'same_user': 0,
        'time_distance': -1,
        'label': -1
    }
    #def __init__(self):
    @staticmethod
    def process_reply_pair(message_primary, message_candidate):
        """

        :param message_primary: Message object, primary message
        :param message_candidate: Message object, potential antecedent message
        :return: modified dictionary of DEFAULT_PARAMS
        """

        results = MessageProcessor.DEFAULT_PARAMS.copy()

        results['same_user'] = MessageProcessor._check_user_pair(message_primary, message_candidate)
        results['time_distance'] = MessageProcessor._calculate_time_distance(message_primary, message_candidate)
        results['label'] = MessageProcessor._check_is_reply(message_primary, message_candidate)
        return results

    @staticmethod
    def process_reply_convo(message_primary, message_candidate):

        results = MessageProcessor.DEFAULT_PARAMS.copy()

        results['same_user'] = MessageProcessor._check_user_pair(message_primary, message_candidate)
        results['time_distance'] = MessageProcessor._calculate_time_distance(message_primary, message_candidate)
        results['label'] = MessageProcessor._check_is_reply(message_primary, message_candidate)
        return results

    @staticmethod
    def _check_user_pair(message_primary, message_candidate):
        return int(message_primary.user == message_candidate.user)

    @staticmethod
    def _calculate_time_distance(message_primary, message_candidate):
        if message_primary.timestamp < message_candidate.timestamp:
            raise Exception("Primary Message is before candidate.")
        return message_primary.timestamp - message_candidate.timestamp

    @staticmethod
    def _check_is_reply(message_primary, message_candidate):
        return int(message_primary.uuid_parent == message_candidate.uuid)

    @staticmethod
    def _check_is_same_convo(message_primary, message_candidate):
        return int(message_primary.conversation_ind == message_candidate.conversation_ind)

In [3]:
import pandas as pd

In [10]:
dev_df = pd.read_csv('../agg_dev.csv', index_col=0)

In [29]:
dev_convo_df = pd.read_csv('../data/cleaned/dev_conversations.csv', index_col=0)

In [30]:
dev_convo_df.head()

Unnamed: 0,raw,timestamp,user,message,file_ind,date,parent,child,file_ind_parent,uuid_parent,conversation_ind
2004_11_15_03_44_CPayan_1094,[03:44] <CPayan> hmm progression or regression,03:44,CPayan,hmm progression or regression,1094,2004-11-15,1092.0,1094.0,1092.0,2004_11_15_03_43_djtansey_1092,0
2004_11_15_03_44_CPayan_1094,[03:44] <CPayan> hmm progression or regression,03:44,CPayan,hmm progression or regression,1094,2004-11-15,1093.0,1094.0,1093.0,2004_11_15_03_44_Nafallo_1093,0
2004_11_15_03_44_jdub_1096,[03:44] <jdub> it will work with the ubuntu ke...,03:44,jdub,it will work with the ubuntu kernel,1096,2004-11-15,1090.0,1096.0,1090.0,2004_11_15_03_42_Nafallo_1090,0
2004_11_15_03_47_Nafallo_1111,[03:47] <Nafallo> djtansey: and it should be f...,03:47,Nafallo,djtansey: and it should be fixed in 2.6.9 als...,1111,2004-11-15,1110.0,1111.0,1110.0,2004_11_15_03_46_Nafallo_1110,0
2004_11_15_03_50_Nafallo_1125,"[03:50] <Nafallo> jdub: but then, I got both a...",03:50,Nafallo,"jdub: but then, I got both a samsung and a to...",1125,2004-11-15,1123.0,1125.0,1123.0,2004_11_15_03_50_Nafallo_1123,0


In [11]:
dev_df.head()

Unnamed: 0,raw,timestamp,minutes,user,user_ind,message,file_ind,date,uuid,parent,child,file_ind_parent,uuid_parent
0,"[12:18] <|trey|> usual, quite stable though :)",12:18,738,|trey|,10946,"usual, quite stable though :)",0,2004-11-15,2004_11_15_12_18_|trey|_0,,,,
1,[12:18] <tweaked> HrdwrBoB: ok how many partit...,12:18,738,tweaked,1375,HrdwrBoB: ok how many partitions should i make?,1,2004-11-15,2004_11_15_12_18_tweaked_1,,,,
2,"[12:18] <Matt|> |trey|, top in the list --> ub...",12:18,738,Matt|,6784,ubuntu servers,2,2004-11-15,2004_11_15_12_18_Matt|_2,,,,
3,[12:18] <usual> a few libs and media,12:18,738,usual,7183,a few libs and media,3,2004-11-15,2004_11_15_12_18_usual_3,,,,
4,[12:18] <usual> maybe some others,12:18,738,usual,7183,maybe some others,4,2004-11-15,2004_11_15_12_18_usual_4,,,,


In [12]:
labeled_df = dev_df[dev_df['parent'].notnull()]

In [13]:
labeled_df.head()

Unnamed: 0,raw,timestamp,minutes,user,user_ind,message,file_ind,date,uuid,parent,child,file_ind_parent,uuid_parent
982,"[02:41] <jcole> aitrus: great, thanks",02:41,161,jcole,854,"aitrus: great, thanks",982,2004-11-15,2004_11_15_02_41_jcole_982,982.0,982.0,982.0,2004_11_15_02_41_jcole_982
999,=== strestout1__ is now known as strestout1,03:01,181,System,3792,,999,2004-11-15,2004_11_15_03_01_System_999,999.0,999.0,999.0,2004_11_15_03_01_System_999
1000,[03:01] <LinuxJones> night all :),03:01,181,LinuxJones,1839,night all :),1000,2004-11-15,2004_11_15_03_01_LinuxJones_1000,999.0,1000.0,999.0,2004_11_15_03_01_System_999
1001,=== yohannes [~yohannes@adsl-67-112-218-21.dsl...,03:10,190,System,3792,,1001,2004-11-15,2004_11_15_03_10_System_1001,999.0,1001.0,999.0,2004_11_15_03_01_System_999
1002,[03:10] <yohannes> can anyone recommend any ap...,03:10,190,yohannes,14260,can anyone recommend any app to create/open *...,1002,2004-11-15,2004_11_15_03_10_yohannes_1002,1001.0,1002.0,1001.0,2004_11_15_03_10_System_1001


In [17]:
message_list = []
for ind, row in labeled_df[0:10].iterrows():
    message = Message(timestamp=row['minutes'],
                      user_id=row['user_ind'],
                      message=row['message'],
                      uuid=row['uuid'],
                      uuid_parent=row['uuid_parent'])
    message_list.append(message)

In [23]:
message_list[0].timestamp

161

In [26]:
MessageProcessor.process_pair(message_list[1], message_list[0])

{'same_user': 0, 'time_distance': 20, 'label': 0}

In [None]:
class DatasetBuilder:
    def __init__(self):
        self.training_data = pd.DataFrame()
        self.labels = None
        self.size = len(self.training_data)

    def load_message(self, message):
        """
        load single Message object
        :param message:
        :return: None
        """
        self.training_data.append()

    def load_messages(self, message_list):
        """

        :param message_list:
        :param metrics: list of strings to load to dataset
        :return: None
        """

        for message in message_list:
            self.load_message(message)