# Problem Formulation and Feature Engineering



The problem presented has been previously worked as known by this blog. How do we extract meaningful insight out of unlabeled data?
https://www.sentiance.com/2018/01/29/unlabeled-visits/


"Since the type of place a user is currently visiting highly depends on his recent and long-term past behavior, this kind of data needs to be gathered in a realistic setting where the temporal behavior of the user actually corresponds to natural human behavior."

To understand a location we must first put into context the past and recent user behaviour with the place. I first try to work with the data given, and extract new features out of this. Let's first try to understand two columns.

start_time(YYYYMMddHHmmZ), duration(ms)

If we were given the start_time and duration, then we can infer many things. First we can infer, the day of the week of the location. This is shown in my day_of_the_week method, where I use python's datetime library to build a date time object that has the ability to infer the day of the week. This is important as a venue visited during a work day could entail a much different story than a venue during a weekend.

Another tool used is to figure how time is spent during "sleeping" or working hours. The method "work_sleep" which takes as input (hour,duration_hours,day_of_the_week). Imagine someone comes home at 11pm and stays there until 3pm on a Monday Night. We can consider this as the following. Let's say "sleep hours are between 12am - 6am. Let's also consider work hours are between 9am-5pm. The user has "slept" for 6 hours and worked from home for 6 hours (9pm-3pm). This contextualizes that this place although may likely be a work location; it is most likely a home location.


Lastly, I would like to talk about algorithms and features that were considered but not used. Since, the data is unlabeled it may be thought to use unsupervised methods such as k-means. However, since the data has many noisy points, the labels will be mostly concentrated around non work and non home locations. To work around this, I have chosen to label home locations from person1 and person2. The labeling choice is to look at how many hours they spend "sleeping" in a location and then label it as home or not home. This is done in the "home_labels" method.
The idea is use these "sparse labels" to train a logistic regression classifier on it and then predict on person3.

This process is encapsulated in our dfFactory class.

We can rebuild our dataframe shown below. This uses a "factory" design pattern.



In [7]:
import matplotlib.pyplot as plt
import datetime
import pandas as pd
import numpy as np
import datetime

from math import radians, cos, sin, asin, sqrt

def read_file(file="person2.csv"):
    df = pd.read_csv(file,sep = ';')
    return df


def day_of_the_week(time):
    '''string -> datetime'''
    year = int(time[0:4])
    month = int(time[4:6])
    day = int(time[6:8])
    hour = int(time[8:10])
    minute = int(time[10:12])
    full_date = datetime.datetime(year=year, month=month, day=day, hour=hour, minute=minute, second=0)
    day_of_week = full_date.weekday()
    return day_of_week


def convert_to_hours(duration):
    '''converts duration to hours'''
    duration = float(duration)
    return duration/3600000

def work_sleep(hour,duration_hours,day_of_the_week):
    '''Sees how much time is spent during sleep or work hours'''
    work_hours = 0
    sleep_hours = 0
    while(duration_hours >= 1):
        #while loop to see how every hour is spent
        hour += 1
        #time moves forward by one hour
        duration_hours -= 1
        if(hour > 23):
            #this if statement is spent for when a new day occurs i.e 11pm-2am. 
            hour = hour - 24
            day_of_the_week += 1
        
        #Check if in work or home hours
        if hour < 6:
            sleep_hours += 1
            
        elif hour > 9 and hour < 18:
            work_hours += 1
    return (work_hours,sleep_hours)         
            
def label_home(sleep_hour,work_hour):
    if(sleep_hour > 2):
        #Label as home if sleep hour is greater than 2
        return 1
    else:
        return 0



class dfFactory:
    
    def __init__(self):
        print("prepared_to_build")
        
        
    def build_df(self, file="person2.csv"):
        df = read_file(file)
        df['hour'] = df["start_time(YYYYMMddHHmmZ)"]
        df['hour'] = df.apply(lambda row: int(str(row.hour)[8:10]), axis = 1)

        df['day_of_week'] = df.apply(lambda row: day_of_the_week(row["start_time(YYYYMMddHHmmZ)"]),axis = 1)
        df['duration(hours)'] = df.apply(lambda row: convert_to_hours(row["duration(ms)"]),axis =1)
        df = df[df['duration(hours)'] > 4]

        df["work_sleep"] = df.apply(lambda row: work_sleep(row["hour"],row["duration(hours)"], row["day_of_week"]),axis =1)
        df["work_hours"] = df.apply(lambda row: row["work_sleep"][0],axis =1)
        df["sleep_hours"] = df.apply(lambda row: row["work_sleep"][1],axis=1)
        df["home_label"] = df.apply(lambda row: label_home(row["sleep_hours"],row["work_hours"]), axis = 1)
        return df

Although we will not using this class directly in building our model, it is nice to see the intermittent dataframe.
Let's examine the new dataframe created

In [8]:
factory = dfFactory()
new_df = factory.build_df()
new_df

prepared_to_build


Unnamed: 0,latitude,longitude,start_time(YYYYMMddHHmmZ),duration(ms),hour,day_of_week,duration(hours),work_sleep,work_hours,sleep_hours,home_label
0,51.057022,3.714476,201312250036+0100,19593588,0,2,5.442663,"(0, 5)",0,5,1
1,51.056984,3.714681,201312250608+0100,30460679,6,2,8.461300,"(5, 0)",5,0,0
7,50.997192,4.802296,201312251907+0100,18114776,19,2,5.031882,"(0, 1)",0,1,0
8,50.997192,4.802300,201312260015+0100,35447860,0,3,9.846628,"(0, 5)",0,5,1
18,51.214290,4.411606,201312261550+0100,53163669,15,3,14.767686,"(2, 6)",2,6,1
...,...,...,...,...,...,...,...,...,...,...,...
734,51.214230,4.411570,201403220039+0100,53344915,0,5,14.818032,"(5, 5)",5,5,1
745,51.214210,4.411585,201403222220+0100,44265668,22,5,12.296019,"(1, 6)",1,6,1
747,51.214188,4.411525,201403231101+0100,14724392,11,6,4.090109,"(4, 0)",4,0,0
755,51.214360,4.411226,201403231737+0100,52019895,17,6,14.449971,"(0, 6)",0,6,1


As one can see each data frame now contains the hour of starting, the day of the week, the duration in hours, the time spent "working" or "sleeping" and the home_label.

We would also like to fully contextualize each location and reduce noise. This is similar to reducing locations that are "checked_in" multiples time and have one row per location indicating the total amount of hours of slept and worked there. This is done in our prepareTrainingDf class which takes a df transformed by our dfFactory and aggregates them. Again this class will not be used directly as a function will deal with the creation of this. Let's look at the intermitten dataframe.

In [13]:
class prepareTrainingDf:
    
    
    def __init__(self,df):
        #has the transformed df
        self.df = df
        #has the data_table
        self.data_table = {}
        
        
    def distance(self, lat1, lat2, lon1, lon2):
        # The math module contains a function named
        # radians which converts from degrees to radians.
        lon1 = radians(lon1)
        lon2 = radians(lon2)
        lat1 = radians(lat1)
        lat2 = radians(lat2)

        # Haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2

        c = 2 * asin(sqrt(a))

        # Radius of earth in kilometers. Use 3956 for miles
        r = 6371

        # calculate the result
        return (c * r)
        
        
    def insert_locations(self,user="person2"):
        #very similar to our datalookuptable but aggregates sleep and work hours.
        df = self.df
        self.data_table[user] = {}
        for index, row in df.iterrows():
            unique = True
            proposed_lat = row['latitude']
            proposed_lon = row['longitude']
            home_label = int(row["home_label"])
            work_hours = int(row["work_hours"])
            sleep_hours = int(row["sleep_hours"])
            for key in self.data_table[user]:
                existing_lat, existing_lon = key[0], key[1]
                dist = self.distance(existing_lat, proposed_lat, existing_lon, proposed_lon)
                if (dist < 0.5):
                    unique = False
                    location = (existing_lat, existing_lon)
                    if(home_label) == 1:
                        #if there is a discrepency between a home and work label then we choose home label
                        self.data_table[user][location][2] = 1
                    self.data_table[user][location][0] += sleep_hours
                    self.data_table[user][location][1] += work_hours   
                    break
            if (unique):
                #This is where the value is a list containing sleep,work and the label
                self.data_table[user][(proposed_lat, proposed_lon)] = [sleep_hours, work_hours, home_label]
                
    
    def clean_df(self,user= "person2"):
        '''Creates a dataframe out of our lookup_table.'''
        lst = []
        for key in self.data_table[user]:
            latitude,longitude = key[0], key[1]
            location = (latitude, longitude)
            sleep_hours = float(self.data_table[user][location][0])
            work_hours = float(self.data_table[user][location][1])
            label =  int(self.data_table[user][location][2])
            lst.append([latitude,longitude,work_hours, sleep_hours,label])
        test_df = pd.DataFrame(lst, columns=['latitude', 'longitude', 'work_hours','sleep_hours','home_label']) 
        return test_df  

In [17]:
prepareTrain = prepareTrainingDf(new_df)
prepareTrain.insert_locations()
test_df = prepareTrain.clean_df()
test_df

Unnamed: 0,latitude,longitude,work_hours,sleep_hours,home_label
0,51.057022,3.714476,17.0,20.0,1
1,50.997192,4.802296,0.0,6.0,1
2,51.21429,4.411606,115.0,440.0,1
3,51.216366,4.394152,142.0,0.0,0
4,51.047264,3.713058,0.0,1.0,0
5,51.036304,3.734711,1.0,0.0,0
6,51.21846,4.404464,44.0,3.0,1
7,51.210907,4.389543,115.0,10.0,1


As one can see, our original dataframe is now only 7 rows this sparse representation will be used towards our training of the network. We seemed to have labeled 4 home locations. Perhaps this person has a home location at his parents house outside the city and other home-locations such as staying at a friends house.

The main use of our classes in enscapulated in these last 4 functions.

Firstly the make_ml_model function runs the logistic regression training, it makes use of sklearn's LogisticRegression. One caveat to know is that longitude and latitude are dropped when training the data. This is because our lack of data points will be confused by what the longitude and latitude means. For this iteration, work and sleep hours will be used instead.

# Model Training

These next four functions encapsulate entirely the use of our past 2 classes

In [19]:
from sklearn import metrics 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


def prepare_training_data(files = ["person2.csv","person3.csv"]):
    '''concats df frames used for training into one dateframe'''
    factory = dfFactory()
    list_of_dfs = []
    train_dfs = []
    #get all files into a dataframe
    for file in files:
        list_of_dfs.append(factory.build_df(file))
        
    #reduce data from dataframes
    
    for i in range(len(files)):
        df = list_of_dfs[i]
        user = files[i][:-4]
        train_dfs.append(train_df_build(df,user))
        
    df_row_reindex = pd.concat([df for df in train_dfs],ignore_index=True)
    return df_row_reindex
    
    
def train_df_build(df,user):
    '''does feature engineering to include sleep and work hours'''
    prepare_object = prepareTrainingDf(df)
    prepare_object.insert_locations(user)
    train_df = prepare_object.clean_df(user)
    return train_df


def make_ml_model():
    '''makes the ml model out of the fixed data'''
    train_df = prepare_training_data()
    #drops columns not used for training
    x_train = train_df.drop(['home_label', 'latitude','longitude'],axis =1)
    #extract labels
    y_train = train_df.home_label
    logistic_regression = LogisticRegression()
    #This is where the model is trained
    ml = logistic_regression.fit(x_train,y_train)
    return logistic_regression
    

#factory = dfFactory()
#new_df = factory.build_df()

let's see how one should call these functions


In [22]:
model = make_ml_model()
#This makes our ml model and returns a LogisticRegression object

test_model = prepare_training_data(["person1.csv"])
#creates testing data from person2
x_test = test_model.drop(['home_label', 'latitude','longitude'],axis=1)
#splits
y_test = test_model.home_label

prepared_to_build
prepared_to_build


Let's glance over the data we are trying trying to predict for

In [26]:
test_model

Unnamed: 0,latitude,longitude,work_hours,sleep_hours,home_label
0,-49.32688,-72.89085,3.0,6.0,1
1,-50.33432,-72.25323,4.0,11.0,1
2,-54.809326,-68.31953,2.0,7.0,1
3,-25.624971,-54.55071,3.0,18.0,1
4,-34.60891,-58.378475,1.0,3.0,1
5,-34.812263,-58.539413,4.0,6.0,1
6,51.171577,4.34692,70.0,359.0,1
7,51.216354,4.394118,77.0,12.0,1
8,50.929203,3.011693,6.0,0.0,0
9,51.06182,3.756978,3.0,0.0,0


Now let's predict and assess our accuracy.


In [27]:
y_pred = model.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_percentage = 100 * accuracy
print("The accuracy of our model is:", accuracy_percentage)

The accuracy of our model is: 100.0


Our model is currently can accurately predict our hand made labels 

# Short comings and next steps


The model heavily relies on the features being engineered. Many times these operations may be costly and timely, when used for prediction.
Also, we have made some heavy assumptions of what makes a home location. Again, how would we look at someone who works at night?(https://www.sentiance.com/2016/11/03/semantics-of-time/) 

There are also more features to be looked at such as the amount of checkins and semi supervised methods that would possible work better. Also as shown in the next question, vision information will be proven useful. Perhaps we can take a picture of the longitude and latitude and create an embedding similar to Loc2Vec?