# Ayman

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.optimize import curve_fit

Let us start by importing the data and exploring the column names and number of entries.

In [None]:
df_checkin = pd.read_csv("data/processed_dataset.csv")

In [None]:
df_checkin.info()

### Ground Truth

Let us create a ground truth dataframe that will contain the homes of users.

We start by isolating the check-ins that have the `Home (private)` label.

In [None]:
df_checkins_homes = df_checkin[df_checkin.place == "Home (private)"]

In [None]:
len(df_checkins_homes)

We see that we have 1724007 user check-ins into their homes.

There is one issue with the dataset, whch is that the home location may vary certain users. Here, we sort the home check-ins by `User ID` and we see that for user 15, the home longitude and latitude may vary.

In [None]:
df_checkins_homes.sort_values("User ID").head(30)

Create dataframe for user home locations, where we use the median location with label home for each user

In [None]:
df_homes = df_checkins_homes.groupby("User ID")[["lat", "lon"]].agg("median")

In [None]:
df_homes

Since the check-in location labeled with `Home (private)` may not be unique for a certain user, we decide to keep only users whose home locations have a standard deviation of 100m around the mean location.

The `haversine` function computes the distance between two point on a sphere using thei longitudes and latitudes.

In [None]:
EARTH_RADIUS = 6371.0088
def haversine(lat1, lat2, lon1, lon2):
    sigma = np.arcsin(np.sqrt(np.sin((lat2-lat1)/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin((lon1-lon2)/2)**2))
    return (2 * EARTH_RADIUS * sigma)

The `select_relevant_homes` filters out the users having home locations with a standard deviation larger than 100m.

In [None]:
def select_relevant_homes(df_homes):

    # Grouping df_homes according to the user id and compute std and mean for lat and lon
    df_homes = df_homes.groupby('User ID').agg({'lat':('std','mean'),'lon':('std','mean')})

    # Filling nan values with 0 (std return 0 if there is only one sample)
    df_homes.fillna(0,inplace = True)

    # Compute distance from mean

    # Preparing dataframe
    df_tmp = pd.DataFrame()
    df_tmp['lat1'] = df_homes.lat['mean']-df_homes.lat['std']
    df_tmp['lat2'] = df_homes.lat['mean']+df_homes.lat['std']
    df_tmp['lon1'] = df_homes.lon['mean']-df_homes.lon['std']
    df_tmp['lon2'] = df_homes.lon['mean']+df_homes.lon['std']

    # Compute distance
    df_tmp['home_radius'] = haversine(df_tmp.lat1, df_tmp.lat2, df_tmp.lon1, df_tmp.lon2)

    # Filter home and keep relevant home (estimated distance between homes checkins < 100m )
    df_homes = df_homes[df_tmp['home_radius']<0.1][[('lat','mean'),('lon','mean')]].copy()

    df_homes.columns = df_homes.columns.get_level_values(0)

    return df_homes

In [None]:
df_homes_2 = select_relevant_homes(df_checkins_homes)

In [None]:
df_homes_2

Now we are left with 27784 users and their home locations.

### Home locations using discretization

Now let us isolate the users that have homes.

In [None]:
users_with_homes = df_homes_2.index.values

In [None]:
df_checkin_filtered = df_checkin[df_checkin["User ID"].isin(users_with_homes)].copy()

Here is the code to discretize the world and detect home locations (from P2).

In [None]:
import math
from pandarallel import pandarallel

# Initialization
pandarallel.initialize()

In [None]:
# Function to be called in lambda to get specific 25km*25km square where the coordinates are
def square_coordinate(latitude, longitude):
    lat_rad = np.deg2rad(latitude)
    lon_rad = np.deg2rad(longitude)

    square_side_on_equator = 10

    # signed distance from equator
    latitude_dist = EARTH_RADIUS * lat_rad

    # signed distance from prime meridian (moving parallel to equator parallel to equator)
    longitude_dist = EARTH_RADIUS * np.cos(lat_rad) * lon_rad

    latitude_square_nbr = np.trunc(latitude_dist / square_side_on_equator)

    # rescaling the length of the square side because of the curvature of the earth.
    # on a cylindrical projection, horizontal distances are rescaled as follows
    square_side_on_lat_line = square_side_on_equator * np.cos(lat_rad)

    longitude_square_nbr = np.trunc(longitude_dist / square_side_on_lat_line)

    return list(zip(latitude_square_nbr, longitude_square_nbr))

In [None]:
def get_home_coordinates(df):
    df_ = df.copy()

    # Get the square coordinates for each row
    df_['square_coordinates'] = square_coordinate(df_.lat.values, df_.lon.values)

    # Group by user then square, for each square create a list of the corresponding coordinates
    df_ = df_.groupby(['User ID', 'square_coordinates'] ,as_index=False)\
                     [['User ID', 'square_coordinates', 'lat', 'lon']].agg(list)

    # Compute number of checkins in each square per user
    df_['freq'] = df_['lat'].str.len()

    # Keep only the most frequented square for each user
    df_ = df_.sort_values(['User ID','freq'], ascending=[True, False]).drop_duplicates(['User ID'])

    # For each user, find the average location in the most frequented square
    df_['lat'] = df_.lat.parallel_apply(lambda row: np.mean(row))
    df_['lon'] = df_.lon.parallel_apply(lambda row: np.mean(row))

    #return a dataframe with user, home latitude and home longitude
    return df_[['User ID', 'lat', 'lon']].set_index('User ID')

Let us now copute the home location of the users using the discretization.

In [None]:
df_detected_homes = get_home_coordinates(df_checkin_filtered)

In [None]:
df_detected_homes

### Comparizon between ground truth and discretization results

Let us compute the distance between the home locations resulting from the discretization and the true home locations.

First, we merge the two dataframes, then we compute the distance between the true home and the detected home using the `haversine` function defined above.

In [None]:
df_distance_detection_from_truth = pd.merge(df_homes_2, df_detected_homes, left_index=True, right_index=True)

In [None]:
df_distance_detection_from_truth

In [None]:
df_distance_detection_from_truth['dist'] = haversine(df_distance_detection_from_truth.lat_x,
                                                     df_distance_detection_from_truth.lat_y,
                                                     df_distance_detection_from_truth.lon_x,
                                                     df_distance_detection_from_truth.lon_y)

In [None]:
df_distance_detection_from_truth

Now, we visualize the distribution of distances.

In [None]:
def cumulative_dist_plot(series):
    fig = plt.figure()
    array = plt.hist(series, bins=1000, density=True, cumulative=True, bottom=0, histtype='step')
    plt.close(fig)

    x = array[1][1:]
    y = array[0]

    plt.plot(x, y, alpha=0.8, label='Distance from home CDF')

    percentage = 0.85

    i_percentage = next(i for i,v in enumerate(y) if v >= percentage)

    y_percentage = y[i_percentage]
    x_percentage = x[i_percentage]

    plt.plot([10, x_percentage], [y_percentage, y_percentage], color='g', linestyle='dotted')
    plt.plot([x_percentage, x_percentage], [0, y_percentage], color='g', linestyle='dotted')

    plt.xscale('log')
    plt.xlim(10, 5*1e4)
    plt.ylim(0, 1.1)

    plt.yticks(sorted(list(plt.yticks()[0]) + [y_percentage])[:-1])

    plt.annotate(f'{x_percentage:.0f} km', (x_percentage, 0), xytext=(x_percentage+1000, 0.12),
                 arrowprops={'width':1, 'headlength':10, 'headwidth':5})

    plt.ylabel("Cumulative density", fontsize=12)
    plt.xlabel("Distance from true home (km)", fontsize=12)

    plt.title("Cumulative distribution\nof the distance between the detected\nhome and the actual home\n",
              fontsize=18)

    plt.legend()

In [None]:
cumulative_dist_plot(df_distance_detection_from_truth.dist)

We see that to attain a 85% accuracy, we need to consider detected home locations to be true up to a distance of 3'263 km from the true home location. This distance almost as much as the distance from Beirut to Valencia, i.e. larger than the Mediterranean Sea.

Let us explore why this threshold distance is so large. We plot the distribution of distances on a log-log scale.

In [None]:
def power_func(x, a, b):
    """
    function to compute a*(x^b)
    """
    return a * np.power(x, b)

In [None]:
def density_loglog_dist_plot(series):
    fig = plt.figure()
    array = plt.hist(series, bins=1000, log=True, density=True, bottom=0, histtype='step')
    plt.close(fig)

    plt.loglog(array[1][1:], array[0], alpha=0.5, label='Distance from home probability density')

    start = 4

    distribution = array[0][start:]
    distances = array[1][start+1:]

    [a, b], _ = curve_fit(power_func, distances, distribution)

    x = range(round(distances[0]), round(distances[-1]))
    y = power_func(x, a, b)

    plt.loglog(x, y, color='r', label=r'$%.4f*x^{%.4f}$' % (a,b))

    plt.ylim(1e-6, 1e-2)
    plt.xlim(10, 5*1e4)

    plt.ylabel("Probability density", fontsize=12)
    plt.xlabel("Distance from true home (km)", fontsize=12)

    plt.title("Probability distribution\nof the distance between the detected\nhome and the actual home\n",
              fontsize=18)

    plt.legend()

In [None]:
density_loglog_dist_plot(df_distance_detection_from_truth.dist)

### <span style="color:red">Percentiles and interpretation</span>

# Rami

In [None]:
import pandas as pd
import numpy as np
from haversine import haversine_vector, Unit
from sklearn.cluster import DBSCAN
from datetime import timedelta
import networkx as nx

In [None]:
def correct_latitude(lat):
    """
    This function corrects for out of range latitude.

    Input:
    -- lat: latitude coordinates in °
    Output:
    -- lat: latitude coordinates put between -90 and 90°
    """
    while lat>90 or lat<-90:
        if lat>90:
            lat = -(lat-180)
        elif lat<-90:
            lat = -(lat+180)
    return lat

In [None]:
def correct_longitude(long):
    """
    This function corrects for out of range longitude.

    Input:
    -- long: longitude coordiantes in °
    Output:
    -- long: longitude coordinates put between -180 and 180°
    """
    while long>180 or long<-180:
        if long>180:
            long = long - 360
        elif long<-180:
            long = long +360
    return long

In [None]:
def compute_distance(df,columns):
    '''
    This function computes the distance between two geographic coordinates for a given dataframe.

    Input:
        - df: Dataframe containing 4 columns latitude1, longitude1, latitude2 and longitude2
        - columns: list of columns [latitude1, longitude1, latitude2 and longitude2]

    Output:
        - numpy array containing the distance between geographic coordinates of each row
    '''
    points1 = list(zip(df[columns[0]],df[columns[1]]))
    points2 = list(zip(df[columns[2]],df[columns[3]]))
    # Use harvesine_vector to compute the distance between points
    return np.round(haversine_vector(points1,points2,Unit.KILOMETERS),decimals=3)

In [None]:
def select_relevant_homes(df_homes):
    '''
    This function selects relevant homes. We consider a home location as relevant if it's latitude and
    longitude doesn't "vary much". To measure this variation, we simply compute the mean and the std of
    the latitude and longitude of homes for every user. Then we construct 4 points as follow:
        - by adding and substracting the standard deviation of the latitude and longitude from their
        respective mean
        - Measure the diagonal in KM
        - If the diagonal is less than 100m we can assume with confidence that the mean is indeed the
        home location

    Input:
        - df_homes: A dataframe containing all checkins labled as Home
    Output:
        - df_homes: Home location for each user
    '''

    # Grouping df_homes according to the user id and compute std and mean for lat and lon
    df_homes = df_homes.groupby('User ID').agg({'lat':('std','mean'),'lon':('std','mean')})

    # Filling nan values with 0 (std return 0 if there is only one sample)
    df_homes.fillna(0,inplace = True)

    # Construct the diagonal points
    df_tmp = pd.DataFrame()
    df_tmp['lat1'] = df_homes.lat['mean']-df_homes.lat['std']
    df_tmp['lat2'] = df_homes.lat['mean']+df_homes.lat['std']
    df_tmp['lon1'] = df_homes.lon['mean']-df_homes.lon['std']
    df_tmp['lon2'] = df_homes.lon['mean']+df_homes.lon['std']

    # Compute diagonal length
    df_tmp['home_radius'] = compute_distance(df_tmp,['lat1','lon1','lat2','lon2'])

    # Filter home and keep relevant home (estimated distance between homes checkins < 100m )
    df_homes = df_homes[df_tmp['home_radius']<0.1][[('lat','mean'),('lon','mean')]].copy()

    # Flatten df_homes columns
    df_homes.columns = df_homes.columns.get_level_values(0)

    return df_homes

In [None]:
def construct_df_checkins(path,sample_frac = 1):
    '''
    This function takes the path of the raw data, import it and construct a checkin dataframe where
    all users have at least 5 checkins and 1 home location

    Input:
        - Path: the Path of the file containing the data
        - sample_frac: sample fraction from the raw dataframe
    Output:
        - df_checkins: Checkin dataframe where all users have at least 5 checkins and 1 home location
    '''

    # Read data from the file and drop unnecessary columns
    df_tmp = pd.read_csv(path).sample(frac=sample_frac).drop(columns=['Venue ID','day'])

    # Latitude and Longitude correction
    df_tmp.lat = df_tmp.lat.apply(correct_latitude)
    df_tmp.lon = df_tmp.lon.apply(correct_longitude)

    # Construct df_homes and select only relevant homes
    df_homes = df_tmp.loc[df_tmp.place.str.lower().str.contains('home' and 'private')].copy()
    df_homes = select_relevant_homes(df_homes)

    # Select users with relevant homes from the raw data
    df_tmp = df_tmp.loc[df_tmp['User ID'].isin(df_homes.index)].copy()

    # Count the number of checkins for each user
    df_tmp_grouped = df_tmp.groupby('User ID').agg({'User ID':'count'})

    # Define a set containing users with at least 5 checkins
    users = set(df_tmp_grouped[df_tmp_grouped['User ID']>5].index)

    # Construct df_checkins
    df_checkins = df_tmp.loc[df_tmp['User ID'].isin(users)].copy()

    # Convert 'local time' attribute to a pandas datetime
    df_checkins['local time'] = pd.to_datetime(df_checkins['local time'])

    # Label Homes
    df_checkins['Is_home'] = df_checkins.place.str.lower().str.contains('home' and 'private')

    # Drop unnecessary column
    df_checkins.drop(columns = ['place'],inplace = True)

    return df_checkins.sort_values(by=['User ID','local time']).reset_index(drop=True)

In [None]:
def build_clusters_labels(df_user,clustering_method):
    '''
    This function clusters the checkins for a single user.

    Input:
        - df_user: a dataframe containing the latitude and longitude for each checkin
        - clustering_method: DBSCAN, we define this parameter to avoid unnecessary initialisations
        when calling this funcrion
    Output:
        - clusters_labels: cluster label assigned to each checkin
    '''
    cluster_lables = clustering_method.fit(np.deg2rad(df_user[['lat','lon']])).labels_

    return cluster_lables

In [None]:
def cleaning_user(df_user):
    '''
    To avoid biasing the dataset with multiple checkins in a small period of time or small distance traveled,
    we drop checkins that are consecutively shared within 60 minutes and 100m.

    Input:
        - df_user: datafame containing checkin time and location sorted by time
    Output:
        - df_user: cleaned df_user

    '''

    # Constructing a dataframe containing the actual checkin and the next checkin
    df_tmp = df_user.reset_index().merge(df_user.iloc[1:].reset_index(drop=True),right_index=True,
                                         left_index=True,how='inner')

    # Compute the time between two consecutive checkins
    df_tmp['dt'] = df_tmp['local time_y'] - df_tmp['local time_x']

    # Compute the distance between two consecutive checkins
    columns = ['lat_x','lon_x','lat_y','lon_y']
    df_tmp['distance'] = compute_distance(df_tmp,columns)

    # Construct a mask to keep consecutive checkins if they are distant by 60 minutes or 100m
    # We also ignore checkins with 0 dt
    mask = (df_tmp['dt']!=timedelta(0))&((df_tmp['dt']>timedelta(hours=1))|(df_tmp['distance']>0.1))

    return df_user.reset_index().iloc[df_tmp[mask].index]


In [None]:
def compute_checkin_during_midnight(df_user):
    '''
    This function lables the checkins after midnight.

    Input:
        - df_user: dataframe containing and sorted by checkin time
    Output:
        - Labeles for each checkin. If it is happening after midnight and befor 7am it's set to True
        and False otherwise
    '''
    df_tmp = (df_user['local time'].dt.hour>=0) & (df_user['local time'].dt.hour<7)

    return df_tmp

In [None]:
def compute_last_checkin(df_user):
    '''
    This function labels the last checkin before 3 am.

    Input:
        df_user: Dataframe containing and sorted by checkin time
    Output:
        - Labeles for each checkin. If it is the last checkin of the day, the label is set to True
        and False otherwise
    '''
    # We subsctract 3 hours  so we can detect the last checkin whenever the date changes
    tmp_date = (df_user['local time']-timedelta(hours=3)).dt.date.values
    last_checkin = []

    # Labeling last checkins
    for i in range(len(tmp_date)-1):
        if tmp_date[i]<tmp_date[i+1]:
            last_checkin.append(True)
        else:
            last_checkin.append(False)

    # The last checkin is always True by definition
    last_checkin.append(True)

    return last_checkin

In [None]:
def compute_last_checkin_with_inactive_midnight(df_user):
    '''
    This function labels the last checkin with inactive midnight (no checkins between 0am and 7am).

    Input:
        df_user: Dataframe containing and sorted by checkin time
    Output:
        - Labeles for each checkin. If it is the last checkin of the day and the user didn't checkin
        between 0am and 7am the label is True and False otherwise
    '''

    # Substract 7 hours to detect the change of the day whenever the date changes
    tmp_date = (df_user['local time']-timedelta(hours=7)).dt.date.values
    tmp_hour = (df_user['local time']).dt.hour.values

    last_checkin_with_inactive_midnight = []

    # Compute the last checkin with inactive midnight
    for i in range(len(tmp_date)-1):
        # If the date changes and the hour is <= 23 the last checkin is happening before midnight
        if (tmp_date[i]<tmp_date[i+1]) and (tmp_hour[i]<=23):
            last_checkin_with_inactive_midnight.append(True)
        else:
            last_checkin_with_inactive_midnight.append(False)
    # As the last checkin is by definition the last checkin of the day, we simply need to see if it is
    # happening before midnight
    if tmp_hour[-1]<=23:
        last_checkin_with_inactive_midnight.append(True)
    else:
        last_checkin_with_inactive_midnight.append(False)

    return last_checkin_with_inactive_midnight

In [None]:
def compute_dt_to_next_checkin(df_user):
    '''
    This function compute delta time between two consecutive checkins in Hour.

    Input:
        - df_user: dataframe containing checkin time
    Output:
        - Delta time between two consecutive checkins
    '''

    # get checkin times
    checkin_time = df_user['local time'].values

    # Compute the difference (The result is in nanoseconds)
    delta_time = checkin_time[1:]-checkin_time[:-1]

    # Convert delta_time to hours
    delta_time = delta_time.astype(float)/(1e9*3600)

    # The last checkin doesn't have a next checkin so we append None
    delta_time = np.append(delta_time,None)
    return delta_time

In [None]:
def compute_PR_RPR(df_user):
    '''
    This function computes the PageRank and ReversePageRank for each cluster. The ReversePageRank is computed
    by inverting the the edges. The weight of an edge is obtained by computing the sum over clusters of the
    inverse of the time made between two consecutive checkins

    Input:
        - df_user: dataframe containing clusters labels for each checkin and the time until next checkin
    Output:
        - PageRank and ReversePageRank
    '''
    # Construct a dataframe containing the acutal checkin and next checkin
    df_tmp = df_user.reset_index().iloc[:-1].merge(df_user.iloc[1:].reset_index(),
                                                    right_index=True,left_index=True)

    # Compute the inverse time made between two consecutive checkins
    df_tmp['inverse_time'] = 1/df_tmp['dt_to_next_checkin_x']

    # Construct the graph edges dataframe
    df_graph = df_tmp.groupby(['cluster_label_x','cluster_label_y'],as_index = False).agg({'inverse_time':'sum'})

    # Initialise PageRank Graph
    G = nx.DiGraph()

    # Initialise ReversePageRank Graph
    RG = nx.DiGraph()

    # Building edges of the two graphs
    for i, row in df_graph.iterrows():
        G.add_edge(int(row['cluster_label_x']),int(row['cluster_label_y']),weight=row['inverse_time'])
        RG.add_edge(int(row['cluster_label_y']),int(row['cluster_label_x']),weight=row['inverse_time'])

    return list(nx.pagerank(G, weight='weight').values()), list(nx.pagerank(RG, weight='weight').values())

In [None]:
def compute_ratio(column):
    '''
    This is an aggregation function to compute the ratio.

    Input:
        - column: column containing binary labels
    Output:
        - Ratio of positive labels
    '''
    return np.sum(column)/len(column)

In [None]:
def build_features(path, sample_frac = 1):
    '''
    This is the main function to extract features from checkin data.

    Input:
        - path: the path of the file containing raw checkins.
        - sample_frac: sample fraction from the raw dataframe
    Output:
        - Cleaned training data: dataframe containing features for each cluster of every user
    '''

    # Construct checkin dataframe
    df_checkins  = construct_df_checkins(path,sample_frac=sample_frac)

    # Initialize output dataframe
    df_tmp = pd.DataFrame(columns = ['user','cluster_id','CR','MR','EDR','EIDR','PR','RPR','Is_home'])

    # Exctract users from the checkin dataframe
    users_id = np.unique(df_checkins['User ID'])

    # Grouping the checkin dataframe by the 'User ID'
    grouped_checkins = df_checkins.groupby('User ID')

    # Initialize Clustering method with the right parameters
    KMS_PER_RADIAN = 6371.0088
    PRECISION = 0.1
    clustering_method = DBSCAN(eps=PRECISION/KMS_PER_RADIAN,metric='haversine')

    # Compute features for each cluster of every user
    for user in users_id:

        # Get the user Dataframe
        df_user = grouped_checkins.get_group(user)

        # Clean df_user
        df_user = cleaning_user(df_user).copy()

        # Consider only dataframes containing more than 1 cleaned entries
        if len(df_user)>1:

            # Compute cluster_label
            df_user['cluster_label'] = build_clusters_labels(df_user,clustering_method)

            # Compute Checkin during midnight
            df_user['checkin_during_midnight'] = compute_checkin_during_midnight(df_user)

            # Compute last checkin
            df_user['last_checkin'] = compute_last_checkin(df_user)

            # Compute last checkin with inactive midnight
            df_user['last_checkin_with_inactive_midnight'] = compute_last_checkin_with_inactive_midnight(df_user)

            # Compute distance to next_checkin and classify edges
            df_user['dt_to_next_checkin'] = compute_dt_to_next_checkin(df_user)

            # Construct aggregation dictionnary
            agg_dic = {'cluster_label':'count','checkin_during_midnight':compute_ratio,
                        'last_checkin':compute_ratio,'last_checkin_with_inactive_midnight': compute_ratio,
                        'Is_home': 'sum'}
            # Construct rename dictiaonnary
            rename_dic = {'cluster_label':'CR','checkin_during_midnight':'MR','last_checkin':'EDR',
                          'last_checkin_with_inactive_midnight':'EIDR'}

            # Group by cluster_label
            grouped_clusters = df_user.groupby('cluster_label')

            # Compute the first 4 features
            features = grouped_clusters.agg(agg_dic).rename(columns = rename_dic)

            # Add user ID to the features
            features['user'] = user

            # Compute Checkin Ration (CR)
            features['CR'] = features['CR']/features['CR'].sum()

            # Label the clusters that contain the home location
            features['Is_home'] = features['Is_home'] == features['Is_home'].max()

            # Compute PageRank and ReversePageRank
            features['PR'],features['RPR'] = compute_PR_RPR(df_user)

            # Append results to the output dataframe
            df_tmp = df_tmp.append(features)

    # Assign clusters_id to the dataframe
    df_tmp['cluster_id'] = df_tmp.index

    return df_tmp.reset_index(drop=True)

In [None]:
df_training = build_features('data/processed_dataset-003.csv')

In [None]:
df_training.head()

In [None]:
df_training.info()

In [None]:
df_training.to_csv('training_dataset.csv')

In [None]:
df_training.Is_home.sum()/len(df_training)

# Selim

<a href="https://colab.research.google.com/github/TheAzouz/Project_ADA/blob/main/selim_P4_group.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount ('/content/drive')

In [None]:
cd "drive/My Drive/Colab Notebooks/ADA_P4/data"

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.optimize import curve_fit

In [None]:
def calculate_distance (lat1,lon1,lat2,lon2):
    """
    ################## Function we used in P2 and P4 to calculte distances ####################

    R : radius of earth : 6378.137 km
    lat1,lon1 : latitude and longitude of one user
    lat2,lon2 : latitude and longitude of other user
    """
    R = 6378.137
    # convert into radians
    lat1_rad=np.deg2rad(lat1)
    lat2_rad=np.deg2rad(lat2)
    lon1_rad=np.deg2rad(lon1)
    lon2_rad=np.deg2rad(lon2)

    #get difference of lattitude and difference of longitude
    delta_lat=lat2_rad-lat1_rad
    delta_lon=lon2_rad-lon1_rad

    #return formula Haversine formula
    a=((np.sin(0.5*delta_lat))**2)+np.cos(lat1_rad)*np.cos(lat2_rad)*((np.sin(0.5*delta_lon))**2)
    return 2*R*np.arcsin(np.sqrt(a))

def cont (x):
    if any(x.str.contains('Home','private')):return 1
    else : return 0

## **2) Friends checkins:**


- We want to find checkin patterns between friends gaherings.
- For that, we create a new dataframe, with whome we are are going to work for this whole part.
- We assume two friends have met together if they have checked in the same place with at most one hour difference. Since our dataframe is labeled and each place has its own id, we don't need to do any approximation on the checkin distance between friends.


In [None]:
def friends_gatherings(PROCEEDED_PATH,FRIENDSHIPS_PATH):
    """
    function to get dataframe where friends gathered
    INPUTS :
    PROCEEDED_PATH : proceeded dataset path
    FRIENDSHIPS_PATH : friendships (edges) path
    OUTPUTS :
    1) dataframe that will be used for 1 futur plot, contains two columns : distance from home and probability of checkin
    This dataframe isn't limited to friends checkins and is used to see if there is any difference
    between considering friends gatherings or not
    2) Friends dataframe : contains the dataframe that will be used all this part for our studies.
    This dataframe contains the columns ['User ID','Venue ID','day','local time','lat','lon','place','country']

    In order to have a study that is the most accurate possible, we only work on users that checked in their homes at least once.
    That helps us make no approximation on the home location.
    Moreover, our dataframe being large, we don't have the problem of not having enough data to study.

    In order to preserve memory, we will import the friendship dataframe in this function, we won't need it after that.
    """
    #Import two datasets
    df=pd.read_csv(PROCEEDED_PATH,parse_dates=['local time'])
    friendships=pd.read_csv(FRIENDSHIPS_PATH,sep='\t',encoding='latin-1',names=['User ID','User2 ID'], header=None)

    #clean edges dataframe : erase columns where the person following and the person followed is the same
    friendships=friendships[friendships['User ID']!=friendships['User2 ID']]

    print('begin getting homes')
    #create a new dataframe : df1 : used to find the home location of each user
    #To do that,  we keep only the line where the user checked in their homes
    #Then, we get the means of checkin latitude and longitude for each user and each place checked in to find the home location
    # ==> The home location is the mean of checkins that were in the home of a user
    # We finally return only the three columns needed to perform the final merge to have the final

    #We just initialize the two columns we need
    df1=df[['User ID','lat','lon','place']].rename(columns={'lat':'home lat','lon':'home lon'})
    df1=df1.groupby(['User ID','place'],as_index=False).\
        agg({'home lat':'mean','home lon':'mean'},axis='columns')
    df1=df1[df1['place'].str.contains('Home (private)',regex=False)][['User ID','home lat','home lon']]
    #df1 now contains for each user his home longitude and home latitude

    #We then perform a merge with the original dataset to integrate the home coordinates to the dataset
    df1=df1.merge(df[['User ID','Venue ID','day','local time','lat','lon','place','country']])
    df1['dist home']=calculate_distance(df1['lat'],df1['lon'],df1['home lat'],df1['home lon']).round()

    #Now we move to finding the final dataframe (the one we will use in the future)
    merged_friends=pd.DataFrame()
    chunksize=10**6
    numb_chunks=int(np.ceil(df1.shape[0]/chunksize))
    print('begin merging')
    #We work with chunks of size 10**6 each
    for i in range(numb_chunks):
        # Since we want the checkin place to be exactly the same for each user (have the same id),
        #we perform the merge on both the user and his checkin place (this procedure helped us save much memory and time)
        tmp_merge=df1[chunksize*i:chunksize*(i+1)].merge(friendships).\
                    merge(df1[['User ID','Venue ID','local time']],\
                          left_on=['User2 ID','Venue ID'],right_on=['User ID','Venue ID']).\
                    rename(columns={'dist home_x':'dist home'})

        #filter the tmp_merge with friends that checked in at most with one hour difference in the same place
        tmp_merge=tmp_merge[(np.abs((tmp_merge['local time_x']-tmp_merge['local time_y']).dt.total_seconds())<3600)]
        tmp_merge=tmp_merge.rename(columns={'local time_x':'local time'})

        #append the chunk to the final dataset
        merged_friends=merged_friends.append(tmp_merge[['day','local time','place','country','dist home']],ignore_index=True)

        if i==int(0.5*numb_chunks):print('halfway through merging')
    print('finished merging')
    return get_vects_plot(df1[['dist home']]),merged_friends

In [None]:
def get_vects_plot(df):
    """
    function to calculate the probability of checkin as a function of the distance
    Returns a dataframe where there are two columns: the distance from home and its probiability

    We begin by counting the number of checkin for each distance
    The we divide by the total number of counts to get a  probability.
    """
    #Create a new column, we will change it after
    #This new column will contain the probability of probability of checking in knowing the distance from home
    df1=df.copy()
    df1['proba dist']=df1['dist home']

    #'proba dist' column now contains the number of checkins for each distance
    df1=df1.groupby('dist home',as_index=False).count()[['dist home','proba dist']]

    #We divide the 'proba dist' column by the total number of checkins
    df1.loc[:,'proba dist']=df1['proba dist']/df1['proba dist'].sum()
    return df1[['dist home','proba dist']]

In [None]:
PROCEEDED_PATH='processed_dataset.csv'
FRIENDSHIPS_PATH='dataset_WWW_friendship_new.txt'
df_tot,df_friends=friends_gatherings(PROCEEDED_PATH,FRIENDSHIPS_PATH)

In [None]:
def apply_median(y,N=6):
    """
    ################  Function used also in P4 ################

    smoth curve using median
    y : to be plotted
    N : Number of items to use to smooth curve
    We choose not to do this process for small values because the curve is already smooth for small distances
    """
    y1=np.copy(y)

    for i in range (N,len(y)):
        if i>N:y1[i]=np.median(y[0:i+N])
        elif i<len(y)-N : y1[i]=np.median(y[i-N:i+N])
        else : y1[i]=np.median(y[i-N:len(y)])

    return y1

In [None]:
df_friends.head(2)

In [None]:
def power_func(x, a, b):
    """
    function to compute a*(x^b)
    """
    return a * np.power( x,b)


In [None]:
df_plot_friends=get_vects_plot(df_friends)


In [None]:
df_plot_friends.head(2)

In [None]:
#The change_point isthe point where the behavior of the curves changes
#We find the chnge point empirically
change_point=16
#We create a logspace x_axis, we will use this vector for our futur plots
x_log=np.logspace(0,4.05,50)

#For both dataframes, we do an interpolation to find the corresponding values in the new logspace axis
#First, we begin by finding the interpolation functions for each vector,
#Then, we apply them of the dataframes
f_friends=interp1d(df_plot_friends['dist home'],df_plot_friends['proba dist'],kind='zero')
f_tot=interp1d(df_tot['dist home'],df_tot['proba dist'],kind='zero')

#y_friends,y_tot will be the vectors used for the plot.
#We don't apply the  median because we want to see the true curve, not a smoothed one.
y_friends=f_friends(x_log)
y_tot=f_tot(x_log)

#Since we want to approximate the parameters of our curves, we use scipy function 'fit_curve'
#'params_friends1' are parameters for smaller distnces
#'params_friends2' are parameters for bigger distances
params_friends1,_=curve_fit(power_func,x_log[:change_point],y_friends[:change_point])
params_friends2,_=curve_fit(power_func,x_log[change_point:],y_friends[change_point:])

plt.figure(figsize=(10,5))
#Create marker style for both vectors and do the plots
#Plot all checkins
marker_style_all = dict(color='r', linestyle=':', marker=(6, 2, 0),markersize=16)
plt.loglog(x_log,y_tot,markevery=0.05, **marker_style_all,label='all checkins')

#Plot checkins proceeded only with friends
marker_style_friends = dict(color='blue', linestyle=':', marker='o',markersize=16,fillstyle='none')
plt.loglog(x_log,y_friends,markevery=0.05, **marker_style_friends,label='only meeting friends')
plt.xlabel('distance from home')
plt.ylabel('probability')
plt.title('fraction of friends met as a function of the distance from home')

#Now we proceed the plot of the approximation curve for smaller and larger distances
plt.loglog(x_log[change_point-1:], power_func(x_log[change_point-1:], params_friends2[0], params_friends2[1]),\
                  color='g',linewidth=3,label='slopes (meeting friends)')
plt.loglog(x_log[:change_point], power_func(x_log[:change_point], params_friends1[0], params_friends1[1]),\
                  color='g',linewidth=3)
plt.axvline(x_log[change_point-1],linestyle='--',linewidth=1, color='grey')
#plt.xticks([change_point+1],[round(x_log[change_point-1],0)])
plt.legend()

In [None]:
print('we have slopes of {} for distances inferior to 30km and slopes of {} for distances superior to 20km'.\
      format(round(params_friends1[1],2),round(params_friends2[1],2)))

In [None]:
print('For distances inferior to 20km:\nThe distribution checkins knowingthe distance from home can ba approximates as :\
 P(x)={}*exp({}*x)'.format(round(params_friends1[0],2),round(params_friends1[1],2)))
print('For distances superior to 20km:\nthe distribution checkins knowingthe distance from home can ba approximates as :\
 P(x)={}*exp({}*x)'.format(round(params_friends2[0],2),round(params_friends2[1],2)))

- We can also approximate the probability of moving from home knowing the distance with two equations:

$$ P(x)=
\begin{cases}
    0.1e^{-0.44x} & \text{if x<20 km}\\
    0.35e^{-1.3x} & \text{otherwise}
\end{cases}
$$

- We notice a change in the slope at a distance of approximately 20km distance from home. This behavior is similar to the one described in the paper. However, some differences are noticeble:
1) The shift happened in a distance of 20km from home (vs 100km using other daasets)
2) The slope are different than the ones described in the paper.
Firstly, the slopes generally are smaller in our study compared to the ones found in the paper.
Secondly, while the slope is smaller for small distances in the paper (-1.9 < -0.9), the behaviour is different in our study (-0.44 > -1.3)

- Finally, we notice in the plots that whether a user visited a friend or not does not make much differences in the overall behavior of checkins. We make the hypothesis that the behavior is the same and we test it. This hypothesis states that friends don't have any influence on a user's movement.
- Before going into that, we test whether our data is normally distributed or not. We perform a Kolmogorov Smirnov test, a test made to check whether a distribution is normal or not. The  Null hypothesis is that the sample comes from a normal distribution. and we reject it depending on the pvalue we get. The p-value is the probability of obtaining results that aren't more extereme than the ones observed, if we assume the null hypothesis is true.


In [None]:
from statsmodels.stats import diagnostic
_, p_value_friends = diagnostic.kstest_normal(df_plot_friends['proba dist'], dist = 'norm')
_, p_value_tot = diagnostic.kstest_normal(df_tot['proba dist'], dist = 'norm')

In [None]:

print('Having the Null hypothesis that the data is derived from a normal distribution, we get :\n \t pvalue {} \
for the total checkins \n \t pvalue {} for the checkins with only friends'.format(p_value_tot,p_value_friends))

In [None]:
print('The dataframe containing only checkins with friends contains {} ligns\n\
The dataframe containing all checkins contains {} ligns'.format(df_plot_friends.shape[0],df_tot.shape[0]))

$\Rightarrow$ Using the result found before, we conclude our data is not derived from a normal distribution

- Now we test whether the behaviour is the same or not for the two sets (considering all checkins vs considering only checkins with friends). This will help us know whether friends do have an influence on people's movement or not.
- We use the assertions below:
1) We took two different samples from the same population.
2) The data we're treating (probabilities) is not derived from a normal disrtibution (verified).
3) Our samples are paired since they are derived from the same dataset
$\Rightarrow$  We do a Wilcoxon test.
To test the null hypothesis that there is no difference in behavior, we can apply the two-sided test.

- Our frames being of different shapes, we create a vector that will be used to make to comparision between the two frames and we use interpolations in order to have the the most precise values possible.


In [None]:
x_log_exp=np.logspace(0,4.05,3000)

#For both dataframes, we do an interpolation to find the corresponding values in the new logspace axis
#First, we begin by finding the interpolation functions for each vector,
#Then, we apply them of the dataframes
f_friends_exp=interp1d(df_plot_friends['dist home'],df_plot_friends['proba dist'],kind='zero')
f_tot_exp=interp1d(df_tot['dist home'],df_tot['proba dist'],kind='zero')

#y_friends_exp,y_tot_exp will be the expanded (working with 3000 points) vectors used for the hypothesis testing
y_friends_exp=f_friends_exp(x_log)
y_tot_exp=f_tot_exp(x_log)

In [None]:
from scipy.stats import wilcoxon

In [None]:
_, p_value = wilcoxon(np.abs(y_friends_exp-y_tot_exp))
print('The pvalue found after performing a wilcoxon test is :',np.round(p_value,2))

- Having found a p-value of 0.0, we strongly reject our Null hypothesis.
$\Rightarrow$ We conclude that friends do have a significant importance on a user's mobility.

## **3) Places checkin patterns**
### I) Preprocessing :
- From now on, we only work with checkins with friends only : Every thing we will be studying are meetings between friends.
- We now go through our dataset and gategorize our places features into different categories:
1) `Eat` : Going to eat with friends (restaurant , fast food ...)
2) `Study` : Studying (being in school, universty ...)
3) `Drink` : Having a drink with friends, going out ...
4) `Culture` : Going to watch a movie, visit monuments ...
5) `Home` : meet at someone's home
6) `Move` : take public transports or travel to far places
7) `Consume` : Visit stores, malls ...
8) `Work` : Being in work's place
9) `Entertain` : Go to a spa, hotel, beach,park ...
10) `Sport` : practise sports together

- We also categorize the days of the week into two types:
1) `Working days` : Monday until Friday
2) `Week end day` : Saturday and Sunday

### II) Processing :
- We study the probability people meet in each category
- We compare normalized probabilities




In [None]:
df_classified=df_friends.copy()

#Change the name of variables to its type (will be used after)
df_classified.loc[df_classified['place'].str.\
                  contains('restaurant|Burger|pizza|Diner|food|Steakhouse|\
                  BBQ|Dessert|Ramen|Ice Cream|Fried|Sandwich|breakfast|snack|\
                  taco|hot|soup|wings',case=False),'place']='Eat'

df_classified.loc[df_classified['place'].str.\
                  contains('college|University|school|student',case=False),'place']='Study'

df_classified.loc[df_classified['place'].str.\
                  contains('coffee|Bar|Nightclub|pub|Lounge|Beer|tea|Nightlife',\
                           case=False),'place']='Drink'

df_classified.loc[df_classified['place'].str.\
                  contains('multiplex|Movie|Theater|concert|Music|historic|arts|\
                  Museum|library|Monument|temple|art',case=False),'place']='Culture'

df_classified.loc[df_classified['place'].str.\
                  contains('Home|Residential|Building',case=False),'place']='Home'

df_classified.loc[df_classified['place'].str.\
                  contains('station|airport|subway|travel|boat|bus',case=False),'place']='Move'

df_classified.loc[df_classified['place'].str.\
                  contains('store|mall|plaza|shop|boutique|market',case=False),'place']='Consume'

df_classified.loc[df_classified['place'].str.\
                  contains('work|office|Startup|professional',case=False),'place']='Work'

df_classified.loc[df_classified['place'].str.\
                  contains('soccer stadium|Entertainment|Outdoor|beach|park|event|\
                  Arcade|resort|hotel|spa|Casino',case=False),'place']='Entertainement'

df_classified.loc[df_classified['place'].str.\
                  contains('soccer field|sport|gym|stadium|surf|pool|golf|\baseball',\
                           case=False),'place']='Sport'

#Change the name of a day to its type (work day or week end)
df_classified.loc[df_classified['day'].str.contains('mon|tue|wed|thu|fri',case=False),'day']='Work day'
df_classified.loc[df_classified['day'].str.contains('sat|sun',case=False),'day']='Week end'

#Keep only the place types we mentioned before
df_classified=df_classified.loc[df_classified['place'].str.contains('sport|Entertainement|Work|Consume|Move|Home|Culture|Drink|Study|Eat',case=False),:]
df_places=df_classified.groupby(['place','day'],as_index=False).agg({'country':'size'}).\
                sort_values(by='country',ascending=False)[['place','day','country']].\
                rename(columns={'country':'numb checkins','place':'place type','day':'day type'})
#df1.groupby('place_x').count().sort_values(by='day_x',ascending=False).head(50)

In [None]:
df_classified.head(2)

In [None]:
df_places.head(2)

- Now we move to visualizing the distributions :

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x=df_places['numb checkins']/df_places['numb checkins'].sum(),\
            y=df_places['place type'],orient='h',hue=df_places['day type'])
plt.xlabel('probability')
plt.ylabel('type of place')
plt.title('meeting patterns between friends')

- The visualization above shows us that people are the most likely to be studying. This observation can be explained by the fact that students are the most likely to use sociaal media, so the most number of checkins can be found among students.
- However, we can't draw more conlusions because we have to take into account that there are 5 work days and 2 week end days in a week. $\rightarrow$ In order to be able to compare and study people's checkins, we normalize by the number of days each probability.
- We will :
1) Divide each probability that any event occured in a working day by 5
2) Divide each probability that any event occured in a week end day by 2
3) Get the normalized difference (normalized probability that a user checks in in a working day - normalized probability that a user checks in in a week end day)
4) Divide the result by the probability of occurence in a work day to have a ratio.
The final equation we will have is (for each place):
$$
final ratio = \frac{\frac{P_{workday}}{5}-\frac{P_{workend}}{2}}{\frac{P_{weekday}}{5}}
$$

- In the end:
1)If this difference is positive : people are more likely to checkin in the place in a working day
2)If this difference is negative : people are more likely to checkin in the place in a week end day
3)The absolute value gives us the magnitude of the absolute ratio


In [None]:
def ratio (x):
    """
    Function to get difference between items atn then normalize
    Since x contains one negative (week end) and one positive (work day) value,
    max(x) is the value  of probability for a working day
    """
    return sum(x)/max(x)

In [None]:
#df4=df_classified.copy()
#normalize to have proba/day and then take the difference between the items:

#First, we divide by the number of days and set week end probabilities to be negative values
#So that we perform a sum after
df_places.loc[df_places['day type'].str.contains('Week|end',case=False),'numb checkins']=\
                            -df_places.loc[df_places['day type'].str.\
                                               contains('Week|end',case=False),'numb checkins']/2

df_places.loc[df_places['day type'].str.contains('Work|day',case=False),'numb checkins']=\
                            df_places.loc[df_places['day type'].str.\
                                              contains('Work|day',case=False),'numb checkins']/5
#Now we groupby the place type and get the ratio we need
df_places=df_places.groupby('place type',as_index=False).\
                            agg({'numb checkins':ratio},axis='columns').\
                            sort_values(by='numb checkins',ascending=False)


In [None]:
plt.figure(figsize=(15,6))
sns.barplot(x=df_places['numb checkins'],y=df_places['place type'],orient='h', palette='viridis_r')
plt.xlabel('difference between normalized probabilities')
plt.ylabel('type of place')
plt.title('meeting patterns between friends day week vs week end')

## **4) Times checkin patterns**

- In this final part, we will study the time friends meet.
We willclassify a day in two parts : day and night.
- Since work usually finished at 5pm, we consider day hours of the day between 5h and 17h, and night as other hours of the day

In [None]:
#In order to get our times, we substract 4 from total checkin hours
#Then we get the hours that are superior to 14
df_classified.loc[:,'night']=(df_classified.loc[:,'local time'].dt.hour-5)>12
df_classified.loc[:,'day']=(df_classified.loc[:,'local time'].dt.hour-5)<=12

In [None]:
df_classified.head(2)

In [None]:
# We groupby the place type and then get the ratio of evening checkins
df_classified=df_classified.groupby('place',as_index=False).\
                      agg({'night':'sum','country':'size','day':'sum'}).\
                      rename(columns={'country':'numb checkins'})
df_classified['ratio night']=df_classified['night']/sum(df_classified['numb checkins'])
df_classified['ratio day']=df_classified['day']/sum(df_classified['numb checkins'])
#We sort values to have a good looking visualization
df_classified=df_classified.sort_values(by='ratio day')

In [None]:
df_classified.head(2)

In [None]:
df_classified.plot(x='place', y=['ratio night','ratio day'], kind="barh",figsize=(12,6),width=0.8)

plt.xlabel('probiabilities')
plt.ylabel('type of place')
plt.title('checkins in different times of day')

- We first notice that :
1) The biggest probability of checking in with friends during day occurs during studying
2) The biggest probability of checking in with friends during night occurs during night.
- We now do the same work we did previously with the places patterns in order to compare the ratio of checkins:
1) We get the difference between probabilities that a user checks in during the day or during night.
2) We divide the result by the probability of a checkin during the day
The final equation we will have is (for each place):
$$
final ratio = \frac{P_{daycheckin}-P_{nightcheckin}}{P_{daycheckin}}
$$

In [None]:
#get the ratio ( values used in the plot below)
df_classified['ratio final']=(df_classified['ratio day']-df_classified['ratio night'])/df_classified['ratio day']
plt.figure(figsize=(15,6))
sns.barplot(x='ratio final',y='place',data=df_classified.sort_values(by='ratio final',ascending=False),\
            orient='h', palette='viridis_r')

plt.xlabel('checkin patterns in different times of day')
plt.ylabel('type of place')
plt.title('ratio of probabilities')

- We see that friends are more likely to meet during the day to study or work. On the other hand, they are more likely to meet at night to have night, even if the difference is small.
- We can conlude that most checkins happen to be during the day and that people tend less to checkin at night

### III) Conclusions :
- People tend to meet their friends more in work or study places during the week. This can be explained by the fact that people usually have their coworkers and classmates as friends on social media. Studying or working is part of people's obligations and these are task are generally proceeded during the week
- However, when it comes to free time (week end for most of people), people choose to meet their friends in diverting places (every other category that doesn't involve working or studying). Specifically, people are the most likely to go out in weekend to have drinks or to entertainement places.
- Finally, people tend to spend their day working and studying, and then spend their evening and night in diverting places (eating, having drinks)