# Bonus track

In this section we will go throught extra steps that we can perform in order to understand a bit more the data we have.

We will start with a join to have all the information together.

Second, we will go deep in the data to see if there are some interesting insights or questions that we could answer by analyzing this information.

Finally we will share some conclusions.



In [23]:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
import datetime

In [6]:
#upload and show df with members information
member_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/Member_nopass.xlsx')
member_df.head()

Unnamed: 0,UserID,UserCompanyLinkID,Age Group,BMI Group,Gender
0,002c5434-2390-440d-886f-f03b09c79651,3907,,Overweight,Male
1,0051aee8-67ad-4186-9934-10d61b967bf2,4143,40 - 49,Normal weight,Female
2,005ab8a5-ab67-421b-9a39-3998dc9232ce,3716,,Normal weight,Male
3,005dfde1-9412-4cc8-baad-9db8c3c568da,6289,40 - 49,Normal weight,Female
4,009220b1-543e-4362-9ccf-479184c01063,4860,40 - 49,Overweight,Male


In [9]:
#upload and show df with app engagement information
AppEngagement_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/AppEngagement_nopass.xlsx')

#we convert UserCompanyLinkID from float to integer in order to have the same value type with member_df 
AppEngagement_df['UserCompanyLinkID'] = AppEngagement_df['UserCompanyLinkID'].fillna(0).astype(int)

AppEngagement_df.head()

Unnamed: 0,UserCompanyLinkID,AddDateTimeUTC
0,6435,2021-03-25T13:53:08.7370000
1,4439,2021-03-25T13:46:18.6300000
2,4266,2021-03-25T13:22:33.3130000
3,4266,2021-03-25T13:22:32.9670000
4,3705,2021-03-25T13:19:18.4730000


In [10]:
#upload and show df with physical activity information
Physical_activity_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/Physical_activity_nopass.xlsx')
Physical_activity_df.head()

Unnamed: 0,UserID,ActivityDate,Activity_Score,Calories,Distance,Steps
0,002c5434-2390-440d-886f-f03b09c79651,2019/12/25 12:00:00 AM,91,586,6153,6622
1,002c5434-2390-440d-886f-f03b09c79651,2019/12/26 12:00:00 AM,0,0,0,0
2,002c5434-2390-440d-886f-f03b09c79651,2019/12/27 12:00:00 AM,0,0,0,0
3,002c5434-2390-440d-886f-f03b09c79651,2019/12/28 12:00:00 AM,0,0,0,0
4,002c5434-2390-440d-886f-f03b09c79651,2019/12/29 12:00:00 AM,0,0,0,0


In [51]:
#merge dfs into a new df
bonus_df = member_df.merge(Physical_activity_df,on='UserID')

#first groupby analysis
bonus_df.groupby(['BMI Group','Gender']).size()

BMI Group      Gender
Normal weight  Female     17
               Male       68
Overweight     Female     19
               Male      774
dtype: int64

In [71]:
#count how many users we have by company
bonus_df['UserCompanyLinkID'].value_counts()

3907    433
4034    338
3962    117
5057     67
4951     19
4143     16
4860      2
4885      1
3716      1
6289      1
Name: UserCompanyLinkID, dtype: int64

In [54]:
#stats for BMI category
bmi_df = bonus_df.groupby(['BMI Group']).mean()
bmi_df.drop('UserCompanyLinkID', axis=1, inplace=True)

bmi_df

Unnamed: 0_level_0,Activity_Score,Calories,Distance,Steps
BMI Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Normal weight,42.682353,275.258824,2896.788235,4021.164706
Overweight,126.813367,826.905422,7737.912989,8265.482976


In [55]:
#stats by gender
gender_df = bonus_df.groupby(['Gender']).mean()
gender_df.drop('UserCompanyLinkID', axis=1, inplace=True)

gender_df

Unnamed: 0_level_0,Activity_Score,Calories,Distance,Steps
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,14.805556,95.833333,1200.694444,1304.583333
Male,123.109264,802.473872,7528.7019,8134.634204


In [56]:
#stats by age group
age_df = bonus_df.groupby(['Age Group']).mean()
age_df.drop('UserCompanyLinkID', axis=1, inplace=True)

age_df

Unnamed: 0_level_0,Activity_Score,Calories,Distance,Steps
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30 - 39,37.344828,240.942529,2546.655172,3329.413793
40 - 49,34.315789,221.421053,1994.631579,3198.052632


In [57]:
#stats by company
company_df = bonus_df.groupby(['UserCompanyLinkID']).mean()

company_df

Unnamed: 0_level_0,Activity_Score,Calories,Distance,Steps
UserCompanyLinkID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3716,48.0,309.0,3831.0,5208.0
3907,117.406467,755.616628,7084.872979,8222.533487
3962,97.495726,628.726496,8184.128205,7348.811966
4034,146.168639,965.89645,9027.674556,8817.573964
4143,23.8125,153.625,1786.5625,2637.5625
4860,110.0,709.5,2927.5,6899.0
4885,0.0,0.0,24.0,33.0
4951,5.315789,34.842105,588.526316,0.0
5057,46.985075,302.985075,3139.597015,4322.776119
6289,51.0,330.0,3458.0,4764.0


# Conclusions

After analise all the information, we can observe the following:

- The data distribution by "UserCompanyLinkID" is not equally distributed, more than 70% of all records belongs to top 3 companies.

- Our dataset contains a **77,74%** of information that belongs to female members. This fact is a bit problematic when the information sample is limited and under my point of view, we can't extrapolate male's behaviour with this information.

- As saw with "Gender", a large part of the information **(79,61%)** belogs to a one category, "Overweight", in this case. So, we can't say that data sample represents all the categories present in the dataset.
 
- We find a similar issue with the "Age_Group" as majority of users didn't include it.

If we continue reading the code, we will find the mean of all variables by a groupby category. As we saw before, the insights got form this code are not representative of what we can found in the day to day behaviour. Again, the reason why it's because data is not "equally" distributed.

Finally, we can find a table that measures all statistics by UserCompanyLinkID. In this table we can observe that the less records we have the less representative this data is for us. Due to this circumstances, get descripted data it's not 100% useful for us.