### Importing Libs

In [19]:
import pandas as pd
from datetime import datetime

### Donations dataset
Set consisting of three CSVs:
- donations;
- year of membership;
- emails read;

---> analyzing whether the member status is annual or just the most recent;

---> analyzing the email table to understand the timestamp of the 'week' column;

#### Exploratory Data Analysis

In [9]:
year_joined = pd.read_csv("/Users/dellacorte/py-projects/data-science/time-series-pocket-reference/getting-time-series-datasets/datasets/year_joined.csv")

# analyzing whether the member status is annual or just the most recent
year_joined.groupby("user").count().groupby("userStats").count()

Unnamed: 0_level_0,yearJoined
userStats,Unnamed: 1_level_1
1,1000


Checking the thousand records, they only have one status, so the year they joined will probably be yearJoined, accompanied by a status that could be the current status or when they joined.

In [18]:
# analyzing the email table to understand the timestamp of the 'week' column
emails = pd.read_csv("/Users/dellacorte/py-projects/data-science/time-series-pocket-reference/getting-time-series-datasets/datasets/emails.csv")
# emails.head()
emails.dtypes

#empty_emails = emails[emails.emailsOpened < 1]
#empty_emails

emailsOpened    float64
user            float64
week             object
dtype: object

There is a possibility that null weeks are not depicted in the table or members always have at least one email event. It's difficult to accept the possibility of always having an email event, so to do this, we can analyze the history of just one user:

In [15]:
user_998 = emails[emails.user == 998]
user_998

Unnamed: 0,emailsOpened,user,week
25464,1.0,998.0,2017-12-04 00:00:00
25465,3.0,998.0,2017-12-11 00:00:00
25466,3.0,998.0,2017-12-18 00:00:00
25467,3.0,998.0,2018-01-01 00:00:00
25468,3.0,998.0,2018-01-08 00:00:00
25469,2.0,998.0,2018-01-15 00:00:00
25470,3.0,998.0,2018-01-22 00:00:00
25471,2.0,998.0,2018-01-29 00:00:00
25472,3.0,998.0,2018-02-05 00:00:00
25473,3.0,998.0,2018-02-12 00:00:00


We can notice that some weeks are missing. For example, there are no email events after December 18, 2017. We can go further and check mathematically:

In [21]:
# converting object to datetime
emails['week'] = pd.to_datetime(emails['week'])

# member's membership time
user_membership = (max(emails[emails.user == 998].week) - 
                      min(emails[emails.user == 998].week)).days/7

user_membership

25.0

In [23]:
# number of corresponding weeks of data for user = 998
quantity_weeks_data_998 = emails[emails.user == 998].shape
quantity_weeks_data_998

(24, 3)

We have 24 rows, but we should have 26. This shows that a few weeks of this user's data is missing. By the way, we could also run this calculation for all users simultaneously.

Filling in all missing weeks for all users of the dataset

In [24]:
complete_idx = pd.MultiIndex.from_product((set(emails.week),
                                          set(emails.user)))

We will use this index to re-index the original table and fill in the missing values - in this case with 0, assuming that if nothing is recorded it means there was nothing to record.
We'll also redefine the index to make user and week information available as columns, and then name those columns:

In [29]:
all_email = emails.set_index(['week', 'user']).reindex(complete_idx, fill_value = 0).reset_index()
        
all_email.columns = ['week', 'user', 'emailsOpened']

In [30]:
all_email[all_email.user == 998].sort_values('week')

Unnamed: 0,week,user,emailsOpened
57133,2015-02-09,998.0,0.0
37190,2015-02-16,998.0,0.0
38807,2015-02-23,998.0,0.0
64679,2015-03-02,998.0,0.0
17247,2015-03-09,998.0,0.0
...,...,...,...
79232,2018-04-30,998.0,3.0
42041,2018-05-07,998.0,3.0
36651,2018-05-14,998.0,3.0
11318,2018-05-21,998.0,3.0
