In [1]:
pwd

'C:\\Users\\Mohamed Asfar\\User Activity Analysis'

## **1. Problem Statement**

Zilto is a business-to-business professional services company that provides online services and products to support work of the employees of its customer companies. Many companies in diverse areas of business including, for example, manufacturers, producers, suppliers, retailers, transportation, and others are Zilto’s paid subscribers. Besides an annual subscription fee, variable charges are based on the extent of engagement with its products and services by the users (users are employees of customer companies).
The responsibility is to monitor user activity and engagement(whether engagement is stable, dropping, or increasing). For any changes in user engagement activity,possible sources or reasons for the changes should be identified and noted. The task at hand is to review the available data, do necessary analysis, interpret the findings and project visualizations.


## **2. Aim of the Project**


The project aims to answer the following questions:<br>
1. Has user activity or engagement dropped, increased or remained stable? What is the extent of change in user activity/engagement?<br>
2. Are there any changes in the three stages of the emails funnel? What changes have you discovered??<br>
3. Are changes in user activity/engagement associated with specific devices or hardware ( mobile phone, tablet computers, desktop computers, etc.)  that customers use to read emails and interact with Zilto’s  web portal?<br>
4. Are changes in user activity/engagement associated with specific countries or regions of the world where Zilto’s  users are located?<br>



In [2]:
# Importing library for data processing
import pandas as pd

## **3. Discover**

In [3]:
# Reading the dataset
users_raw_data = pd.read_csv("users.csv") 
emails_raw_data = pd.read_csv("emails.csv")
events_raw_data = pd.read_csv("events.csv")

## **3.1 Analyzing the user data**

In [4]:
users_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19066 entries, 0 to 19065
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   user_id       19066 non-null  float64
 1   created_at    19066 non-null  object 
 2   company_id    19066 non-null  float64
 3   language      19066 non-null  object 
 4   activated_at  9381 non-null   object 
 5   state         19066 non-null  object 
dtypes: float64(2), object(4)
memory usage: 893.8+ KB



1. user_id - A unique ID per user<br>
2. created_at - The time the user was created (first signed up)<br>
3. company_id - The ID of the user's company<br>
4. activated_at - The time the user was activated, if they are active <br>
5. language - The chosen language of the user <br>
6. state - The state of the user (active or pending)<br>

In [5]:
users_raw_data.shape

(19066, 6)

In [6]:
users_raw_data.head()

Unnamed: 0,user_id,created_at,company_id,language,activated_at,state
0,0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
1,1.0,2013-01-01 13:07:46,28.0,english,,pending
2,2.0,2013-01-01 10:59:05,51.0,english,,pending
3,3.0,2013-01-01 18:40:36,2800.0,german,2013-01-01 18:42:02,active
4,4.0,2013-01-01 14:37:51,5110.0,indian,2013-01-01 14:39:05,active


In [7]:
#Checking the null values
users_raw_data.isnull().sum()

user_id            0
created_at         0
company_id         0
language           0
activated_at    9685
state              0
dtype: int64

The "activated_at" column has 9685 missing values but it doesnt matter as both columns created_at and activated_at donot provide any important information about the dataset

## **3.1 Analyzing the emails data**

In [8]:
emails_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90389 entries, 0 to 90388
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      90389 non-null  float64
 1   occurred_at  90389 non-null  object 
 2   action       90389 non-null  object 
 3   user_type    90389 non-null  float64
dtypes: float64(2), object(2)
memory usage: 2.8+ MB



1. user_id - A unique ID per user<br>
2. occurred_at - The time the event occurred. <br>
3. action - The name of the event that occurred.  <br>
sent_weekly_digest: it means that the user was delivered a digest email showing relevant conversations from the previous day. <br>
email_open: it means that the user opened the email.  <br>
email_click-through: it means that the user clicked a link in the email.  <br>
4. user_type - The time the user was activated, if they are active <br>

In [10]:
emails_raw_data.shape

(90389, 4)

In [11]:
emails_raw_data.head()

Unnamed: 0,user_id,occurred_at,action,user_type
0,0.0,2014-05-06 09:30:00,sent_weekly_digest,1.0
1,0.0,2014-05-13 09:30:00,sent_weekly_digest,1.0
2,0.0,2014-05-20 09:30:00,sent_weekly_digest,1.0
3,0.0,2014-05-27 09:30:00,sent_weekly_digest,1.0
4,0.0,2014-06-03 09:30:00,sent_weekly_digest,1.0


In [16]:
emails_raw_data["action"].unique()

array(['sent_weekly_digest', 'email_open', 'email_clickthrough',
       'sent_reengagement_email'], dtype=object)

In [18]:
sorted(emails_raw_data["user_type"].unique())

[1.0, 2.0, 3.0]

## **3.3 Analyzing the events data**

In [19]:
events_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      150000 non-null  float64
 1   occurred_at  150000 non-null  object 
 2   event_type   150000 non-null  object 
 3   event_name   150000 non-null  object 
 4   location     150000 non-null  object 
 5   device       150000 non-null  object 
 6   user_type    134423 non-null  float64
dtypes: float64(2), object(5)
memory usage: 8.0+ MB


1. user_id:	The ID of the user logging the event. Can be joined to user_id in either of the other tables.<br>
2. occurred_at:	The time the event occurred. <br>
3. event_type:	The general event type. <br>
There are two values in this dataset: <br>
signup_flow: refers to anything occurring during the process of a user's authentication, <br>
engagement: refers to general product usage after the user has signed up for the first time. <br>
4. event_name:	The specific action the user took - possible values include the following:<br>
create_user: User is added to the company’s database during signup process<br>
enter_email: User begins the signup process by entering her email address<br>
enter_info: User enters her name and personal information during signup process<br>
complete_signup: User completes the entire signup/authentication process<br>
home_page: User loads the home page<br>
like_message: User likes another user's message<br>
login: User logs into her/his account<br>
search_autocomplete: User selects a search result from the autocomplete list<br>
search_run: User runs a search query and is taken to the search results page<br>
search_click_result_X: User clicks search result X on the results page, where X is a number from 1 through 10.<br>
send_message: User posts a message<br>
view_inbox: User views messages in her inbox<br>
5. location:	The country from which the event was logged (collected through IP address).<br>
6. device:	The type of device used to log the event.<br>


In [22]:
events_raw_data.shape

(150000, 7)

In [23]:
events_raw_data.head()

Unnamed: 0,user_id,occurred_at,event_type,event_name,location,device,user_type
0,10522.0,2014-05-02 11:02:39,engagement,login,Japan,dell inspiron notebook,3.0
1,10522.0,2014-05-02 11:02:53,engagement,home_page,Japan,dell inspiron notebook,3.0
2,10522.0,2014-05-02 11:03:28,engagement,like_message,Japan,dell inspiron notebook,3.0
3,10522.0,2014-05-02 11:04:09,engagement,view_inbox,Japan,dell inspiron notebook,3.0
4,10522.0,2014-05-02 11:03:16,engagement,search_run,Japan,dell inspiron notebook,3.0
