## Background

The original data is offered by a Saas company which provides data consultation and analytical software for small to medium size companies to better understand their cusotmer. The company collects users' browsing activities data when users visit the company's website.
The company tries to understand its website traffic, user engagement and user signup rate through data analysis, and to increase users' signup rate for core products. 


## Dataset

The dataset constains 9-day log of users web browsing activities such as users actions (pageview, butttonclick,formsubmit),  action-related properties (such as url, referer_page, utm info), time when performing each action and device information.  The original dataset has 75092 rows and 70 columns.


## Analysis

### DataWrangling: 
1. Flattening nested json data using io.json_normalize, and check distributions of categorical and continous variables(especially NA percent) to get initial understanding of data.
2. Data transformation(such as droping meanless features, regrouping based on distribution, generating time-related features,separating mixed information in one variable) is applied to event-based data. (See 'Event Level Data' description in 'Sensors_DataWrangling' for detailed steps)
3. Generate user-level data from event-level data using information such as count, unique count and mean, aggregated by user distinct id. (See 'User Level Data' description in 'SensorsData_Model' detailed steps)
    
### EDA: 

- **Website Traffic**: for normal weekdays(except first Monday),on average, the website has 9041.5 events, 1836.5 unique users, 2257.75 sessions on a daily basis. During weekends, users are less active. 

<img src="img/session.png",width=500,height=400/> 

- **User Metric**: the metrics in the following table indicate that users have quite low level interection with the website. Also, more than 75% of users only have 1 session(short session) in 9 days. 

| Metric | Mean | Median|75%
|--|--|--|--|
|number of events|6.38|3|6|
|number of unique events|2.4|2|3|
|number of sessions|1.42|1|1|
|session length(in minutes)|3.85 |0.64|3.78|
|interval between sessions(in minutes)|1158.46|278.67|1334.24|

- **Funnel Analysis**: Both funnel show that bottleneck exists at directing user from 'pageview' to 'btnclick', and from 'page_leave' to 'click_send_cellphone'. The company should focus on improving user interaction with website and figure out why users are lost at 'click_send_cellphone' stage.
<table>       
<tr>
<td> <img src="img/funnel1.png",width=400,height=400/> </td>
<td> <img src="img/funnel2.png",width=400,height=400/> </td>
</tr> 
</table>

- **Retention Analysis**: By looking at the daily user cohort(users who visit the website on same day), analysis shows that retention rate drops significantly on 2nd and continues to drop slightly on 7th day. By further breaking the user group into 4 segments, users with demo_leave or btnclick action have relatively high second day retention rate. By directing users to visit demopage or turning users into active users will help to improve retention rate.
<img src="img/cmp_retention_line_chart.pdf",width=600,height=600/>


### Modeling

Building models that predicts user signup rate, and based on the models, to better understand what are the key features that have impact on submitting forms.

#### Preparing Dataset For Modeling

* Define target variable
Define users who had any event of 'formSubmit', 'clickSubmit', 'click_send_cellphone', 'verify_cellphone_code' as convertors since storng correlation and overlap between these events.  

* Feature Engineering
Based on user level dataset, the following steps are done to generate new features and transform features:

1. Based on time variable, generate 'time difference' and 'session' related features. 
2. Regroup cols with too many levels into topN levels, and apply one-hot-encoding or binary encoding to features.
3. Time-related variable have skewed distribution and large values, apply log transformation to them.
4. Since NA in most columns indicating no such event/properties, fill NA with 0.
5. Drop columns will leak conversion information and keep only one format if the feature have multiple presentations.


#### Model and Performance
**Models**: Logistic Regression, Single Decision Tree and Bagged Decision Trees, Random Forest.

**Data split and sampling**: 0.7/0.3 split for train and test. Trainig dataset has 4.3% positive label while test dataset has 4.7% positive label. Due to the imbalance nature of labels, upsampling(SMOTE) and downsampling(random sampling) methods are tested to reduce the impact of imbalanced dataset. Also, parameter ‘classweight’ is adjusted to penalize the cost when giving wrong prediction of positive label.

**Metric**: Precision and Recall curve will be a better metric to check becuase false positive rate will increase slowly(denominator is large) and ROC curve will more likely be at upper left positionin the case of imbalanced dataset. 

**Performance**: Random Forest model(half of original feature size) and Logistic Regression with L1 regularization have quite good performance. Two model's ROC curve and auc score are quite close, while LR model has higher precision and RF model has higher Recall.

|Model|AUC|Accuracy|Precision|Recall|f1-score|PR-AUC|
|--|--|--|--|--|--|--|
|LR1(with L1 regularization)|98.54%|97.96%|81.46%|73.65%|77.36%|0.8123|
|Decision Tree	|91.58%|	97.11%|	70.19%|	67.66%|	68.90%|	0.7081|
|Bagged Tree	|98%|	96.48%|	58.85%|	85.63%|69.23%|	0.78|
|Random Forest	|98.21%|	97.42%	|69.39%	|81.43%|	74.93%|	0.7790|

**Model results**: Model results show that active session time and actions related to visiting demo page(click for demo-link, url contains 'demo' keyword, visiting and closing demopage) are key factors that drive users to register('submit'). And by looking at the conversion rate by segments, user segment with longer average session time and more demopage related visiting actions have high conversion rate. Thus, the company should focus on improving users' session length and incentivizing users to click buttons and visit demo-page. 




## Conclusion and Suggestions

    1. The website has constant number of users visiting on weekdays and low volume of users on weekends. During weekdays, Monday is the least active day for users. If the company plans to start a marketing campaign, Tuesday through Friday will be a better time to start. 
    
    2. Looking at user level metrics, the user group has poor engagement performance. If we define user engagement as average number of sessions in 9-day period, on average, users only have 1.4 sessions with median session length of 0.64 minutes.  Within such a short time, it's hard for users to touch the core value of the products the company is offering. The company needs to figure out why users leave the website in such short time and do not come back for other sessions. One possible direction is to go through UI. Maybe users are bored with the website design, or the design of the website is so confusing that users can not find what they are looking for, or the loading time for website is not pleasant. A/B testing should be performed to the website.  
    
    3. Funnel analysis shows that 2/3 of users are lost from pageview action to button click action,which is another indication of lack of user interaction with website. Again, the website design(location of buttons/demo link) should be tested. 
    
    4. Funnel analysis also shows that the bottleneck exists at directing users to visit demo page. About 70% button click users are lost without further landing on demopage and leave demo page. Demos are essential to get users to know companies products. When users click buttons for more informaiton, the company can priorize demo related buttons(links) to get users' attention. 
    
    5. Both funnel analyis show that at least 95% users left index page or demo page without registering for demo, which means most users leave the website without giving contact information. The company needs to figure out why users are relucant to register. Is it because users are not satisfied with products after demo introduction or because the registering process require too much information? If it's the latter one,is it possible to simplify the registeration process so that users can be acquired first(in order to get contact information).
   
    6. Retention Analysis shows that the number of users remained in the cohort on second day drops significantly and it continues to drop on 7th day. This implies that most users are one and done user. They visit the website once and don't come back. If user's contact information is available, reminders(emails/msg) should be sent to users. When looking at retention rate of user group with different behaviors, users who checked demo page outperformed other groups. The user group with button click behavior ranks second. In situations that users contact information is unavailable, by increasing user interaction with website(especially interaction with demo page), the retention rate can also be improved.
    
    7. Looking at user profile, 97% users are from China. Beijing(34%), Guangdong(21%), Shanghai(12%), Zhejiang(9%) are the top 4 user group. Comparing to the users in Beijing and Guangdong, the number of users in Shanghai and Zhejiang are still low. There are also a lot of small-medium companies in Shanghai and Zhejiang area, the compnay can grow potential business in eastern region. 
    
    8. By checking the campaign source and referrer source, more than 85% users(who have source information) are from search. Content-based website ranked second, and social media ranked third. The brand team can improve brand awareness through content sharing website. 


## Next Step

    Priority 1: Both EDA and model results show that conversion rate can be improved by increasing users' button click/demo-visiting actions and length of active sessions. The company should perform A/B testing on website design(prioritizing location of demo buttons/simplifying homepage design/reducing page loading/core product information on homepage) and registration process(requiring less information to register) to see if any changes can increase conversion rate.  

    Priority 2: If the company is looking to expand its market share, the company should be considering targeting clients in eastern region such s Shanghai and Zhejiang, and at the same time, spend marketing budget on content-based website such as zhihu, tech blogger and 36kr. 


