# How do users engage with a mobile app for automobiles?

<i> "It is important to understand what you can do before you learn how to measure how well you seem to have done it." </i> – J. Tukey


## Goals

As we saw in the previous case, careful data visualization (DV) can guide or even replace formal statistical analysis and model-building. Here, we'll continue with visualizations that can be more complex and computationally-intensive.

In addition, by the end of this case you will have learned how missing data can sometimes be helpful in our analyses, how domain questions can guide visualizations, and how carefully constructed visualizations can generate new questions and insights.

## Introduction

**Business Context.** A recent trend among car manufacturers is to provide continued support through mobile applications. Features of these apps include services like remote ignition, GPS location, anti-theft mechanisms, maintenance reminders, and promotion pushes. Manufacturers are keen to maximize engagement with their app because they believe this increases relationship depth and brand loyalty with the customer. However, app usage is often limited, with many customers abandoning the app after only a short time period or never even opening it in the first place.

You are a data scientist for a large luxury automobile company. Your company wants you to uncover behavioral patterns of the users who engage with the app. They believe that if you can find discernible patterns, your company can leverage those insights to give users incentives to use the app more frequently.

**Business Problem.** Your employer would like you to answer the following: **"How do users currently engage with your mobile app and how has that engagement changed over time?"** 

**Analytical Context.** In this case, we will look at data on a subset of 105 customers (out of 1,000 total app users) for the first four weeks after installing the app. This small subset of the data is chosen as a representative sample. Data were collected as part off a beta version of the app.

## First look at the data

As always, let's begin by having a look at the data and computing a few summary statistics. The data set contains 
105 rows and 116 columns.  Most of the columns represent app data collected on day $j$ ($1 \le j \le 28$):

| Variable name|  Description | Values |
|--------------|--------------|------------|
| age          | Ordinal age, coded: 1 (<= 25), 2 (26-34), 3 (35-50), 4 (50+)| Int: 1-4 | 
| sex          | Categorical sex | Char: F, M| 
| device_type  | Android or OS X | String: Andr, X|
| vehicle_class| Luxury or standard vehicle| String: Lx, Std|
| p_views_j, j=1,...,28| Ordinal page views on day j| Int: 1-5 |
| major_p_type_j, j=1,...,28| Majority page type| String: Main, Prom, Serv| 
| engagement_time_j, j=1,...,28| Ordinal engagement time per day | Int: 0-5|
| drive_j, j=1,...,28| Indicator that user drove| Int: 0, 1|

We see that a lot of the data are **ordinal variables**. An ordinal variable is a categorical variable where the categories are numbers and the relative values of those numbers matter; however, the absolute values of those numbers does not. In other words, for a given ordinal variable $x$, a larger numbered category means "more of $x$" than a smaller numbered category; however, the category number does not indicate the actual amount of $x$. For example, here `age` is coded as an ordinal variable; the categorical value of `3` clearly indicates "more age" than the categorical value of `1` (35 - 50 years of age vs. under 25 years of age), but the specific category value `3` or `1` is meaningless.

Below is some more information about some of the other variables:

1. The only allowable mobile platforms are Android (coded `Andr`) or OS X (coded `X`) and this is collected automatically when the app is installed; thus, we expect this variable to have no missing values.
2. The vehicle identification number was required to sign in and from this `vehicle_class` was automatically populated; thus, we also expect this variable to have no missing values.
3. The variable `major_p_type_j` is the majority page type for the user on day j. In other words, it's the type of page which is viewed most often. It's coded as a categorical variable taking the values `Main` for maintenance, `Prom` for promotions, and `Serv` for services. Here, services means the app's services (e.g. automatic start, GPS location, etc.), rather than, say, scheduling an appointment to get the car serviced (which would be categorized as maintenance).

Furthermore, a lot of the data here is "opt-in" only; that is, it is only recorded if the user was active on the app that day, and missing otherwise. For example, `p_views_j`, `major_p_type_j`, `engagement_time_j`, and `drive_j` are all "opt-in" variables.

### Exercise 1:

What is the significance of the variables mentioned above being opt-in? What insights can we derive from this?

**Answer.**

## Understanding and visualizing patterns in the missing data

As you saw in the Python cases, missing data is a staple of almost any dataset we will encounter. This one is no different. This dataset has substantial missing data, with nearly 60% of subjects missing a value for at least one column.

A useful tool to look at the structure of missing data is a **missingness plot**, which is a grid where the rows correspond to individuals and the columns correspond to the variables (so in our case, this will be a 106 x 115 grid). 
The $(i,j)$-th square of the grid is colored white if variable $j$ was missing for subject $i$. A first pass at a missingness plot gives us:

<img src="data/missingnessPlotOne.png" width="1200">
    
**Note:** Missingness plots can be created with the `missingno` library or using a `seaborn.heatmap` of your data after a pass of the `.isnull()` method. 

### Question:

Do you spot any patterns in the missing values here?

### Exercise 2:

What are some things you can do with the dataset to visualize the missing data better?

**Answer.**

In light of this, let's remake the missingness plot with the similar variables grouped together:

<img src="data/missingnessPlotTwo.png" width="1200">

### Exercise 3:

What patterns do you notice here? Do these patterns make sense based on your understanding of the problem?

**Answer.**

__________

We can make the pattern from Exercise 2 even more apparent by not just grouping the "opt-in" data together by type of information conveyed, but by grouping them all together, regardless of type. In this case the missingness plot looks like:

<img src="data/missingnessPlotThree.png" width="1200">

### Exercise 4:

A natural question to ask is 'what percentage of users were still engaged as of a certain day?'. How can we modify the above plot to beter visualize this?

**Answer.**

<img src="data/missingnessPlotFour.png" width="1200">

From this plot it is immediately apparent that some subjects are dropping off and not returning; the data shows a **nearly monotone missingness pattern** which is useful for weighting and multiple imputation schemes (such methods are discussed in future cases on data wrangling). Furthermore, a significant proportion of users were engaged with the app throughout the entire 4-week period.

We now see the power of using contextual knowledge of the problem and dataset itself in the data visualization process. **The preceding four plots all contained the same underlying information, yet the later plots were clearly much easier to draw insights from than the earlier ones.**

## Investigating in-app behavior

Now that we've gleaned basic insights into whether or not users engage with the app at all, it's time to do a more detailed analysis of their behavior within the app. We'll start by looking at page views.

### Question:

What kind of chart you think can be helpful in analysing *user engagement* using the app's page view data?


Since we are interested in user engagement, we should use a plot that depicts users' interactivity through time. We could use a line plot that shows average page views per day. We could also use multiple lines, one for each user, with the hope of detecting visible trends and possible clusters of users with similar behavior. Such plots are one example of [*parallel coordinates plots*](https://plotly.com/python/parallel-coordinates-plot/).

### Evaluating patterns in page views

To stakeholders, page views are a key measure of engagement. Let's identify patterns in the number of page views per day. Recall that page views is an ordinal variable (ordered categorical variable) coded 1-5. Here 1 codes 0-1 actual page views, with 1 indicating that the app was opened and then closed without navigating past the splash page. For each person, we have a sequence of up to 28 observations. Let's first create a parallel coordinates plot with one line per subject:

<img src="data/matplotOne.png" width="1200">


The preceding plot is extremely difficult to read. But we don't care so much about patterns for any individual user as much as the aggregate set of users. Thus, let's graph a line representing the average page views per person. The following plot shows this in black:

<img src="data/matplotTwo.png" width="1200">


### Exercise 5:

There seems to be some kind of periodicity in the above smoothed plot. What might explain this pattern?

**Answer.**

### Clustering by user cohorts

Domain experts who have run qualitative studies of user behavior believe that there are different groups, or **cohorts**, of users, where the users within a single cohort behave similarly. They believe that page view behavior would be more homogeneous within any given cohort. However, these cohorts are not directly observable.

Using clustering methods (which you will learn about in future cases), we have segregated the users into three groups based on their similarities:  

<img src="data/matplotG1.png" width="1200">
<img src="data/matplotG2.png" width="1200">
<img src="data/matplotG3.png" width="1200">

### Exercise 6:

Describe the page view behaviors within each cohort.

**Answer.**

### Exercise 7:

Which cohort of users do you think are more likely to look at promotional pages (major page type category `Prom`)?

**Answer.**

### Analyzing patterns in major page type

Let's have a look at the major page type over time across our three user cohorts. This time we will use a **percent stacked area chart**. This plot stacks the percentage of users that used each major page type during the course of the study. You can create such plot using pandas `.plot.area()` method.
    
<img src="data/pagetypeG1.png" width="1200">
<img src="data/pagetypeG2.png" width="1200">
<img src="data/pagetypeG3.png" width="1200">

From this, we can see that the third group is indeed the most engaged with the promotional pages.

### Exercise 8:

What are some potential next steps if you wanted to do a deep dive into user page view behavior? What additional data might you want to collect on users?

**Answer.**

## Predicting dropout from page view behavior

Because page view behavior is believed to be strongly related to engagement with the app and likelihood of discontinuation, we would like to see if we can predict the point of disengagement by analyzing the page view behavior within each cohort. We start by simply labeling the last observation (i.e. day of usage) for each subject with a large red dot:

<img src="data/matplotMissingG1.png" width="1200">
<img src="data/matplotMissingG2.png" width="1200">
<img src="data/matplotMissingG3.png" width="1200">

### Exercise 9:

Do you notice any patterns in page views preceding dropout?

**Answer.**

### Exercise 10:

Work with a partner. Based on the preceding visualizations, propose an *adaptive* intervention strategy; i.e. one for each customer group, that monitors a user's page views and then offers them an incentive to continue using the app right when we believe that the incentive would have the most impact. Assume that you can offer at most one such incentive during the first four weeks of app use.

**Answer.**

## Conclusions
We explored usage and disengagement patterns among users of a mobile app for a car manufacturer. We saw that most users still remained engaged with the app even after 28 days, and that there were three significantly distinct cohorts of users. We used these patterns to generate ideas for intervention strategies that might be used to increase app usage and reduce disengagement. These visualizations are an excellent starting point for building statistical models or designing experiments to test theories about drivers of disengagement.  

## Takeaways

In this case, you looked at more types of plots and how to draw conclusions from them. You also learned how these conclusions can drive further questions and plotting. Some key insights include: 

1. Sometimes it is important to reorder the data according to some variable in order to derive insights (as we saw with the missingness plot).
2. It is important to not disregard or transform missing data without a proper understanding of it. Always ask the question: "What is a possible reason for the data to be missing?". As we saw in the case, missing data can in some cases give valuable information for our analyses.
3. Sometimes additional computation or data manipulation is required in order to tease a meaningful pattern from a data visualization (as we saw with the clustering & averaging for the parallel coordinates plots with the three cohorts).
4. Domain knowledge and understanding the context of the problem and data at hand is crucial. Without this, we would never have been able to create the visualizations we did and draw the conclusions we did from the missingness plot and the parallel coordinates plots.