# **Case Study** - *Waze Divice Type*

The Waze team is working on a specific project to prevent monthly churn of users.

As part of the milestones for the analysis, creating an **Hypothesis Test** to understand more about the user's behavior on the app is relevant. By doing so, the Analytical Team can make inferences of the population and develop new strategies to impact on the churn ratio of the App.

## **1. Setting Enviroment**

As a first step on this case study, we focus on preparing the enviroment for our anlaysis and uploading our data.


In [None]:
# First, let´s import the libraries that are escential for the analysis.

import pandas as od
from scipy import stats

In [None]:
# Load dataset into dataframe
waze_df = pd.read_csv('waze_dataset.csv')

#NOTE: The path to the CSV is not specified on the code, but can be replicated as long as the file is locally stored.

## **2. Descriptive Analysis & Data Preparation**

Now that we have everything set up, it is important that we give our dataset a quick descriptive analysis. This will allow us to better understand our data set, check for any missing values, examine the data structures, understand data distribution and for this specific case study, compute the mean of ***drives*** per device type. (iPhone - Android)

Also, we will prepare the dataset, this includes transforming data types, dropping duplicates or missing values, or creating new columns if necessary.  

In [None]:
# See the first rows of our data set and familiarize with the data.
waze_df.head()

In [None]:
# Table of descriptive measures of the data set
waze_df.describe()

In [None]:
# Creation of new data frames that will help us in the hypothes testing process

df_iphone = waze_df[waze_df["device"]=="iPhone"]
df_android = waze_df[waze_df["device"]=="Android"]

In [None]:
# Mean validation
waze_df.groupby(["verified_status"]).mean()["video_view_count"]

The code shows the following results:

```
device
Android    66.231838
iPhone     67.859078
```



# **3. Hypothesis Testing**

Now that we are ready with our dataset, we can proceed with our Hypothesis Testing.
Someting that I find really interesting while executing this step is to recall relevant concepts such as:

***Null Hypothesis***: A statement that is assumed to be true unless there is convincing evidence to the contrary.

***Alternative Hypothesis:*** A statement that contradicts the null hypothesis and is accepted as true only if there is convincing evidence for it

---

For this specific case study, the hypothesis would be:

Null Hypothesis = There is no statistically significant difference between the **mean drives** from *Iphone* users and *Android* users, the difference is due to random sampling.

Alternative Hypothesis = There is statistically significant difference between the **mean drives** from *Iphone* users and *Android* users. This can be accepted only if there is convincing evidence.  

We set the **significance level at 5%** and proceed with a two-sample t-test.

In [None]:
stats.ttest_ind(a= df_iphone['drives'] , b= df_android['drives'], alternative='two-sided', equal_var=False)

## **4. Results and Conclusion**

The code shows the following results:

```
Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)
```
Recalling what a p-value is:
> The probability of observing results as or more extreme than those observed when the null hypothesis is true

Based on the result that we got from the two-tailed t-test, we observe that the p-value is 0.1433, which means a 14.33% probability of seeing this specific behaviour on the sampled data.
The p-value is higher than the significance level of 5% (0.1433 > 0.05), so we fail to reject the null hypothesis.
By doing so, it means that there is no statistically significant difference between the means from the 2 types of devices and that this difference can be due to random sampling.


## **5. Next Steps**

As part of the analysis, here are some insights I would recommend to analyze further and contribute to the project's  objectives:

*   Due to the results of the hypothesis test, we can explore other variables on the data set that could explain more about the user's behavior.
*   As the drives mean is basically the same for both type of users, a change on the interface or a new marketing strategy could benefit and prevent the churn ratio.

