# **Two-Sample Hypothesis Test Aanalysis on Waze User Churn Data**

I'm nearing the midpoint of my user churn project. So far, I've completed a project proposal and used Python to explore and analyze Wazeâ€™s user data, including creating data visualizations. My next step is to apply statistical methods to analyze and interpret the data.

Sylvester Esperanza, my project manager,request to analyze the relationship between the mean amount of rides and device type. He want a statistical analysis of ride data based on device type, specifically to determine if there is a statistically significant difference in the mean amount of rides between iPhone users and Android users. Outlines my specific assignment: to conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.


In this project, I'll explore the data provided and conduct a hypothesis test.
<br/>

**The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

*This project has three tasks:*

**Task 1:** Imports and data loading
* Data packages will be necessary for hypothesis testing.

**Task 2:** Conduct hypothesis testing
* Computing descriptive statistics to help analyze the data.

* Formulating the null hypothesis and alternative hypothesis.

**Task 3:** Communicate insights with stakeholders

* Key business insights emerged from the hypothesis test.

* Proposing business recommendations based on the results.

# **Data exploration and hypothesis testing**

<img src="assets/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout my project notebooks, I reference the problem-solving framework PACE. Each component of the notebooks is labeled according to the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="assets/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**



**Research question of this data project**
- Is there a statistically significant difference in the mean amount of rides between iPhone users and Android users?

### **Task 1. Imports and data loading**




Importing packages and libraries to compute descriptive statistics and conduct a hypothesis test.

In [7]:
# Import any relevant packages or libraries
import pandas as pd
import numpy as np
from scipy import stats 

Importing the dataset.

In [8]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

<img src="assets/Analyze.png" width="100" height="100" align=left>

<img src="assets/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**



- Descriptive statistics helps providing a summary of the data giving a describtion and overview of the data centeral tendency, dispersion and distribution. to help identifying the outliers, patterns and trends within the data.

### **Task 2. Data exploration**

Using descriptive statistics to conduct exploratory data analysis (EDA).

In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, I must turn each label into an integer.

Creating a new variable to not overwrite the original data.



In [14]:
# discovering and descriptve stats
print(df.info())
print(df['device'].value_counts())
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
 13  device_new               14999 non-null  object 
dtypes: float64(3), int64(8

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,7499.0,80.633776,67.281152,189.964447,1749.837789,121.605974,29.672512,4039.340921,1860.976012,15.537102,12.179879
std,4329.982679,80.699065,65.913872,136.405128,1008.513876,148.121544,45.394651,2502.149334,1446.702288,9.004655,7.824036
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.661156,878.0,9.0,0.0,2212.600607,835.99626,8.0,5.0
50%,7499.0,56.0,48.0,159.568115,1741.0,71.0,9.0,3493.858085,1478.249859,16.0,12.0
75%,11248.5,112.0,93.0,254.192341,2623.5,178.0,43.0,5289.861262,2464.362632,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


In [15]:
# 1. Create `map_dictionary`
map_dictionary={"iPhone":1, "Android":2}
# 2. Map the new column to the dictionary
df['device_new']=df['device'].map(map_dictionary)
df.head(10)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_new
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2
5,5,retained,113,103,279.544437,2637,0,0,901.238699,439.101397,15,11,iPhone,1
6,6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone,1
7,7,retained,39,35,176.072845,2999,0,0,7892.052468,2466.981741,22,20,iPhone,1
8,8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android,2
9,9,churned,84,68,244.802115,2997,72,0,6043.460295,2341.838528,7,3,iPhone,1


I'm interested in exploring the relationship between device type and the number of drives. One approach is to examine the average number of drives for each device type, so I will calculate these averages.

In [16]:
df.groupby('device_new')['drives'].mean()


device_new
1    67.859078
2    66.231838
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, I'll conduct a hypothesis test.


### **Task 3. Hypothesis testing**

The goal is to conduct a two-sample t-test. 

Recalling the steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

The ($H_0$): There is no difference between the average number of drives between the iPhone users and the Android users.

The ($H_A$): There is a difference between the average number of drives between the iPhone users and the Android users

Choosing 5% as the significance level and proceed with a two-sample t-test.

I can use the `stats.ttest_ind()` function to perform the test.


**Note**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance). I can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances $t$-test (known as Welch's `t`-test). Refer to the [scipy t-test documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.


1. Isolating the `drives` column for iPhone users.
2. Isolating the `drives` column for Android users.
3. Performing the t-test

In [17]:
# 1. Isolate the `drives` column for iPhone users.
iphone_drives=df[df['device_new']==1]['drives']
# 2. Isolate the `drives` column for Android users.
android_drives=df[df['device_new']==2]['drives']
# 3. Perform the t-test
stats.ttest_ind(a=iphone_drives,b=android_drives, equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

- The p-value is "0.143351" which greater than the significance level "0.05", so we fail to reject the null hypothesis and conclude that the difference is due to chance or the sampling variance.

<img src="assets/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**


### **Task 4. Communicate insights with stakeholders**

[Executive Summary](https://docs.google.com/presentation/d/1JzgJGTucyfeozozWs4EcyzXp4TGfxXlcVz4ReGcuAc4/edit?usp=sharing)

**Conclusion**


The analysis revealed a p-value of 0.143351, which is greater than the significance level of 0.05. Therefore, I fail to reject the null hypothesis and conclude that the difference in the mean number of drives between iPhone and Android users is likely due to chance or sampling variance.

**Key Insights**

- On average, drivers using iPhone devices have a similar number of drives as those using Android devices.
- Future research may explore additional factors influencing driving behavior, and further hypothesis tests could be conducted to understand user engagement better.