# Waze Project
**Phase 3 - The Power of Statistics; Data exploration and hypothesis testing**
<br>We are in the midpoint of our user churn project. So far, we’ve used Python to explore and analyze Waze’s user data. We’ve also used Python to create data visualizations. The next step is to use statistical methods to analyze and interpret the data.
<br>To be more specific, we need to analyze the relationship between mean amount of rides and device type. i.e. if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. This should be done through a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.

**The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.
**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

*This project has three steps:*

**Step 1:** Prepare data for hypothesis testing<br>
**Stept 2:** Conduct hypothesis testing
* How does computing descriptive statistics help you analyze your data?
* How do you formulate your null hypothesis and alternative hypothesis?

**Step 3:** Communicate insights with stakeholders
* What key business insight(s) emerged from your hypothesis test?
* What business recommendations do you propose based on your results?

## Research Question

"Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?"

## 1. Prepare Data for Hypothesis Testing

In [1]:
import pandas as pd
from scipy import stats

In [3]:
df = pd.read_csv('waze_dataset.csv')
df.head(3)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android


In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`. In order to perform t-test, I must turn each label into an integer.

In [8]:
map_dictionary = {'Android': 2, 'iPhone': 1}
df['device_type'] = df['device'].map(map_dictionary)
df['device_type'].head()

0    2
1    1
2    2
3    1
4    2
Name: device_type, dtype: int64

In [9]:
df.groupby('device_type')['drives'].mean()

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.

In [10]:
# 1. Isolate the `drives` column for iPhone users.
iPhone = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
Android = df[df['device_type'] == 2]['drives']

## 2. Hypothesis testing
Our goal is to conduct a two-sample t-test. Steps for conducting a hypothesis test:
1. State the null hypothesis and the alternative hypothesis
2. Choose a signficance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

### State the null and alternative hypothesises
$H_0$: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.<br>
$H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Next, I choose 5% as the significance level and proceed with a two-sample t-test.
<br>We use the `stats.ttest_ind()` function to perform the test.

**Technical note**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); we can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances t-test (known as `Welch's t-test`).

In [5]:
# Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

### Interpretation of t-test
Since the p-value is larger than the chosen significance level (5%), we can't reject the null hypothesis in favor of the alternative hypothesis. We conclude that there is **not** a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.

## 3. Communicate insights with stakeholders
Now that we've completed hypothesis test, the next step is to share our findings with the Waze leadership team. I have to consider the following question as I prepare to write my executive summary:

**What business insight(s) can you draw from the result of your hypothesis test?**<br>
The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.

*One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.*