# **Waze Project**
**Course 4 - The Power of Statistics**

Your team is nearing the midpoint of their user churn project. So far, you’ve completed a project proposal, and used Python to explore and analyze Waze’s user data. You’ve also used Python to create data visualizations. The next step is to use statistical methods to analyze and interpret your data.

You receive a new email from Sylvester Esperanza, your project manager. Sylvester tells your team about a new request from leadership: to analyze the relationship between mean amount of rides and device type. You also discover follow-up emails from three other team members: May Santner, Chidi Ga, and Harriet Hadzic. These emails discuss the details of the analysis. They would like a statistical analysis of ride data based on device type. In particular, leadership wants to know if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. A final email from Chidi includes your specific assignment: to conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.

# **Course 4 End-of-course project: Data exploration and hypothesis testing**

In this activity, you will explore the data provided and conduct a hypothesis test.
<br/>

**The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How did computing descriptive statistics help you analyze your data?

* How did you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>


# **Data exploration and hypothesis testing**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


**Answer:**  
The top management requests an analysis of the mean number of rides compared to the device type. More specifically, it was asked to determine if the difference between the mean rides of users with Android devices and the users with iPhone devices is only due to chance or if another impact influences this difference.

*Complete the following tasks to perform statistical analysis of your data:*

### **Task 1. Imports and data loading**




Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [2]:
# Import any relevant packages or libraries
import pandas as pd
from scipy import stats

Import the dataset.

In [3]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv',index_col=0)

<img src="images/Analyze.png" width="100" height="100" align=left>

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


**Answer:**  
In general, descriptive statistics are useful because they let you quickly explore and understand large amounts of data. In this case, computing descriptive statistics helps you quickly compare the average amount of drives by device type.

### **Task 2. Data exploration**

Use descriptive statistics to conduct exploratory data analysis (EDA).

**Note:** In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, you must turn each label into an integer.  The following code assigns a `1` for an `iPhone` user and a `2` for `Android`.  It assigns this label back to the variable `device_new`.

**Note:** Creating a new variable is ideal so that you don't overwrite original data.



1. Create a dictionary called `map_dictionary` that contains the class labels (`'Android'` and `'iPhone'`) for keys and the values you want to convert them to (`2` and `1`) as values.

2. Create a new column called `device_type` that is a copy of the `device` column.

3. Use the [`map()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html#pandas-series-map) method on the `device_type` series. Pass `map_dictionary` as its argument. Reassign the result back to the `device_type` series.
</br></br>
When you pass a dictionary to the `Series.map()` method, it will replace the data in the series where that data matches the dictionary's keys. The values that get imputed are the values of the dictionary.

```
Example:
df['column']
```

|column |
|  :-:       |
| A     |
| B     |
| A     |
| B     |

```
map_dictionary = {'A': 2, 'B': 1}
df['column'] = df['column'].map(map_dictionary)
df['column']
```

|column |
|  :-: |
| 2    |
| 1    |
| 2    |
| 1    |


In [4]:
df.head()

Unnamed: 0_level_0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


In [5]:
# 1. Create `map_dictionary`
map_dictionary = {'iPhone': 1, 'Android': 2}

# 2. Create new `device_type` column
df['device_type'] = df.device

# 3. Map the new column to the dictionary
df['device_type'] = df['device_type'].map(map_dictionary)

In [6]:
df.head()

Unnamed: 0_level_0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2


You are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [7]:
iphone_mean = df.groupby('device_type')['drives'].mean()[1]
android_mean = df.groupby('device_type')['drives'].mean()[2]
print('iphone_mean =',iphone_mean)
print('android_mean =', android_mean)

iphone_mean = 67.85907775020678
android_mean = 66.23183780739629


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.


### **Task 3. Hypothesis testing**

Your goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

Recall the difference between the null hypothesis ($H_0$) and the alternative hypothesis ($H_A$).

**Question:** What are your hypotheses for this data project?

**Answer:**   
H0 : There is no difference between the iPhone's users rides mean and the Android's users rides mean.  
HA : There is a difference between the iPhone's users rides mean and the Android's users rides mean.

Next, choose 5% as the significance level and proceed with a two-sample t-test.

You can use the `stats.ttest_ind()` function to perform the test.


**Technical note**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); you can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances $t$-test (known as Welch's `t`-test). Refer to the [scipy t-test documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.


1. Isolate the `drives` column for iPhone users.
2. Isolate the `drives` column for Android users.
3. Perform the t-test

In [8]:
# 1. Isolate the `drives` column for iPhone users.
iphone_users = df.loc[df['device'] == 'iPhone']

# 2. Isolate the `drives` column for Android users.
android_users = df.loc[df['device'] == 'Android']

# 3. Perform the t-test
statistic , p_value = stats.ttest_ind(a = iphone_users['drives'], b = android_users['drives'], equal_var = False)
print("t-stat =",statistic)
print("P-value =",p_value)

t-stat = 1.463523206885235
P-value = 0.143351972680206


In [14]:
p_value > 0.05

True

**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?

The **`t-statistic`** is a numerical value that quantifies the difference between the means of the two samples (iPhone users and Android users) relative to the variability within each group. In particular:

* A positive t-statistic suggests that the mean of the first group (iPhone users) is greater than the mean of the second group (Android users).

* A negative t-statistic suggests that the mean of the first group (iPhone users) is less than the mean of the second group (Android users).

* The magnitude of the t-statistic reflects the size of the difference between the means, relative to the variability within each group.

The **`p-value`** is the probability of observing results that are as or more extreme than those observed when the null hypothesis is true. In other words, the p-value is the probability of the case that the difference occurs only due to chance without any other influence. Furthermore, it is also a tool for concluding if our results are statistically significant. More specifically:
* When p-value < significance level then we reject the null hypothesis as it is highly unlikely that the results are driven only due to chance.
* When p-value > significance level then we fail to reject the null hypothesis as the probability is greater than the set threshold below of which we consider the results statistically significant.

Based on our data and the calculated p-value, we fail to reject the null hypothesis, and seems that the difference that occurred between the mean rides of iPhone users and the mean rides of Android users is indeed due to chance.

<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Task 4. Communicate insights with stakeholders**

Now that you've completed your hypothesis test, the next step is to share your findings with the Waze leadership team. Consider the following question as you prepare to write your executive summary:

* What business insight(s) can you draw from the result of your hypothesis test?

**Answer:**  
The result of the hypothesis test occurs via the comparison of the calculated p-value with the chosen significance level. Based on that information, and given that the p-value is greater than the set 5% significance level, we fail to reject the null hypothesis. This means that the difference observed between the mean rides of iPhone users and the mean rides of Android users is exclusively due to chance and therefore seems to occur due to sampling variability.  

Based on the above facts, we conclude that the device type of a user does not affect churn.