# **Waze Project** - Statistical Analysis


Your team is nearing the midpoint of their user churn project. So far, we’ve completed a project proposal, and used Python to explore and analyze Waze’s user data. We’ve also used Python to create data visualizations. The next step is to use statistical methods to analyze and interpret your data.

You receive a new email from Sylvester Esperanza, your project manager. Sylvester tells your team about a new request from leadership: to analyze the relationship between mean amount of rides and device type. You also discover follow-up emails from three other team members: May Santner, Chidi Ga, and Harriet Hadzic. These emails discuss the details of the analysis. They would like a statistical analysis of ride data based on device type. In particular, leadership wants to know if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. A final email from Chidi includes your specific assignment: to conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users

**Data exploration and hypothesis testing**

We will explore the data provided and conduct a hypothesis test.
<br/>

*This effort has three parts:*

**Part 1:** Imports and data loading

**Part 2:** Conduct hypothesis testing

**Part 3:** Communicate insights with stakeholders

<br>

# **Data exploration and hypothesis testing**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

1. What is the research question for this data project? Later on, we will need to formulate the null and alternative hypotheses as the first step of the hypothesis test.


Do Android and iPhone users have the same mean number of uses of the app (times a user opens the app)?

### **Task 1. Imports and data loading**




Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [4]:
# Import any relevant packages and libraries
import pandas as pd
from scipy import stats

In [5]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

<img src="images/Analyze.png" width="100" height="100" align=left>

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**


### **Task 2. Data exploration**

Use descriptive statistics to conduct exploratory data analysis (EDA).

**Note:** In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, we must turn each label into an integer.  The following code assigns a `1` for an `iPhone` user and a `2` for `Android`.  It assigns this label back to the variable `device_new`.

**Note:** Creating a new variable is ideal so that we don't overwrite original data.

1. Create a dictionary called `map_dictionary` that contains the class labels (`'Android'` and `'iPhone'`) for keys and the values you want to convert them to (`2` and `1`) as values.

2. Create a new column called `device_type` that is a copy of the `device` column.

3. Use the [`map()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html#pandas-series-map) method on the `device_type` series. Pass `map_dictionary` as its argument. Reassign the result back to the `device_type` series.
</br></br>
When we pass a dictionary to the `Series.map()` method, it will replace the data in the series where that data matches the dictionary's keys. The values that get imputed are the values of the dictionary.

```
Example:
df['column']
```

|column |
|  :-:       |
| A     |
| B     |
| A     |
| B     |

```
map_dictionary = {'A': 2, 'B': 1}
df['column'] = df['column'].map(map_dictionary)
df['column']
```

|column |
|  :-: |
| 2    |
| 1    |
| 2    |
| 1    |


In [9]:
# 1. Create a dictionary map
dict_map = {'iPhone': 1, 'Android': 2}

# 2. Create new `device_type` column
df['device_type'] = df['device']
# 3. Map the new column to the dictionary
df['device_type'] = df['device'].map(dict_map)
df.head(5)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2


We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [25]:
#group data based on device type
grouped_dev = df.groupby('device_type')['drives'].mean()
diff = grouped_dev.iloc[0] - grouped_dev.iloc[1] 
print(grouped_dev)
print(f"Percentage difference: {diff:.2f}%")

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64
Percentage difference: 1.63%


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.


### **Task 3. Hypothesis testing**

Our goal is to conduct a two-sample t-test. Steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

Recall the difference between the null hypothesis ($H_0$) and the alternative hypothesis ($H_A$).

**Hypotheses**:

𝐻0
 : There is not an actuall difference in the mean number of times users of iPhone and Android open Waze app.

𝐻𝐴
 : There is an actuall difference in the mean number of times users of iPhone and Android open Waze app.

Next, choose 5% as the significance level and proceed with a two-sample t-test.


**Technical note**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); you can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances $t$-test (known as Welch's `t`-test). Refer to the [scipy t-test documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.


1. Isolate the `drives` column for iPhone users.
2. Isolate the `drives` column for Android users.
3. Perform the t-test

In [28]:
#Set significance level
SL =  0.05

# 1. Isolate the `drives` column for iPhone users.
ip_drives = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
and_drives = df[df['device_type'] == 2]['drives']

# 3. Perform the t-test
stats.ttest_ind(a=ip_drives, b=and_drives, equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

**Question:** Based on the p-value obtained above, do we reject or fail to reject the null hypothesis?

> Based on the p_value of 14.33% and the significance level of 5%, we fail to reject the null hypothesis given that p_value > significance level. That also means that the difference of drives between iPhone and Android users is not statistically significant and likely due to chance.

<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**


### **Task 4. Communicate insights with stakeholders**

Consider the following question as you prepare to write your executive summary:

* What business insight(s) can we draw from the result of your hypothesis test?

> The difference in the mean number of drives per device type was not concluded to be statistically significant, indicating that it occured due chance and sampling variability. Both user types, therefore, have similar behavior in terms of number of drives (but not necessarely in other paremeters). For further analysis, more data could be collected and direct the focus of analysis towards the variations of number of drives among users (for instance, understand better users that have very elevated numbero of drives versus users that are assumed to be more casual).