# **Waze Project**
**Course 4 - The Power of Statistics**

# **Course 4 End-of-course project: Data exploration and hypothesis testing**

<br/>

**The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How did computing descriptive statistics help you analyze your data?

* How did you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>



# **Data exploration and hypothesis testing**

<img src="/Users/kamaladadashova/Desktop/Google Analytics/C4-Statistics/Pace.png" width="100" height="100" align=left>

# **PACE stages**


<img src="/Users/kamaladadashova/Desktop/Google Analytics/C4-Statistics/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**


1. What is research question for this data project? 

**Do drivers who use an iPhone to open the application have the same average number of drives as those who use Android devices?**


### **Task 1. Imports and data loading**




In [3]:
# Import any relevant packages or libraries
import pandas as pd
from scipy import stats

In [5]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

<img src="/Users/kamaladadashova/Desktop/Google Analytics/C4-Statistics/Analyze.png" width="100" height="100" align=left>

<img src="/Users/kamaladadashova/Desktop/Google Analytics/C4-Statistics/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**

1. Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help us learn more about data in this stage of the analysis?


Overall, descriptive statistics are valuable because they allow you to quikly analyze and comprehend large datasets. In this instance, calculating descriptive statistics enables you to easily compare the average number of drives based on device type.

### **Task 2. Data exploration**

**Note:** In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, we must turn each label into an integer.  The following code assigns a `1` for an `iPhone` user and a `2` for `Android`.  It assigns this label back to the variable `device_type`.

**Note:** Creating a new variable is ideal so that you don't overwrite original data.



In [6]:
map_dictionary = {'Android': 2, 'iPhone': 1}
df['device_type'] = df['device']
df['device_type'] = df['device_type'].map(map_dictionary)
df['device_type'].head()

0    2
1    1
2    2
3    1
4    2
Name: device_type, dtype: int64

Our goal is to find relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [7]:
df.groupby('device_type')['drives'].mean()

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.


### **Task 3. Hypothesis testing**

Your goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

**Hypotheses:**

$H_0$: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

$H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Next, choose 5% as the significance level and proceed with a two-sample t-test.

**NOTE**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice; we can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances $t$-test (known as Welch's `t`-test). 

In [10]:
iPhone = df[df['device_type'] == 1]['drives']
Android = df[df['device_type'] == 2]['drives']
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

TtestResult(statistic=np.float64(1.463523206885235), pvalue=np.float64(0.143351972680206), df=np.float64(11345.066049381952))

>*Since the p-value is larger than the chosen significance level (5%), you fail to reject the null hypothesis. Conculusion is that there is **not** a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.*

<img src="/Users/kamaladadashova/Desktop/Google Analytics/C4-Statistics/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

### **Task 4. Communicate insights with stakeholders**



* What business insight(s) can we draw from the result of your hypothesis test?

> *The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.*

> *One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.*
