## **PACE: Plan Stage**

**What is the main purpose of this project?**
The purpose of this project is to demonstrate knowledge of how to conduct a two-sample hypothesis test and apply descriptive statistics using Python to analyze real-world data.

**What is your research question for this project?**
"Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?"

**What is the importance of random sampling?**
Random sampling helps ensure that the sample data accurately represents the population, reducing bias and increasing the reliability and generalizability of the results.

**Give an example of sampling bias that might occur if you didn’t use random sampling.**
If we only collected data from users who drive during morning rush hour, we might miss patterns in driving behavior from those who drive at night, creating bias in our conclusions about average drives.

### **Task 1. Imports and data loading**

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [3]:
# Import any relevant packages or libraries
import numpy as np
import pandas as pd
from scipy import stats

In [4]:
# Load the dataset
df = pd.read_csv(r'D:\5B. Google_Advanced_data_analysis\training_project_data\waze_dataset.csv')

### **Task 2. Data exploration**
In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, you must turn each label into an integer. The following code assigns a 1 for an `iPhone` user and a 2 for `Android`. It assigns this label back to the variable `device_new`.

Creating a new variable is ideal so that you don't overwrite original data.
    
1.Create a dictionary called `map_dictionary` that contains the class labels ('Android' and 'iPhone') for keys and the values you want to convert them to (2 and 1) as values.
    
2.Create a new column called `device_type` that is a copy of the `device` column.

3.Use the `map()` method on the `device_type` series. Pass `map_dictionary` as its argument. Reassign the result back to the `device_type` series. When you pass a dictionary to the `Series.map()` method, it will replace the data in the series where that data matches the dictionary's keys. The values that get imputed are the values of the dictionary.

In [6]:
# 1. Create `map_dictionary`
map_dictionary = {'iPhone': 1,'Android': 2}

# 2. Create new `device_type` column
df['device_type'] = df['device']

# 3. Map the new column to the dictionary
df['device_type'] = df['device_type'].map(map_dictionary)

df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2


You are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [8]:
df.groupby('device_type').mean(numeric_only=True)['drives']

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.

## **PACE: Analyze & Construct Stages**

**In general, why are descriptive statistics useful?**
Descriptive statistics are useful because they summarize and organize large datasets, allowing us to quickly understand patterns, trends, and key metrics such as means and standard deviations.

**How did computing descriptive statistics help you analyze your data?**
Computing descriptive statistics helped compare the average number of drives between iPhone and Android users, offering initial insight into whether a difference exists and guiding the need for further statistical testing.

**In hypothesis testing, what is the difference between the null hypothesis and the alternative hypothesis?**
The null hypothesis (H0) assumes there is no effect or difference, while the alternative hypothesis (HA) suggests there is a statistically significant effect or difference.

**How did you formulate your null hypothesis and alternative hypothesis?**

    H0: There is no difference in the average number of drives between iPhone and Android users.
    
    HA: There is a difference in the average number of drives between iPhone and Android users.

**What conclusion can be drawn from the hypothesis test?**
Since the p-value (0.143) is greater than the significance level of 0.05, the null hypothesis cannot be rejected. This suggests that there is no statistically significant difference in the average number of drives between iPhone and Android users.

## **Task 3. Hypothesis testing**
    1. State the null hypothesis and the alternative hypothesis
    2. Choose a signficance level
    3. Find the p-value
    4. Reject or fail to reject the null hypothesis

Recall the difference between the null hypothesis (H0) and the alternative hypothesis (HA).

**Question:** What are your hypotheses for this data project?

H0 : There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

HA: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Taking 5% as the significance level and proceed with a two-sample t-test.

1. Isolate the `drives` column for iPhone users.
2. Isolate the `drives` column for Android users.
3. Perform the t-test

In [12]:
# 1. Isolate the `drives` column for iPhone users.
iPhone_drives = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
android_drives = df[df['device_type'] == 2]['drives']

# 3. Perform the t-test
stats.ttest_ind(a=iPhone_drives, b=android_drives, equal_var=False)

TtestResult(statistic=1.463523206885235, pvalue=0.143351972680206, df=11345.066049381952)

*Since p-value is 0.143, which is larger than the significant level (0.05), the null hypothesis cannot be rejected. This indicates that there is no statistically significant difference between iPhone users and Android users in terms of the occurrence of driving at least 1 km during the month.*

## **PACE: Execute Stage**

**What key business or organizational insight(s) emerged from your A/B test?**
There is no statistically significant difference in the number of drives between users based on their device type. This implies that the Waze app provides a similar user experience and engagement level for both iPhone and Android users.

**What recommendations do you propose based on your results?**
Waze does not need to focus optimization efforts based solely on device type. Instead, future analysis should investigate other variables that may affect driving behavior and churn, such as onboarding time, usage frequency, or driving patterns. Additional A/B tests could be conducted on feature changes or UI updates to improve engagement across all user segments.