Data exploration and hypothesis testing

In this activity, you will explore the data provided and conduct a hypothesis test.
The purpose of this project is to demostrate knowledge of how to conduct a two-sample hypothesis
test.
The goal is to apply descriptive statistics and hypothesis testing in Python.

1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test.

The primary research question for this data project is:

Is there a statistically significant difference in the mean number of rides between iPhone users and Android users on the Waze platform?

This question will guide the formulation of the null and alternative hypotheses for the hypothesis test we will conduct.

In [10]:
import pandas as pd
import numpy as np
from scipy import stats

In [11]:
df = pd.read_csv("C://Users//hp//Desktop//PYTHON//Stat//Waze project//waze_dataset.csv")
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


2. Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help you learn more about your data in this stage of your analysis?

Summarize Data: Provides key metrics like mean, median, and standard deviation to understand central tendencies and variability.

Identify Trends: Reveals patterns and comparisons between groups (e.g., iPhone vs. Android users).

Assess Distributions: Indicates if data is normally distributed or skewed, guiding appropriate statistical tests.

Detect Outliers: Highlights anomalies that could impact analysis results.

Inform Next Steps: Guides further analysis based on initial findings.

Overall, it lays a solid foundation for deeper analysis and hypothesis testing.

In [12]:
#Create a mapping dictionary
map_dictionary = {'iPhone':1, 'Android':2}

In [13]:
#Create a new 'device_type' column
df['device_type'] = df['device'].copy()

In [14]:
#Map the new column to the dictionary
df['device_type'] = df['device_type'].map(map_dictionary)

In [16]:
df[['device', 'device_type']].head()

Unnamed: 0,device,device_type
0,Android,2
1,iPhone,1
2,Android,2
3,iPhone,1
4,Android,2


You are interested in the relationship between device type and the number of drives. One approach
is to look at the average number of drives for each device type. Calculate these averages.

In [20]:
# Calculate the average number of rides for each device type
average_rides = df.groupby('device_type')['drives'].mean()
average_rides

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

It appears that drivers who use an iPhone device to interact with the
application have a higher number of drives on average. However, this difference might arise from
random sampling, rather than being a true difference in the number of drives. To assess whether
the difference is statistically significant, you can conduct a hypothesis test.

3. What are your hypotheses for this data project?

Null Hypothesis (H0):
There is no significant difference in the mean number of drives between iPhone users and Android users. μiPhone = μAndroid
​

Alternative Hypothesis (H1):

There is a significant difference in the mean number of drives between iPhone users and Android users. μiPhone = μAndroid
​

These hypotheses will guide the statistical analysis to determine if the observed differences in average drives are significant or could be attributed to random sampling.

In [23]:
#Isolate the drives column for iPhone users
iPhone_drives = df[df['device_type'] == 1]['drives']

In [24]:
#Isolate the drives column for Android users
Android_drives = df[df['device_type'] == 2]['drives']

In [25]:
#Perform t-test (unequal variances)
t_stat, p_value = stats.ttest_ind(iPhone_drives, Android_drives, equal_var=False)

In [27]:
alpha = 0.05  #significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a statistically significant difference in the mean number of drives between iPhone and Android users.")
else:
    print("Fail to reject the null hypothesis: There is no statistically significant difference in the mean number of drives between iPhone and Android users.")

Fail to reject the null hypothesis: There is no statistically significant difference in the mean number of drives between iPhone and Android users.


4. What business insight(s) can you draw from the result of your hypothesis test?

a. A significant difference in drives indicates varying levels of engagement between iPhone and Android users.

b. Insights can inform tailored marketing strategies to boost engagement, especially for the less active group.

c. Focus on enhancing features for the device type with lower engagement to improve user satisfaction.

d. Guide customer support and development efforts based on user behavior differences.