## <center> About Waze </center>

##### Waze is an app that provides satellite navigation software on smartphones and other computers that support the Global Positioning System (GPS). Waze's free navigation app makes it easier for drivers around the world to get to where they want to go. Waze's community of map editors, beta testers, translators, partners, and users helps make each drive better and safer

## <center>   Project Goal  </center>

#### The final goal of this project is to develop a ML model to predict which users are most likely to stop using Waze app. This information will allow to make further decisions that will help prevent churn, improve user retention, and thus, grow Waze's business.

## <center> Churn problem </center>
#### Churn quantifies the number of users who have uninstalled the Waze app or stopped using the app. 

## <center> Main parts of this project include following steps: </center>

1. Gather and import data to Python, then inspect it in general perspective 
2. Cleaning and pre-processing data to prepare it for further analysis
3. perform EDA to understand data structure
4. Calculate descriptive statistics for quantitative variables, conduct a statistical hypothesis tests for insights
5. Build and evaluate a logistic regression model to predict outcomes  
6. Perform feature engineering and build more complex models e.g. Random forest and XGBoost

## <center>  1. Gather and import data to Python, then inspect it in general perspective   </center>

#### This will include:
- x

We import necessary libraries i.e. pandas and numpy

In [7]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns

And we load our dataset into dataFrame using standard **pandas.read_csv()** function

In [3]:
df_waze_0 = pd.read_csv("C:/Coursera/Google Advanced Data Analytics\GADA_datasets/waze_dataset.csv")

#### Except that we got also provided a brief description of each column in our dataset:

- ID - A sequential numbered index
- label - Target variable (“retained” vs “churned”) for if a user has churned anytime during the course of the month 
- sessions - The number of occurrence of a user opening the app during the month
- drives - An occurrence of driving at least 1 km during the month
- device - The type of device a user starts a session with
- total_sessions - A model estimate of the total number of sessions since a user has onboarded
- n_days_after_onboarding - The number of days since a user signed up for the app
-  total_navigations_fav1 - Total navigations since onboarding to the user’s favorite place 1
- total_navigations_fav2 - Total navigations since onboarding to the user’s favorite place 2
- driven_km_drives - Total kilometers driven during the month
- duration_minutes_drives - Total duration driven in minutes during the month
- activity_days - Number of days the user opens the app during the month 
- driving_days - Number of days the user drives (at least 1 km) during the month

As a next step, we will view and inspect summary information of our dataset by using following methods and attributes from pandas module:

- head()
- info()
- shape

In [5]:
df_waze_0.head(10)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android
5,5,retained,113,103,279.544437,2637,0,0,901.238699,439.101397,15,11,iPhone
6,6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone
7,7,retained,39,35,176.072845,2999,0,0,7892.052468,2466.981741,22,20,iPhone
8,8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android
9,9,churned,84,68,244.802115,2997,72,0,6043.460295,2341.838528,7,3,iPhone


In [8]:
df_waze_0.shape

(14999, 13)

In [9]:
df_waze_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


As we can see above, the variables `label` and `device` are of type `object` (`string`); `total_sessions`, `driven_km_drives`, and `duration_minutes_drives` are of type `float64`; the rest of the variables are of type `int64`. There are 14,999 rows and 13 columns.

Above, with usage of **info()** method, we also got number of non-null values per each column. Below we provide the same information from another perspective - how many times per each column, null value occurs. To do so, we will use **isnull()** and **sum()** methods from *pandas* module

In [11]:
df_waze_0.isnull().sum()

ID                           0
label                      700
sessions                     0
drives                       0
total_sessions               0
n_days_after_onboarding      0
total_navigations_fav1       0
total_navigations_fav2       0
driven_km_drives             0
duration_minutes_drives      0
activity_days                0
driving_days                 0
device                       0
dtype: int64

We observe that the only null values in our dataframe are contained in `label` column, there are 700 of them

We will split out dataset for two parts: one where `label` is not null, and another where there are null values in `label` column. Next we will look if there are any differencies in distributions of another variables based on whether `label` columns is null or not

In [14]:
df_waze_not_null = df_waze_0[~df_waze_0["label"].isna()]
df_waze_null = df_waze_0[df_waze_0["label"].isna()]

In [15]:
print(df_waze_not_null.shape)
print(df_waze_null.shape)

(14299, 13)
(700, 13)


For quantitative variables, we will use **describe()** method to calculate summary statistics of each variable:

In [17]:
df_waze_not_null.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0
mean,7503.573117,80.62382,67.255822,189.547409,1751.822505,121.747395,29.638296,4044.401535,1864.199794,15.544653,12.18253
std,4331.207621,80.736502,65.947295,136.189764,1008.663834,147.713428,45.35089,2504.97797,1448.005047,9.016088,7.833835
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.457733,878.5,10.0,0.0,2217.319909,840.181344,8.0,5.0
50%,7504.0,56.0,48.0,158.718571,1749.0,71.0,9.0,3496.545617,1479.394387,16.0,12.0
75%,11257.5,111.0,93.0,253.54045,2627.5,178.0,43.0,5299.972162,2466.928876,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


In [18]:
df_waze_null.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,7405.584286,80.837143,67.798571,198.483348,1709.295714,118.717143,30.371429,3935.967029,1795.123358,15.382857,12.125714
std,4306.900234,79.98744,65.271926,140.561715,1005.306562,156.30814,46.306984,2443.107121,1419.242246,8.772714,7.626373
min,77.0,0.0,0.0,5.582648,16.0,0.0,0.0,290.119811,66.588493,0.0,0.0
25%,3744.5,23.0,20.0,94.05634,869.0,4.0,0.0,2119.344818,779.009271,8.0,6.0
50%,7443.0,56.0,47.5,177.255925,1650.5,62.5,10.0,3421.156721,1414.966279,15.0,12.0
75%,11007.0,112.25,94.0,266.058022,2508.75,169.25,43.0,5166.097373,2443.955404,23.0,18.0
max,14993.0,556.0,445.0,1076.879741,3498.0,1096.0,352.0,15135.39128,9746.253023,31.0,30.0


Comparison of observation where `label` value is missing with those that are not missing didn't reveal anything noteworthy. By this, we mean that means and standard deviations are very consistent across both groups

For qualitative variable `device` - we will calculate distributions for both group by using **value_counts()** method from *pandas*

In [27]:
print(df_waze_not_null['device'].value_counts())
print(df_waze_not_null['device'].value_counts(normalize = True))

device
iPhone     9225
Android    5074
Name: count, dtype: int64
device
iPhone     0.64515
Android    0.35485
Name: proportion, dtype: float64


In [28]:
print(df_waze_null['device'].value_counts())
print(df_waze_null['device'].value_counts(normalize = True))

device
iPhone     447
Android    253
Name: count, dtype: int64
device
iPhone     0.638571
Android    0.361429
Name: proportion, dtype: float64


Also there, we see that there is nothing unusual in those distributions. Both are very similar.

There is nothing to suggest a non-random cause of the missing data in our dataset

In next step, we want to check for distribution in our target variable - `label`. We will use a **value_counts()** method again, once for numbers, and once for percentages, with parameter *normalize = True*

In [30]:
print(df_waze_0["label"].value_counts())
print(df_waze_0["label"].value_counts(normalize = True))

label
retained    11763
churned      2536
Name: count, dtype: int64
label
retained    0.822645
churned     0.177355
Name: proportion, dtype: float64


Our dataset contains over 80% of retained users.

Next, we will look for median values of numerical variables for retained and churned users of Waze app. We will use **groupby()** and **median()** methods from *pandas* module

In [33]:
df_waze_0.groupby("label").median(numeric_only = True)

Unnamed: 0_level_0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
churned,7477.5,59.0,50.0,164.339042,1321.0,84.5,11.0,3652.655666,1607.183785,8.0,6.0
retained,7509.0,56.0,47.0,157.586756,1843.0,68.0,9.0,3464.684614,1458.046141,17.0,14.0


## <center>  2. Cleaning and pre-processing data to prepare it for further analysis   </center>
