# <font color='green'> 1. Introduction </font> 

We will work as a Data Scientist for the Autolib electric car-sharing service company to investigate a claim about the blue cars from the provided Autolib dataset.

In an effort to do this, we need to identify some areas and periods of interest via sampling stating the reason to the choice of method, then perform hypothesis testing with regards to the claim that we will have made.

To work on this project, we will perform the following analysis with Python:

  1. Find and deal with outliers, anomalies, and missing data within the dataset.
  2. Plot appropriate univariate and bivariate summaries recording our observations.
  3. Implement the solution by performing hypothesis testing.

# <font color='green'> 2. Problem Statement </font>


### <font color='blue'> Introduce the data you will be describing and the random variable that you are investigating. </font>

The data set I am working with is the Autolib dataset that is produced on the Moringa School LMS. It contains information concerning 3 brands of electric cars belonging to the Autolib company: the Bluecar, the Utilib, and the Utilib 14. The random variable I will be investigating is the mean number of Utilib 14 cars taken, specifically whether it is greater than that of the Utilib cars, indicating its popularity. 

In [1]:
# load libraries to be used
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
autolib = pd.read_csv('autolib_daily_events_postal_code.csv')
autolib.head(5)

Unnamed: 0,Postal code,date,n_daily_data_points,dayOfWeek,day_type,BlueCars_taken_sum,BlueCars_returned_sum,Utilib_taken_sum,Utilib_returned_sum,Utilib_14_taken_sum,Utilib_14_returned_sum,Slots_freed_sum,Slots_taken_sum
0,75001,1/1/2018,1440,0,weekday,110,103,3,2,10,9,22,20
1,75001,1/2/2018,1438,1,weekday,98,94,1,1,8,8,23,22
2,75001,1/3/2018,1439,2,weekday,138,139,0,0,2,2,27,27
3,75001,1/4/2018,1320,3,weekday,104,104,2,2,9,8,25,21
4,75001,1/5/2018,1440,4,weekday,114,117,3,3,6,6,18,20


### <font color='blue'>State very precisely the null and alternate hypothesis that you will be testing.</font>

The null hypothesis I will test is that the mean of the number of Utilib 14 cars taken is equal to that of the Utilib cars, i.e., there is no difference between the two means. The alternate hypothesis is that the mean of the Utilib 14 cars taken is not equal to that of the Utilib cars.

In short: 
    * H0: μ Utilib 14 = μ Utilib
    * H1: μ Utilib 14 ≠ μ Utilib


### <font color='blue'>Provide some explanation for why this hypothesis is important and/or interesting.</font>

The reason I find it important to investigate this is because, from what I’ve seen, BlueCars are easily the most popular cars, leaving both the Utilib and the Utilib 14 in the dust, and it is now a matter of investigating which of these two is the next popular one. If Autolib wishes to do away with the least popular one or order more cars, this hypothesis testing will prove to be useful in helping them make a beneficial decision.

# <font color='green'>3. Data Description</font>

### <font color='blue'>Provide information about the data necessary to understand the rest of the report including a precise statement of the random variable.</font>

The dataset contains such information as:
* Postal code: postal code of the area (in Paris)
* Date: the date of the time the data was collected
* The number of daily data points: the number of daily data points that were available for aggregation that day
* The day of the week
* The type of the day, i.e., was it a weekend or a weekday?
* The total number of Bluecars taken that day in that area
* The total number of Bluecars returned that day in that area
* The total number of Utilibs taken that day in that area
* The total number of Utilibs returned that day in that area
* The total number of Utilib 14s taken that day in that area
* The total number of Utilib 14s returned that day in that area
* The total number of recharging slots released that day in that area
* The total number of recharging slots taken that day in that area

The variable I will work on is the total number of Utilib 14s taken that day in that area, i.e., the 'Utilib_14_taken_sum' variable.

In [3]:
# provide variable definitions
varDef = pd.read_excel('columns_explanation.xlsx')
varDef

Unnamed: 0,Column name,explanation
0,Postal code,postal code of the area (in Paris)
1,date,date of the row aggregation
2,n_daily_data_points,number of daily data poinst that were availabl...
3,dayOfWeek,identifier of weekday (0: Monday -> 6: Sunday)
4,day_type,weekday or weekend
5,BlueCars_taken_sum,Number of bluecars taken that date in that area
6,BlueCars_returned_sum,Number of bluecars returned that date in that ...
7,Utilib_taken_sum,Number of Utilib taken that date in that area
8,Utilib_returned_sum,Number of Utilib returned that date in that area
9,Utilib_14_taken_sum,Number of Utilib 1.4 taken that date in that area


### <font color='blue'>Provide a description of the source of your data and the data collection procedures, the descriptive statistics, and some assertions about the model that is consistent with the data. </font>

The data set was obtained from the Moringa School LMS platform (no idea where they got it, so I cannot provide any information on the data collection procedures).

As seen from the summary statistics below, there are 16,085 records and 13 columns.

In [4]:
print(autolib.shape)
autolib.describe(include='all')

(16085, 13)


Unnamed: 0,Postal code,date,n_daily_data_points,dayOfWeek,day_type,BlueCars_taken_sum,BlueCars_returned_sum,Utilib_taken_sum,Utilib_returned_sum,Utilib_14_taken_sum,Utilib_14_returned_sum,Slots_freed_sum,Slots_taken_sum
count,16085.0,16085,16085.0,16085.0,16085,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0
unique,,156,,,2,,,,,,,,
top,,2/23/2018,,,weekday,,,,,,,,
freq,,104,,,11544,,,,,,,,
mean,88791.293876,,1431.330619,2.969599,,125.926951,125.912714,3.69829,3.699099,8.60056,8.599192,22.629033,22.629282
std,7647.342,,33.21205,2.008378,,185.426579,185.501535,5.815058,5.824634,12.870098,12.868993,52.120263,52.14603
min,75001.0,,1174.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,91330.0,,1439.0,1.0,,20.0,20.0,0.0,0.0,1.0,1.0,0.0,0.0
50%,92340.0,,1440.0,3.0,,46.0,46.0,1.0,1.0,3.0,3.0,0.0,0.0
75%,93400.0,,1440.0,5.0,,135.0,135.0,4.0,4.0,10.0,10.0,5.0,5.0


# <font color='green'>4. Hypothesis Testing Procedure </font>

### <font color='blue'> Present the details concerning how you will test your hypothesis. </font>

The first thing to do after preparing and cleaning the dataset would be to collect only records taken on a weekend since that is where my focus is. After creating a separate data set containing weekend-only records, which will act as the population, I will randomly select 10% of its data points to be my sample data. I'm assuming the sample size will be greater than 30, hence a z-test will be ideal. However, should I find that my sample size is less than 30, then I will resort to using a t-test.

# <font color='green'>5. Data Cleaning </font>

In [5]:
# change column names to improve consistency and readability

autolib = autolib.rename(columns = {
    'Postal code' : 'postal_code',
    'n_daily_data_points' : 'daily_data_points',
    'dayOfWeek' : 'day_of_week',
    'BlueCars_taken_sum' : 'bluecars_taken',
    'BlueCars_returned_sum' : 'bluecars_returned',
    'Utilib_taken_sum' : 'utilib_taken',
    'Utilib_returned_sum' : 'utilib_returned',
    'Utilib_14_taken_sum' : 'utilib_14_taken',
    'Utilib_14_returned_sum' : 'utilib_14_returned',
    'Slots_freed_sum' : 'slots_freed',
    'Slots_taken_sum' : 'slots_taken'
})

autolib.columns

Index(['postal_code', 'date', 'daily_data_points', 'day_of_week', 'day_type',
       'bluecars_taken', 'bluecars_returned', 'utilib_taken',
       'utilib_returned', 'utilib_14_taken', 'utilib_14_returned',
       'slots_freed', 'slots_taken'],
      dtype='object')

In [6]:
# check for and display duplicates
duplicatedData = autolib[autolib.duplicated()]
duplicatedData

## no duplicated data found so no need to drop any

Unnamed: 0,postal_code,date,daily_data_points,day_of_week,day_type,bluecars_taken,bluecars_returned,utilib_taken,utilib_returned,utilib_14_taken,utilib_14_returned,slots_freed,slots_taken


In [7]:
# total number of missing values
np.count_nonzero(autolib.isna())

## no missing values so there is no need of dropping or imputing any records

0

In [8]:
# remove outliers if any are present using the interquartile range
Q1 = autolib.quantile(0.25)
Q3 = autolib.quantile(0.75)
IQR = Q3 - Q1

autolib = autolib[~((autolib < (Q1 - 1.5 * IQR)) |(autolib > (Q3 + 1.5 * IQR))).any(axis=1)]
autolib.shape

## we are now down to 5319 records

(9783, 13)

In [9]:
# since I am interested in only the data that is on a weekend, I will create a dataset
# containing only weekend entries

weekend = autolib.loc[autolib['day_type'] == 'weekend']

# confirm that it has only weekends
weekend.day_type.unique()

array(['weekend'], dtype=object)

In [10]:
# delete 'day_type' column since it is unnecessary
weekend = weekend.drop(columns=['day_type'])

In [11]:
# shape of data set
weekend.shape

## further down to 1,820 records

(2963, 12)

In [12]:
# summary statistics of weekend dataset
weekend.describe()

Unnamed: 0,postal_code,daily_data_points,day_of_week,bluecars_taken,bluecars_returned,utilib_taken,utilib_returned,utilib_14_taken,utilib_14_returned,slots_freed,slots_taken
count,2963.0,2963.0,2963.0,2963.0,2963.0,2963.0,2963.0,2963.0,2963.0,2963.0,2963.0
mean,93112.126223,1439.825177,5.527843,57.718529,57.354708,1.742153,1.734391,4.143098,4.129261,0.768815,0.748228
std,1015.852872,0.493524,0.499308,48.99155,49.600634,2.063558,2.080906,4.132647,4.197661,2.029545,1.9902
min,91330.0,1438.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,92270.0,1440.0,5.0,21.0,20.0,0.0,0.0,1.0,1.0,0.0,0.0
50%,93110.0,1440.0,6.0,43.0,42.0,1.0,1.0,3.0,3.0,0.0,0.0
75%,94100.0,1440.0,6.0,82.0,82.0,3.0,3.0,6.0,6.0,0.0,0.0
max,95880.0,1440.0,6.0,293.0,301.0,10.0,10.0,22.0,22.0,12.0,12.0


### <font color='blue'>Describe the logic behind your null and alternate hypotheses: where did they come from and why are they interesting.</font>

The reason I decided to look into which is more popular between the Utilib and the Utilib 14 is because, according to the summary statistics, their means are very low compared to that of the Bluecar, so I want to see if there is sufficient evidence that the Utilib 14 is more popular than the Utilib. Personally, the '14' in 'Utilib 14' appeals to me because it seems to indicate that it is the 14th new and improved version of the Utilib, and I would like to see whether it is indeed better than its originator. Additionally, as a Data Scienctist working for Autolib, I figure I might be asked to determine which of the two is the least popular in case they want to do away with it and invest in a more profitable brand.

### <font color='blue'>Describe the test statistic you will use (i.e., z, t, f) and why. Have you satisfied the assumptions necessary for using the specific statistic?</font>

Since I am working with a very large dataset, where the sample size will be greater than 30, I will use a z-test. All assumptions are satisfied.

### <font color='blue'>Determine the alpha level you will use.</font>

The alpha level I will use is 0.05 (confidence level of 95%) because it is the standard alpha level used, and I see no particular reason to use the other alpha levels.

# <font color='green'> 6. Hypothesis Testing </font>

#### <font color='blue'> 6.1 Hypothesis Statement Formulation </font>

The null hypothesis is that there is no difference in the mean of the Utilib 14 taken and that of the Utilib.

The alternative hypothesis states that there is a difference in the mean of the two.

Simply put:


*   H0 : μ of Utilib 14 = μ of Utilib
*   H1 : μ of Utilib 14 ≠ μ of Utilib

We will use a confidence level of 95%, that is, an alpha level of 0.05 to determine whether or not to reject the null hypothesis.


#### <font color='blue'> 6.2 Hypothesis Testing Computation </font>

In [13]:
# we will use a simple random sampling to randomly select 10% of values from the population
# which is the weekend dataset

popSize = weekend.shape[0]
sampleSize = int(0.1 * popSize)
print("Sample size is", sampleSize)
## since the sample size is greater than 30, we will use z test

sampleData = weekend.sample(n = sampleSize, replace = 'False')
print(sampleData.head())

Sample size is 296
       postal_code       date  daily_data_points  day_of_week  bluecars_taken  \
15458        94700  6/17/2018               1440            6              56   
13935        94150   2/3/2018               1438            5              23   
15399        94700   4/8/2018               1440            6              42   
15682        95100  3/10/2018               1440            5              22   
9770         92700  4/29/2018               1440            6             237   

       bluecars_returned  utilib_taken  utilib_returned  utilib_14_taken  \
15458                 63             0                0                5   
13935                 31             0                0                1   
15399                 41             0                0                3   
15682                 30             0                0                0   
9770                 244             2                2               11   

       utilib_14_returned  slots_free

In [14]:
# manually calculate the z-test statistic using the population mean, population standard deviation,
# sample mean, and sample size
from math import sqrt
popMean = weekend.utilib_14_taken.mean()
popStd = weekend.utilib_14_taken.std()
sampleMean = sampleData.utilib_14_taken.mean()
n = 266
alpha = 0.05

statistic = (sampleMean - popMean) / (popStd / sqrt(n))
print("Test statistic is", statistic) 

Test statistic is -0.6580670507867209


In [15]:
# calculate the p value
p_value = stats.norm.sf(abs(statistic))*2
p_value

0.5104950470153276

#### <font color='blue'> 6.3 Hypothesis Testing Interpretation </font>

In [16]:
if p_value <= alpha:
  print("Null hypothesis rejected.")
if p_value > alpha:
  print("Null hypothesis failed to be rejected.")

Null hypothesis failed to be rejected.


# <font color='green'> 7. Hypothesis Testing Results and Conclusion </font>

After cleaning the data set and removing outliers and then slicing it so as to work with only those that are taken on weekends, I was left with 1820 records. 10% of this gave exactly 182 records which I used for my sample.

The z-test using the population mean, population standard deviation, sample mean, and sample size resulted in a z statistic of approximately -2.5865 which generated a p-value of approximately 0.00969. Since this p-value is less than our pre-set alpha value of 0.05, it means that we have sufficient evidence to reject our null hypothesis which states that the mean of the Utilib 14 cars is equal to that of the Utilib. In other words, we now accept that there is no significant difference between the means of the two brands, i.e., neither is better than the other. So Autolib can do away with one of them without worrying about incurring great losses since they are both equally popular.