# Task

The dataset, in question, contains information about the customers’ demand rate between 
January 2017 and August 2018. The data were collected on an hourly basis and included
the time data such as date, hour, and season as well as weather data such as the weather 
condition, temperature, humidity, and wind speed.

I'll be using appropriate hypothesis testing to determine if there is a significant relationship 
between each column except the timestamp column and the demand rate.

![](weather-images.jpg)

# Data

| Column name | Description |
|-------------|-------------|
| id | The sample number specifying its order among other samples (records) |
| timestamps | The time and date when the sample was collected |
| season | The season when the sample was collected |
| holiday | This column specifies whether the date when the sample was collected was a holiday or not |
| workingday | This column specifies whether the date when the sample was collected was a working day or not |
| weather | This column specifies the weather condition when the sample was collected |
| temp | This column shows the temperature when the sample was collected |
| temp_feel | This column shows the feels-like temperature when the sample was collected |
| humidity | This column shows the humidity when the sample was collected |
| windspeed | This column shows the wind speed when the sample was collected |
| demand | This column shows the demand rate for the hour when the sample was collected. Higher the demand rate, the higher the customer’s willingness to rent a car. |


In [24]:
# Import libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
from scipy import stats

In [25]:
# Imports data and sets the "id" column as the index of the dataset
df = pd.read_csv("CarSharing.csv",index_col = "id") 

df.head()


Unnamed: 0_level_0,timestamp,season,holiday,workingday,weather,temp,temp_feel,humidity,windspeed,demand
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2017-01-01 00:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,81.0,0.0,2.772589
2,2017-01-01 01:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0,3.688879
3,2017-01-01 02:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0,3.465736
4,2017-01-01 03:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0,2.564949
5,2017-01-01 04:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0,0.0


In [26]:

# duplicated() return True if a row is duplicated, sum() adds up all the "True" values
df.duplicated().sum() 


0

In [27]:

# Checks for null values per column
df.isnull().sum() 


timestamp        0
season           0
holiday          0
workingday       0
weather          0
temp          1202
temp_feel      102
humidity        39
windspeed      200
demand           0
dtype: int64

In [28]:

# A function that plots the distribution of a column and computes the mean, median and mode of a column
def compute_hist_central_tendency(col):
    plt.hist(df[col], bins = 25)
    plt.ylabel("Frequency")
    plt.xlabel(col)
    plt.title("Distribution of "+ col)
    
    print(col, "mean: ", df[col].mean())
    print(col, "median: ", df[col].median())
    print(col, "mode: ", df[col].mode())
    
# filling missing values with medians of the respective columns
df['temp'].fillna(df['temp'].median(), inplace = True)
df['humidity'].fillna(df['humidity'].median(), inplace = True)
df['windspeed'].fillna(df['windspeed'].median(), inplace = True)
df['temp_feel'].fillna(df['temp_feel'].median(), inplace = True)

# Save cleaned data
df.to_csv("car_sharing_cleaned")



In [29]:
# Read csv data into a dataframe and inspect data
df = pd.read_csv("car_sharing_cleaned",index_col = "id" )
df.head()
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 8708 entries, 1 to 8708
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   timestamp   8708 non-null   object 
 1   season      8708 non-null   object 
 2   holiday     8708 non-null   object 
 3   workingday  8708 non-null   object 
 4   weather     8708 non-null   object 
 5   temp        8708 non-null   float64
 6   temp_feel   8708 non-null   float64
 7   humidity    8708 non-null   float64
 8   windspeed   8708 non-null   float64
 9   demand      8708 non-null   float64
dtypes: float64(5), object(5)
memory usage: 748.3+ KB


 ## Numerical vs Numverical significance test


In [30]:

# Checks if data is normally distributed
def check_normality(data):
    test_stat_normality, p_value_normality=stats.shapiro(data)
    print("p value:%.4f" % p_value_normality)
    if p_value_normality <0.05:
        print(f"Reject null hypothesis >> The column `{data.name}` is not normally distributed")
    else:
        print(f"Fail to reject null hypothesis >> The column `{data.name}` is normally distributed")

# Checks if two columns have the same variance
def check_variance_homogeneity(group1, group2):
    test_stat_var, p_value_var= stats.levene(group1,group2)
    print("p value:%.4f" % p_value_var)
    if p_value_var <0.05:
        print(f"Reject null hypothesis >> The variances of the samples, `{group1.name}` and `{group2.name} are different.")
    else:
        print(f"Fail to reject null hypothesis >> The variances of the samples, `{group1.name}` and `{group2.name} are same.")


"""
ASSUMPTION CHECK
H₀: The data is normally distributed.

H₁: The data is not normally distributed.

H₀: The variances of the samples are the same.

H₁: The variances of the samples are different.
"""

check_normality(df["temp"])
check_normality(df["temp_feel"])
check_normality(df["humidity"])
check_normality(df["windspeed"])

check_variance_homogeneity(df["temp"], df["temp_feel"])
check_variance_homogeneity(df["temp"], df["humidity"])
check_variance_homogeneity(df["temp"], df["windspeed"])
check_variance_homogeneity(df["temp_feel"], df["humidity"])
check_variance_homogeneity(df["temp_feel"], df["windspeed"])
check_variance_homogeneity(df["humidity"], df["windspeed"])


p value:0.0000
Reject null hypothesis >> The column `temp` is not normally distributed
p value:0.0000
Reject null hypothesis >> The column `temp_feel` is not normally distributed
p value:0.0000
Reject null hypothesis >> The column `humidity` is not normally distributed
p value:0.0000
Reject null hypothesis >> The column `windspeed` is not normally distributed
p value:0.0000
Reject null hypothesis >> The variances of the samples, `temp` and `temp_feel are different.
p value:0.0000
Reject null hypothesis >> The variances of the samples, `temp` and `humidity are different.
p value:0.0000
Reject null hypothesis >> The variances of the samples, `temp` and `windspeed are different.
p value:0.0000
Reject null hypothesis >> The variances of the samples, `temp_feel` and `humidity are different.
p value:0.0000
Reject null hypothesis >> The variances of the samples, `temp_feel` and `windspeed are different.
p value:0.0000
Reject null hypothesis >> The variances of the samples, `humidity` and `win



### Mann Whitney Test

In [31]:

# Since the data is not normally distributed and more than that,
# have unequal variances, proceed to a non-parametric test.
# Mann Whitney non-parametric test can be used in this scenario

def mannwhitney_test(col1, col2):
    ttest,pvalue = stats.mannwhitneyu(df[col1],df[col2], alternative="two-sided")
    print("p-value:%.4f" % pvalue)
    if pvalue <0.05:
        print("Reject null hypothesis >> it can be said that there is a statistically significant difference between",
              col1, "and",col2)
    else:
        print("Fail to reject null hypothesis")

mannwhitney_test("temp", "temp_feel")
mannwhitney_test("temp", "humidity")
mannwhitney_test("temp", "windspeed")
mannwhitney_test("temp_feel", "humidity")
mannwhitney_test("temp_feel", "windspeed")
mannwhitney_test("humidity", "windspeed")


p-value:0.0000
Reject null hypothesis >> it can be said that there is a statistically significant difference between temp and temp_feel
p-value:0.0000
Reject null hypothesis >> it can be said that there is a statistically significant difference between temp and humidity
p-value:0.0000
Reject null hypothesis >> it can be said that there is a statistically significant difference between temp and windspeed
p-value:0.0000
Reject null hypothesis >> it can be said that there is a statistically significant difference between temp_feel and humidity
p-value:0.0000
Reject null hypothesis >> it can be said that there is a statistically significant difference between temp_feel and windspeed
p-value:0.0000
Reject null hypothesis >> it can be said that there is a statistically significant difference between humidity and windspeed


## Categorical vs Categorical significance test Using Chi-square test of independence


In [32]:
# Define a function to test `weather` against other columns
def hypothesis_weather_and_others(compare_with):
    fall_temp = df[df["weather"] == "Clear or partly cloudy"][compare_with]
    spring_temp = df[df["weather"] == "Light snow or rain"][compare_with]    
    summer_temp = df[df["weather"] == "Mist"][compare_with]    
    winter_temp = df[df["weather"] == "heavy rain/ice pellets/snow + fog"][compare_with] 
    
    result = stats.f_oneway(winter_temp.values, summer_temp.values, spring_temp.values, fall_temp.values)   
    print("p value:%.4f" % result.pvalue)
    
    if result.pvalue > 0.05:        
        print("Accept null hypothesis >> The relationship between season and", compare_with, "is not significant")
    else:
        print("Reject null hypothesis >> The relationship between season and", compare_with, "is significant")
    

hypothesis_weather_and_others("temp")
hypothesis_weather_and_others("temp_feel")
hypothesis_weather_and_others("humidity")
hypothesis_weather_and_others("windspeed")


p value:0.0000
Reject null hypothesis >> The relationship between season and temp is significant
p value:0.0000
Reject null hypothesis >> The relationship between season and temp_feel is significant
p value:0.0000
Reject null hypothesis >> The relationship between season and humidity is significant
p value:0.0000
Reject null hypothesis >> The relationship between season and windspeed is significant


In [33]:
# Define a function to test `holiday` against other columns
def hypothesis_holiday_others(compare_with):
    holi_yes_temp = df[df["holiday"] == "Yes"][compare_with]
    holi_no_temp = df[df["holiday"] == "No"][compare_with] 
    
    result = stats.ttest_ind(holi_no_temp, holi_yes_temp, equal_var= False)   
    print("p value:%.4f" % result.pvalue)
    
    if result.pvalue > 0.05:        
        print("Accept null hypothesis >> The relationship between holiday and", compare_with, "is not significant")
    else:
        print("Reject null hypothesis >> The relationship between holiday and", compare_with, "is significant")

hypothesis_holiday_others("temp")
hypothesis_holiday_others("temp_feel")
hypothesis_holiday_others("humidity")
hypothesis_holiday_others("windspeed")



p value:0.6461
Accept null hypothesis >> The relationship between holiday and temp is not significant
p value:0.2271
Accept null hypothesis >> The relationship between holiday and temp_feel is not significant
p value:0.0165
Reject null hypothesis >> The relationship between holiday and humidity is significant
p value:0.2803
Accept null hypothesis >> The relationship between holiday and windspeed is not significant


In [34]:
# Define a function to test `workingday` against other columns
def hypothesis_workingday_others(compare_with):
    wd_yes_temp = df[df["workingday"] == "Yes"][compare_with]
    wd_no_temp = df[df["workingday"] == "No"][compare_with] 
    
    result = stats.ttest_ind(wd_no_temp, wd_yes_temp, equal_var= True)   
    print("p value:%.4f" % result.pvalue)
    
    if result.pvalue > 0.05:        
        print("Accept null hypothesis >> The relationship between working day and", compare_with, "is not significant")
    else:
        print("Reject null hypothesis >> The relationship between working day and", compare_with, "is significant")

hypothesis_workingday_others("temp")
hypothesis_workingday_others("temp_feel")
hypothesis_workingday_others("humidity")
hypothesis_workingday_others("windspeed")




p value:0.0082
Reject null hypothesis >> The relationship between working day and temp is significant
p value:0.0015
Reject null hypothesis >> The relationship between working day and temp_feel is significant
p value:0.3665
Accept null hypothesis >> The relationship between working day and humidity is not significant
p value:0.9930
Accept null hypothesis >> The relationship between working day and windspeed is not significant
