# Statistic Research Project


Students: Arthur ADAM, Alexandre FLION

## Context

This research project has been done between August 31st and the October 3rd 2022. We were asked to complete a research on the thematic of our choice. Having no idea what to work on, we choose a simple situation:

```txt
France and Netherlands both have a long periods of rain and we just discovered the city centre where there is shops and restaurants. What if we study the impact of weather on the shopping behaviour of people of Eindhoven?
```

We chose that thematic and we worked on it during this period.

## Research Project

### Research Setup

As a first task, we created 2 hypothesis, our null Hypothesis called H0 and our alternative hypothesis called HA, in order to begin our study.

Our hypothesis, H0 and HA, were respectively the followings:

- People shop wether it rains or it is sunny.
- People shop less when it rains than when it is sunny.

In first place, we wanted to measure the number of people in the street in a street where there are shops and restaurants. The problem with that would have been the number of data points, which would have been too large. We restraints ourselves to focus on specific areas of this street by looking at shops.

In order to do this research, we had to take a place in Eindhoven when there are shops and restaurants. Therefore, we choose the city centre. We exactly choose the Demerstraat, Eindhoven, where are located a lot of clothing stores and restaurants.

We also had to choose well frequented shops. We choose one clothing store and one restaurant or fast food, both being attractive.

We could choose Levis but the prices are very high so we decided to choose "The Athlete's Foot" located at Demer 18B, 5611 AR, Eindhoven.

In order to get data during the day or the afternoon, we choose a restaurant where people can take food they can eat on the go. Therefore, we took Baker's Bart, located at Demer 21, 5611 AN, Eindhoven.

These two locations also have an advantage: we can look at both stores being at the same location so we can communicate while taking data.

From this point, we chose the simple randome sample sampling method. We took random people entering the shops regardless of age, sex or nationality.

We also had to choose a time where there are people shopping. Then, we took the an afternoon time, which was between 3:40 PM and 4:30 PM. We also choose 2 days when we were available and when people were shopping. We took Monday and Thursday.

At this point, we knew what to do: Counting people entering "The Athlete's Foot" and "Baker's Bart" on Monday and Thursday between 3:40PM and 4:30PM.

### Data sample

With this Notebook, you will find our [dataset as a CSV file](./fdata.csv). All date points are in UTC date format.

We extracted our data from our dataset using the following code:

In [1]:
from dataclasses import dataclass
from datetime import datetime
from enum import Enum

class ShopType(Enum):
    Clothing = 0
    Fastfood = 1

class DataTaker(Enum):
    Alexandre = 0
    Arthur = 1

class Weather(Enum):
    Sunny = 0
    Rainy = 1

@dataclass
class DataPoint:
    name: DataTaker
    shop: ShopType
    rain: Weather
    time: datetime

In [3]:
import csv

data: list[DataPoint] = []

with open("fdata.csv") as f:
    r = csv.reader(f)
    next(r)
    for i in r:
        data.append(DataPoint(
            name=DataTaker.Alexandre if i[0] == "Alexandre" else DataTaker.Arthur,
            shop=ShopType.Clothing if i[1] == "clothing" else ShopType.Fastfood,
            rain=Weather.Sunny if i[2] == "sunny" else Weather.Rainy,
            time=datetime.strptime(i[3], '%Y-%m-%d %H:%M:%S')
        ))

As long as we have the data, we can count the number of entry in each group we can find:

- The person entered "The Athlete's Foot" on a sunny day.
- The person entered "The Athlete's Foot" on a rainy day.
- The person entered "Baker's Bart" on a sunny day.
- The person entered "Baker's Bart" on a rainy day.

In [4]:
athlete_sunny = len(list(filter(lambda x: x.shop == ShopType.Clothing and x.rain == Weather.Sunny, data)))
athlete_raining = len(list(filter(lambda x: x.shop == ShopType.Clothing and x.rain == Weather.Rainy, data)))
baker_sunny = len(list(filter(lambda x: x.shop == ShopType.Fastfood and x.rain == Weather.Sunny, data)))
baker_raining = len(list(filter(lambda x: x.shop == ShopType.Fastfood and x.rain == Weather.Rainy, data)))

print(f"{athlete_sunny=}")
print(f"{athlete_raining=}")
print(f"{baker_sunny=}")
print(f"{baker_raining=}")

athlete_sunny=103
athlete_raining=73
baker_sunny=54
baker_raining=40


Now we have the values of each individual group, we can compute totals.

In [5]:
total_sunny = athlete_sunny + baker_sunny
total_raining = athlete_raining + baker_raining
total_athlete = athlete_raining + athlete_sunny
total_baker = baker_raining + baker_sunny
total = total_baker + total_athlete

print(f"{total_sunny=}")
print(f"{total_raining=}")
print(f"{total_athlete=}")
print(f"{total_baker=}")
print(f"{total=}")

total_sunny=157
total_raining=113
total_athlete=176
total_baker=94
total=270


With the values we gathered, we can create the following table with our sample of 270 people:

|//////////////////|Sunny|Raining|Total|
|:----------------:|:---:|:-----:|:---:|
|The Athlete's Foot|103|73|176|
|Baker's Bart|54|40|94|
|Total|157|113|270|

Therefore, we can calculate the probability the probability that someone goes shopping on a sunny and on a rainy day:

In [6]:
print(f"Probability that someone goes shopping on a sunny day: {total_sunny/total:.3f}")
print(f"Probability that someone goes shopping on a rainy day: {total_raining/total:.3f}")

Probability that someone goes shopping on a sunny day: 0.581
Probability that someone goes shopping on a rainy day: 0.419


Currently, with the data we have we can't get a mean, a median or even a standard deviation. Then, we're going to separate our data in different groups of 5 minutes to get a median and a mean of people entering each shop in 5 minutes on a sunny and a rainy weather.

In order to compare and tests our basics hypothesis, we can rewrite them as the followings:

- Mean Sunny Shopping - Mean Rainy Shopping = 0
- Mean Sunny Shopping - Mean Rainy Shopping > 0

We'll have 8 groups for of each group, so 16 groups.

|//////////|Sunny|Rainy|Total|
|:--------:|:---:|:---:|:---:|
|3:40PM to 3:45PM|10|6|16|
|3:45PM to 3:50PM|25|14|39|
|3:50PM to 3:55PM|18|19|37|
|3:55PM to 4:00PM|20|14|34|
|4:00PM to 4:05PM|23|19|42|
|4:05PM to 4:10PM|10|7|17|
|4:10PM to 4:15PM|16|9|25|
|4:15PM to 4:20PM|21|13|34|
|4:20PM to 4:25PM|12|12|24|
|4:25PM to 4:30PM|2|0|2|
|Total|157|113|270|
|Mean of people during the timestamp|15.7|11.3|27|
|Standard deviation|7.134|5.922|

In order to compare our two values, we can run a two sample t-test with difference of means.

In [7]:
sunny_shopping_mean = 15.7
rainy_shopping_mean = 11.3

print(f"{sunny_shopping_mean - rainy_shopping_mean = :.2f}")

sunny_shopping_mean - rainy_shopping_mean = 4.40


Using this value, we're going to use a one-sided confidence interval of 95%.

From our one-sided confidence interval of 95%, we will be able to use a z-value of 1.96.

In [8]:
from math import sqrt


mean_difference = sunny_shopping_mean - rainy_shopping_mean

z_value = 1.96
interval_value = z_value * sqrt((pow(7.134, 2) / total_sunny) + (pow(5.922, 2) / total_raining))

print(f"({mean_difference - interval_value:.3f}, {mean_difference + interval_value:.3f})")

(2.839, 5.961)


Then, we're our confidence interval will be: (2.839, 5.961).

Using the Welsh-t test as a two sample t-test, we can find the t-value of our test and the degree of freedom of our study.

In [11]:
## Welsh t-test
##
## Welsh T-Test t-value formula: (mA - mB) / sqrt((standard deviation of A^2 / size of sample A) + (standard deviation of B^2 / size of sample B))

t_value = mean_difference * sqrt((pow(7.134, 2) / 157) + (pow(5.922, 2) / 113))
f_degree = pow((pow(7.134, 2) / 157) + (pow(5.922, 2) / 113), 2) / ((pow(7.134, 4) / (pow(total_sunny, 2) * (total_sunny - 1))) + (pow(5.922, 4) / (pow(total_raining, 2) * (total_raining - 1))))

print(f"Welsh t-test t-value: {t_value:.3f}")
print(f"Welsh t-test degree of freedom: {t_value:.0f}")

Welsh t-test t-value: 3.505
Welsh t-test degree of freedom: 4


Our t-value will be 3.505 and our degree of freedom will be 4. Being of degree 4, 3.505 is located between 2.776 and 3.747 being respectively of p-value 0.025 and 0.01. Therefore, our p-value will also be between 0.025 and 0.01.

Using an [online calculator](https://www.socscistatistics.com/pvalues/tdistribution.aspx) for one-sided confidence interval, we find that our p-value is equal to 0.012392.

Therefore, we can compare it with our significance level which is 0.05.

0.012392 is lower than 0.05. In other words, we can conclude that the mean values of group "Shopping on Sunny day" and "Shopping on Rainy day" are significantly different.

Tho, we can reject our null hypothesis and validate our alternative hypothesis.

## Conclusion

Our research showed that the shopping behaviour in Eindhoven on Mondays and Thursdays between 3:40PM and 4:30PM is less important when it rains than when it's sunny.

## Discussion

From this research project results, we can say that the statistics and our test showed what we tought.

But we could use another approach to this study...

We could have take data on more than 2 days and change the approach for a bigger approach of the problem.

The fact we were just choosing people entering shops is a little bit restricting for the problem we try to analyse. We could have way more data points to take people going in the market street. Unfortunately, this method could have biais like counting people just passing and not going into shops.

This could have give us way more data points but it also could give us way more samples, better standard deviations and a way larger choice on statistical tests.

Also only 2 samples forced us to use a test that is really hard to use even if the Welsh T-Test is the most accurate test we could use for the data we had.