# Stage 1 - Generate a test dataset

## Step 1 - Choose 5 cities

According to the website of SMOOVE, before the year of 2020, SMOOVE had already run bike sharing service in 8 cities in France. In this project, we won't care about electric bikes and the long-term rental bikes. We only focus on the classical bikes in self-service (Vélos en libre-service). For convenience, we call them "bikes" in the rest of this project. We can get the number of bikes and stations from the website. 

We need to pick 5 lucky cities out of these 8. The number of bikes in Paris is so large that it could break the balence of the dataset; there are only 50 electric bikes in Vannes; in Strasbourg, there are so much more bikes in long-term rent (9000)than classical bikes (200).

Finally the 5 cities are : Montpellier, Clermont-Ferrand, Saint-Etienne, Avignon, Belfort.

## Step 2 - Define the structure of this dataset

5*366=1830 Rows: One row per day and per city.

9 Columns: Date, City, Number of available bikes, Number of stations, Number of users who have used bikes on the day, Number of trips, Total travelled distance(km), Total travelled time(hours), Number of new users.

So the shape of this dataset will be (1830, 9)

## Step 3 - Generate the dataset

### Step 3.1 - Date

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
# Create a list containing all dates of 2020

dates = [datetime.strftime(x,'%Y-%m-%d') for x in list(pd.date_range(start = '2020-01-01', end = '2020-12-31'))]

In [3]:
# Create a DataFrame repeating each date 5 times

dataset = pd.DataFrame(np.repeat(dates, 5, axis = 0), columns = ['Date'])

In [4]:
dataset.head(10)

Unnamed: 0,Date
0,2020-01-01
1,2020-01-01
2,2020-01-01
3,2020-01-01
4,2020-01-01
5,2020-01-02
6,2020-01-02
7,2020-01-02
8,2020-01-02
9,2020-01-02


In [5]:
dataset.shape

(1830, 1)

### Step 3.2 - City

In [6]:
# List of 5 cities

cities = ['Montpellier', 'Clermont-Ferrand', 'Saint Etienne', 'Avignon', 'Belfort']

In [7]:
# Add City column, 5 cities appear in a same order in all the 366 days

dataset.insert(dataset.shape[1], 'City', np.tile(cities, 366))

In [8]:
dataset.head(10)

Unnamed: 0,Date,City
0,2020-01-01,Montpellier
1,2020-01-01,Clermont-Ferrand
2,2020-01-01,Saint Etienne
3,2020-01-01,Avignon
4,2020-01-01,Belfort
5,2020-01-02,Montpellier
6,2020-01-02,Clermont-Ferrand
7,2020-01-02,Saint Etienne
8,2020-01-02,Avignon
9,2020-01-02,Belfort


In [9]:
dataset.shape

(1830, 2)

### Step 3.3 - Number of available bikes

In this case I operate a fleet of 1000 bikes over 5 different cities. According to the bikes amounts shown by the website of SMOOVE, we roughly estimated the original amounts of bikes in our case using equal scale scaling. 

In [10]:
# Amounts from website

amounts_SMOOVE = [548, 632, 364, 300, 250]

In [11]:
# Estimated amounts

s_ = sum(amounts_SMOOVE)
amounts = [round(x/s_*1000) for x in amounts_SMOOVE]

In [12]:
# Check the sum of amounts equals to 1000 or not

print("The total amount of bikes in 5 cities is", sum(amounts))

The total amount of bikes in 5 cities is 1000


In [13]:
# Print bikes amount in each city

bike_city = pd.DataFrame({'City': pd.Series(cities), 'Total_bike_amount': pd.Series(amounts)})

As we know sometimes some bikes are not available, for example it could be broken, stolen, got out of service region etc. But we have experienced teams to handle these kind of problems timely to make sure there will be at most 5% unavailable bikes every day.

In [14]:
# Add the limit of unavailable bikes for each city

bike_city.insert(bike_city.shape[1], 'Max_unavailable_bike', [round(x*0.05) for x in amounts])

In [15]:
# Take a look at of the results of bike amounts

bike_city

Unnamed: 0,City,Total_bike_amount,Max_unavailable_bike
0,Montpellier,262,13
1,Clermont-Ferrand,302,15
2,Saint Etienne,174,9
3,Avignon,143,7
4,Belfort,119,6


We assume that for each city and for each single day, the number of unavailable bikes is a random variable who follows a uniform distribution in the set of integers from 0 up to its max unavailable bikes.

Now we can add the column : Number of available bikes into our dataset based on our assumptions.

In [16]:
# Repeat bike_city 366 times for 366 days.

bike_city_new = bike_city
for i in range(365):
    bike_city_new = pd.concat([bike_city_new, bike_city])

In [17]:
bike_city_new.index = dataset.index

In [18]:
# Calculate  "number of available bikes" for each city and for each day

Nb_available_bike = bike_city_new.Total_bike_amount - np.random.randint(0,bike_city_new.Max_unavailable_bike+1)

In [19]:
# Add this column into our dataset

dataset.insert(dataset.shape[1], 'Nb_available_bike', Nb_available_bike)

In [20]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike
0,2020-01-01,Montpellier,256
1,2020-01-01,Clermont-Ferrand,301
2,2020-01-01,Saint Etienne,172
3,2020-01-01,Avignon,141
4,2020-01-01,Belfort,113
5,2020-01-02,Montpellier,250
6,2020-01-02,Clermont-Ferrand,294
7,2020-01-02,Saint Etienne,165
8,2020-01-02,Avignon,137
9,2020-01-02,Belfort,115


In [21]:
dataset.shape

(1830, 3)

### Step 3.4 - Number of stations

After observing the data from SMOOVE, I found that the number of stations are roughly equal to or slightly larger than 10% of bike amounts. 

So I decided to set 110 stations for these 1000 bikes. I also used equal scale scaling to arrange the stations into 5 cities, respectly.

In [22]:
# Data from SMOOVE

stations_SMOOVE = [54, 52, 38, 30, 29]

In [23]:
# Data we will use

s_1 = sum(stations_SMOOVE)
stations = [round(x/s_1*110) for x in stations_SMOOVE]

In [24]:
# Check the sum of stations equals to 1000 or not

print("The total amount of stations in 5 cities is", sum(stations))

The total amount of stations in 5 cities is 110


In [25]:
# Print stations amount in each city

station_city = pd.DataFrame({'City': pd.Series(cities), 'Station_amount': pd.Series(stations)})

In [26]:
# Take a look at of the results of station amounts

station_city

Unnamed: 0,City,Station_amount
0,Montpellier,29
1,Clermont-Ferrand,28
2,Saint Etienne,21
3,Avignon,16
4,Belfort,16


We assume that the number of stations in these 5 cities are invariable in 2020.

In [27]:
# Add "Nb_station" column into our dataset

dataset.insert(dataset.shape[1], 'Nb_station', np.tile(stations, 366))

In [28]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station
0,2020-01-01,Montpellier,256,29
1,2020-01-01,Clermont-Ferrand,301,28
2,2020-01-01,Saint Etienne,172,21
3,2020-01-01,Avignon,141,16
4,2020-01-01,Belfort,113,16
5,2020-01-02,Montpellier,250,29
6,2020-01-02,Clermont-Ferrand,294,28
7,2020-01-02,Saint Etienne,165,21
8,2020-01-02,Avignon,137,16
9,2020-01-02,Belfort,115,16


In [29]:
dataset.shape

(1830, 4)

### Step 3.5 - Number of users who have used bikes on the day

Here comes the most important part. The following 3 columns will need the data in this column.

I proposed a mathematical model to estimate the number of users in a single day in which I took 5 factors into account : The population, the congestion level, the density of stations, people's enthusiasm for fitness through cycling and the popularity level of our bikes.

This model is reasonable from a marketing point of view. 

A behavior of consumption will occur only when these 4 steps occur : One has a demand to ride a bike --> he chooses our bikes --> he arrives at one of our stations --> there exist available bikes.

Or : One has a demand to ride a bike --> he arrives at one of ours stations by coincidence --> he thinks our bikes are acceptable --> there exist available bikes.

The orders' exchange of the 2nd and the 3rd step doesn't matter for the model thanks to commutative law of multiplication.

I assumed that the demand of a user to ride a bike comes from 2 parts : the transportation requirements using a bike thanks to congestion level and the fitness requirements which is related to air quality. The popularity level represents the possibility of users to choose our bikes or accept our bikes in the case that he arrives at the station by coincidence. The density of stations represents the difficulty level for users to get our service or the probability for a user to meet one of our station by coincidence. Finaly, we use the average available bike amount per station to evaluate this probability.

 Then apply multiplication principle and addtion principle, we conclude that :

###### Number of users = r * Population

###### r = (a_1 * Congestion level + a_2 * Fitness enthusiasm level) * a_3 * Popularity level * a_4 * Density of stations * a_5 * Bike per station + error

a_1, a_2, a_3, a_4 and a_5 are parameters we need to choose. As to the 5 independent variables, we can estimate them with the help of some data from Internet.

Although this model may not be able to reflet all aspects in real cases, for example the influence of climate, seasons and Covid-19, but it's reliable enough to generate the dataset. And it's easy to modify if we need a new factor.

#### Step 3.5.1 - Population

Through the data on the website of "insee", we can get the total population of the 5 cities in 2019. We regard them as the population we need in 2020 and ignore the daily changes.

In [30]:
population = [298933, 150596, 175792, 92821, 47242]

#### Step 3.5.2 - Congestion level

Through the data of traffic index on the website of "tomtom", we can get the congestion level of 5 cities in 2021 and their changes from 2020. To get the results in 2020, we need a simple calculation. We assume that for citizens, the worse the traffic is, the more possible to transport via bike. 

In [31]:
congestions_2021 = pd.DataFrame({'2021':[0.27, 0.22, 0.19, 0.19, 0.1], 'Change_from_2020':[0.03, 0.01, 0.02, 0.02, 0]})

In [32]:
congestions = congestions_2021['2021'] - congestions_2021.Change_from_2020

In [33]:
congestions

0    0.24
1    0.21
2    0.17
3    0.17
4    0.10
dtype: float64

#### Step 3.5.3 - Fitness enthusiasm level

We assume that for citizens, the better the air quality is, the more possible to do outside exercises. I recorded a list of air pollution level data from "aqicn". Assume that ftiness enthusiasm level = 1/air pollution level.

In [34]:
aqi = [40, 37, 46, 40, 26]

#### Step 3.5.4 - Density of stations

I got the urban area size from Wikipedia. (Superfice en km2)

In [35]:
urban_size = [56.88, 42.67, 79.97, 64.78, 17.1]

Now we merge the results in the step into a DataFrame.

In [36]:
# Create a DataFrame

factors = station_city.copy()

# Add the results one by one

factors.insert(factors.shape[1], 'Population', population)
factors.insert(factors.shape[1], 'Congestion_level', congestions)
factors.insert(factors.shape[1], 'Air_pollution_level', aqi)
factors.insert(factors.shape[1], 'Urban_size', urban_size)

# Calculate the denstities of staions

factors['Station_density'] = factors.Station_amount/factors.Urban_size

In [37]:
# Take a look of it.

factors

Unnamed: 0,City,Station_amount,Population,Congestion_level,Air_pollution_level,Urban_size,Station_density
0,Montpellier,29,298933,0.24,40,56.88,0.509845
1,Clermont-Ferrand,28,150596,0.21,37,42.67,0.656199
2,Saint Etienne,21,175792,0.17,46,79.97,0.262598
3,Avignon,16,92821,0.17,40,64.78,0.24699
4,Belfort,16,47242,0.1,26,17.1,0.935673


#### Step 3.5.5 - Popularity level

I trust the company was getting better and better day by day, but it's silly to let popularity be monotone increasing. Instead, for every sigle day, I will pick a random value which follows a normal distribution. And the expectation of the distribution is monotone increasing, the variances are the same.

In [38]:
popularity = []

for i in range(366):
    a = i*0.01 + 10
    x = np.random.normal(a,0.5)
    popularity.append(x)

In [39]:
# Look at the popularity I've created

popularity

[10.665276752146639,
 10.252563301199247,
 10.618533813345124,
 10.364587268262452,
 9.26125976600565,
 9.236244608282359,
 9.721377549124158,
 9.687679787136204,
 9.60522085934673,
 10.052761343466313,
 11.140805203669698,
 10.154000996662393,
 10.006387589263761,
 9.724284669916766,
 9.458631496089874,
 10.874269378214306,
 10.172362549373277,
 9.294051589104699,
 9.625066045566369,
 10.547501561379688,
 9.63055446247735,
 10.063413901014247,
 10.943834063275697,
 10.233632757057737,
 9.74339771618335,
 9.605878205571372,
 9.717657325513024,
 10.320321341667656,
 10.05687410783063,
 10.257425162454489,
 10.173209283971477,
 9.892972780681925,
 11.0826315623375,
 10.468192370498528,
 9.911707517868122,
 10.669553134890473,
 9.67917625408644,
 10.64015822867607,
 10.65235571951016,
 9.977116899233854,
 9.647883895266203,
 10.380619195854429,
 10.790594862573178,
 10.312568343804092,
 10.566904289972555,
 9.93879464371062,
 11.165409228106848,
 10.90654209235491,
 10.524403722829451,
 1

In [40]:
factors_final = factors

for i in range(365):
    factors_final = pd.concat([factors_final, factors])

In [41]:
factors_final.insert(factors_final.shape[1], 'Popularity_level', np.repeat(popularity, 5))
factors_final.index = dataset.index

#### Step 3.5.6 - Bike per station

In [42]:
bike_per_station = dataset.Nb_available_bike / dataset.Nb_station

In [43]:
factors_final.insert(factors.shape[1], 'Bike_per_station', bike_per_station)

In [44]:
factors_final.head(10)

Unnamed: 0,City,Station_amount,Population,Congestion_level,Air_pollution_level,Urban_size,Station_density,Bike_per_station,Popularity_level
0,Montpellier,29,298933,0.24,40,56.88,0.509845,8.827586,10.665277
1,Clermont-Ferrand,28,150596,0.21,37,42.67,0.656199,10.75,10.665277
2,Saint Etienne,21,175792,0.17,46,79.97,0.262598,8.190476,10.665277
3,Avignon,16,92821,0.17,40,64.78,0.24699,8.8125,10.665277
4,Belfort,16,47242,0.1,26,17.1,0.935673,7.0625,10.665277
5,Montpellier,29,298933,0.24,40,56.88,0.509845,8.62069,10.252563
6,Clermont-Ferrand,28,150596,0.21,37,42.67,0.656199,10.5,10.252563
7,Saint Etienne,21,175792,0.17,46,79.97,0.262598,7.857143,10.252563
8,Avignon,16,92821,0.17,40,64.78,0.24699,8.5625,10.252563
9,Belfort,16,47242,0.1,26,17.1,0.935673,7.1875,10.252563


In [45]:
factors_final.tail(10)

Unnamed: 0,City,Station_amount,Population,Congestion_level,Air_pollution_level,Urban_size,Station_density,Bike_per_station,Popularity_level
1820,Montpellier,29,298933,0.24,40,56.88,0.509845,8.758621,13.063274
1821,Clermont-Ferrand,28,150596,0.21,37,42.67,0.656199,10.75,13.063274
1822,Saint Etienne,21,175792,0.17,46,79.97,0.262598,8.142857,13.063274
1823,Avignon,16,92821,0.17,40,64.78,0.24699,8.625,13.063274
1824,Belfort,16,47242,0.1,26,17.1,0.935673,7.25,13.063274
1825,Montpellier,29,298933,0.24,40,56.88,0.509845,8.689655,13.252084
1826,Clermont-Ferrand,28,150596,0.21,37,42.67,0.656199,10.714286,13.252084
1827,Saint Etienne,21,175792,0.17,46,79.97,0.262598,7.952381,13.252084
1828,Avignon,16,92821,0.17,40,64.78,0.24699,8.8125,13.252084
1829,Belfort,16,47242,0.1,26,17.1,0.935673,7.3125,13.252084


The DataFrame Factors_final contains everything we need to estimated the daily user numbers, except 5 parameters.

In [46]:
# Define a function to represent the mathematical model to estimate the number of users

def Nb_users(a_1, a_2, a_3, a_4, a_5):
    r = (a_1*factors_final.Congestion_level + a_2/factors_final.Air_pollution_level) * a_3*factors_final.Station_density * a_4*factors_final.Popularity_level * a_5*factors_final.Bike_per_station
    return round(factors_final.Population * r).astype(int)

After a series of tests I set these parameters to [0.1, 0.5, 0.5, 0.04, 0.05]

In [47]:
dataset.insert(dataset.shape[1], 'Nb_users', Nb_users(0.1, 0.5, 0.5, 0.04, 0.05))

In [48]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users
0,2020-01-01,Montpellier,256,29,524
1,2020-01-01,Clermont-Ferrand,301,28,391
2,2020-01-01,Saint Etienne,172,21,112
3,2020-01-01,Avignon,141,16,64
4,2020-01-01,Belfort,113,16,97
5,2020-01-02,Montpellier,250,29,492
6,2020-01-02,Clermont-Ferrand,294,28,367
7,2020-01-02,Saint Etienne,165,21,104
8,2020-01-02,Avignon,137,16,59
9,2020-01-02,Belfort,115,16,95


In [49]:
dataset.tail(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users
1820,2020-12-30,Montpellier,254,29,636
1821,2020-12-30,Clermont-Ferrand,301,28,479
1822,2020-12-30,Saint Etienne,171,21,137
1823,2020-12-30,Avignon,138,16,76
1824,2020-12-30,Belfort,116,16,122
1825,2020-12-31,Montpellier,252,29,641
1826,2020-12-31,Clermont-Ferrand,300,28,484
1827,2020-12-31,Saint Etienne,167,21,136
1828,2020-12-31,Avignon,141,16,79
1829,2020-12-31,Belfort,117,16,125


In [50]:
dataset.shape

(1830, 5)

### Step 3.6 - Number of trips

As we know one user can finish at least 1 trip. But he can also ride for more than once. This number is a integer >= 1. As this number goes larger, the probability will go down. This remind me of Geometric distribution. So I used it to estimate Number of trips based on Number of users. 

In [51]:
# Define a function to transform number of users to number of trips. 
# Assume that the number of trip for each user is a random variable following a geometric distribustion with p=0.5

def user_to_trip(x):
    l = np.random.geometric(0.5, x)
    return sum(l)

In [52]:
# Apply this function to our dataset then insert it into dataset

nb_trips = dataset.Nb_users.apply(user_to_trip)
dataset.insert(dataset.shape[1], 'Nb_trips', nb_trips)

In [53]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips
0,2020-01-01,Montpellier,256,29,524,1046
1,2020-01-01,Clermont-Ferrand,301,28,391,764
2,2020-01-01,Saint Etienne,172,21,112,235
3,2020-01-01,Avignon,141,16,64,146
4,2020-01-01,Belfort,113,16,97,180
5,2020-01-02,Montpellier,250,29,492,975
6,2020-01-02,Clermont-Ferrand,294,28,367,767
7,2020-01-02,Saint Etienne,165,21,104,202
8,2020-01-02,Avignon,137,16,59,114
9,2020-01-02,Belfort,115,16,95,171


In [54]:
dataset.tail(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips
1820,2020-12-30,Montpellier,254,29,636,1297
1821,2020-12-30,Clermont-Ferrand,301,28,479,992
1822,2020-12-30,Saint Etienne,171,21,137,298
1823,2020-12-30,Avignon,138,16,76,155
1824,2020-12-30,Belfort,116,16,122,255
1825,2020-12-31,Montpellier,252,29,641,1300
1826,2020-12-31,Clermont-Ferrand,300,28,484,954
1827,2020-12-31,Saint Etienne,167,21,136,277
1828,2020-12-31,Avignon,141,16,79,172
1829,2020-12-31,Belfort,117,16,125,301


In [55]:
dataset.shape

(1830, 6)

### Step 3.7 - Total travelled distance

As the name of our company, we want our users to arrive their destinations in 15 minutes. In general, the average speed for a cycler is in the range of 16km/h to 20km/h. That means 4km to 5km per trip. For the users who use our bike as a transportation tool, this distance is more likely to be lower than 4km or 5km. But the users who use our bikes for exercise will ride longer than that. So I pick 4km as the mean value, generate a random distance based on normal distrubution for each trip.

In [56]:
# Define a function to transform number of trips to total travelled distance. 
# Assume that the distance of each trip is a random variable following a normal distribustion with mean value = 4.

def trip_to_distance(x):
    l = np.random.normal(4, 1, x)
    return round(sum(l))

In [57]:
# Add Total_distance_km into our dataset

total_distance = dataset.Nb_trips.apply(trip_to_distance)
dataset.insert(dataset.shape[1], 'Total_distance_km', total_distance)

In [58]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km
0,2020-01-01,Montpellier,256,29,524,1046,4128
1,2020-01-01,Clermont-Ferrand,301,28,391,764,3080
2,2020-01-01,Saint Etienne,172,21,112,235,925
3,2020-01-01,Avignon,141,16,64,146,583
4,2020-01-01,Belfort,113,16,97,180,720
5,2020-01-02,Montpellier,250,29,492,975,3868
6,2020-01-02,Clermont-Ferrand,294,28,367,767,3083
7,2020-01-02,Saint Etienne,165,21,104,202,826
8,2020-01-02,Avignon,137,16,59,114,457
9,2020-01-02,Belfort,115,16,95,171,691


In [59]:
dataset.tail(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km
1820,2020-12-30,Montpellier,254,29,636,1297,5200
1821,2020-12-30,Clermont-Ferrand,301,28,479,992,3986
1822,2020-12-30,Saint Etienne,171,21,137,298,1196
1823,2020-12-30,Avignon,138,16,76,155,616
1824,2020-12-30,Belfort,116,16,122,255,983
1825,2020-12-31,Montpellier,252,29,641,1300,5240
1826,2020-12-31,Clermont-Ferrand,300,28,484,954,3839
1827,2020-12-31,Saint Etienne,167,21,136,277,1121
1828,2020-12-31,Avignon,141,16,79,172,674
1829,2020-12-31,Belfort,117,16,125,301,1208


In [60]:
dataset.shape

(1830, 7)

### Step 3.8 - Total travelled time

Similarly set mean value at 0.25.

In [61]:
# Define a function to transform number of trips to total travelled time. 
# Assume that the distance of each trip is a random variable following a normal distribustion with mean value = 0.25.

def trip_to_time(x):
    l = np.random.normal(0.25, 0.25, x)
    return round(sum(l))

In [62]:
# Add Total_time_hour into our dataset

total_time = dataset.Nb_trips.apply(trip_to_time)
dataset.insert(dataset.shape[1], 'Total_time_hour', total_time)

In [63]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km,Total_time_hour
0,2020-01-01,Montpellier,256,29,524,1046,4128,258
1,2020-01-01,Clermont-Ferrand,301,28,391,764,3080,183
2,2020-01-01,Saint Etienne,172,21,112,235,925,58
3,2020-01-01,Avignon,141,16,64,146,583,38
4,2020-01-01,Belfort,113,16,97,180,720,43
5,2020-01-02,Montpellier,250,29,492,975,3868,253
6,2020-01-02,Clermont-Ferrand,294,28,367,767,3083,185
7,2020-01-02,Saint Etienne,165,21,104,202,826,46
8,2020-01-02,Avignon,137,16,59,114,457,33
9,2020-01-02,Belfort,115,16,95,171,691,41


In [64]:
dataset.tail(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km,Total_time_hour
1820,2020-12-30,Montpellier,254,29,636,1297,5200,316
1821,2020-12-30,Clermont-Ferrand,301,28,479,992,3986,243
1822,2020-12-30,Saint Etienne,171,21,137,298,1196,86
1823,2020-12-30,Avignon,138,16,76,155,616,38
1824,2020-12-30,Belfort,116,16,122,255,983,69
1825,2020-12-31,Montpellier,252,29,641,1300,5240,328
1826,2020-12-31,Clermont-Ferrand,300,28,484,954,3839,244
1827,2020-12-31,Saint Etienne,167,21,136,277,1121,68
1828,2020-12-31,Avignon,141,16,79,172,674,48
1829,2020-12-31,Belfort,117,16,125,301,1208,70


In [65]:
dataset.shape

(1830, 8)

### Step 3.9 - Number of new users

The number of new users will related to the population and the popularity.

In [66]:
# Define the ratio of population to turn into new users. It's depends on the change of popularity level.

new_user_ratio = (pd.Series(popularity).pct_change()+0.5)/2500

# Calculate the mean value of the number of new users

factors_final.insert(factors_final.shape[1], 'New_user_ratio', np.repeat(new_user_ratio.to_list(), 5))
factors_final['New_user'] = factors_final.Population*factors_final.New_user_ratio

In [67]:
# Again I will use random.normal to make this number more random around its mean value.

nb_new_user = factors_final.New_user.fillna(0).map(lambda x: round(np.random.normal(x, 3)))

In [68]:
# Add this column into our dataset and set the data of the first day to 0.

dataset.insert(dataset.shape[1], 'Nb_new_users', nb_new_user)
dataset.iloc[:5,8] = 0

# Now the generation of the dataset is done. Take a look of it then write it in a csv file.

In [69]:
dataset.describe()

Unnamed: 0,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km,Total_time_hour,Nb_new_users
count,1830.0,1830.0,1830.0,1830.0,1830.0,1830.0,1830.0
mean,195.028962,22.0,261.144262,521.829508,2087.617486,130.510383,30.601093
std,68.607518,5.622924,205.450745,411.903439,1647.856137,103.081593,17.759015
min,113.0,16.0,53.0,96.0,381.0,21.0,-1.0
25%,137.0,16.0,101.0,198.0,787.25,48.25,17.0
50%,170.0,21.0,126.0,254.0,1020.0,64.0,29.0
75%,259.0,28.0,456.0,912.5,3638.75,227.0,39.0
max,302.0,29.0,696.0,1447.0,5772.0,364.0,85.0


In [70]:
dataset.shape

(1830, 9)

In [71]:
dataset.head(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km,Total_time_hour,Nb_new_users
0,2020-01-01,Montpellier,256,29,524,1046,4128,258,0
1,2020-01-01,Clermont-Ferrand,301,28,391,764,3080,183,0
2,2020-01-01,Saint Etienne,172,21,112,235,925,58,0
3,2020-01-01,Avignon,141,16,64,146,583,38,0
4,2020-01-01,Belfort,113,16,97,180,720,43,0
5,2020-01-02,Montpellier,250,29,492,975,3868,253,52
6,2020-01-02,Clermont-Ferrand,294,28,367,767,3083,185,28
7,2020-01-02,Saint Etienne,165,21,104,202,826,46,34
8,2020-01-02,Avignon,137,16,59,114,457,33,20
9,2020-01-02,Belfort,115,16,95,171,691,41,11


In [72]:
dataset.tail(10)

Unnamed: 0,Date,City,Nb_available_bike,Nb_station,Nb_users,Nb_trips,Total_distance_km,Total_time_hour,Nb_new_users
1820,2020-12-30,Montpellier,254,29,636,1297,5200,316,47
1821,2020-12-30,Clermont-Ferrand,301,28,479,992,3986,243,30
1822,2020-12-30,Saint Etienne,171,21,137,298,1196,86,29
1823,2020-12-30,Avignon,138,16,76,155,616,38,24
1824,2020-12-30,Belfort,116,16,122,255,983,69,6
1825,2020-12-31,Montpellier,252,29,641,1300,5240,328,62
1826,2020-12-31,Clermont-Ferrand,300,28,484,954,3839,244,33
1827,2020-12-31,Saint Etienne,167,21,136,277,1121,68,34
1828,2020-12-31,Avignon,141,16,79,172,674,48,19
1829,2020-12-31,Belfort,117,16,125,301,1208,70,8


# Take the influence of Covid-19 into account

In [73]:
# Copy a new dataset and turn dtype in column Date into datetime

new_data = dataset.copy()
new_data.Date = new_data.Date.map(lambda x : datetime.strptime(x, "%Y-%m-%d"))

The period of first containment (le premier confinement) : 2020-03-17 00:00 - 2020-05-10 23:59. During this period most people obey the rules and the transportation was reduced dramatically. I assume the total travelled distance and time reduced 95%. But I will use a random value who follows a normal distribution.

In [75]:
def confinement_1(x):
    n = np.random.normal(0.05, 0.01)
    return round(x*n)

In [109]:
# Use transitional data to modify this part

data_con_1 = new_data.loc[(new_data.Date>=datetime(2020,3,17,0,0)) & (new_data.Date<datetime(2020,5,11,0,0))]

In [110]:
data_con_1_1 = data_con_1.copy()

In [111]:
data_con_1_1['Nb_users'] = data_con_1['Nb_users'].apply(confinement_1)
data_con_1_1['Nb_trips'] = data_con_1['Nb_trips'].apply(confinement_1)
data_con_1_1['Total_distance_km'] = data_con_1['Total_distance_km'].apply(confinement_1)
data_con_1_1['Total_time_hour'] = data_con_1['Total_time_hour'].apply(confinement_1)
data_con_1_1['Nb_new_users'] = data_con_1['Nb_new_users'].apply(confinement_1)

In [114]:
new_data.loc[(new_data.Date>=datetime(2020,3,17,0,0)) & (new_data.Date<datetime(2020,5,11,0,0))] = data_con_1_1

The second period of containment : 2020-10-30 00:00 - 2020-12-14 23:59. People didn't obey the rules strictly as they did during the first period. So I assume the reduction rate is around 70%.

In [116]:
def confinement_2(x):
    n = np.random.normal(0.3, 0.06)
    return round(x*n)

In [117]:
# Similarly,

data_con_2 = new_data.loc[(new_data.Date>=datetime(2020,10,30,0,0)) & (new_data.Date<datetime(2020,12,15,0,0))]

In [118]:
data_con_2_2 = data_con_2.copy()

In [119]:
data_con_2_2['Nb_users'] = data_con_2['Nb_users'].apply(confinement_2)
data_con_2_2['Nb_trips'] = data_con_2['Nb_trips'].apply(confinement_2)
data_con_2_2['Total_distance_km'] = data_con_2['Total_distance_km'].apply(confinement_2)
data_con_2_2['Total_time_hour'] = data_con_2['Total_time_hour'].apply(confinement_2)
data_con_2_2['Nb_new_users'] = data_con_2['Nb_new_users'].apply(confinement_2)

In [120]:
new_data.loc[(new_data.Date>=datetime(2020,10,30,0,0)) & (new_data.Date<datetime(2020,12,15,0,0))] = data_con_2_2

In [128]:
new_data.to_csv('test_dataset_Covid.csv', index = False, header = True)