# `Group Project` - Boston 2017 Bike Sharing Analysis

## Introduction
In this project we will work with real world data from "Blue Bikes" which is a bike sharing firm based in Boston.

Our focus lies on data of the year 2017 and we will make a comprehensive analysis, in which we use typical data analysis and machine learning approaches to make use of the data to monitor and optimize the operations of "Blue Bikes". The overall topic is smart mobility systems and how we can use data in impactful ways to address pressing societal issues. In terms of bike sharing the addressed societal issue are the reduce of greenhouse gas emissions, reduce of pollution as health risk for urban population, reduce of (fatal) road accidents and to create a more efficient road transport infrastructur. We will cover the following task:

- **Task 1) Data Collection & Preparation**: 
    * Cleaning of datasets for use in later analysis stages
- **Task 2) Descriptive Analysis**: 
    * Demonstrate temporal demand patterns and seasonality
    * Demonstrate geographical demand patterns 
    * Define Key Performance Indicators which provides overview of current fleet operations
- **Task 3) Predictive Analysis**:
    * Forecast total system-level demand in the next hour

**The authors of this analysis are:**
* Robin Kirch      (7364580)
* Niklas Nesseler  (7367375)
* Lukas Tempfli    (7367097)
* Sven Dornbrach   (7364484)
* Moritz Danhausen (7369413)


Hint: Please run the seperated cells from the top to the bottom

### Required Imports

In [1]:
import numpy as np
import pandas as pd
from datetime import date, time, datetime, timedelta 
import matplotlib.pyplot as plt
import folium
from folium import plugins
from folium.plugins import HeatMap
from datetime import datetime 
from datetime import timedelta 
import math
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import seaborn as sns
from sklearn.model_selection import train_test_split
sns.set()
sns.set_style("white")
sns.set_palette("GnBu_d")
#import etc


# `Task 1) Data Collection & Preparation:`

### Details to Task 1
#TODO
Detailed introduction to task 1 here

In [2]:
raw_data_boston = pd.read_csv("boston_2017.csv")
#Problematisch bei gleicher Startzeit und ggbf. kompletten Dateipfad angeben notwendig

In [3]:

raw_data_boston["start_time"]=pd.to_datetime(raw_data_boston["start_time"], format="%Y/%m/%d %H:")

#rdb=raw_data_boston
#rdb=pd.to_datetime(rdb["start_time"], format="%Y-%m-%d %H:")
raw_data_boston["end_time"]=pd.to_datetime(raw_data_boston["end_time"], format="%Y/%m/%d %H:")
#raw_data_boston
#rdb
raw_data_boston["hour"]=raw_data_boston["start_time"].dt.hour
raw_data_boston["year"]=raw_data_boston["start_time"].dt.year
raw_data_boston["month"]=raw_data_boston["start_time"].dt.month
raw_data_boston["day"]=raw_data_boston["start_time"].dt.day
raw_data_boston["seconds"]=raw_data_boston["start_time"].dt.second

df1=pd.DataFrame({'year':raw_data_boston["year"],'month':raw_data_boston["month"],'day':raw_data_boston["day"],'hour':raw_data_boston["hour"] })
df1=pd.to_datetime(df1, format="%Y/%m/%d, %H:")

raw_data_boston["date_time"]=df1

#del raw_data_boston["hour"]
del raw_data_boston["year"]
del raw_data_boston["month"]
del raw_data_boston["day"]
del raw_data_boston["seconds"]

raw_data_boston

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,hour,date_time
0,2017-01-01 00:06:58,2017-01-01 00:12:49,67,139,MIT at Mass Ave / Amherst St,Dana Park,644,Subscriber,0,2017-01-01 00:00:00
1,2017-01-01 00:13:16,2017-01-01 00:28:07,36,10,Boston Public Library - 700 Boylston St.,B.U. Central - 725 Comm. Ave.,230,Subscriber,0,2017-01-01 00:00:00
2,2017-01-01 00:16:17,2017-01-01 00:44:10,36,9,Boston Public Library - 700 Boylston St.,Agganis Arena - 925 Comm Ave.,980,Customer,0,2017-01-01 00:00:00
3,2017-01-01 00:21:22,2017-01-01 00:33:50,46,19,Christian Science Plaza,Buswell St. at Park Dr.,1834,Subscriber,0,2017-01-01 00:00:00
4,2017-01-01 00:30:06,2017-01-01 00:40:28,10,8,B.U. Central - 725 Comm. Ave.,Union Square - Brighton Ave. at Cambridge St.,230,Subscriber,0,2017-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...
1313769,2017-12-31 23:46:18,2017-12-31 23:50:27,117,141,Binney St / Sixth St,Kendall Street,1846,Subscriber,23,2017-12-31 23:00:00
1313770,2017-12-29 16:11:56,2017-12-29 16:16:18,54,42,Tremont St at West St,Boylston St at Arlington St TEMPORARY WINTER L...,2,Subscriber,16,2017-12-29 16:00:00
1313771,2017-12-30 08:09:44,2017-12-30 08:26:08,54,58,Tremont St at West St,Beacon St at Arlington St,1534,Subscriber,8,2017-12-30 08:00:00
1313772,2017-12-30 12:20:01,2017-12-30 12:49:12,54,46,Tremont St at West St,Christian Science Plaza - Massachusetts Ave at...,1978,Subscriber,12,2017-12-30 12:00:00


In [4]:
raw_data_boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313774 entries, 0 to 1313773
Data columns (total 10 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   start_time          1313774 non-null  datetime64[ns]
 1   end_time            1313774 non-null  datetime64[ns]
 2   start_station_id    1313774 non-null  int64         
 3   end_station_id      1313774 non-null  int64         
 4   start_station_name  1313774 non-null  object        
 5   end_station_name    1313774 non-null  object        
 6   bike_id             1313774 non-null  int64         
 7   user_type           1313774 non-null  object        
 8   hour                1313774 non-null  int64         
 9   date_time           1313774 non-null  datetime64[ns]
dtypes: datetime64[ns](3), int64(4), object(3)
memory usage: 100.2+ MB


In [5]:
raw_data_boston.describe()

Unnamed: 0,start_station_id,end_station_id,bike_id,hour
count,1313774.0,1313774.0,1313774.0,1313774.0
mean,85.71053,85.5436,993.1371,13.7343
std,56.34766,56.48317,570.1397,4.77768
min,1.0,1.0,1.0,0.0
25%,43.0,43.0,516.0,9.0
50%,74.0,74.0,978.0,14.0
75%,117.0,117.0,1512.0,17.0
max,232.0,232.0,1981.0,23.0


In [6]:
raw_data_boston.columns

Index(['start_time', 'end_time', 'start_station_id', 'end_station_id',
       'start_station_name', 'end_station_name', 'bike_id', 'user_type',
       'hour', 'date_time'],
      dtype='object')

In [7]:
raw_data_boston["end_time"].describe()
#start_time nicht zugreifbar

  raw_data_boston["end_time"].describe()


count                 1313774
unique                1245112
top       2017-06-27 17:27:42
freq                        6
first     2017-01-01 00:12:49
last      2018-01-07 20:00:16
Name: end_time, dtype: object

In [8]:
raw_data_boston["start_station_id"].describe()

count    1.313774e+06
mean     8.571053e+01
std      5.634766e+01
min      1.000000e+00
25%      4.300000e+01
50%      7.400000e+01
75%      1.170000e+02
max      2.320000e+02
Name: start_station_id, dtype: float64

In [9]:
raw_data_boston["end_station_id"].describe()

count    1.313774e+06
mean     8.554360e+01
std      5.648317e+01
min      1.000000e+00
25%      4.300000e+01
50%      7.400000e+01
75%      1.170000e+02
max      2.320000e+02
Name: end_station_id, dtype: float64

In [10]:
raw_data_boston["start_station_name"].describe()

count                          1313774
unique                             285
top       MIT at Mass Ave / Amherst St
freq                             42320
Name: start_station_name, dtype: object

In [11]:
raw_data_boston["end_station_name"].describe()

count                          1313774
unique                             283
top       MIT at Mass Ave / Amherst St
freq                             42442
Name: end_station_name, dtype: object

In [12]:
raw_data_boston["bike_id"].describe()

count    1.313774e+06
mean     9.931371e+02
std      5.701397e+02
min      1.000000e+00
25%      5.160000e+02
50%      9.780000e+02
75%      1.512000e+03
max      1.981000e+03
Name: bike_id, dtype: float64

In [13]:
raw_data_boston["user_type"].describe()

count        1313774
unique             2
top       Subscriber
freq         1104738
Name: user_type, dtype: object

In [14]:
raw_data_boston.count()

start_time            1313774
end_time              1313774
start_station_id      1313774
end_station_id        1313774
start_station_name    1313774
end_station_name      1313774
bike_id               1313774
user_type             1313774
hour                  1313774
date_time             1313774
dtype: int64

In [15]:
raw_data_weather= pd.read_csv("weather_hourly_boston.csv")
raw_data_weather["date_time"]=pd.to_datetime(raw_data_weather["date_time"])


In [16]:
raw_data_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43848 entries, 0 to 43847
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date_time  43354 non-null  datetime64[ns]
 1   max_temp   43354 non-null  float64       
 2   min_temp   43354 non-null  float64       
 3   precip     43356 non-null  float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.3 MB


In [17]:
raw_data_weather.head(20)

Unnamed: 0,date_time,max_temp,min_temp,precip
0,2015-01-02 01:00:00,-1.1,-1.1,0.0
1,2015-01-02 02:00:00,-1.1,-1.1,0.0
2,2015-01-02 03:00:00,-0.6,-0.6,0.0
3,2015-01-02 04:00:00,-0.6,-0.6,0.0
4,2015-01-02 05:00:00,-0.6,-0.6,0.0
5,2015-01-01 06:00:00,-5.6,-5.6,0.0
6,2015-01-01 07:00:00,-5.6,-5.6,0.0
7,2015-01-01 08:00:00,-5.6,-5.6,0.0
8,2015-01-01 09:00:00,-4.4,-4.4,0.0
9,2015-01-01 10:00:00,-5.6,-5.6,0.0


In [48]:
raw_station_location = pd.read_csv("current_bluebikes_stations.csv")
raw_station_location.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Last Updated    379 non-null    object
 1   June 11th 2021  379 non-null    object
 2   Unnamed: 2      379 non-null    object
 3   Unnamed: 3      379 non-null    object
 4   Unnamed: 4      376 non-null    object
 5   Unnamed: 5      379 non-null    object
 6   Unnamed: 6      379 non-null    object
dtypes: object(7)
memory usage: 20.9+ KB


In [49]:
raw_station_location.describe()

Unnamed: 0,Last Updated,June 11th 2021,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
count,379,379,379.0,379.0,376,379,379
unique,379,379,379.0,379.0,11,2,22
top,S32013,Binney St / Sixth St,42.3650112,-71.07942931,Boston,Yes,15
freq,1,1,1.0,1.0,225,378,136


In [50]:
raw_station_location.head(50)

Unnamed: 0,Last Updated,June 11th 2021,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,Number,Name,Latitude,Longitude,District,Public,Total docks
1,W32006,160 Arsenal,42.36466403,-71.17569387,Watertown,Yes,11
2,A32019,175 N Harvard St,42.363796,-71.129164,Boston,Yes,18
3,S32035,191 Beacon St,42.38032335,-71.10878613,Somerville,Yes,19
4,C32094,2 Hummingbird Lane at Olmsted Green,42.28887,-71.095003,Boston,Yes,17
5,S32023,30 Dane St,42.38100143,-71.10402523,Somerville,Yes,15
6,M32026,359 Broadway - Broadway at Fayette Street,42.370803,-71.104412,Cambridge,Yes,23
7,C32091,645 Summer St,42.34178089,-71.03987017,Boston,Yes,19
8,M32054,699 Mt Auburn St,42.37500235,-71.14871614,Cambridge,Yes,25
9,V32001,7 Acre Park,42.41143223,-71.06823265,Everett,Yes,15


In [53]:
station_location = raw_station_location
station_location.drop(0,inplace=True)
station_location.rename(columns={'Last Updated':'Number','June 11th 2021':'Name','Unnamed: 2':'Latitude','Unnamed: 3':'Longitude','Unnamed: 4':'District',
                                 'Unnamed: 5':'Public','Unnamed: 6':'Total Docks'}, inplace = True)
station_location.head(50)


Unnamed: 0,Number,Name,Latitude,Longitude,District,Public,Total Docks
1,W32006,160 Arsenal,42.36466403,-71.17569387,Watertown,Yes,11
2,A32019,175 N Harvard St,42.363796,-71.129164,Boston,Yes,18
3,S32035,191 Beacon St,42.38032335,-71.10878613,Somerville,Yes,19
4,C32094,2 Hummingbird Lane at Olmsted Green,42.28887,-71.095003,Boston,Yes,17
5,S32023,30 Dane St,42.38100143,-71.10402523,Somerville,Yes,15
6,M32026,359 Broadway - Broadway at Fayette Street,42.370803,-71.104412,Cambridge,Yes,23
7,C32091,645 Summer St,42.34178089,-71.03987017,Boston,Yes,19
8,M32054,699 Mt Auburn St,42.37500235,-71.14871614,Cambridge,Yes,25
9,V32001,7 Acre Park,42.41143223,-71.06823265,Everett,Yes,15
10,B32060,700 Commonwealth Ave.,42.34960945,-71.10391524,Boston,Yes,16


In [None]:
first_date =datetime(year=2017,day=1,month=1)
lastdate=datetime(year=2017, day=31,month =12)

neu_weather_2017= raw_data_weather[(raw_data_weather["date_time"]>=first_date)&(raw_data_weather["date_time"]<=lastdate)]
neu_weather_2017.info()


In [None]:
bike_list = raw_data_boston["bike_id"].unique()
fleet_size = len(bike_list)
fleet_size


In [None]:
def get_date (ts):
    return ts.date()

def get_weekday (ts):
    return ts.weekday()

def get_hour (ts):
    return ts.hour
def set1 (ts):
    return 1

In [None]:
raw_data_boston["Date"] = raw_data_boston["start_time"].apply(lambda ts: get_date (ts))
raw_data_boston["Rented"]=raw_data_boston["Date"].apply(lambda ts: set1(ts))
raw_data_boston.head(50)

In [None]:
bikes_rented_total = raw_data_boston.groupby("Date")["Rented"].sum()
fig,ax = plt.subplots(figsize=(16,9)) 

ax.plot(bikes_rented_total)

ax.set_title("# RENTALS PER DAY", fontsize = 20, fontname = "arial", color = "red")
ax.set_ylabel("# of rentals", fontsize = 14, color = "red")
ax.set_xlabel("Date", fontsize = 14, color = "red")

plt.show()

In [None]:
bikes_rented_total

## Demand in dependency of the temperature

In [None]:
#connecting both data sets by  the date with an inner join

left=raw_data_boston
right=neu_weather_2017


inner_merge = pd.merge(left=left, right=right, left_on="date_time", right_on="date_time")

inner_merge

In [None]:

weather_rentals=inner_merge.groupby("min_temp")["Rented"].sum()

Fig_2, ax_2= plt.subplots(figsize=(16,9))

ax_2.plot(weather_rentals)

ax_2.set_title("TEMPERATURE/RENTALS", fontsize = 20, fontname = "arial", color = "red")
ax_2.set_ylabel("number of rentals", fontsize = 14, color = "red")
ax_2.set_xlabel("temperature at the rental", fontsize = 14, color = "red")

plt.show()





## Rented bicycles for each day of the week

In [None]:
#add the weekday to the data

raw_data_boston["Weekday"] = raw_data_boston["start_time"].apply(lambda ts: get_weekday (ts))

In [None]:
weekday_rentals=raw_data_boston.groupby("Weekday")["Rented"].sum()
Fig_3, ax_3= plt.subplots(figsize=(16,9))

ax_3.plot(weekday_rentals)

ax_3.set_title("WEEKDAY/RENTALS", fontsize = 20, fontname = "arial", color = "red")
ax_3.set_ylabel("number of rentals", fontsize = 14, color = "red")
ax_3.set_xlabel("weekday", fontsize = 14, color = "red")


plt.show

## Overview of the hourly demand

In [None]:
hourly_rentals=raw_data_boston.groupby("hour")["Rented"].sum()
Fig_3, ax_3= plt.subplots(figsize=(16,9))

ax_3.plot(hourly_rentals)

ax_3.set_title("HOUR/RENTALS", fontsize = 20, fontname = "arial", color = "red")
ax_3.set_ylabel("number of rentals", fontsize = 14, color = "red")
ax_3.set_xlabel("hour", fontsize = 14, color = "red")


plt.show

In [None]:
# Ansatz Barplot, da diskrete Werte (Kein Tag 1.4)
# Wahrscheinlich Seaborn nutzen

weekday_rentals=raw_data_boston.groupby("Weekday")["Rented"].sum()

Fig_3, ax_3, gg= plt.hist(weekday_rentals, bins = 7, edgecolor = 'k')

plt.show



## Popularity of the stations 

In [None]:
start_station_rentals=inner_merge.groupby("start_station_id")["Rented"].sum()
end_station_rentals=inner_merge.groupby("end_station_id")["Rented"].sum()


Fig_4, ax_4= plt.subplots(figsize=(16,9))

ax_4.plot(start_station_rentals, label="start_station", color = "blue")
ax_4.plot(end_station_rentals , label="end_station", color = "green")

ax_4.set_title("POPULARITY OF A STATION", fontsize = 20, fontname = "arial", color = "red")
ax_4.set_ylabel("number of rentals", fontsize = 14, color = "red")
ax_4.set_xlabel("station number", fontsize = 14, color = "red")


plt.legend(loc="upper left")




In [None]:
ypos= raw_data_boston["start_station_id"].unique()
#ypos1=len(ypos)
#ypos1
ypos

In [None]:
max= raw_data_boston["start_station_id"].quantile(1)
max=int(max)
stations= np.arange(1,233)
stations

In [None]:
station_rentals=raw_data_boston.groupby("start_station_id")["Rented"].sum()
#plt.bar(stations,station_rentals, width=0.5)
#plt.show
#station_rentals
#plt.hist(raw_data_boston.start_station_id,bins= stations)
plt.figure(figsize=(35,20)) #change your figure size as per your desire here
n,bins,patch = plt.hist(raw_data_boston.start_station_id,bins=232, color='green', alpha=0.8, label='Value', edgecolor='orange', linewidth=2)

plt.show()

## Popularity of start stations

In [None]:
df_station_rentals=pd.DataFrame(station_rentals)
df_station_rentals=df_station_rentals.sort_values(["Rented"], ascending=False)

right=df_station_rentals
left=raw_data_boston


inner_merge2 = pd.merge(df_station_rentals,raw_data_boston, on="start_station_id",how="left")
del inner_merge2["start_time"]
del inner_merge2["end_time"]
del inner_merge2["end_station_id"]
del inner_merge2["end_station_name"]
del inner_merge2["bike_id"]
del inner_merge2["user_type"]
del inner_merge2["date_time"]
del inner_merge2["Date"]
del inner_merge2["Rented_y"]
del inner_merge2["Weekday"]
del inner_merge2["hour"]
#pd.merge(left=left, right=right, how="left",left_on="start_station_id", right_on="start_station_id")
inner_merge2=inner_merge2.drop_duplicates(subset=["start_station_id"], keep='first', inplace=False, ignore_index=False)
inner_merge2.set_index("start_station_id",inplace=True)
inner_merge2

## Demand in dependency of the weather

In [None]:

left=raw_data_boston
right=neu_weather_2017


inner_merge3 = pd.merge(left=left, right=right, left_on="date_time", right_on="date_time")

#inner_merge3

station_rentals=inner_merge3.groupby("precip")["Rented"].sum()
#plt.bar(stations,station_rentals, width=0.5)
#plt.show
#station_rentals
#plt.hist(inner_merge3.precip,bins= inner_merge3.Rented)
#plt.figure(figsize=(35,20)) #change your figure size as per your desire here
#n,bins,patch = plt.hist(inner_merge3.precip,bins=2, color='green', alpha=0.8, label='Value', edgecolor='orange', linewidth=2)



#dfX=pd.DataFrame({'precip':inner_merge3["precip"],'Rented':inner_merge3.groupby("precip")["Rented"].sum() })
#df.set_index('precip')[['Rented']].plot.bar()
df = pd.DataFrame(station_rentals, columns = ['precip','Rented'])
df.plot.bar(x = 'precip', y = 'Rented',figsize=(30,20),fontsize = 20)
bars = ['sun', 'rain/snow']
y_pos = np.arange(len(bars))

# Create names on the x-axis
plt.xticks(y_pos, bars)
plt.show



# `Task 2) Descriptive Analysis:`

### Details to Task 2
#TODO
Detailed introduction to task 2 here

# Temporal Demand Patterns and Seasonality:

## Fleet usage during a day:

In [None]:
hourly_rentals=raw_data_boston.groupby("hour")["Rented"].sum()
Fig_3, ax_3= plt.subplots(figsize=(16,9))

ax_3.plot(hourly_rentals)

ax_3.set_title("HOUR/RENTALS", fontsize = 20, fontname = "arial", color = "red")
ax_3.set_ylabel("number of rentals", fontsize = 14, color = "red")
ax_3.set_xlabel("hour", fontsize = 14, color = "red")


plt.show

As we look upon the graph we notice 2 local extrema.
One in the Morning between 8 and 10 am and one between 4 and 6pm, which shows that the rentals are mostly used by locals to get to and from work.
At night between 11pm and 6am we experience a predictable low point of rented bikes, obviously because most people are active during the day and rest at night.
During midday we experience a consistent number of rented bikes between 60 and 80 thousand.

##  Fleet  usage during the week

In [None]:

Fig_3, ax_3= plt.subplots(figsize=(16,9))

ax_3.plot(weekday_rentals)

ax_3.set_title("WEEKDAY/RENTALS", fontsize = 20, fontname = "arial", color = "red")
ax_3.set_ylabel("number of rentals", fontsize = 14, color = "red")
ax_3.set_xlabel("weekday", fontsize = 14, color = "red")


plt.show

In the Graph we can see high demand during the week and low demand at the weekend. Like in the Graph from above the demand is correlated to the working days which means that a bike is rented mostly for the way to work. Due to the fact most of the people do not work at the weekend there is low demand at those days. As a result the major group of people that rent a bikes live in Boston and are not tourists.

## Fleet usage during a year

In [None]:
bikes_rented_total = raw_data_boston.groupby("Date")["Rented"].sum()
fig,ax = plt.subplots(figsize=(16,9)) 

ax.plot(bikes_rented_total)

ax.set_title("# RENTALS PER DAY", fontsize = 20, fontname = "arial", color = "red")
ax.set_ylabel("# of rentals", fontsize = 14, color = "red")
ax.set_xlabel("Date", fontsize = 14, color = "red")

plt.show()

As we can see there is much demand in the summer than in the winter. It can be explained with the weather.

In [None]:
weather_plot= neu_weather_2017

#neu_weather_2017
#weather_plot
del weather_plot ["min_temp"]
#del weather_plot ["precip"]
weather_plot


In [None]:

weather_plot

x=weather_plot['date_time']
y=weather_plot['max_temp']

Fig_1 = plt.figure(figsize=(22,15))


ax_1 = Fig_1.add_axes([0.5, 0.5, 0.51, 0.51])
ax_1.plot(x, y)

ax_1.set_xlabel('x')
ax_1.set_ylabel('f(x)')
ax_1.set_title('Diagram title')



#fig,ax = plt.subplots(figsize=(16,9)) 

#ax_1.plot(bikes_rented_total)




plt.show()

In [None]:
xp=weather_plot['date_time']
yp=weather_plot['max_temp']

In [None]:
plt.figure(figsize= (16,9))
plt.scatter(xp, yp, marker="x")
plt.xlabel("Temperature")
plt.ylabel("Demand")
plt.show

In [None]:
inner_merge
wr=inner_merge.groupby(["max_temp"], as_index=False)["Rented"].sum()
wr

In [None]:
yp=wr['Rented']
xp=wr['max_temp']

In [None]:
plt.figure(figsize= (9,9))
plt.scatter(xp, yp, marker="x")
plt.xlabel("Temperature")
plt.ylabel("Demand")
plt.show

In [None]:
def plot_regularized_polyregression (x, y, lam, d):
    
    min_x, max_x = x.min(), x.max()
    xs = 2*(x - min_x)/(max_x - min_x) - 1
    X = np.array([xs**i for i in range(d,-1,-1)]).T
    theta = np.linalg.solve(X.T @ X + lam*np.eye(X.shape[1]), X.T @ y)  # see lecture notes for derivation!, where np.eye() returns the identity matrix
    xt0 = np.linspace(min_x-1, max_x+1, 400)
    xt = 2*(xt0 - min_x)/(max_x - min_x) - 1
    Xt = np.array([xt**i for i in range(d,-1,-1)]).T
    yt = Xt @ theta
    
    # plotting routine
    plt.figure(figsize = (8,6))
    plt.scatter(x, y, marker="x")
    ylim = plt.ylim()
    plt.plot(xt0, yt, 'C1')
    plt.xlabel("Temperature (°C)")
    plt.ylabel("Demand (GW)")
    plt.xlim([min_x-2, max_x+2])
    plt.ylim(ylim)
    print(theta[:4])

In [None]:
plot_regularized_polyregression(xp,yp,0.1, 15)

# `Task 3) Descriptive Analysis:`

### Details to Task 3
#TODO
Detailed introduction to task 3 here

In [None]:
data_boston=raw_data_boston
del data_boston["start_time"]
del data_boston["end_time"]
del data_boston["end_station_id"]
del data_boston["end_station_name"]
del data_boston["bike_id"]
del data_boston["user_type"]
del data_boston["start_station_name"]
del data_boston["start_station_id"]
left_mergeT = pd.merge(neu_weather_2017,data_boston, on="date_time",how="left")
left_mergeT




In [None]:
#def weekday_check (dt):
    
 #   day_number = dt.Weekday()
  #  
   # if day_number <=4:
    #    return 1
    #else:
     #   return 0

#left_mergeT["IsWeekday"] = left_mergeT["Date"].apply(lambda dt: weekday_check(dt))
left_mergeT["IsWeekday"] = left_mergeT["Weekday"].apply(lambda x: 1 if x<=4 else 0)
left_mergeT

In [None]:
del left_mergeT["Weekday"]
del left_mergeT["Date"]
#del left_mergeT["min_temp"]
left_mergeT=left_mergeT.dropna()
left_mergeT.info()

In [None]:
summe=left_mergeT.groupby("date_time")["Rented"].sum()
df = pd.DataFrame(summe, columns = ['Rented'])
df

In [None]:
left_mergeT2=left_mergeT.drop_duplicates(subset=["date_time"], keep='first', inplace=False, ignore_index=False)
del left_mergeT2['Rented']
left_mergeT3=pd.merge(left_mergeT2,df, on="date_time",how="left")
left_mergeT3.set_index("date_time",inplace=True)
left_mergeT3.head(10)

In [None]:

X=left_mergeT3[['max_temp','precip','hour','IsWeekday']].values
Y=left_mergeT3['Rented'].values
X

In [None]:
Y

In [None]:
xp=X
yp=Y
# Do a 70-30 split first
x_train, X_test, y_train, y_test = train_test_split(xp, yp, test_size=0.3,random_state=34 )

# now split X_train to achive 50-20-30 split
#X_train, X_hold, y_train, y_hold = train_test_split(X_train, y_train, test_size=(0.2/0.7),random_state=34 )

In [None]:
print(len(X),len(x_train))


In [None]:
from sklearn.preprocessing import PolynomialFeatures

# initialize model
Poly = PolynomialFeatures(degree = 20)

# fit and transform xp
X_poly = Poly.fit_transform(x_train)

In [None]:
from sklearn.linear_model import Ridge

model_L2 = Ridge(alpha = 0.01, normalize = True, solver = 'lsqr') # select least squares regression as solver

model_L2.fit(X_poly, y_train)
prediction = model_L2.predict(X_poly)

#print("Coefficients ", model_L2.coef_, "\nIntercept ", model_L2.intercept_ )
prediction