<a href="https://colab.research.google.com/github/CassDabii/BBC-DS-Task/blob/main/BBC_DSProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***BBC Data Science Project*** 
---
Since this project is open ended it is up to me to determine what data is useful to make actionable insights. To do this I will make a list of *preliminary* goals that are variable, however any goals added or omitted will not be hidden but instead justified to maintain credibility




Goals
*   Determine the effect weather has on the length of the journey and the cycle volume.
*   Determine the usefulness and performance of trying to predict cycle volumes
*   Decipher where to add another station
*   If the business was ever wanting to get rid of a station what station would make the most sense to get rid of.
*   What effect bike station capacity has (e.g., the more spaces the more people go use it).
*   Which stations would it be useful to expand.












## Acquiring Data

I use the inital cell to declare my imports because it makes the notebook more organised and I never have to look for where I imported something.


In [94]:
import pandas as pd
import matplotlib as plt
import numpy as np
import sqlite3 

In [7]:
conn = sqlite3.connect('/content/BBCDS.sqlite3')

cursor = conn.cursor()

In [10]:
bike_journeys = pd.read_sql('SELECT * FROM bike_journeys;',conn)
bike_stations = pd.read_sql('SELECT * FROM bike_stations;',conn)
weather = pd.read_sql('SELECT * FROM weather;',conn)

### Bike Journeys

In [9]:
bike_journeys # Shows the data in a dataframe format 

Unnamed: 0,Journey Duration,Journey ID,End Date,End Month,End Year,End Hour,End Minute,End Station ID,Start Date,Start Month,Start Year,Start Hour,Start Minute,Start Station ID
0,2040,953,19,9,17,18,0,478,19,9,17,17,26,251
1,1800,12581,19,9,17,15,21,122,19,9,17,14,51,550
2,1140,1159,15,9,17,17,1,639,15,9,17,16,42,212
3,420,2375,14,9,17,12,16,755,14,9,17,12,9,163
4,1200,14659,13,9,17,19,33,605,13,9,17,19,13,36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1542839,270,5296,8,8,17,0,2,32,7,8,17,23,58,42
1542840,911,12348,8,8,17,0,13,625,7,8,17,23,58,222
1542841,447,8303,8,8,17,0,7,453,7,8,17,23,59,130
1542842,424,12038,8,8,17,0,6,405,7,8,17,23,59,755


I output the dataframe just by using the variable to see the columns, the size of the data and get some preliminary domain knowledge. Moreover, the reason I use this method is because the nodebook workspace I use has a wand icon next to the dataframe that allows me to sort in acending or descending order through each column without using any code which is a lot more seamless. Cross referencing the value for journey duration and start and ending times I now know that the journey duration is measured in seconds.

In [11]:
bike_journeys.dtypes # Shows the data type for each feature/varible

Journey Duration    int64
Journey ID          int64
End Date            int64
End Month           int64
End Year            int64
End Hour            int64
End Minute          int64
End Station ID      int64
Start Date          int64
Start Month         int64
Start Year          int64
Start Hour          int64
Start Minute        int64
Start Station ID    int64
dtype: object

Since these are all numerical I am thinking the use of the .corr() will be useful for initial inteprettions.

### Bike Stations

In [12]:
bike_stations

Unnamed: 0,Station ID,Capacity,Latitude,Longitude,Station Name
0,1,19,51.529163,-0.109970,"River Street , Clerkenwell"
1,2,37,51.499606,-0.197574,"Phillimore Gardens, Kensington"
2,3,32,51.521283,-0.084605,"Christopher Street, Liverpool Street"
3,4,23,51.530059,-0.120973,"St. Chad's Street, King's Cross"
4,5,27,51.493130,-0.156876,"Sedding Street, Sloane Square"
...,...,...,...,...,...
768,190,21,51.489975,-0.132845,"Rampayne Street, Pimlico"
769,194,56,51.504627,-0.091773,"Hop Exchange, The Borough"
770,195,30,51.507244,-0.106237,"Milroy Walk, South Bank"
771,196,17,51.503688,-0.098497,"Union Street, The Borough"


I can already see that the Station ID can is a foreign key between bike stations and bike journeys. I am expecting to use a join to get all the data in one flat table.

In [13]:
bike_stations.dtypes

Station ID        int64
Capacity          int64
Latitude        float64
Longitude       float64
Station Name     object
dtype: object

### Weather

In [98]:
weather.head()

Unnamed: 0,LATITUDE,LONGITUDE,DATE,PRCP (MM),TAVG (CELSIUS)
0,51.478,-0.461,01/09/2017,1.5,16.1
1,51.478,-0.461,02/09/2017,0.0,15.8
2,51.478,-0.461,03/09/2017,0.0,13.7
3,51.478,-0.461,04/09/2017,6.1,17.7
4,51.478,-0.461,05/09/2017,0.3,17.6


The way the date is formated in this dataset differs from the bike journeys data set so this will have to be changed so I can join these datasets. Also this dataset includes dates from august when all of the bike journeys take place in september so that will be omitted as it is not needed. Also, this data only tells of the weather in one area and tells us the precipitation and temperature for the day not detailing wether it was the average or the highest respective values for that day and not giving specific times.

In [15]:
weather.dtypes

LATITUDE          float64
LONGITUDE         float64
DATE               object
PRCP (MM)         float64
TAVG (CELSIUS)    float64
dtype: object

In [95]:
weather.columns

Index(['LATITUDE', 'LONGITUDE', 'DATE', 'PRCP (MM)', 'TAVG (CELSIUS)'], dtype='object')

In [96]:
bike_journeys.columns

Index(['Journey Duration', 'Journey ID', 'End Date', 'End Month', 'End Year',
       'End Hour', 'End Minute', 'End Station ID', 'Start Date', 'Start Month',
       'Start Year', 'Start Hour', 'Start Minute', 'Start Station ID'],
      dtype='object')

In [97]:
bike_stations.columns

Index(['Station ID', 'Capacity', 'Latitude', 'Longitude', 'Station Name'], dtype='object')

## Prepare

In [21]:
bike_journeys.isnull().any() # The use of these 2 methods together checks if there are missing values in any of the columns

Journey Duration    False
Journey ID          False
End Date            False
End Month           False
End Year            False
End Hour            False
End Minute          False
End Station ID      False
Start Date          False
Start Month         False
Start Year          False
Start Hour          False
Start Minute        False
Start Station ID    False
dtype: bool

Initally I want to check if there are any missing data in any of the columns since it could most probably be lowering the data quality. However, other alternatives to just droping the rows of data could be looked into.Depending on factors such as where the missing data lies (e.g, if the station ID's are missing that is major factor for the outcomes but if the start hour is missing but there is still the journey duration the data could be filled.) this also depends on how much missing data there is could tell me if it worth going through these changes.

In [18]:
bike_stations.isnull().any()

Station ID      False
Capacity        False
Latitude        False
Longitude       False
Station Name    False
dtype: bool

In [19]:
weather.isnull().any()

LATITUDE          False
LONGITUDE         False
DATE              False
PRCP (MM)         False
TAVG (CELSIUS)    False
dtype: bool

### Bike Journey and Stations Preparations

In [72]:
bike_journeys_stations = pd.read_sql('SELECT * FROM bike_journeys JOIN bike_stations ON bike_journeys.[Start Station ID]=bike_stations.[Station ID]',conn)

I join the bike journeys and bike stations tables to have all the data in one flat table so I can use methods like .corr and or visualise correlations and other statistical methods.

In [77]:
bike_journeys_stations

Unnamed: 0,Journey Duration,Journey ID,End Date,End Month,End Year,End Hour,End Minute,End Station ID,Start Date,Start Month,Start Year,Start Hour,Start Minute,Start Station ID,Station ID,Capacity,Latitude,Longitude,Station Name
0,2040,953,19,9,17,18,0,478,19,9,17,17,26,251,251,34,51.518908,-0.079249,"Brushfield Street, Liverpool Street"
1,1800,12581,19,9,17,15,21,122,19,9,17,14,51,550,550,23,51.521564,-0.039264,"Harford Street, Mile End"
2,1140,1159,15,9,17,17,1,639,15,9,17,16,42,212,212,17,51.506584,-0.199004,"Campden Hill Road, Notting Hill"
3,420,2375,14,9,17,12,16,755,14,9,17,12,9,163,163,27,51.493184,-0.167894,"Sloane Avenue, Knightsbridge"
4,1200,14659,13,9,17,19,33,605,13,9,17,19,13,36,36,28,51.501737,-0.184980,"De Vere Gardens, Kensington"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1530235,270,5296,8,8,17,0,2,32,7,8,17,23,58,42,42,28,51.530991,-0.093903,"Wenlock Road , Hoxton"
1530236,911,12348,8,8,17,0,13,625,7,8,17,23,58,222,222,43,51.502757,-0.155349,"Knightsbridge, Hyde Park"
1530237,447,8303,8,8,17,0,7,453,7,8,17,23,59,130,130,24,51.509506,-0.075459,"Tower Gardens , Tower"
1530238,424,12038,8,8,17,0,6,405,7,8,17,23,59,755,755,24,51.485121,-0.174971,"The Vale, Chelsea"




### Weather Preparation

In [None]:
pd.read_sql('''SELECT * 
                FROM bike_stations
                WHERE [Latitude] = "51.478" AND [Longitude] = "-0.461"
                ;''', conn) # Check the row where the weather has the exact same latitude and logitude 

There is no row with this exact latitdue and logitude so I have to check how close the nearest station is to this point and if the distance is large enough to disregard the weather data. 

In [89]:
pd.read_sql('''SELECT *
                FROM bike_stations
                ORDER BY ABS([Latitude] - 51.478) + ABS([Longitude] + 0.461)
                ;''', conn)
# This query shows orders the closest station to the furtherst station from the given longitude and latitude 

Unnamed: 0,Station ID,Capacity,Latitude,Longitude,Station Name
0,668,26,51.494223,-0.236769,"Ravenscourt Park Station, Hammersmith"
1,753,28,51.492636,-0.234094,"Hammersmith Town Hall, Hammersmith"
2,644,36,51.483732,-0.223852,"Rainville Road, Hammersmith"
3,682,46,51.488108,-0.226606,"Crisp Road, Hammersmith"
4,599,28,51.485743,-0.223616,"Manbre Road, Hammersmith"
...,...,...,...,...,...
768,785,64,51.540940,-0.010510,"Aquatic Centre, Queen Elizabeth Olympic Park"
769,787,35,51.546805,-0.014691,"Timber Lodge, Queen Elizabeth Olympic Park"
770,786,44,51.549369,-0.015717,"Lee Valley VeloPark, Queen Elizabeth Olympic Park"
771,784,34,51.546326,-0.009935,"East Village, Queen Elizabeth Olympic Park"


Using an online google maps I calculated the distance to the furthest station and the given longitude and latitude in the weather dataset (20.11 miles). This still falls under the london GPE so the weather could be taken under consideration with every journey. The time not being a part of of the date is still an issue.

In [91]:
weather = pd.read_sql('''SELECT * FROM weather WHERE [DATE] LIKE "%/09/2017";''', conn)
weather

Unnamed: 0,LATITUDE,LONGITUDE,DATE,PRCP (MM),TAVG (CELSIUS)
0,51.478,-0.461,01/09/2017,1.5,16.1
1,51.478,-0.461,02/09/2017,0.0,15.8
2,51.478,-0.461,03/09/2017,0.0,13.7
3,51.478,-0.461,04/09/2017,6.1,17.7
4,51.478,-0.461,05/09/2017,0.3,17.6
5,51.478,-0.461,06/09/2017,1.5,15.8
6,51.478,-0.461,07/09/2017,0.0,15.8
7,51.478,-0.461,08/09/2017,0.8,15.8
8,51.478,-0.461,09/09/2017,6.1,13.2
9,51.478,-0.461,10/09/2017,7.1,13.3


In [93]:
weather.corr()['TAVG (CELSIUS)']

<pandas.plotting._core.PlotAccessor object at 0x7fc842691b20>

## Analyse