<a href="https://colab.research.google.com/github/CassDabii/BBC-DS-Task/blob/main/BBC_DSProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***BBC Data Science Project*** 
---
Since this project is open ended it is up to me to determine what data is useful to make actionable insights. To do this I will make a list of *preliminary* goals that are variable, however any goals added or omitted will not be hidden but instead justified to maintain credibility




Goals
*   Determine the effect weather has on the length of the journey and the cycle volume.
*   Determine the usefulness and performance of trying to predict cycle volumes
*   Decipher where to add another station
*   If the business was ever wanting to get rid of a station what station would be the best to get rid of.
*   What effect bike station capacity has (e.g., the more spaces the more people go use it)









## Acquiring Data

I segment each csv file to get an understanding of the context, shape and features of the dataset and have a quick mental check of what is feasible and what logical connections can be made.


In [None]:
import pandas as pd
import matplotlib as plt

In [5]:

bike_journeys = pd.read_csv('/content/bike_journeys.csv')
bike_stations = pd.read_csv('/content/bike_stations.csv')
weather = pd.read_csv('/content/weather.csv')

### Bike Journeys

In [None]:
bike_journeys.head()

Gives a 

In [7]:
len(bike_journeys) # To show how many rows in the dataset

430529

In [8]:
bike_journeys.columns # Shows all the features/variables within the dataset

Index(['Journey Duration', 'Journey ID', 'End Date', 'End Month', 'End Year',
       'End Hour', 'End Minute', 'End Station ID', 'Start Date', 'Start Month',
       'Start Year', 'Start Hour', 'Start Minute', 'Start Station ID'],
      dtype='object')

In [9]:
bike_journeys.dtypes # Shows the data type for each feature/varible

Journey Duration    float64
Journey ID          float64
End Date            float64
End Month           float64
End Year            float64
End Hour            float64
End Minute          float64
End Station ID      float64
Start Date          float64
Start Month         float64
Start Year          float64
Start Hour          float64
Start Minute        float64
Start Station ID    float64
dtype: object

Since these are all numerical I am thinking the use of the .corr() will be useful for initial inteprettions.

### Bike Stations

In [26]:
bike_stations

Unnamed: 0,Station ID,Capacity,Latitude,Longitude,Station Name
0,1,19,51.529163,-0.109970,"River Street , Clerkenwell"
1,2,37,51.499606,-0.197574,"Phillimore Gardens, Kensington"
2,3,32,51.521283,-0.084605,"Christopher Street, Liverpool Street"
3,4,23,51.530059,-0.120973,"St. Chad's Street, King's Cross"
4,5,27,51.493130,-0.156876,"Sedding Street, Sloane Square"
...,...,...,...,...,...
768,190,21,51.489975,-0.132845,"Rampayne Street, Pimlico"
769,194,56,51.504627,-0.091773,"Hop Exchange, The Borough"
770,195,30,51.507244,-0.106237,"Milroy Walk, South Bank"
771,196,17,51.503688,-0.098497,"Union Street, The Borough"


In [11]:
len(bike_stations)

773

In [12]:
bike_stations.columns

Index(['Station ID', 'Capacity', 'Latitude', 'Longitude', 'Station Name'], dtype='object')

In [13]:
bike_stations.dtypes

Station ID        int64
Capacity          int64
Latitude        float64
Longitude       float64
Station Name     object
dtype: object

### Weather

In [27]:
weather

Unnamed: 0,LATITUDE,LONGITUDE,DATE,PRCP (MM),TAVG (CELSIUS)
0,51.478,-0.461,01/08/2017,0.0,17.1
1,51.478,-0.461,02/08/2017,0.8,16.8
2,51.478,-0.461,03/08/2017,7.1,18.4
3,51.478,-0.461,04/08/2017,0.0,18.3
4,51.478,-0.461,05/08/2017,0.0,16.8
5,51.478,-0.461,06/08/2017,0.5,16.2
6,51.478,-0.461,07/08/2017,0.0,16.7
7,51.478,-0.461,08/08/2017,0.3,15.2
8,51.478,-0.461,09/08/2017,1.0,14.1
9,51.478,-0.461,10/08/2017,22.1,16.1


In [15]:
len(weather)

44

In [16]:
weather.columns

Index(['LATITUDE', 'LONGITUDE', 'DATE', 'PRCP (MM)', 'TAVG (CELSIUS)'], dtype='object')

In [17]:
weather.dtypes

LATITUDE          float64
LONGITUDE         float64
DATE               object
PRCP (MM)         float64
TAVG (CELSIUS)    float64
dtype: object

## Prepare

In [18]:
bike_journeys.isnull().any() # The use of these 2 methods together checks if there are missing values in any of the columns

Journey Duration    False
Journey ID           True
End Date             True
End Month            True
End Year             True
End Hour             True
End Minute           True
End Station ID       True
Start Date           True
Start Month          True
Start Year           True
Start Hour           True
Start Minute         True
Start Station ID     True
dtype: bool

There are some columns which contain null values so I will have to remove them.

In [19]:
bike_journeys = bike_journeys.dropna() # this removes the row of where any missing values. 

This is beneficial because some statistical methods will require complete data and if the missing values are sporadic it will be very difficult to manage and produce an output that is credible.

In [20]:
bike_journeys.isnull().any()

Journey Duration    False
Journey ID          False
End Date            False
End Month           False
End Year            False
End Hour            False
End Minute          False
End Station ID      False
Start Date          False
Start Month         False
Start Year          False
Start Hour          False
Start Minute        False
Start Station ID    False
dtype: bool

In [21]:
bike_journeys.corr()['Journey Duration']

Journey Duration    1.000000
Journey ID         -0.002061
End Date           -0.003439
End Month          -0.001989
End Year                 NaN
End Hour            0.019115
End Minute         -0.001257
End Station ID      0.015545
Start Date         -0.008946
Start Month        -0.004519
Start Year               NaN
Start Hour         -0.000427
Start Minute       -0.000392
Start Station ID    0.012663
Name: Journey Duration, dtype: float64

The fact that the ending month and year and starting month and year are NaN may have been because there being null values within the column. The use of the .dropna(),.isnull() and .any() methods showed that even when columns had  missing values they still provided an value for correlation so it must have been becuase all the values are the same in these columns so the denominator in the correlation formula is zero.


In [22]:
bike_stations.isnull().any()

Station ID      False
Capacity        False
Latitude        False
Longitude       False
Station Name    False
dtype: bool

In [23]:
weather.isnull().any()

LATITUDE          False
LONGITUDE         False
DATE              False
PRCP (MM)         False
TAVG (CELSIUS)    False
dtype: bool

## Analyse

In [24]:
bike_journeys.groupby('Start Station ID')['Journey Duration'].count()

Start Station ID
1.0       370
2.0       411
3.0      1095
4.0       397
5.0      1009
         ... 
818.0     768
819.0     421
820.0     381
821.0     426
826.0     588
Name: Journey Duration, Length: 778, dtype: int64