## Download data from:
https://data.smartdublin.ie/dataset/dublinbikes-api/resource/aab12e7d-547f-463a-86b1-e22002884587
https://data.smartdublin.ie/dataset/dublinbikes-api/resource/8ddaeac6-4caf-4289-9835-cf588d0b69e5
https://data.smartdublin.ie/dataset/dublinbikes-api/resource/99a35442-6878-4c2d-8dff-ec43e91d21d7
https://data.smartdublin.ie/dataset/dublinbikes-api/resource/5328239f-bcc6-483d-9c17-87166efc3a1a

### keep them in a same folder as .ipynb file


## Abstract

For Machine Learning project I have used: Decision Tree, Linear Regression, XGBoost and Logistic Regression. My dataset called Dublin Bikes. The best prediction effect I got with Decision Tree and Linear Regression which was 0.99. Linear Regression is even higher than in Decision Tree.  


## Outline
In a assignment I took the following steps to perform analysis:
- Research Question
- Introduction to ML algorithms
- Loading datasets
- Displaying name of the columns before and after encoding, creating plots
- Split data on training and testing
- Implementing algorithms

## Research Question
- How many bikes were available in different months in 2020?

## Background

Machine Learning is the study of computer algorithms that improve automatically through experience and by the use of data. 
ML Algorithms are used for various purposes like data mining, image processing, predictive analytics, etc. to name a few.  The main advantage of using machine learning is that, once an algorithm learns what to do with data, it can do its work automatically. Types of machine learning Algorithms

There some variations of how to define the types of Machine Learning Algorithms but commonly they can be divided into categories according to their purpose and the main categories are the following:

    Supervised learning
    Unsupervised Learning
    Semi-supervised Learning
    Reinforcement Learning


## Introduction

For Data Analytics and Algorithms (CW_KCDAT_M) Y5 module I was required completion of a research/software project based on data analytics/machine learning. This involved finding a large dataset and performing research analysis on it in the python environment from within Jupyter Notebook. I selected two sets of data sets and sent them for approval. One dataset is a collection of used cars for sale collected from one of the largest sales websites Craigslist https://www.kaggle.com/austinreese/craigslist-carstrucks-data . The second dataset is Dublin Bikes, which is a bike hire scheme operating from docks and bike stations in Dublin city https://data.smartdublin.ie/dataset/dublinbikes-api . After a preliminary review of both sets I decided to use Dublin Bikes for the analysis. I found it quite interesting and comprehensive.  I have selected the last four datasets from Dublin Bikes, which cover a period of 1 year (01.01.2020 to 01.01.2021). 


In [None]:
#  I've installed all necessary packages

In [None]:
# pip install xgboost
# pip install graphviz

In [None]:
#imports all necessary libraries 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import xgboost as xgb
from matplotlib import pyplot as plt

In [None]:
# Read data from csv file

In [None]:
df1 = pd.read_csv('dublinbikes_20200101_20200401.csv')
df2 = pd.read_csv('dublinbikes_20200401_20200701.csv')
df3 = pd.read_csv('dublinbikes_20200701_20201001.csv')
df4 = pd.read_csv('dublinbikes_20201001_20210101.csv')

In [None]:
# Concanate four dataset in one and display 

In [None]:
bike_list = [df1, df2, df3, df4]
bike = pd.concat(bike_list, ignore_index=True)
bike

In [None]:
# Print a concise summary of a DataFrame

In [None]:
bike.info()

In [None]:
# Display number of rows and number of columns in the DataFrame

In [None]:
bike.shape

In [None]:
# Count and display if any "NaN" values in dataset

In [None]:
bike.isnull().sum()

In [None]:
# Display the first n rows

In [None]:
bike.head()

In [None]:
# Load sklearn.model_selection

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split data into two subsets: training data and testing data

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(bike.loc[:,'STATION ID':'LATITUDE'],bike.loc[:,'LONGITUDE'],test_size=0.30,random_state=1)

In [None]:
print('Xtrain shape',xtrain.shape)
print('Xtest shape',xtest.shape)
print('Ytrain shape',ytrain.shape)
print('Ytest shape',ytest.shape)

In [None]:
# Create subplots in a single feature and display

In [None]:
sns.relplot(data=bike, x='BIKE STANDS', y='AVAILABLE BIKE STANDS')

In [None]:
# Create subplots in a single feature and display as a line

In [None]:
sns.relplot(data=bike, x='BIKE STANDS', y='AVAILABLE BIKE STANDS', kind='line')
plt.show()

In [None]:
sns.relplot(data=bike, x='AVAILABLE BIKES', y='TIME', kind='line', fig.set_size_inches(11.7, 8.27))
plt.show()

In [None]:
# Create and display histogram

In [None]:
bike['AVAILABLE BIKE STANDS'].plot.hist()

In [None]:
# bike.boxplot('NAME','LAST UPDATED',rot = 30,figsize=(5,6))

In [None]:
# Return basic information about each column in dataset 

In [None]:
bike.describe()

In [None]:
# Display a view of the dataset

In [None]:
bike.values

In [None]:
# Select data by the label of the rows and columns

In [None]:
bike.index

In [None]:
# Available "bikes" grouped by the "status"

In [None]:
bike.boxplot('AVAILABLE BIKES','STATUS',rot = 30,figsize=(10,12))

In [None]:
# Available "longtitude" grouped by the "status"

In [None]:
bike.boxplot('LONGITUDE','STATUS',rot = 30,figsize=(7,9))

# Decision Tree

In [None]:
# Make a copy of dataset and convert categorical variable into "dummy" variable, encode them by "status" and display

In [None]:
cat_bike_onehot = bike.copy() 
cat_bike_onehot = pd.get_dummies(cat_bike_onehot, columns=['STATUS'], prefix = ['STATUS']) 
print(cat_bike_onehot.head())

In [None]:
# Convert argument to datetime

In [None]:
cat_bike_onehot.TIME=pd.to_datetime(cat_bike_onehot.TIME)

In [None]:
# Display summary of a DataFrame

In [None]:
cat_bike_onehot.info()

In [None]:
# Convert "time" column to time, year, month, day etc.

In [None]:
cat_bike_onehot['TIME_year'] = cat_bike_onehot['TIME'].dt.year
cat_bike_onehot['TIME_month'] = cat_bike_onehot['TIME'].dt.month
cat_bike_onehot['TIME_week'] = cat_bike_onehot['TIME'].dt.isocalendar().week
cat_bike_onehot['TIME_day'] = cat_bike_onehot['TIME'].dt.day
cat_bike_onehot['TIME_hour'] = cat_bike_onehot['TIME'].dt.hour
cat_bike_onehot['TIME_minute'] = cat_bike_onehot['TIME'].dt.minute
cat_bike_onehot['TIME_dayofweek'] = cat_bike_onehot['TIME'].dt.dayofweek

In [None]:
# Create new name for DataFrame

In [None]:
new_bike= pd.DataFrame(cat_bike_onehot[['STATION ID','TIME_year','TIME_month','TIME_week','TIME_day','TIME_hour','TIME_minute','TIME_dayofweek','BIKE STANDS','AVAILABLE BIKE STANDS','AVAILABLE BIKES','STATUS_Close','STATUS_Open']])

In [None]:
# Remove column "available bikes" to give new input set, it creates new dataset without that column. We are giving "X" to represent that dataset

In [None]:
X = new_bike.drop(columns=['AVAILABLE BIKES'])
X

In [None]:
for col in new_bike:
    if col == 'TIME_year' :
        bike_year=(new_bike[new_bike[col]==2020]).copy()

In [None]:
bike_year = bike_year.groupby(bike_year.TIME_month).count().reset_index()

In [None]:
bike_year

In [None]:
# How many bikes were available in different months in 2020?

In [None]:
bike_year['AVAILABLE BIKES'].plot(kind="bar", title="test", figsize=(12,14), color=['black', 'green', 'black', 'green', 'black', 'green'] )

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("How many bikes are available in different months in 2020")
plt.xlabel("Months")
plt.ylabel("Number of Available Bikes")

In [None]:
y = new_bike['AVAILABLE BIKES']
y

### Separate the dependent and independent data variables into two data frames: 30% testing data and 70% training data.

In [None]:
# Load train_test_split (random train and test subsets)

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)

In [None]:
#  Display info about Dataframe

In [None]:
new_bike.info()

### Implement decision tree algorithm

In [None]:
# Load  DecisionTreeClassifier 

In [None]:
from sklearn.tree import DecisionTreeClassifier

### Create new instance of the class and train that model to learn patterns of the data. It takes two datasets: input set and output set.

In [None]:
# Create new object called model

In [None]:
model = DecisionTreeClassifier()
model.fit(x_train,y_train)

### Return the labels of the data passed as argument based upon the learned or trained data obtained from the model

In [None]:
# Make a prediction

In [None]:
prediction=model.predict(x_test)

### Function which computes subset accuracy

In [None]:
# Load accuracy_scre library from sklearn.metrics

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
score = accuracy_score(y_test, prediction)
score

In [None]:
# Compute correlation

In [None]:
new_bike.corr()

In [None]:
# Make a box plot from DataFrame columns grouped by "Status_Close"

In [None]:
new_bike.boxplot('TIME_dayofweek','STATUS_Close',rot = 30,figsize=(7,9))

### Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task.

# Linear Regression

In [None]:
# Load LinearRegression library

In [None]:
from sklearn.linear_model import LinearRegression

###  Drop column

In [None]:
# Create new dataset without "AVAILABLE BIKE STANDS" column, to use identify a columns as axis=1

In [None]:
x=new_bike.drop(['AVAILABLE BIKE STANDS'],axis=1).values
y=new_bike['AVAILABLE BIKE STANDS'].values

In [None]:
# Display input data

In [None]:
print(x)

In [None]:
# Display output data

In [None]:
print(y)

In [None]:
# Load train_test_split and give 70% to training data and 30% to testing data

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)

In [None]:
# Load LinearRegression library and create new object called ml

In [None]:
from sklearn.linear_model import LinearRegression
ml=LinearRegression()
ml.fit(x_train,y_train)

In [None]:
# Make a prediction

In [None]:
y_pred=ml.predict(x_test)
print(y_pred)

### Coefficient of determination regression

In [None]:
# Load r2_score

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

In [None]:
# Draw a plot 

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(y_test,y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')

### Predicted values

In [None]:
pred_y_df=pd.DataFrame({'Actual Value':y_test,'Predicted value':y_pred, 'Difference': y_test-y_pred})
pred_y_df[0:20]

## XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework

In [None]:
# Create Xgboost specific DMatrix data format from the numpy array.

In [None]:
train = xgb.DMatrix(x_train, label=y_train)
test = xgb.DMatrix(x_test, label=y_test)

In [None]:
# Set the parameters to get Xgboost working

In [None]:
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 60}
epochs = 4

In [None]:
# Train the model

In [None]:
model = xgb.train(param, train, epochs)

In [None]:
# Dump model as a text file 

In [None]:
model.dump_model('dump.raw.text')

In [None]:
# Make a prediction

In [None]:
predictions = model.predict(test)

In [None]:
# Display predictions

In [None]:
print(predictions)

In [None]:
# Load accuracy_score and compute it

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

In [None]:
# xgb.plot_tree(bst, num_trees=2)

## Logistic Regression is a ML classification algorithm that is used to predict the probability of a categorical dependent variable. 

In [None]:
# Load LogisticRegression from the sklearn library and import confusion_matrix which is used for evaluating the performance of a classification model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
log_model = LogisticRegression(solver='lbfgs', max_iter=5000)

In [None]:
# Create a Logistic Regression Object, perform Logistic Regression

In [None]:
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)

In [None]:
# perform prediction using the test dataset
y_pred = log_reg.predict(x_test)

In [None]:
# Display the Confusion Matrix

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
print(x)


In [None]:
print(y)

In [None]:
# plt.scatter(x_train, y_train, c=y, cmap='rainbow')
# plt.title('Scatter Plot of Logistic Regression')
# x_train[:,0]
# plt.show()

## Conclusion

Dublin Bikes also has a Real Time API which means it is possible to download data which should be updated in real time and therefore have data up to date.
I believe that if cycling was extended to other cities an analysis could be made of the larger cities in Ireland and cycling usage could be compared to Dublin. 


## References:

https://www.geeksforgeeks.org/how-to-count-the-number-of-nan-values-in-pandas/
https://www.youtube.com/watch?v=57vFbsiZYHg 
https://nbviewer.jupyter.org/github/Tanu-N-Prabhu/Python/blob/master/Data_Cleaning/Data_Cleaning_using_Python_with_Pandas_Library.ipynb
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
https://data.smartdublin.ie/dataset/dublinbikes-api
https://www.kaggle.com/austinreese/craigslist-carstrucks-data
https://stackoverflow.com/questions/62658215/convergencewarning-lbfgs-failed-to-converge-status-1-stop-total-no-of-iter
https://www.askpython.com/python/examples/python-predict-function
https://www.w3cschool.cn/doc_scikit_learn/scikit_learn-modules-generated-sklearn-metrics-accuracy_score.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html
https://www.geeksforgeeks.org/ml-linear-regression/
https://www.kdnuggets.com/2017/03/simple-xgboost-tutorial-iris-dataset.html
https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/
https://en.wikipedia.org/wiki/Machine_learning
http://scholar.google.com/scholar_url?url=https://www.researchgate.net/profile/Randeep_Kaur12/post/latest_research_which_should_be_chosen_in_machine_learning/attachment/5bb64a68cfe4a76455f83a27/AS%253A678058935717889%25401538673256296/download/machine%2Blearning%2Balgorithms.pdf&hl=pl&sa=X&ei=4kZ7YPfxNcXTsQK64KuAAg&scisig=AAGBfm0t2zr4xu3z_qp9AUasL00oeQTpNw&nossl=1&oi=scholarr
