# Project Title

The data we are using can be found here: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

In [None]:
##Import libraries using common aliases.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import metrics

In [None]:
##Read in the csv file containing the data.
bike_sharing = pd.read_csv('bike_sharing.csv')

## Data Cleaning

The first step of the cleaning process is to familiarize ourselves with the dataset.

**Question.** What do you want to know about the data?  You can double-click on the cell below to type some of your ideas.

**Ideas.**

In [None]:
bike_sharing.shape

In [None]:
bike_sharing.dtypes

In [None]:
bike_sharing.head()

**Your turn!**  You can use the following cells to learn more about the data.  One option is to determine the unique values of each feature, using the first cell below as a template.

In [None]:
bike_sharing['holiday'].unique()

You'll have some time to further explore this data set, and we will discuss cleaning in more detail in the next session.

## Exploratory Data Analysis

In [None]:
bike_sharing.describe()

### Examining Variables
Let's create a histogram for one of the quantitative variables, and then a bar graph for one of the categorical variables.

In [None]:
sns.displot(bike_sharing['temp'])

In [None]:
sns.countplot(x=bike_sharing['season'])

### Examining Relationships Between Two Variables
Below, we will compute descriptive statistics for variables by season. Then, we create a scatterplot and generate correlation coefficients to examine relationships between two quantitative variables.

In [None]:
bike_sharing.groupby('season').max()

In [None]:
bike_sharing.groupby('season').mean()

In [None]:
sns.scatterplot(x=bike_sharing['temp'], y=bike_sharing['windspeed'])

In [None]:
np.corrcoef(bike_sharing[['temp', 'windspeed']],rowvar=False)

In [None]:
plt.figure(figsize=[8,5])
sns.pairplot(bike_sharing)

In [None]:
bike_sharing.corr()

## For next time:
- Explore the data set further to find aspects that need to be cleaned (e.g. missing data, inconsistent formatting, datatypes that need to be changed, new features to add).
- Create visualizations and compute descriptive statistics for features you are interested in from the dataset.
- Bring questions you are interested in investigating based on this data.

## Data Cleaning (continued)

### Data Types & Formatting

Each feature should have consistent formatting and be stored as an appropriate data type.

**Question.** What do you notice about the data type of each feature?

In [None]:
#Convert dteday to datetime data type
bike_sharing['dteday'] = pd.to_datetime(bike_sharing['dteday'])

### Enriching the Data

We may want to add a feature by either:
- Appending a column from another dataframe
- Creating a new feature using features that already exist in our dataframe

Let's create a feature called ```bike_usage``` which takes values “Below Average” or “Above Average” based on the given feature “cnt”.

In [None]:
bike_sharing['cnt'].mean()

In [None]:
def number_category(count):
    if count <= 4504:
        return 'Below Average'
    else:
        return 'Above Average'

In [None]:
bike_sharing['bike_usage'] = bike_sharing['cnt'].apply(number_category)

In [None]:
def number_category_binary(category):
    if category == 'Below Average':
        return 0
    else:
        return 1

In [None]:
bike_sharing['bike_usage_binary'] = bike_sharing['bike_usage'].apply(number_category_binary)

### Validating the Data

In the validation process, we verify that the data is accurate.  Let’s ensure that registered + casual = cnt.

In [None]:
total = bike_sharing['registered'] + bike_sharing['casual']

In [None]:
(bike_sharing['cnt'] - total).unique()

### Feature Scaling

Convert numeric features to uniform ranges.

In [None]:
bike_sharing.min()

## Data Modeling

Consider predicting ```bike_usage``` from ```temp``` and ```windspeed```.  First, we split the data.

In [None]:
X = bike_sharing[['temp', 'windspeed']]
y = bike_sharing['bike_usage']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=124)

Let's use k-Nearest Neighbors for Classification.  The documentation is linked [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

**Algorithm**

1. Select the number of neighbors to consider, k.
2. Calculate the distance between all labeled instances and the instance to predict.
3. The k instances with the shortest distances are the nearest neighbors.
4. Use the k nearest neighbors' most common output value as the predicted value.

In [None]:
bikeClassifier = KNeighborsClassifier(n_neighbors=3)
bikeClassifier.fit(X_train, y_train)

In [None]:
predicted_usage = bikeClassifier.predict(X_test)

In [None]:
predicted_bike_sharing = pd.DataFrame({'temp': X_test['temp'], 'windspeed': X_test['windspeed'], 'bike_usage': y_test, 'predicted_usage': predicted_usage})

In [None]:
predicted_bike_sharing

In [None]:
sns.scatterplot(x=predicted_bike_sharing['temp'], y=predicted_bike_sharing['windspeed'], hue=predicted_bike_sharing['bike_usage'])

In [None]:
sns.scatterplot(x=predicted_bike_sharing['temp'], y=predicted_bike_sharing['windspeed'], hue=predicted_bike_sharing['predicted_usage'])

In [None]:
total_in_test = len(predicted_bike_sharing)

In [None]:
number_incorrect = (predicted_bike_sharing['bike_usage'] != predicted_bike_sharing['predicted_usage']).sum()

In [None]:
(total_in_test - number_incorrect) / total_in_test

In [None]:
metrics.accuracy_score(y_test, predicted_usage)

Use Logistic Regression to predict ```bike_usage_binary``` from ```temp``` and ```windspeed```.

In [None]:
y_binary = bike_sharing['bike_usage_binary']

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y_binary, test_size=0.2, random_state=124)

In [None]:
bikeLogRegression = LogisticRegression()
bikeLogRegression.fit(X_train2, y_train2)

In [None]:
y_log_pred = bikeLogRegression.predict(X_test2)

In [None]:
predicted_bike_sharing['log_predicted_usage'] = y_log_pred

In [None]:
sns.scatterplot(x=predicted_bike_sharing['temp'], y=predicted_bike_sharing['windspeed'], hue=predicted_bike_sharing['log_predicted_usage'])

In [None]:
metrics.accuracy_score(y_test2, y_log_pred)