<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/999_trial-lecture.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Data Science Fundamentals: Trial Lecture
___

## 1. Preparation
___

In Python, we often work with packages. Packages are a collection of useful commands that we can use in our code to make our lives easier. We can import packages into our code using the `import` command.

In [None]:
# Import the pandas package and call it pd
import pandas as pd 

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

Let us begin by importing the data. We can use the `read_csv` command from the pandas package to read in the data. The data is stored in a file called `bike_rental.csv` in the `data` folder. But before loading in Python, let's have a look at the data in the CSV explorer in JupyterLab.

In [None]:
# Load the data using pandas read_csv method
bike_rentals = pd.read_csv(f"{DATA_PATH}/bike_rental.csv")

The data set `bike_rentals` contains information about the bike rental data in Washington D.C. in 2011 and 2012. It is a typical data set that you will encounter in your work as a data scientist. It contains information about the weather, the time of day, and the number of bikes rented. The data set is based on the [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).


Let us start by thinking about this data set. We can ask ourselves a few questions:

1. Using the CSV explorer, can we find out what the columns mean?
2. What could we use this data for?
3. How can we create value from this data?
4. Can we use the data in its raw form, or do we need to preprocess it?

In [None]:
# Of course, the data can be inspected directly in the notebook as well
bike_rentals.head() # Show the first 5 rows

In [None]:
# Number of observations (data points) per season
bike_rentals["season"].value_counts()

In [None]:
# Summary statistics for the temperature
bike_rentals["temp"].describe()

## 2. Illustration of how to use graphical tools to explore data
___

When working with data, a good starting point is to explore the data. For many of us, this is easier to do when we can see the data. We can use graphical tools to explore the data. In this section, we will use the `matplotlib` package to create visualizations of our data.

In [None]:
# Import the matplotlib package and call it plt
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots() # Create a figure containing a single axes.
ax.hist(bike_rentals["cnt"]); # Plot a histogram of the number of bike rentals

Is the above plot useful? Can we make it a bit nicer?

In [None]:
fig, ax = plt.subplots(figsize=(9, 6)) # Create a figure containing a single axes.
# Plot a histogram of the number of bike rentals, but this time with 30 bins, in green
ax.hist(bike_rentals["cnt"], bins=30, color="darkgreen")
# Add axis labels and a title
ax.set_xlabel("Number of bike rentals per hour")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of bike rentals per hour")

Experiment yourself with different values for `bins` or `colors` to see what happens. Be careful, not every color will work, here is a list:

![](https://matplotlib.org/stable/_images/sphx_glr_colors_004.png)

What we want to predict is the value `cnt`, can you see why this is an interesting value to predict?

How could we go about predicting this value? What information do we have available?

For instance, do you think there is a difference in bike rentals in the different seasons? Let's have a look at the data.

In [None]:
# For each season, plot a histogram of the number of bike rentals
fig, ax = plt.subplots(figsize=(9, 6)) # Create a figure containing a single axes.

# Plot the histogram for the spring season
ax.hist(bike_rentals.loc[bike_rentals["season"] == 2, "cnt"],
        label="Spring", color="lime", alpha=0.5, bins=50, edgecolor="black")
# Plot the histogram for the fall season
ax.hist(bike_rentals.loc[bike_rentals["season"] == 4, "cnt"],
        label="Fall", color="sienna", alpha=0.5, bins=50, edgecolor="black")
ax.set_xlabel("Number of bike rentals per hour")
ax.set_ylabel("Frequency")
ax.set_title("Seasonal distribution of bike rentals per hour")
ax.legend(title="Season"); # Add a legend

Do you think the bike renting business is better in the winter or in the summer?

Go ahead and try to visualize the effect of spring and fall building on the above code, try to play around with the bins and colors to make the plot more informative.

## 3. Predicting bike rentals: the idea
___

As expected, the number of bike rentals fluctuates across the seasons. In particular, it is sensible to assume that the outside temperature has an effect on the demand for bike rentals. In this section, we will explore how we can use the data to predict the number of bike rentals.

Let us begin by exploring this data visually as well.

In [None]:
fig, ax = plt.subplots(figsize=(9, 6)) # Create a figure containing a single axes.
# Plot a scatter plot of the number of bike rentals against the temperature
ax.scatter(bike_rentals["temp"], bike_rentals["cnt"], alpha=0.5)
ax.set_xlabel("Normalized temperature (in degrees Celsius)")
ax.set_ylabel("Number of bike rentals per hour")
ax.set_title("Number of bike rentals per hour vs. temperature")

This plot looks somewhat strange, can you see why? What is the problem with this plot?

But we will not fix it for now, this is something you will see in the *real* course... However, we will add a trendline to the plot. This is a line that shows the trend in the data. We could use this curve to help us predict the number of bike rentals.

In [None]:
# Compute lowess trendline for the scatter plot, no need to understand this code
# for now, we will cover machine learning techniques in the full course
# (lowess stands for locally weighted scatterplot smoothing)
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_data = lowess(bike_rentals["cnt"], bike_rentals["temp"], frac=0.3)

In [None]:
fig, ax = plt.subplots(figsize=(9, 6)) # Create a figure containing a single axes.
# Plot a scatter plot of the number of bike rentals against the temperature
ax.scatter(bike_rentals["temp"], bike_rentals["cnt"], alpha=0.5)
# Add the lowess trendline
ax.plot(lowess_data[:, 0], lowess_data[:, 1], color="darkred", linewidth=3, 
    label="LOWESS")
ax.set_xlabel("Normalized temperature (in degrees Celsius)")
ax.set_ylabel("Number of bike rentals per hour")
ax.set_title("Number of bike rentals per hour vs. temperature")
ax.legend();

## 4. Linear Regression
___

For a predictive analysis, it can be valuable to make use of more than just one variable, i.e., our predictions will probably be more accurate if we don't restrict our predictive modeling to the temperature but instead account for seasons, day of week, and hour of day, etc. Let's try to do this.

In [None]:
# Create a subset of the data containing only the columns we are interested in
bike_rentals_subset = bike_rentals[["season", "mnth", "hr", "holiday", "weekday",
    "workingday", "weathersit", "temp", "hum", "windspeed", "cnt"]]

In [None]:
# Import the regression model from the sci-kit learn machine learning package
from sklearn.linear_model import LinearRegression

In [None]:
# Set up a regression model with the number of bike rentals as the outcome and
# all other variables as predictors
model = LinearRegression().fit(bike_rentals_subset.drop("cnt", axis=1),
    bike_rentals_subset["cnt"])

In [None]:
# Plot a histogram of the residuals / errors of the model
fig, ax = plt.subplots(figsize=(9, 6)) # Create a figure containing a single axes.
ax.hist(bike_rentals_subset["cnt"] - 
        model.predict(bike_rentals_subset.drop("cnt", axis=1)), bins=50, 
        color="darkgreen", edgecolor="black")
ax.set_xlabel("Residuals")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of residuals of the linear regression model")

In [None]:
# Visualize the predicted number of bike rentals against the actual number of
# bike rentals
fig, ax = plt.subplots(figsize=(9, 6)) # Create a figure containing a single axes.
ax.scatter(model.predict(bike_rentals_subset.drop("cnt", axis=1)),
    bike_rentals_subset["cnt"], alpha=0.5)
# Add the ground truth line
ax.plot([0, 500], [0, 500], color="darkred", linewidth=3, label="Ground truth",
        linestyle="-.")
ax.set_xlabel("Predicted number of bike rentals per hour")
ax.set_ylabel("Actual number of bike rentals per hour")
ax.set_title("Predicted vs. actual number of bike rentals per hour")
ax.legend();

What do you think of this model? Is it a good model? Why? Why not?

In the certificate program, you will learn how to evaluate models. You will learn about detailed measures on how to quantify the predictive success of a model. 

## 5. Discussion
___

- What is **machine learning**?
- What is **artificial intelligence**?
- What is **data science**?