# Exploratory Data Analysis for Beginners (Bike Riding Data Set)

## 1. Preface

The exploratory data analysis is the key step when working with data, it enables you to get an impression of the structure of the data. This is especially important to choose a suitable algorithm when you want to make predictions later. This notebook should show beginners how to start with their data analysis. Besides providing the code to do so, it should also show how to structure a data analysis, how to get information from data and how to build hypothesis and theories out of the new gained insights.

### Data Set Information 
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. 

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

### Attribute Information
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

This information were taken from:
> https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

## 2. Data Import and Data Manipulation

First we always have to fix some annoying stuff (missing data, wrong formatted frames or data types, etc.). That's meant with data manipulation.

### Import Libaries

When working with Python, it's always a good idea to load numpy and pandas. You will need it for loading, cleaning and manipulating data. For visualization I recommend to use seaborn in addition to matplotlib.pyplot.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Read in Data

In [None]:
df_daily = pd.read_csv("../input/bike_sharing_daily.csv")
df_hourly = pd.read_csv("../input/bike_sharing_hourly.csv")

### Head and Tail
The first thing I look for, are the head and tail of a data set. Probably you are working with an environment (like Spyder) and you could ask: "Why not just clicking on the dataframe in the variable explorer, instead of typing all this to my console?". But note that data sets can be big sometimes and then your environment will need a while to show you the whole frame.

In [None]:
df_daily.head()

In [None]:
df_daily.tail()

We use the columns attribute of "df_hourly" and see it has the same columns as df_daily plus the hour of the day ('hr').

In [None]:
df_hourly.columns

### Missing Values
Fortunately we have no missing data in our data set. But normally that's the exception and not the rule.

In [None]:
df_daily.isnull().sum()

In [None]:
df_hourly.isnull().sum()

Unfortunately we have missing missing data, so whole observations are just missing. We can see that, when we count the values for every hour. The numbers should be even for every hour. It's always high noon somewhere. Normally we should impute (make educated guesses) missing data. Because this is a topic for its own especially when missing data is missing, we will discuss this in an upcoming kernel.

In [None]:
df_hourly['hr'].value_counts()

### Repairing Data Types

Beside missing data and missing missing data, there's another thing that can make you loos your cool, while working with data: wrong data types. When we watch the "dypes"-Attribute of one of the frames, we see that some variables have an impractical data type. For example should "workingday" be a binary variable that indicates if the day is a workingday (1) or not (0). But when the datatype is set to an integer, this can cause trouble. For example regression techniques would treat the variable as an integer instead of a binary variable.

In [None]:
df_daily.dtypes

Because we have to repair the datatypes for both frames and we also want to learn something about programming as a side effect, we write a function for this task. It takes the dataframe, converts the variable "date" into datetime-format, all the categorial data into 'category' respectively 'bool'-format, gives meaningful names to all categories and returns a repaired frame.

In [None]:
def fixing_datatypes(df):
    # Fixing the datatypes 
    df['dteday'] = df['dteday'].astype('datetime64')
    df.loc[:,'season':'mnth'] = df.loc[:,'season':'mnth'].astype('category')
    df[['holiday','workingday']] = df[['holiday','workingday']].astype('bool')
    df[['weekday','weathersit']] = df[['weekday','weathersit']].astype('category')

    # Get Meaningful Names for the categorial Variables
    mapping_season = {1:"1_Winter", 2:"2_Spring", 3:"3_Summer", 4:"4_Fall" }
    mapping_weekdays = {0:"Sunday", 1:"Monday", 2:"Tuesday", 3:"Wednesday", 
                        4:"Thursday", 5:"Friday", 6:"Saturday"}
    mapping_weather = {1:"good", 2:"medium", 3:"poor", 4:"very_poor" }
    
    df["season"] = df.season.map(mapping_season)
    df["weekday"] = df.weekday.map(mapping_weekdays)
    df["weathersit"] = df.weathersit.map(mapping_weather)
      
    return df

In the next step we apply our function on the dataframe. And repair also the variable 'hr' (hour) that exist only in the hourly dataframe.

In [None]:
df_daily = fixing_datatypes(df_daily)
df_hourly = fixing_datatypes(df_hourly)

df_hourly['hr'] = df_hourly['hr'].astype('category')

We use the dates as index. This is useful when we want to plot something as a time-series or resample the data later.

In [None]:
df_daily.set_index('dteday', inplace=True)

After getting a description of the numerical weather data, we could perhaps ask, if want to scale the data differently. But I think that's beyond the scope.

In [None]:
df_daily[['temp','hum','windspeed']].describe()

## 3. Descriptive Data Analysis

It’s good practice in my opinion to split the analysis into a purely descriptive and an explanatory part. In the descriptive part you can just concentrate on what you see in the data, before making thoughts and assumptions how and why. It prevents us from getting lost in the data or being prejudiced about potential results.

### Counting Cases

On nearly two thirds of the days the weathersite was good in Washington, D.C., on one third it was medium (cloudy) and in a few cases it was even poor. The expression "very poor" (Heavy Rain, Thunderstorm etc.) isn't even used during the time period in our dataset.

In [None]:
df_daily['weathersit'].value_counts(normalize=True)

To get an overview in which season which weathersites have occurred, we use the crosstab-function. We see that good weathersites have occurred more often during the summer, and the poor weathersite more often in the winter (remember from the variable description that wathersite is about rainfall and clouds and not about temperature, so this is not so trivial as it sounds). Since we have not so much data here, this could also be by chance. Here we could search for more weatherdata of D.C. or ask a meteorologist.

In [None]:
pd.crosstab(df_daily.weathersit, df_daily.season)

### Plotting Processes

The variables "temp" and "atemp" seem to be highly correlated, with an exception on one datapoint in September 2013.
Therefor we should delate this variable to avoid multicollinearity, if we want to create a statistical model. The temperature has a seasonal trend, whereas the humidity and windspeed seem the proceed a bit noise.

In [None]:
df_daily.loc[:,'temp':'windspeed'].plot(subplots=True)
plt.xlabel("")
plt.tight_layout()
plt.show()

Now we pay attention to our dependent variable (the one we want to predict). Here we plot a timeseries of the count of bikes that were rented per day.

In [None]:
df_daily.cnt.plot(title = "Count of total rental bikes per day")
plt.xlabel("")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

We can aggregate the count to months to become a more smoothed curve. Seems like people don't wanted bikes in February.

In [None]:
df_daily.asfreq('M').cnt.plot(title = "Count of total rental bikes per month")
plt.xlabel("")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

Now we visualise the proportion between registered and unregistered users develops over time. In January there were very few casual users, and where proportion compared to registered users is also very low. In spring the proportion of casual users rises. The gap between the total numbers and the rentals generated by registered user is very interesting. It's higher in the summer months and very low in the winter.

In [None]:
df_daily[['casual','registered','cnt']].asfreq('M').plot()
plt.xlabel("")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

### Overview and Outlier Detection

The description of the dependant variables shows that on average the number of bikes rented by registered users is four times as high as for casual user. Unlike the monthly variation the daily variation is higher for registered users (you can see that by the standard deviation).

In [None]:
df_daily[['casual','registered','cnt']].describe()

Now we have a look at potential outliers. Let's start with the day with the fewest rentals. The 29t of October 2012 was officially a normal working day, but state of emergency prevailed in D.C. and Maryland, because of Hurricane Kathrina. In a nutshell it's not an outlier in that sense, that it is a mistake in the dataset (e.g. when somebody just mistypes the value). But state of emergency (or state of exception) is an exception by definition and you could also argue that it's an outlier.

In [None]:
df_daily[df_daily.cnt == df_daily.cnt.min()]

The 15th of September 2012 was a warm Saturday and 8714 rentals are not enough to surprise us.

In [None]:
df_daily[df_daily.cnt == df_daily.cnt.max()]

## 4. Exploratory Data Analysis

After our descriptive analysis, we can make some suppositions about relationships of our variables and build some theories and elaborate on them. This is perhaps the most interesting part, then working with data, because you need some creativity and craftiness here. Some things that stand out are:
<registered and casual users seem to behave different
<there is a seasonal effect and an increasing trend (see rents bikes per month) but this not explain all the variance (see rents bikes per day). So other variables or interactions between variables must be responsible for the left free variance.

#### Conditional Means

To start with the explorative part of the analysis, conditional means are often a good choice. They allow us to quickly notice possible effects of an independent variable or a combination of independent variables on the outcome variable. But please be aware that we don't make hypothesis tests in that kernel (perhaps it is a good topic for the next one), so we must make our assertions with caution.

The mean of rentals per day for all observations serves as our benchmark.

In [None]:
df_daily['cnt'].mean()

Now we calculate the mean of rentals grouped by different weathersits. It seems that people are more likely to use a bike then weather is better. Makes sence to me.

In [None]:
df_daily.groupby('weathersit')['cnt'].mean()

Because we set the date in datetime-Format as our index we can get conditional means with resample() to aggregate to year and month. Don't get confused with the output. It's the mean for the whole two years and not just for New Year's Eve. The rentals were obviously higher in 2012 than in 2011, but two years are not enough to make a time series analysis here. If we would live in January 2013 and want to make predictions for the year, it would be difficult to use this trend. It could be the case that 2011 was the first year of the business and the customers had to get used with this whole bike stuff and in 2012 perhaps the rentals have reached their "natural" level. But if you want to make in-sample predictions for the training-competition you can off course use this trend.

In [None]:
df_daily.resample('A').cnt.mean()

This also works for months.

In [None]:
df_daily.resample('M').cnt.mean()

Here we aggregate both years, so winter means both winters (2011 and 2012), therefore we must use "groupby" instead of "resample". Another thing that make sense: on an average summer day more bikes were rented than on an average day in the winter.

In [None]:
df_daily.groupby('season')['cnt'].mean()

There seem to be almost no difference between working days and the weekend, but on holidays less bikes had been rented.

In [None]:
df_daily.groupby('workingday')['cnt'].mean()

In [None]:
df_daily.groupby('holiday')['cnt'].mean()

Now that you are familiar with that concept, we can go really crazy and make the mean of rented bikes depending on combinations of features to get more interesting insights. For example we see, that nobody has used a bike on a poor weather day, if it was a holiday. Another possibility could be that were was no holiday with a poor weather site (I told you: because we haven't made any hypothesis test and we have not much data, we have to be careful with conclusions), but on days with medium weather site this difference can also be seen.

In [None]:
df_daily.groupby(['holiday','weathersit'])['cnt'].mean()

On the weekend it was different. For the different weather sites, there was not so much difference between working days and the weekend.

In [None]:
 df_daily.groupby(['workingday','weathersit'])['cnt'].mean()

If we compare the holiday-boolean with the workingday-boolean effect, we can see similar patterns.

In [None]:
df_daily.groupby(['holiday','season'])['cnt'].mean()

In [None]:
df_daily.groupby(['workingday','season'])['cnt'].mean()

### Boxplots

Another simple and fast method to get insights from a dataset are boxplots. The first boxplot shows us how the bike-rentals per hour are distributed over the hours of the day. For example if on the 5th of July 2011 200 bikes were rented between 10 and 11 p.m, one point is drawn with the coordinates x = 22 and y = 200. This is done for every day and every hour. Because so many points would confuse us, the boxes show us there the quantiles are and the line in the middle shows the mean, only outliers are visible as a point.
We see that during the rush hour (around hour 8 and 17), we have higher means and higher variance. Additional there are many outliers on the afternoon and even a few in the night.

In [None]:
sns.boxplot(df_hourly['hr'], df_hourly['cnt'])
plt.xlabel("Hour")
plt.ylabel("Count")
plt.title("Rentals on hour")
plt.show()

If we do the same for the months, we can see a high variance in between the daily rentals for March and October.

In [None]:
sns.boxplot(df_daily.index.month, df_daily['cnt'])
plt.xlabel("Month")
plt.ylabel("Count")
plt.title("Rentals on month")
plt.show()

By contrast for the days of the week the variance seems to be more even. It's high on every weekday, so the weekday without further information can't explain much.

In [None]:
sns.boxplot(df_daily.index.weekday, df_daily['cnt'])
plt.xlabel("Weekday")
plt.ylabel("Count")
plt.title("Rentals on weekday")
plt.show()

### Correlation Matrix

As mentioned above "temp" or "atemp" has to get delated, otherwise this would lead to multicollinearity. We see this also in the next matrix. This variables have a Person Correlation Coefficient (the default setting for pandas.corr()) of over 0.99. Interesting are the correlations between the dependant variables. 
Not surprisingly "cnt" and "registred" are highly correlated but "registred" and "casual" have a correlation of less than 0.40. We also want to have a look at the correlation between the numerical and the categorial weather data. We can see a negative correlation between humidity and good weather. But we cant fully describe one variable with another (this is a good example, when to use dimensionality reduction technics like Principal Component Analysis, but this is topic for another kernel). 
Another information we get from this matrix is that "instant" correlates stronger with "registered" than with "casual". With that in mind we can take a second look at the plot with "registered" and "casual" over time and see that the increasing trend in rentals is mostly based on the registered users.

In [None]:
df_tmp = pd.get_dummies(df_daily.weathersit)
pd.concat([df_tmp, df_daily], axis=1).corr()

## 5. Storytelling

Before telling our story, we have to be aware that we haven't tested our hypothesis. We have just theories no facts and we were just looking at two years. 

We found out that registered and casual users behaved differently, so it's likely that they are to completely different groups. It's probable that the registered users use the bike mainly for transportation (e.g. go to work) and the casual users use it for leisure (making tours or sightseeing). This theory would explain a lot we had observed. You remember that the increasing trend is mostly caused by the registered user? Our new theory could explain that. The registered users need time to increasing, since they must change their habits. They first have to register, then perhaps sell their car or own bike. It could also be that they have monthly or yearly tickets for public transportation and just want to wait till those expire. Make a long story short the casual users can be more spontaneous. Tourists planning to go to Washington Monument or lovers wanting to cycle through Rock Creek Park just take a bike and start their trip. We can also assume that spontaneous trips are made on holidays or during vacation so the weather has a greater impact on their behaviour. During a hurricane, your partner really must be Mrs./Mr. Right, when you go cycling in the park (please have a look on the casual column at 2012-10-29 an congratulate them💕), also think of day tourists here, they come perhaps for a weekend and decide what they want to do, depending on the weather. On the other side the poor registered users have to go to work no matter if it's raining or not. We also wondered why the impact of the weather on weekends is less than on holidays. An explanation could be that the registered users, also have fixed dates on the weekend (sports, brunch, church etc.).