# Introduction
I am going to analysis and visulize some datasets. In this note book, I am analyzing the time series dataset – [Parking Birmingham](https://archive.ics.uci.edu/ml/machine-learning-databases/00482/) downloaded from the UCI machine learning repository.

# Time Series
A time series data is a series of data points or observations recorded at different or regular time intervals. In general, a time series is a sequence of data points taken at equally spaced time intervals. The frequency of recorded data points may be hourly, daily, weekly, monthly, quarterly or annually.

A time series analysis encompasses statistical methods for analyzing time series data. These methods enable us to extract meaningful statistics, patterns and other characteristics of the data. Time series are visualized with the help of line charts. So, time series analysis involves understanding inherent aspects of the time series data so that we can create meaningful and accurate forecasts.



# Types of Data
A time series data means that data is recorded at different time periods or intervals. The time series data may be of three types:-

1. **Time series data** - The observations of the values of a variable recorded at different points in time is called time series data.
2. **Cross sectional data** - It is the data of one or more variables recorded at the same point in time.
3. **Pooled data** - It is the combination of time series data and cross sectional data.

# Terminology
There are various terms and concepts in time series that we should know. These are as follows:-

1. **Dependence** - It refers to the association of two observations of the same variable at prior time periods.
2. **Stationarity** - It shows the mean value of the series that remains constant over the time period. If past effects accumulate and the values increase towards infinity then stationarity is not met.
3. **Differencing** - Differencing is used to make the series stationary and to control the auto-correlations. There may be some cases in time series analyses where we do not require differencing and over-differenced series can produce wrong estimates.
4. **Specification** - It may involve the testing of the linear or non-linear relationships of dependent variables by using time series models such as ARIMA models.
5. **Exponential Smoothing** - Exponential smoothing in time series analysis predicts the one next period value based on the past and current value. It involves averaging of data such that the non-systematic components of each individual case or observation cancel out each other. The exponential smoothing method is used to predict the short term prediction.
6. **Curve fitting** - Curve fitting regression in time series analysis is used when data is in a non-linear relationship.
7. **ARIMA** - ARIMA stands for Auto Regressive Integrated Moving Average.

# About The Dataset
**Source:** Daniel H. Stolfi, dhstolfi '@' lcc.uma.es, University of Malaga - Spain.

**Data Set Information:** Occupancy rates (8:00 to 16:30) from 2016/10/04 to 2016/12/19

**Attribute Information:**

    SystemCodeNumber: Car park ID
    Capacity: Car park capacity
    Occupancy: Car park occupancy rate
    LastUpdated: Date and Time of the measure
    
**Data Set Characteristics:** Multivariate, Univariate, Sequential, Time-Series

**Number of Instances:** 35717

**Number of Attributes:** 4

# Data Visualization

Import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from scipy.stats import pearsonr
%matplotlib inline

Import data set.

In [None]:
parking_data = pd.read_csv('/kaggle/input/dataset.csv')
parking_data.describe()

Removing inconsistence data. Such as duplicate values, negative occupancy, occupancy value greater than capacity.

In [None]:
print('Before removing inconsistence data:',parking_data.shape)
parking_data.dropna(inplace = True)
parking_data.drop_duplicates(keep='first',inplace=True) 
parking_data = parking_data[parking_data['Occupancy']>=0 ]
parking_data = parking_data[parking_data['Capacity']>=0 ]
false_data = parking_data[parking_data['Occupancy']> parking_data['Capacity']]
parking_data = pd.concat([parking_data, false_data]).drop_duplicates(keep=False)
print('After removing inconsistence data:',parking_data.shape)
parking_data.describe()

Convert LastUpdated column into Time and Date column. Add new columns for occupancy rate in percentage and Day of Week.

In [None]:
parking_data['OccupancyRate'] = (100.0*parking_data['Occupancy'])/parking_data['Capacity']
dateTime = parking_data['LastUpdated'].str.split(" ", n = 1, expand = True) 
date = dateTime[0]
time = dateTime[1]
parking_data['Date'] = date
parking_data['Time'] = time
day_name = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
parking_data['DayOfWeek'] = pd.to_datetime(parking_data['Date']).dt.dayofweek.apply(lambda x: day_name[x])
parking_data.info()

Plot occupancy rate for all car parks with respect to time.

In [None]:
plt.plot([],[])
park_name = parking_data['SystemCodeNumber'].unique()
#print(park_name)
for i in range(len(park_name)):
    s = park_name[i]
    rate = parking_data[parking_data['SystemCodeNumber'] == s]['OccupancyRate']
    time=pd.to_datetime(parking_data[parking_data['SystemCodeNumber'] == s]['Time'],format='%H:%M:%S')
    plt.scatter(time,rate,label=s)
plt.gcf().autofmt_xdate()
myFmt = mdates.DateFormatter('%H:%M')
plt.gca().xaxis.set_major_formatter(myFmt)    
plt.xlabel('Time')
plt.ylabel('Occupancy Rate')
plt.show()
plt.close()

Plot mean occupancy rate with respect to each carpark.

In [None]:
xData = parking_data.groupby('SystemCodeNumber')['OccupancyRate'].mean()
key_list = list(xData.keys()) 
val_list = []
for x in key_list:
    val_list.append(xData[x])
df = pd.DataFrame(list(zip(key_list, val_list)), 
               columns =['Park ID', 'Mean Occupancy Rate']) 
ax = sns.barplot(y='Park ID',x='Mean Occupancy Rate',data=df,orient="h")
ax.set(ylabel="Car Park ID", xlabel = "Mean Occupancy Rate")
ax.tick_params(axis='y', labelsize=7)


Scatter plot of occupancy rate of **BHMBCCMKT01** car park.

In [None]:
for i in range(1):
    s = park_name[i]
    rate = parking_data[parking_data['SystemCodeNumber'] == s]['OccupancyRate']
    time=pd.to_datetime(parking_data[parking_data['SystemCodeNumber'] == s]['Time'],format='%H:%M:%S')
    plt.scatter(time,rate,label=s)
plt.gcf().autofmt_xdate()
myFmt = mdates.DateFormatter('%H:%M')
plt.gca().xaxis.set_major_formatter(myFmt)    
plt.xlabel('Time')
plt.ylabel('Occupancy Rate')
plt.show()
plt.close()

In [None]:
park_name = ['BHMEURBRD01']
for i in range(len(park_name)):
    s = park_name[i]
    rate = parking_data[parking_data['SystemCodeNumber'] == s]['OccupancyRate']
    time=pd.to_datetime(parking_data[parking_data['SystemCodeNumber'] == s]['Date'])
    plt.plot(time,rate)
    plt.gcf().autofmt_xdate()
plt.xlabel('Date')
plt.ylabel('Occupancy Rate')
plt.show()
plt.close()

There are some difference in occupancy rate in weekends and workday. I am going to plot occupancy rate for Shopping car parks on a weekends and workday.

In [None]:
plt.plot([],[])

rate = parking_data[parking_data['SystemCodeNumber'] == 'Shopping']
rate = rate[rate['Date'] == '2016-10-06']['OccupancyRate']
time = parking_data[parking_data['SystemCodeNumber'] == 'Shopping'] 
time=pd.to_datetime(time[time['Date'] == '2016-10-06']['Time'],format='%H:%M:%S')
plt.plot(time,rate,label='2016-10-06')

rate = parking_data[parking_data['SystemCodeNumber'] == 'Shopping']
rate = rate[rate['Date'] == '2016-10-09']['OccupancyRate']
time = parking_data[parking_data['SystemCodeNumber'] == 'Shopping'] 
time=pd.to_datetime(time[time['Date'] == '2016-10-09']['Time'],format='%H:%M:%S')
plt.plot(time,rate,label='2016-10-09')



plt.gcf().autofmt_xdate()
myFmt = mdates.DateFormatter('%H:%M')
plt.gca().xaxis.set_major_formatter(myFmt)
plt.xlabel('Time')
plt.ylabel('Occupancy Rate')
plt.legend()
plt.show()
plt.close()

We can plot data count with respect to days of week. We can see that we get less data on weekends.

In [None]:
ax = sns.catplot(x='DayOfWeek',kind='count',data=parking_data,orient="h")
ax.fig.autofmt_xdate()
ax.set(xlabel="Week Days", ylabel = "Count")

It can be clear from this box graph.

In [None]:
ax = sns.catplot(x = "DayOfWeek",y="OccupancyRate",kind='box',data=parking_data)
ax.set(xlabel="Week Days", ylabel = "Occupancy Rate")

We can plot heatmap for occupancy rate with respect to data for each carpark.

In [None]:
heatmap_data = pd.pivot_table(parking_data, values='OccupancyRate', 
                     index=['SystemCodeNumber'], 
                     columns='Date')
ax = sns.heatmap(heatmap_data , cmap="BuGn")
ax.set(ylabel="Car Park ID", xlabel = "Date")

# Data Set Preparation
There are data for 12 weeks. I choose the last week as our test data and remaining as our train data.

In [None]:
test_data = parking_data[(pd.to_datetime(parking_data['Date']) >= pd.to_datetime('2016-12-13'))]
train_data = pd.concat([parking_data, test_data]).drop_duplicates(keep=False)
print('Train data size:',train_data.shape)
print('Test data size:',test_data.shape)
train_data.to_csv('train.csv',index=False)
test_data.to_csv('test.csv',index=False)
