<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Capstone Project 

## AutoTel Shared Cars Availability

### Location history of shared cars


### Exploratory Data Analysis (EDA)

---


In order to reduce the number of owned cars, the city of Tel Aviv launched a shared-car project, called AutoTel. Users of the service are able to reserve a car using a mobile app, and pay for it by the minute. The project that was launched in October 2017 attracted over 7500 users, with more than 50% of them using the service at least once a week.


From the AutoTel website we extracted the location of the parked cars, every two minutes for several months. The raw data was saved to Google Storage in CSV format, and later loaded to a BigQuery Table. This short clip shows a visualization of the recorded data using Uber’s kepler.gl tool.

To select from the BQ table run:

select * from `gad-playground-212407.doit_intl_autotel_public.car_locations` LIMIT 1
Inspiration
In order for the service to be reliable, AutoTel has to make sure that supply and demand are geospatially balanced, meaning cars are where and when they are needed. This task is extremely difficult since cars are driven and parked by customers who are not aligned at all with this optimization task. For the most part, the distribution of cars is uncorrelated with the demand: one reason is that if a car is parked in a suburban neighborhood, it may take a long time before another user may drive it to the city center, where high demand for the cars exists; thus clusters of unused cars are very often present on the outskirts of the city.

Using machine learning, AutoTel can predict the geospatial availability of cars at given times, and use predictions to modify their business model. They could, for example, modify prices so that it would be cheaper to park cars in high demand areas, or plan the the maintenance program so that cars will be collected from high-supply-low-demand areas and returned to areas of high demand.

The data sources is: https://www.kaggle.com

---

The part 2,3 of capstone project is focused on exploratory data analysis, aka "EDA". EDA is an essential part of the data science analysis pipeline. Failure to perform EDA before modeling is almost guaranteed to lead to bad models and faulty conclusions.

#### Package imports

In [1]:
# MY kaggle notebook
# the lightgbm model dosent work on my local jupyter notebook, 
https://www.kaggle.com/acacianinjayt/abdul-notebook/edit/run/35488683

SyntaxError: invalid syntax (<ipython-input-1-a96ee40bba8a>, line 1)

In [None]:
import numpy as np
import scipy.stats as stats
import csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import warnings
warnings.filterwarnings

# this line tells jupyter notebook to put the plots in the notebook rather than saving them to file.
%matplotlib inline

# this line makes plots prettier on mac retina screens. If you don't have one it shouldn't do anything.
%config InlineBackend.figure_format = 'retina'



## 1. Load the `2020_02_25.csv` dataset and describe it

---

I got the `2020_02_25.csv` dataset from The data sources https://www.kaggle.com

In [None]:
df_cars = pd.read_csv('../input/autotel-shared-car-locations/2020_02_25.csv')
df_cars.sample(5)

In [None]:
df_cars.head() 

In [None]:
df_cars.shape

In [None]:
df_cars.info()

In [None]:
#df_sample = df_cars.sample(n=100000)
#df_sample.head()

In [None]:
#import numpy as np
#def num(x):
    
    #try:
        #return x[0]
    #except:
        #return np.nan
    

In [None]:
#df_sample.cars_list.apply(num)

In [None]:
#df_sample.sample(0.01)

Some of the rows indicate empty parking spots. These spots are reserved for AutoTel cars, but no car was parked there at the time. Since we are only interested in the cars location, we will filter these rows to save computation and memory

In [None]:
df_cars.describe()

In [None]:
# we can filter the unwanted rows 

df = df_cars[df_cars['total_cars'] > 0]

In [None]:
# the shape of data frame show us half of rows filtered out
# we can save computation and memory by this way
df.shape

In [None]:
# Group by aggregate by one or more columns in pandas
# The example of to group on one or multiple columns and summarise data with aggregation function using Pandas.
# The source from https://jamesrledoux.com/code/group-by-aggregate-pandas

Pandas comes with a whole host of sql-like aggregation functions we can apply when grouping on one or more columns. This is Python’s closest equivalent to dplyr’s group_by + summarise logic. Here’s a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.

In [None]:
# reset index to get grouped column

df_cars_by_time = df.groupby('timestamp').agg({'total_cars': 'sum'}
                                          ).reset_index()
df_cars_by_time.head()

In [None]:
#Create a datetime index


In [None]:
# kde: bool, optional - Whether to plot a gaussian kernel density estimate.
# bins: argument for matplotlib hist(), or None, optional - Specification of hist bins. If unspecified, as reference rule is used that tries to find a useful default.
# rug: bool, optional - Whether to draw a rugplot on the support axis.
import seaborn as sns, numpy as np

scores_dist1 = sns.distplot(df_cars_by_time.total_cars, bins=15, kde=False, rug=True )


# 2. Analyzing time series by Timestamp

In [None]:
#Time series analysis
# I need more practice for time serice 



In [None]:
df_cars_by_time['total_cars'].plot(lw=1.5, figsize=(12,5))

In [None]:
df_cars_by_time['timestamp'] = df_cars_by_time['timestamp'].apply(pd.Timestamp)
rolling_mean = df_cars_by_time.set_index('timestamp').sort_index().rolling(window=2, center=True).mean()
exp_mean = df_cars_by_time.set_index('timestamp').sort_index().ewm(span=10).mean()

In [None]:
ax = rolling_mean.plot(lw=1.5, figsize=(14,7))
exp_mean.plot(ax=ax, lw=1.5)


In [None]:
#the rolling mean is the mean of a moving window across the time periods.
#Pandas has a lot of functionality to create rolling statistics which we will only scratch the surface of
#There is a rolling() function that has the statistical function chained to it
#Extract the dates from the index as timestamps.
#the .to_timestamp() function lets you extract the timestamps.


df_cars_by_time['timestamp'] = df_cars_by_time['timestamp'].apply(pd.Timestamp)
df_cars_by_time.set_index('timestamp').sort_index().rolling('60min').mean().plot(figsize=(20,6), c='salmon', lw=1.6)
plt.grid()
plt.show()


In [None]:
# We can see that tge max available cars is 260, so we can assume that this is the total number of cars available in AutoTel

# By assuming this we will calculate the usage rate.



In [None]:
df_cars_by_time['usage_rate'] = (260 - df_cars_by_time['total_cars'])/260
usage_rate = df_cars_by_time.set_index('timestamp').sort_index()['usage_rate'].rolling('60min').mean()
usage_rate.plot(figsize=(20,6), c='mediumslateblue', lw=1.5)
plt.grid()
plt.show()

In [None]:
# usage growing or decreasing over time

In [None]:
df_cars_by_time['usage_rate'] = (260 - df_cars_by_time['total_cars'])/260
df_cars_by_time.set_index('timestamp').sort_index()['usage_rate'].rolling('3D').mean().plot(figsize=(20,6), c='navy', lw=1.6)
plt.grid()
plt.show()

# 3. Analyzing Usage Patterns by Time

In [None]:
# In order to analyze the data by time we need to consider the time zone,  
# Tel Aviv is not in UTC timezone. 
# We will converted the capital city time zone as same time zone

In [None]:
# convert timezone

timestamps = pd.DatetimeIndex(df_cars_by_time['timestamp'])
timestamps = timestamps.tz_convert('Asia/Jerusalem')

df_cars_by_time['Local_time'] = timestamps

# extract time features
df_cars_by_time['weekday'] = df_cars_by_time['Local_time'].dt.day
df_cars_by_time['hour'] = df_cars_by_time['Local_time'].dt.hour

In [None]:
df_cars_by_time.head()

In [None]:
# Analyze usage and viselize by hour

plt.figure(figsize=(12,6))
#plt.subplot(121)
sns.barplot(x='hour', y='total_cars', data=df_cars_by_time)
plt.title('Total_Cars usage by hour of day')

plt.show()

In [None]:
# Analyze usage and viselize by day
plt.figure(figsize=(12,6))
#plt.subplot(122)
sns.boxplot(x='weekday', y='total_cars', data=df_cars_by_time) #showfliers=False
plt.title('Total_Cars usage by day of week')

plt.show()

In [None]:
#We can see that tge max available cars is 260, 
#so we can assume that this is the total number of cars available in AutoTel.
#By assuming this we will calculate the usage rate.

df_cars_by_time['usage_rate'] = (260 - df_cars_by_time['total_cars'])/260

In [None]:
plt.figure(figsize=(12,6))
#plt.subplot(121)
sns.barplot(x='hour', y='usage_rate', data=df_cars_by_time)
plt.title('Cars usage rate by hour of day')

plt.show()

In [None]:
# Analyze usage and viselize by day
plt.figure(figsize=(12,6))
#plt.subplot(122)
sns.boxplot(x='weekday', y='usage_rate', data=df_cars_by_time) #showfliers=False
plt.title('Cars usage rate by day of week')

plt.show()

.    Looks much better isn't it? during the night the usage drops to as much as 2.5%, while at peak times almost 20% of the cars are in use.

.    Weekdays in israel are Sunday to Thursday, the weekend is Friday and Saturday. This may explain why Thursday has the highest usage rate on average

### Plotting the data on a map

This geographical data can be presented on a map, showing where people park the cars

In [None]:
# Create interactive maps with the folium package
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster

In [None]:
#Along the way, I'll apply my new skills to visualize AutoTel data.
#We add some data to the map, Location sets the map

df_locations = df.groupby(['latitude', 'longitude', 'timestamp']).sum().reset_index().sample(1500)
df_locations.head()

In [None]:
# create a map with folium.Map()
m = folium.Map([df_locations.latitude.mean(), df_locations.longitude.mean()], zoom_start=11)
for index, row in df_locations.iterrows(): # Add points to the map
    Marker([row['latitude'], row['longitude']], radius=row['total_cars']*6, fill_color="#3db7e4").add_to(m)
    
points = df_locations[['latitude', 'longitude']]
m.add_children(HeatMap(points, radius=15)) #plot heatmap

#Display the map
m

### Predicting the number of available cars per neighborhood 

I can get the AutoTel provide additional data of neighborhood polygons of the city of Tel Aviv. This data will enable us to group the data by neighborhood and later predict the number of aviliable cars per neighborhood!
    
    . Disclaimer: Kaggle Kernels do not support GeoPandas, so I implemented my own geo joins wich are not efficient. Hopefully they will add it soon!

In [None]:
#Import necessary geometric objects from shapely module
#wkt stands for Well-Known Text and is a text markup language for representing vector geometry objects on a map,
#wkt spatial reference systems of spatial objects and transformations between spatial reference systems.
from shapely.geometry import Point, Polygon
from shapely import wkt

In [None]:
# load neiborhood data 

In [None]:
df_neighborhood = pd.read_csv('/kaggle/input/tel-aviv-neighborhood-polygons/tel_aviv_neighborhood.csv')
df_neighborhood.head()

In [None]:
def load_and_close_polygon(wkt_text):
    poly = wkt.loads(wkt_text)
    point_list = poly.exterior.coords[:]
    point_list.append(point_list[0])
    
    return Polygon(point_list)

In [None]:
# Lets transform the WKS's to Polygon Objects and save it to a GeoPandas DataFrame
df_neighborhood['polygon'] = df_neighborhood['area_polygon'].apply(load_and_close_polygon)
neighborhood_map = df_neighborhood.set_index('neighborhood_name')['polygon'].to_dict()

In [None]:
#Use sample(10000) data from original data
#have many points and just one polygon and I try to find out which one of them is inside the polygon
#need to iterate over the points and check one at a time if it is within() the polygon specified
#if have many polygons and just one point and you want to find out which polygon contains the point
#need to iterate over the polygons until you find a polygon that contains() 
#the point specified (assuming there are no overlapping polygons)
sample_df = df.sample(10000)
sample_df['points'] = sample_df.apply(lambda row : Point([row['longitude'], row['latitude']]), axis=1)
sample_df.head()

In [None]:
#add the neighborhood column

poly_idxs = sample_df['points'].apply(lambda point : np.argmax([point.within(polygon) 
                                                                  for polygon in list(neighborhood_map.values())]))

#In order to put points on the map, we need to convert each coordinate to geopoint, same for the dataframe.
#keys() method in Python Dictionary, returns a view object that displays a list of all the keys in the dictionary.
poly_idxs = poly_idxs.apply(lambda x: list(neighborhood_map.keys())[x])
sample_df['neighborhood'] = poly_idxs.values
sample_df.head()

In [None]:
plt.figure(figsize=(20,7))
sns.barplot(x = 'neighborhood', y = 'total_cars', data=sample_df.groupby('neighborhood').count().reset_index())
plt.xticks(rotation=45)
plt.show()


In [None]:
#sat['geome']

### Predicting Car Availability using Models

In [None]:
#Make a copy of this object’s indices and data.
#A shallow copy constructs a new compound object 
#and then (to the extent possible) inserts references into it to the objects found in the original.
df_sample = df.copy()

In [None]:
df_timestaps = pd.DataFrame()

#Pandas drop_duplicates() method helps in removing duplicates from the data frame.
df_timestaps['timestamp'] = df_sample.timestamp.drop_duplicates()

#Manipulating and converting date times with timezone information
#A timezone that has a variable offset from UTC.
#Localize tz-naive DatetimeIndex to a given time zone, or remove timezone from a tz-aware DatetimeIndex.
#timestamps = pd.DatetimeIndex(df_timestaps['timestamp']).tz_localize('UTC')

#Time zone for time. 
#Corresponding timestamps would be converted to this time zone of the Datetime Array/Index. 
#A tz of None will convert to UTC and remove the timezone information.
df_timestaps['Local_time'] = timestamps.tz_convert('Asia/Jerusalem')


In [None]:
#The join is done on columns, the DataFrame indexes will be ignored.
#left: use only keys from left frame
df_sample = df_sample.merge(df_timestaps, on='timestamp', how='left')

In [None]:
# Again no reason to calculate on duplicate points, it's very expensive!
#Creating a GeoDataFrame from a DataFrame with coordinates
df_points = df_sample[['longitude', 'latitude']].drop_duplicates()

#consider a DataFrame containing cities and their respective longitudes and latitudes
df_points['points'] = df_points.apply(lambda row : Point([row['longitude'], row['latitude']]), axis=1)

#have many points and just one polygon and I try to find out which one of them is inside the polygon
#need to iterate over the points and check one at a time if it is within() the polygon specified
#if have many polygons and just one point and you want to find out which polygon contains the point
#need to iterate over the polygons until you find a polygon that contains() 
#the point specified (assuming there are no overlapping polygons)
poly_idxs = df_points['points'].apply(lambda point : np.argmax([point.within(polygon) 
                                                                for polygon in list(neighborhood_map.values())]))

#In order to put points on the map, we need to convert each coordinate to geopoint, same for the dataframe.
#keys() method in Python Dictionary, returns a view object that displays a list of all the keys in the dictionary.
poly_idxs = poly_idxs.apply(lambda x: list(neighborhood_map.keys())[x])
df_points['neighborhood'] = poly_idxs.values


In [None]:
#The join is done on columns, the DataFrame indexes will be ignored.
#left: use only keys from left frame
df_sample = df_sample.merge(df_points[['longitude', 'latitude', 'neighborhood']], 
                            on=['longitude', 'latitude'], how='left')

In [None]:
#Timestamp is just unix time with nanoseconds (so divide it by 10**6)
df_sample['time_in_seconds'] = pd.to_datetime(df_sample['Local_time']).values.astype(np.int64) // 10**6

#Generate n rows of random 24-hour times (seconds past midnight)
seconds_in_day = 24 * 60 * 60
seconds_in_week = 7 * seconds_in_day

#the two-feature transformation in 2D as a 24-hour clock. 
#The distance between two points corresponds to the difference in time as we expect from a 24-hour cycle.
#we will create two new features, deriving a sine transform and cosine transform of the seconds-past-midnight feature. 
df_sample['sin_time_day'] = np.sin(2*np.pi*df_sample['time_in_seconds']/seconds_in_day)
df_sample['cos_time_day'] = np.cos(2*np.pi*df_sample['time_in_seconds']/seconds_in_day)
#This gives you a cyclical embedding of the datetime component. 
#Thus, (e.g.) midnight and 1 am will have a similar representation
df_sample['sin_time_week'] = np.sin(2*np.pi*df_sample['time_in_seconds']/seconds_in_week)
df_sample['cos_time_week'] = np.cos(2*np.pi*df_sample['time_in_seconds']/seconds_in_week)
#We can feed the sin_time and cos_time features into our machine learning model, 
#and the cyclical nature of 24-hour time will carry over.

df_sample['weekday'] = df_sample['Local_time'].dt.weekday
df_sample['hour'] = df_sample['Local_time'].dt.hour

df_sample.sample(5)

### Does our data contain multiple parking spots per Neighborhood?

i.e multiple rounded latLongs per neighborhood?

We may want to roundup the latlongs , in case reporting comes from the car level, rather than the parking spot(s)

In [None]:
df_sample[['longitude', 'latitude', 'neighborhood']].groupby('neighborhood').nunique()

In [None]:
#Round a DataFrame to a variable number of decimal places.
#By providing an integer each column is rounded to the same number of decimal places
df_sample["LL2"] = df_sample['longitude'].round(2).astype(str) + df_sample['latitude'].round(2).astype(str)

df_sample[['LL2', 'neighborhood']].groupby('neighborhood').nunique()

We see that looking at the "parking lot" level would mean roughly Tripling + the amount of rows/samples in our data to predict on. This might be a bit too much, although it would be more relevant for the level of taking action, i.e "where are there missing cars + a demand for cars"

### Back to merging + aggregation by Neighborhood:

Changed : we will look at the hourly level, not minute level. (Could also do every half hour maybe?)

Alternative target: Per "parking lot" = by LatLong2 LL2

In [None]:
# aggregation by Neighbourhood
aggs = {}
aggs['total_cars'] = 'sum'
aggs['sin_time_day'] = 'mean'
aggs['cos_time_day'] = 'mean'
aggs['sin_time_week'] = 'mean'
aggs['cos_time_week'] = 'mean'
aggs['weekday'] = 'first'
aggs['hour'] = 'first'
# Mode is problematic with agg
aggs['latitude'] =   'first' # pd.Series.mode()#lambda x: x.mode #pd.Series.mode()#
aggs['longitude'] =  'first' # pd.Series.mode() #lambda x: x.mode #pd.Series.mode() # 'first'

# 30 minute resample
df_sample = df_sample.set_index('Local_time').groupby([pd.Grouper(freq='1800s'), 'neighborhood']).agg(aggs).reset_index()

# df_sample.set_index('local_time').groupby([pd.Grouper(freq='60s'), 'neighborhood']).agg(aggs).reset_index().tail()

print(df_sample.shape)

df_sample.sample(6)

# Train a Model

## Classification

In [None]:
import pandas as pd 
import numpy as np
from sklearn.datasets import load_breast_cancer

from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesRegressor

from sklearn.model_selection  import train_test_split, GridSearchCV, cross_val_score

from sklearn.metrics import confusion_matrix, recall_score, classification_report

from sklearn.metrics import confusion_matrix, recall_score, accuracy_score

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib

In [None]:
#Convert to categorical type
df_sample['neighborhood'] = df_sample['neighborhood'].astype('category')
df_sample['weekday'] = df_sample['weekday'].astype('category')
df_sample['hour'] = df_sample['hour'].astype('category')

# if there haven't label column we can ceate the lable as blow:

median_total_cars = df_sample['total_cars'].median()

#y is a category that the mapping function predicts
#create lables from salary

df_sample['total_cars_label'] = df_sample['total_cars'].apply(lambda x: 1 if x > median_total_cars else 0)

df_sample['total_cars_label'].value_counts()

In [None]:
# define X and y

X = df_sample[['neighborhood', 'weekday', 'hour']]
y = df_sample['total_cars_label']

In [None]:
X_dummy = pd.get_dummies(X, drop_first=True)
print (X_dummy.shape)

### Ensamble Methodes

#### Baseline Accuracy

In [None]:
max(df_sample['total_cars_label'].value_counts(normalize=True))*100

#### Data Preparation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, test_size=0.33, 
                                                    random_state=42, stratify = y)


In [None]:
y_test.mean()

In [None]:
y_train.mean()

### Decision Tree Model

In [None]:
dt = DecisionTreeClassifier()

In [None]:
cross_val_score(dt, X_train, y_train, cv=10)

In [None]:
dt.fit(X_train, y_train)

In [None]:
dt.score(X_train, y_train)

In [None]:
dt.score(X_test, y_test)

In [None]:
#from io import StringIO  
#from IPython.display import Image  
#from sklearn.tree import export_graphviz
#import pydotplus as pydot

#dot_data = StringIO()  

#export_graphviz(dt, out_file=dot_data,  
#                filled=True, rounded=True,
#                special_characters=True)
             

#graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
#Image(graph.create_png()) 

In [None]:
from sklearn.metrics import classification_report

dt_predict = dt.predict(X_test)

print (classification_report(y_test, dt_predict))

### Bootstrap with Pandas

In [None]:
X_sample = X_train.sample(replace=True, n=X_train.shape[0], random_state=42)
y_sample = y_train[X_sample.index]

In [None]:
bt_tree = DecisionTreeClassifier()
bt_tree.fit(X_sample, y_sample)
bt_tree.score(X_test, y_test)

In [None]:
bt_predict = bt_tree.predict(X_test)
print(classification_report(y_test, bt_predict))

### Bagging Classifier

In [None]:
bag = BaggingClassifier(n_estimators=10)

bag.fit(X_train, y_train)
bag_predict = bag.predict(X_test)
bag.score(X_test, y_test)

In [None]:
print(classification_report(y_test, bag_predict))

### Random ForestClassification

In [None]:
# Run RandomForestClassifier
from sklearn.metrics import confusion_matrix, recall_score, accuracy_score

rfc = RandomForestClassifier(random_state=1000)
rfc.fit(X_train, y_train)
rfc_predict = rfc.predict(X_test)
accuracy_score(y_test, rfc_predict)

In [None]:
print(classification_report(y_test, rfc_predict))

### Predicting Car Availability using LightGBM

Now lets imagine that we'd like to generate predictions to how cars will be distributed between neighborhoods in the city. So in this sample code we will try to use LightGBM to predict the number of avilable car in a neighborhood 

### Lightgbm Model

In [None]:
#conda install -c conda-forge lightgbm
#conda install -c conda-forge/label/cf201901 lightgbm
#conda install -c conda-forge/label/cf202003 lightgbm
#pip install setuptools numpy scipy scikit-learn -U

In [None]:
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score
from sklearn.model_selection import KFold, train_test_split
from lightgbm import LGBMClassifier
import matplotlib.pyplot as pl
import gc
import shap

In [None]:
df_sample = df.copy()

In [None]:
df_timestamps = pd.DataFrame()
df_timestamps['timestamp'] = df_sample.timestamp.drop_duplicates()
#timestamps = pd.DatetimeIndex(df_timestamps['timestamp']).tz_localize('UTC')
df_timestamps['Local_time'] = timestamps.tz_convert('Asia/Jerusalem')

In [None]:
df_sample = df_sample.merge(df_timestamps, on='timestamp', how='left')

In [None]:
# Again no reason to calculate on duplocate points, it's very expensive!
df_points = df_sample[['longitude','latitude']].drop_duplicates()
df_points['points'] = df_points.apply(lambda row : Point([row['longitude'], row['latitude']]), axis=1)
poly_idxs = df_points['points'].apply(lambda point : np.argmax([point.within(polygon) for polygon in list(neighborhood_map.values())]))
poly_idxs = poly_idxs.apply(lambda x: list(neighborhood_map.keys())[x])
df_points['neighborhood'] = poly_idxs.values

In [None]:
df_sample = df_sample.merge(df_points[['longitude', 'latitude', 'neighborhood']], on=['longitude', 'latitude'], how='left')

In [None]:
df_sample['time_in_seconds'] = pd.to_datetime(df_sample['Local_time']).values.astype(np.int64) // 10**6

seconds_in_day = 24 * 60 * 60
seconds_in_week = 7 * seconds_in_day

#df_sample['sin_time_day'] = np.sin(2*np.pi*df_sample['time_in_seconds']/seconds_in_day)
#df_sample['cos_time_day'] = np.cos(2*np.pi*df_sample['time_in_seconds']/seconds_in_day)

#df_sample['sin_time_week'] = np.sin(2*np.pi*df_sample['time_in_seconds']/seconds_in_week)
#df_sample['cos_time_week'] = np.cos(2*np.pi*df_sample['time_in_seconds']/seconds_in_week)

df_sample['weekday'] = df_sample['Local_time'].dt.weekday
df_sample['hour'] = df_sample['Local_time'].dt.hour

df_sample.head()

In [None]:
# aggregation by Neighbourhood
aggs = {}
aggs['total_cars'] = 'sum'
#aggs['sin_time_day'] = 'mean'
#aggs['cos_time_day'] = 'mean'
#aggs['sin_time_week'] = 'mean'
#aggs['cos_time_week'] = 'mean'
aggs['weekday'] = 'first'
aggs['hour'] = 'first'
# Mode is problematic with agg
aggs['latitude'] =   'first' # pd.Series.mode()#lambda x: x.mode #pd.Series.mode()#
aggs['longitude'] =  'first' # pd.Series.mode() #lambda x: x.mode #pd.Series.mode() # 'first'

# 30 minute resample
df_sample = df_sample.set_index('Local_time').groupby([pd.Grouper(freq='1800s'), 'neighborhood']).agg(aggs).reset_index()

# df_sample.set_index('local_time').groupby([pd.Grouper(freq='60s'), 'neighborhood']).agg(aggs).reset_index().tail()

print(df_sample.shape)

df_sample.head()

In [None]:
#Convert to categorical type
df_sample['neighborhood'] = df_sample['neighborhood'].astype('category')
df_sample['weekday'] = df_sample['weekday'].astype('category')
df_sample['hour'] = df_sample['hour'].astype('category')

In [None]:
df_train = df_sample[df_sample['Local_time'] < '2019-01-04']
df_test = df_sample[df_sample['Local_time'] >= '2019-01-04']

print('train_shape: ', df_train.shape)
print('test_shape: ', df_test.shape)

#df_train.to_csv("autoTel_train_30m_Neighborhoods.csv.gz",index=False,compression="gzip")
#df_test.to_csv("autoTel_test_30m_Neighborhoods.csv.gz",index=False,compression="gzip")

In [None]:
y = 'total_cars'
X = ['neighborhood', 'weekday', 'hour']

In [None]:
gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.05, 
                        n_estimators=250)

gbm.fit(df_train[X], df_train[y],
        eval_set=[(df_test[X], df_test[y])],
        eval_metric='mse',
        early_stopping_rounds=5,
      )

In [None]:
y_pred = df_test['prediction'] = gbm.predict(df_test[X])

y_pred

In [None]:
df_test.plot(kind='scatter', x='total_cars', y='prediction', lw=0, s=0.4, figsize=(20,6))
plt.show()

In [None]:
# evaluate performance



In [None]:
#Convert to categorical type
df_sample['neighborhood'] = df_sample['neighborhood'].astype('category')
df_sample['weekday'] = df_sample['weekday'].astype('category')
df_sample['hour'] = df_sample['hour'].astype('category')

# if there haven't label column we can ceate the lable as blow:

median_total_cars = df_sample['total_cars'].median()

#y is a category that the mapping function predicts
#create lables from salary

df_sample['total_cars_label'] = df_sample['total_cars'].apply(lambda x: 1 if x > median_total_cars else 0)

df_sample['total_cars_label'].value_counts()

In [None]:
X = df_sample[['neighborhood', 'weekday', 'hour']]
y = df_sample['total_cars_label']

In [None]:
X_dummy = pd.get_dummies(X, drop_first=True)
print (X_dummy.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, test_size=0.33, 
                                                    random_state=42, stratify = y)


In [None]:
clf = LGBMClassifier(
    n_estimators=400,
    learning_rate=0.03,
    num_leaves=30,
    colsample_bytree=.8,
    subsample=.9,
    max_depth=7,
    reg_alpha=.1,
    reg_lambda=.1,
    min_split_gain=.01,
    min_child_weight=2,
    silent=-1,
    verbose=-1,
)

clf.fit(X_train, y_train, 
    eval_set= [(X_train, y_train), (X_test, y_test)], 
    eval_metric='auc', verbose=100, early_stopping_rounds=30  #30
)

In [None]:
# explain 10000 examples from the validation set
# each row is an explanation for a sample, and the last column in the base rate of the model
# the sum of each row is the margin (log odds) output of the model for that sample
shap_values = shap.TreeExplainer(clf.booster_).shap_values(X_test.iloc[:10000,:])
shap_values

In [None]:
# compute the global importance of each feature as the mean absolute value
# of the feature's importance over all the samples
global_importances = np.abs(shap_values).mean(0)[:-1]

In [None]:
shap.summary_plot(shap_values, X_test.iloc[:10000,:])