###  Datathon2020 – Predicting weather disruption of public transport – provided by Ernst and Young

This Project was inspired from the Business Case of Data Science Society Global 2020 Hackathon hosted from May 15 - 17 , 2020

click <a href='https://www.datasciencesociety.net/predicting-weather-disruption-of-public-transport/'>here</a> for details about the Business Case and the data dictionary
 
#####  Data Sources : 
The datasets used in this project was provided by the organizers, however the external data sourced for was obtained <a href='https://www.dubaipulse.gov.ae/organisation/rta/service/rta-public-transports?organisation=rta&service=rta-public-transports&dataset=rta_public_transport_trips_by_type_of_transport_month-open'>here</a> 

### The analysis for this project will follow the CRISP-DM pipeline which are ;
<a id='the_destination'></a>
+ Business Understanding 
+ Data Understanding
+ Data Preparation
+ Data Modelling
+ Results
+ Deployment - Storytelling


###  Business Understanding

The summary of the project is to predict public transport service disruption in Dubai using the weather data analysis

+ Goal : Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?

### Data Understanding and Data Preprocessing 

This stage involves loading the data and performing necessary data cleaning, preprocessing and feature engineering on the data to prepare it for analysis and modelling

+ Importing Necessary Libraries

In [56]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

plt.style.use('ggplot')
import plotly.graph_objects as go

+ Loading the datasets into a dataframe

In [253]:
data = pd.read_json('Dubai+Weather_20180101_20200316.txt')
transport = pd.DataFrame(data=None,columns=['year','month','transport_type','trips'])

for i in os.listdir('Transport'):
    month_data = pd.read_csv("Transport/" + i)
    transport = pd.concat([transport,month_data],axis=0)

In [254]:
data.tail(3)

Unnamed: 0,city_name,lat,lon,main,wind,clouds,weather,dt,dt_iso,timezone,rain
19341,Dubai,25.07501,55.188761,"{'temp': 21.52, 'temp_min': 20, 'temp_max': 23...","{'speed': 3.1, 'deg': 60}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",1584392400,2020-03-16 21:00:00 +0000 UTC,14400,
19342,Dubai,25.07501,55.188761,"{'temp': 21.04, 'temp_min': 19, 'temp_max': 23...","{'speed': 3.1, 'deg': 70}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",1584396000,2020-03-16 22:00:00 +0000 UTC,14400,
19343,Dubai,25.07501,55.188761,"{'temp': 20.31, 'temp_min': 18, 'temp_max': 23...","{'speed': 3.6, 'deg': 60}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",1584399600,2020-03-16 23:00:00 +0000 UTC,14400,


+ Data Preprocessing and Data Cleaning

In [255]:
transport.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 1
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   year            95 non-null     object
 1   month           95 non-null     object
 2   transport_type  95 non-null     object
 3   trips           95 non-null     object
dtypes: object(4)
memory usage: 3.7+ KB


In [256]:
transport.reset_index(inplace=True)
transport.drop('index',axis=1,inplace=True)

In [257]:
transport.head()

Unnamed: 0,year,month,transport_type,trips
0,2018,Feb,Marine,141840
1,2018,Feb,Tram,528515
2,2018,Feb,Bus,11111573
3,2018,Feb,Metro,16915232
4,2018,Mar,Marine,166561


In [258]:
def uniqueid (col):
    """
    Creating a function to generate unique id for each month
    
    Args : 
        col : list of year & month
        
    Output :
         Return a unique id 
    """
    
    year = str(col[0]).strip()
    month = str(col[1]).strip()
    Month = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,
            'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
    
    if year == '2018' :
        return Month[month]
    
    elif year =='2019' :
        return Month[month] + 12 
    
    elif year=='2020' :
        return Month[month] + 24

In [259]:
transport['id'] = transport[['year','month']].apply(uniqueid,axis=1)

In [260]:
data.tail(1)

Unnamed: 0,city_name,lat,lon,main,wind,clouds,weather,dt,dt_iso,timezone,rain
19343,Dubai,25.07501,55.188761,"{'temp': 20.31, 'temp_min': 18, 'temp_max': 23...","{'speed': 3.6, 'deg': 60}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",1584399600,2020-03-16 23:00:00 +0000 UTC,14400,


+ Transforming the date to pandas date format 

+ Dropping columns with constant labels such as `city_name` and `timezone`

In [261]:
data.drop(['city_name','timezone','dt_iso'],axis=1,inplace=True)

In [262]:
def convert_time(timestamp):
    return datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')

In [263]:
data['dt'] = data['dt'].apply(convert_time)
data['dt'] = pd.to_datetime(data['dt'])

In [264]:
data.head(2)

Unnamed: 0,lat,lon,main,wind,clouds,weather,dt,rain
0,25.07501,55.188761,"{'temp': 14.99, 'temp_min': 13, 'temp_max': 18...","{'speed': 3.1, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 01:00:00,
1,25.07501,55.188761,"{'temp': 14.63, 'temp_min': 13, 'temp_max': 17...","{'speed': 2.6, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 02:00:00,


In [265]:
transport.head(2)

Unnamed: 0,year,month,transport_type,trips,id
0,2018,Feb,Marine,141840,2
1,2018,Feb,Tram,528515,2


####   Feature Engineering 

+ Using the date column created to engineer new date time features such as `Month` and `Year`

In [266]:
data['month'] = data['dt'].dt.month
data['year'] = data['dt'].dt.year
data['weekdays'] = data['dt'].dt.weekday

In [268]:
data.head()

Unnamed: 0,lat,lon,main,wind,clouds,weather,dt,rain,month,year,weekdays
0,25.07501,55.188761,"{'temp': 14.99, 'temp_min': 13, 'temp_max': 18...","{'speed': 3.1, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 01:00:00,,1,2018,0
1,25.07501,55.188761,"{'temp': 14.63, 'temp_min': 13, 'temp_max': 17...","{'speed': 2.6, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 02:00:00,,1,2018,0
2,25.07501,55.188761,"{'temp': 14.03, 'temp_min': 12, 'temp_max': 17...","{'speed': 1.5, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 03:00:00,,1,2018,0
3,25.07501,55.188761,"{'temp': 13.78, 'temp_min': 12, 'temp_max': 17...","{'speed': 2.1, 'deg': 180}",{'all': 1},"[{'id': 701, 'main': 'Mist', 'description': 'm...",2018-01-01 04:00:00,,1,2018,0
4,25.07501,55.188761,"{'temp': 14.28, 'temp_min': 12, 'temp_max': 18...","{'speed': 2.6, 'deg': 160}",{'all': 1},"[{'id': 701, 'main': 'Mist', 'description': 'm...",2018-01-01 05:00:00,,1,2018,0


In [270]:
days_map = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
Month_map = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May',
             6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct',
             11: 'Nov',12: 'Dec'}
data['weekdays'] = data['weekdays'].map(days_map)
data['month'] = data['month'].map(Month_map)

In [272]:
data.head()

Unnamed: 0,lat,lon,main,wind,clouds,weather,dt,rain,month,year,weekdays
0,25.07501,55.188761,"{'temp': 14.99, 'temp_min': 13, 'temp_max': 18...","{'speed': 3.1, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 01:00:00,,Jan,2018,Monday
1,25.07501,55.188761,"{'temp': 14.63, 'temp_min': 13, 'temp_max': 17...","{'speed': 2.6, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 02:00:00,,Jan,2018,Monday
2,25.07501,55.188761,"{'temp': 14.03, 'temp_min': 12, 'temp_max': 17...","{'speed': 1.5, 'deg': 150}",{'all': 1},"[{'id': 800, 'main': 'Clear', 'description': '...",2018-01-01 03:00:00,,Jan,2018,Monday
3,25.07501,55.188761,"{'temp': 13.78, 'temp_min': 12, 'temp_max': 17...","{'speed': 2.1, 'deg': 180}",{'all': 1},"[{'id': 701, 'main': 'Mist', 'description': 'm...",2018-01-01 04:00:00,,Jan,2018,Monday
4,25.07501,55.188761,"{'temp': 14.28, 'temp_min': 12, 'temp_max': 18...","{'speed': 2.6, 'deg': 160}",{'all': 1},"[{'id': 701, 'main': 'Mist', 'description': 'm...",2018-01-01 05:00:00,,Jan,2018,Monday


+ Creating the id in the `data` to be used to map the `Transport` data using the `Create id function`

In [273]:
data['id'] = data[['year','month']].apply(uniqueid,axis=1)

In [275]:
data.tail()

Unnamed: 0,lat,lon,main,wind,clouds,weather,dt,rain,month,year,weekdays,id
19339,25.07501,55.188761,"{'temp': 22.85, 'temp_min': 21, 'temp_max': 25...","{'speed': 3.6, 'deg': 50}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",2020-03-16 20:00:00,,Mar,2020,Monday,27
19340,25.07501,55.188761,"{'temp': 22.35, 'temp_min': 21, 'temp_max': 24...","{'speed': 4.6, 'deg': 60}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",2020-03-16 21:00:00,,Mar,2020,Monday,27
19341,25.07501,55.188761,"{'temp': 21.52, 'temp_min': 20, 'temp_max': 23...","{'speed': 3.1, 'deg': 60}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",2020-03-16 22:00:00,,Mar,2020,Monday,27
19342,25.07501,55.188761,"{'temp': 21.04, 'temp_min': 19, 'temp_max': 23...","{'speed': 3.1, 'deg': 70}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",2020-03-16 23:00:00,,Mar,2020,Monday,27
19343,25.07501,55.188761,"{'temp': 20.31, 'temp_min': 18, 'temp_max': 23...","{'speed': 3.6, 'deg': 60}",{'all': 0},"[{'id': 800, 'main': 'Clear', 'description': '...",2020-03-17 00:00:00,,Mar,2020,Tuesday,27


+ Transforming the `Main` , `Wind`, `Clouds` , `weather` and `rain` columns to extract the details into a proper format to be used for analysis

In [None]:
data