# **Analyzing and Predicting Business Activity for Cafes and Restaurants in Melbourne**

**Authored by:** Sachitha Sadeepa Kasthuriarachchi

**Duration:** TBD

**Level:** Beginner/Intermediate

**Pre-requisite Skills:**  Python Programming,Jupyter Notebooks,Power BI,Data Analysis,Machine Learning and Geospatial Analysis.


**Background:**

I am planning to open a new cafe in Melbourne and want to ensure that my business meets customer demands and adheres to area-specific trends. To help with this, I am using a comprehensive dataset of existing cafes and restaurants, which includes information about their seating capacity, location, industry type, and other relevant factors.

**Objectives:**

1. Distribution Analysis: Understand the distribution of cafes and restaurants across different areas of Melbourne.

2. Seating Capacity Analysis: Compare the seating capacity for indoor and outdoor seating.
   
3. Industry Analysis: Explore the prevalence of different industries within the cafes and restaurants sector.
     
4. Geospatial Analysis: Visualize the geographic distribution of cafes and restaurants across Melbourne.
    
5. Trend Analysis: Identify trends in business activity over different census years.

6. Prediction Task: Develop a predictive model to forecast the number of seats based on various features.

**Scenario:**

1. **Informed Decision-Making:** I use the insights provided by the data analysis to make data-driven decisions about my new cafe, reducing the risk of underestimating or overestimating seating capacity.

2. **ompetitive Advantage:** By understanding local trends and customer preferences, I can tailor my business strategy to stand out in a competitive market.

3. **Resource Optimization:** The predictive model helps me allocate resources effectively, ensuring I invest appropriately in indoor and outdoor seating based on demand forecasts.

## 1. Data Loading and Examination

Below are the links to the four data sets that will be used in this for this use case.

[Data Set 1](https://data.melbourne.vic.gov.au/explore/dataset/cafes-and-restaurants-with-seating-capacity/information/) **Café, restaurant, bistro seats.** This dataset contains **60055** records.

[Data Set 2](https://data.melbourne.vic.gov.au/explore/dataset/blocks-for-census-of-land-use-and-employment-clue/information/?location=13,-37.80246,144.94417&basemap=mbs-7a7333) **Blocks for Census of Land Use and Employment (CLUE).** This dataset contains **606** records.


[Data Set 3](https://data.melbourne.vic.gov.au/explore/dataset/employment-by-block-by-clue-industry/information/) **Jobs per CLUE industry for blocks.** This dataset contains **12349** records.


[Data Set 4](https://data.melbourne.vic.gov.au/explore/dataset/floor-space-by-use-by-block/information/) **Floor space per space use for blocks.** This dataset contains **12349** records.


### 1.0 Dataset Imported through API

In [1]:
import requests
import pandas as pd
from io import StringIO

#Function to collect data
def collect_data(dataset_id):
    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = api_key #use if use datasets API_key permissions
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC',
        #'api_key': apikey  #use if use datasets API_key permissions
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        dataset = pd.read_csv(StringIO(url_content), delimiter=';')
        return dataset
    else:
        print(f'Request failed with status code {response.status_code}')



### 1.2 Load the Dataset

In [2]:
# Set dataset_id to query for the API call dataset name
dataset_1_id = 'cafes-and-restaurants-with-seating-capacity'
dataset_2_id = 'blocks-for-census-of-land-use-and-employment-clue'
dataset_3_id = 'employment-by-block-by-clue-industry'
dataset_4_id = 'floor-space-by-use-by-block'

In [3]:
# Save dataset to df1.df2.df3 and df4 varaible
df1 = collect_data(dataset_1_id)
df2 = collect_data(dataset_2_id)
df3 = collect_data(dataset_3_id)
df4 = collect_data(dataset_4_id)

### 1.3 Exmine the Datasets

#### 1.3.1 Data Set 1

In [4]:
# Check number of records in df1
print(f'The dataset contains {len(df1)} records.')
# View df1
df1.head(3)

The dataset contains 60055 records.


Unnamed: 0,census_year,block_id,property_id,base_property_id,building_address,clue_small_area,trading_name,business_address,industry_anzsic4_code,industry_anzsic4_description,seating_type,number_of_seats,longitude,latitude,location
0,2017,6,578324,573333,2 Swanston Street MELBOURNE 3000,Melbourne (CBD),Transport Hotel,"Tenancy 29, Ground , 2 Swanston Street MELBOUR...",4520,"Pubs, Taverns and Bars",Seats - Indoor,230,144.969942,-37.817778,"-37.817777826050005, 144.96994164279243"
1,2017,6,578324,573333,2 Swanston Street MELBOURNE 3000,Melbourne (CBD),Transport Hotel,"Tenancy 29, Ground , 2 Swanston Street MELBOUR...",4520,"Pubs, Taverns and Bars",Seats - Outdoor,120,144.969942,-37.817778,"-37.817777826050005, 144.96994164279243"
2,2017,11,103957,103957,517-537 Flinders Lane MELBOURNE 3000,Melbourne (CBD),Altius Coffee Brewers,"Shop , Ground , 517 Flinders Lane MELBOURNE 3000",4512,Takeaway Food Services,Seats - Outdoor,4,144.956486,-37.819875,"-37.819875445799994, 144.95648638781466"


#### 1.3.1 Data Set 2

In [5]:
# Check number of records in df2
print(f'The dataset contains {len(df2)} records.')
# View df2
df2.head(3)

The dataset contains 606 records.


Unnamed: 0,geo_point_2d,geo_shape,block_id,clue_area
0,"-37.82296169692379, 144.95049282288122","{""coordinates"": [[[144.9479230372, -37.8233694...",1112,Docklands
1,"-37.78537422996195, 144.94085920366408","{""coordinates"": [[[144.9426153438, -37.7866287...",927,Parkville
2,"-37.777687358375964, 144.94600024715058","{""coordinates"": [[[144.9425926939, -37.7787229...",929,Parkville


#### 1.3.1 Data Set 3

In [6]:
# Check number of records in df3
print(f'The dataset contains {len(df3)} records.')
# View df3
df3.head(3)

The dataset contains 12394 records.


Unnamed: 0,census_year,block_id,clue_small_area,accommodation,admin_and_support_services,agriculture_and_mining,arts_and_recreation_services,business_services,construction,education_and_training,...,information_media_and_telecommunications,manufacturing,other_services,public_administration_and_safety,real_estate_services,rental_and_hiring_services,retail_trade,transport_postal_and_storage,wholesale_trade,total_jobs_in_block
0,2022,4,Melbourne (CBD),0.0,0.0,0.0,362.0,0.0,0.0,,...,0.0,0.0,,0.0,0.0,0.0,38.0,368.0,0.0,1008.0
1,2022,5,Melbourne (CBD),0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2022,6,Melbourne (CBD),0.0,0.0,0.0,203.0,0.0,0.0,0.0,...,,0.0,,0.0,,0.0,47.0,0.0,0.0,647.0


#### 1.3.1 Data Set 4

In [7]:
# Check number of records in df4
print(f'The dataset contains {len(df4)} records.')
# View df4
df4.head(3)

The dataset contains 12394 records.


Unnamed: 0,census_year,block_id,clue_small_area,commercial_accommodation,common_area,community_use,educational_research,entertainment_recreation_indoor,equipment_installation,hospital_clinic,...,transport,transport_storage_uncovered,unoccupied_under_construction,unoccupied_under_demolition_condemned,unoccupied_under_renovation,unoccupied_undeveloped_site,unoccupied_unused,wholesale,workshop_studio,total_floor_space_in_block
0,2013,2387,North Melbourne,0.0,,0.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,44471.0
1,2013,2390,North Melbourne,0.0,1040.0,0.0,,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5357.0,,5071.0,48332.0
2,2013,2501,Kensington,0.0,0.0,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21456.0


## 2.0 Data Cleaning and Pre processing Data

#### 2.1 Identifying Missing Values and Propotions in the Dataset 1

In [8]:
missing_values = df1.isnull().sum()
print("--------------------------Current Missing Values --------------------------")
print(missing_values)

#Lets see propotion
missing_proportion = (missing_values / len(df1))*100
print("---------------------Current Missing Values Propotion ---------------------")
print(missing_proportion.round(3).astype(str) + '%')

--------------------------Current Missing Values --------------------------
census_year                       0
block_id                          0
property_id                       0
base_property_id                  0
building_address                  0
clue_small_area                   0
trading_name                      0
business_address                  0
industry_anzsic4_code             0
industry_anzsic4_description      0
seating_type                      0
number_of_seats                   0
longitude                       527
latitude                        527
location                        527
dtype: int64
---------------------Current Missing Values Propotion ---------------------
census_year                       0.0%
block_id                          0.0%
property_id                       0.0%
base_property_id                  0.0%
building_address                  0.0%
clue_small_area                   0.0%
trading_name                      0.0%
business_address      

#### 2.1 Identifying Missing Values and Propotions in the Dataset 2

In [9]:
missing_values = df2.isnull().sum()
print("--------------------------Current Missing Values --------------------------")
print(missing_values)

#Lets see propotion
missing_proportion = (missing_values / len(df2))*100
print("---------------------Current Missing Values Propotion ---------------------")
print(missing_proportion.round(3).astype(str) + '%')

--------------------------Current Missing Values --------------------------
geo_point_2d    0
geo_shape       0
block_id        0
clue_area       0
dtype: int64
---------------------Current Missing Values Propotion ---------------------
geo_point_2d    0.0%
geo_shape       0.0%
block_id        0.0%
clue_area       0.0%
dtype: object


#### 2.1 Identifying Missing Values and Propotions in the Dataset 3

In [10]:
missing_values = df3.isnull().sum()
print("--------------------------Current Missing Values --------------------------")
print(missing_values)

#Lets see propotion
missing_proportion = (missing_values / len(df3))*100
print("---------------------Current Missing Values Propotion ---------------------")
print(missing_proportion.round(3).astype(str) + '%')

--------------------------Current Missing Values --------------------------
census_year                                    0
block_id                                       0
clue_small_area                                0
accommodation                               2408
admin_and_support_services                  2019
agriculture_and_mining                       599
arts_and_recreation_services                3927
business_services                           1875
construction                                2075
education_and_training                      2553
electricity_gas_water_and_waste_services    2468
finance_and_insurance                       1578
food_and_beverage_services                  2628
health_care_and_social_assistance           2560
information_media_and_telecommunications    2298
manufacturing                               2309
other_services                              3280
public_administration_and_safety            1852
real_estate_services                      

#### 2.1 Identifying Missing Values and Propotions in the Dataset 4

In [11]:
missing_values = df4.isnull().sum()
print("--------------------------Current Missing Values --------------------------")
print(missing_values)

#Lets see propotion
missing_proportion = (missing_values / len(df4))*100
print("---------------------Current Missing Values Propotion ---------------------")
print(missing_proportion.round(3).astype(str) + '%')

--------------------------Current Missing Values --------------------------
census_year                                 0
block_id                                    0
clue_small_area                             0
commercial_accommodation                 2236
common_area                              3088
community_use                             879
educational_research                     2585
entertainment_recreation_indoor          2530
equipment_installation                   3132
hospital_clinic                          1776
house_townhouse                          1034
institutional_accommodation               501
manufacturing                             963
office                                   2052
park_reserve                             2292
parking_commercial_covered               1847
parking_commercial_uncovered              521
parking_private_covered                  2954
parking_private_uncovered                1313
performances_conferences_ceremonies      2108
priv

In [12]:
# Identify rows with any missing values
missing_value_rows = df1[df1.isnull().any(axis=1)]

# Print the rows with missing values
print("--------------------------Rows with Missing Values --------------------------")
print(missing_value_rows)


--------------------------Rows with Missing Values --------------------------
       census_year  block_id  property_id  base_property_id  \
3082          2003        78       110714            105763   
4063          2005         5       101345            101345   
4471          2005        91       598270            598270   
4589          2005       255       594107            594107   
4673          2005       444       598605            598605   
...            ...       ...          ...               ...   
58989         2016      2524       616022            616022   
58990         2016      2524       616043            616043   
58991         2016      2524       616045            616045   
58992         2016      2530       616160            616160   
58993         2016      2539       614669            614669   

                                 building_address  \
3082         32-36 Lonsdale Street MELBOURNE 3000   
4063               Flinders Street MELBOURNE 3000   
4471  

## Installation of Required Package

To use the `geopy` library, you need to install it using pip:

```sh
pip install geopy

In [13]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


## Explanation of the Code

1. **Imports**: 
    - `ArcGIS` from the `geopy.geocoders` module is imported for geocoding.
    - `pandas` is imported for data manipulation.
    - `time` is imported to introduce delays.

2. **Geocoder Initialization**:
    - The `ArcGIS` geocoder is initialized without specifying a user agent.

3. **Geocoding Function**:
    - A function `get_lat_long` is defined to get the latitude and longitude of a given address.
    - The function attempts to geocode the address up to three times (default) before giving up.
    - If geocoding is successful, it returns the latitude and longitude.
    - If a `GeocoderTimedOut` or `GeocoderServiceError` occurs, it retries after waiting for one second.
    - If any other exception occurs, it logs the error and returns `None, None`.

4. **Identify Missing Lat/Long**:
    - Rows with missing latitude and longitude values are identified in the DataFrame `df1`.

5. **Failed Addresses Log**:
    - An empty list `failed_addresses` is created to store addresses that couldn't be geocoded.

6. **Apply Geocoding Function**:
    - The geocoding function is applied to the addresses of rows with missing lat/long values.
    - For each address, if geocoding fails, the address is added to the `failed_addresses` list.
    - The DataFrame is updated with the obtained latitude and longitude.
    - A one-second delay is added between requests to avoid hitting rate limits.

7. **Print Filled Rows**:
    - The rows with newly filled latitude and longitude values are printed.

8. **Print Failed Addresses**:
    - If there are any addresses that couldn't be geocoded, they are printed.


In [14]:
from geopy.geocoders import ArcGIS
import pandas as pd
import time

# Initialize the ArcGIS geocoder
geolocator = ArcGIS()

# Function to get latitude and longitude based on address with retry logic
def get_lat_long(address, retries=3):
    for attempt in range(retries):
        try:
            location = geolocator.geocode(address, timeout=10)
            if location:
                return location.latitude, location.longitude
            else:
                return None, None
        except (GeocoderTimedOut, GeocoderServiceError) as e:
            print(f"Error geocoding {address}: {e}")
            time.sleep(1)  # Wait for 1 second before retrying
        except Exception as e:
            print(f"Unexpected error geocoding {address}: {e}")
            return None, None
    return None, None

# Identify rows with missing latitude and longitude
missing_lat_long = df1[df1['latitude'].isnull() & df1['longitude'].isnull()]

# Log addresses that couldn't be geocoded
failed_addresses = []

# Apply the geocoding function to the building addresses of rows with missing values
for idx, row in missing_lat_long.iterrows():
    lat, long = get_lat_long(row['building_address'])
    if lat is None or long is None:
        failed_addresses.append(row['building_address'])
    df1.at[idx, 'latitude'] = lat
    df1.at[idx, 'longitude'] = long
    time.sleep(1)  # Add a delay to avoid hitting rate limits



# Print the rows with missing values filled
print(df1[df1['latitude'].notnull() & df1['longitude'].notnull()].head(3))

# Print failed addresses
if failed_addresses:
    print("\nAddresses that couldn't be geocoded:")
    for address in failed_addresses:
        print(address)


   census_year  block_id  property_id  base_property_id  \
0         2017         6       578324            573333   
1         2017         6       578324            573333   
2         2017        11       103957            103957   

                       building_address  clue_small_area  \
0      2 Swanston Street MELBOURNE 3000  Melbourne (CBD)   
1      2 Swanston Street MELBOURNE 3000  Melbourne (CBD)   
2  517-537 Flinders Lane MELBOURNE 3000  Melbourne (CBD)   

            trading_name                                   business_address  \
0        Transport Hotel  Tenancy 29, Ground , 2 Swanston Street MELBOUR...   
1        Transport Hotel  Tenancy 29, Ground , 2 Swanston Street MELBOUR...   
2  Altius Coffee Brewers   Shop , Ground , 517 Flinders Lane MELBOURNE 3000   

   industry_anzsic4_code industry_anzsic4_description     seating_type  \
0                   4520       Pubs, Taverns and Bars   Seats - Indoor   
1                   4520       Pubs, Taverns and Bars  Se

In [15]:
missing_values = df1.isnull().sum()
print("--------------------------Current Missing Values --------------------------")
print(missing_values)

#Lets see propotion
missing_proportion = (missing_values / len(df1))*100
print("---------------------Current Missing Values Propotion ---------------------")
print(missing_proportion.round(3).astype(str) + '%')

--------------------------Current Missing Values --------------------------
census_year                       0
block_id                          0
property_id                       0
base_property_id                  0
building_address                  0
clue_small_area                   0
trading_name                      0
business_address                  0
industry_anzsic4_code             0
industry_anzsic4_description      0
seating_type                      0
number_of_seats                   0
longitude                         0
latitude                          0
location                        527
dtype: int64
---------------------Current Missing Values Propotion ---------------------
census_year                       0.0%
block_id                          0.0%
property_id                       0.0%
base_property_id                  0.0%
building_address                  0.0%
clue_small_area                   0.0%
trading_name                      0.0%
business_address      

In [None]:
# Save the updated dataframe to a new CSV file
df1.to_csv('C:/Users/dpmdj/Downloads/cafes-and-restaurants-with-seating-capacity-updated.csv', index=False)
