# Starbucks Worldwide Location Project

# CRISP-DM

# A. Business Understanding

**Business Objective**: The main objective of the project is to develop a predictive model that can accurately estimate the number of Starbucks stores in a country based on a set of variables, such as GDP per capita, population, and easy of doing business. 

**Business Success Criteria**: For this project, the business success criteria have at least a 0.7 R-Squared value for the model's predictions.

---

**Dataset**: Two datasets available on Kaggle were used for Starbucks numbers in countries. The first of them belongs to 2016, the other belongs to 2021. In addition, World Bank data were used for parameters such as GPD and population of countries. World Bank data was obtained with the wbgapi library.

[Starbucks Locations Worldwide 2016](https://www.kaggle.com/datasets/starbucks/store-locations)

[Starbucks Locations Worldwide 2021](https://www.kaggle.com/datasets/kukuroo3/starbucks-locations-worldwide-2021-version)

__Important note:__ _The Starbucks data set for 2021 was used only in the data analysis part with certain assumptions. The machine learning model was made according to the data set in 2016 and the World Bank data._

---

**Technologies and Tools**:

* Programming Language:
    * Python
* Tools:
    * Jupyter Notebook
* Libraries
    * Pandas
    * Numpy
    * Scikit-learn
    * LighGBM
    * CatBoost
    * XGBoost

Since the problem we are trying to solve involves predicting a continuous outcome, I will use regression models in both classical and relatively new machine learning models. This will allow us to explore different modeling approaches and compare their performance.

List of models I used in this project as a result of my research:
* Machine Learning Models
    * Linear Regression
    * Decision Tree Regression
    * Random Forest Regression
    * Boost Methods
        * XGBoost
        * LightGBM
        * CatBoost


# B. Data Understanding 

## B.1 Install Libraries

In [1]:
import numpy as np
import pandas as pd

import pycountry as pc
import pycountry_convert as pcc
import wbgapi as wb

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, train_test_split, RepeatedKFold, KFold
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

## B.2 Collect Initial Data

### B.2.1 Starbucks Worldwide Locations

In [2]:
starbucks_location_2016 = pd.read_csv('data/starbucks_locations_2016.csv', sep= ',')
starbucks_location_2021 = pd.read_csv('data/starbucks_locations_2022.csv', sep= ',')

### B.2.2 World Bank Dataset

In [3]:
wb.source.info()

id,name,code,concepts,lastupdated
1.0,Doing Business,DBS,3.0,2021-08-18
2.0,World Development Indicators,WDI,3.0,2022-12-22
3.0,Worldwide Governance Indicators,WGI,3.0,2022-09-23
5.0,Subnational Malnutrition Database,SNM,3.0,2016-03-21
6.0,International Debt Statistics,IDS,4.0,2022-12-06
11.0,Africa Development Indicators,ADI,3.0,2013-02-22
12.0,Education Statistics,EDS,3.0,2020-12-20
13.0,Enterprise Surveys,ESY,3.0,2022-03-25
14.0,Gender Statistics,GDS,3.0,2022-06-23
15.0,Global Economic Monitor,GEM,3.0,2020-07-27


In [4]:
wb.series.info(q='population')

id,value
EN.ATM.PM25.MC.T1.ZS,"PM2.5 pollution, population exposed to levels exceeding WHO Interim Target-1 value (% of total)"
EN.ATM.PM25.MC.T2.ZS,"PM2.5 pollution, population exposed to levels exceeding WHO Interim Target-2 value (% of total)"
EN.ATM.PM25.MC.T3.ZS,"PM2.5 pollution, population exposed to levels exceeding WHO Interim Target-3 value (% of total)"
EN.ATM.PM25.MC.ZS,"PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)"
EN.POP.DNST,Population density (people per sq. km of land area)
EN.POP.EL5M.RU.ZS,Rural population living in areas where elevation is below 5 meters (% of total population)
EN.POP.EL5M.UR.ZS,Urban population living in areas where elevation is below 5 meters (% of total population)
EN.POP.EL5M.ZS,Population living in areas where elevation is below 5 meters (% of total population)
EN.POP.SLUM.UR.ZS,Population living in slums (% of urban population)
EN.URB.LCTY,Population in largest city


In [5]:
wb.data.DataFrame(series='SP.POP.TOTL', economy='all', time='2016')

Unnamed: 0_level_0,SP.POP.TOTL
economy,Unnamed: 1_level_1
ABW,104874.0
AFE,616377331.0
AFG,34636207.0
AFW,419778384.0
AGO,29154746.0
...,...
XKX,1777557.0
YEM,29274002.0
ZAF,56422274.0
ZMB,16767761.0


## B.3 Describe Data

In [6]:
starbucks_location_2016.shape, starbucks_location_2021.shape

((25600, 13), (28289, 17))

In [7]:
starbucks_location_2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25600 entries, 0 to 25599
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Brand           25600 non-null  object 
 1   Store Number    25600 non-null  object 
 2   Store Name      25600 non-null  object 
 3   Ownership Type  25600 non-null  object 
 4   Street Address  25598 non-null  object 
 5   City            25585 non-null  object 
 6   State/Province  25600 non-null  object 
 7   Country         25600 non-null  object 
 8   Postcode        24078 non-null  object 
 9   Phone Number    18739 non-null  object 
 10  Timezone        25600 non-null  object 
 11  Longitude       25599 non-null  float64
 12  Latitude        25599 non-null  float64
dtypes: float64(2), object(11)
memory usage: 2.5+ MB


In [8]:
starbucks_location_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28289 entries, 0 to 28288
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              28289 non-null  int64  
 1   storeNumber             28289 non-null  object 
 2   countryCode             28289 non-null  object 
 3   ownershipTypeCode       28289 non-null  object 
 4   schedule                23204 non-null  object 
 5   slug                    28289 non-null  object 
 6   latitude                28289 non-null  float64
 7   longitude               28289 non-null  float64
 8   streetAddressLine1      28288 non-null  object 
 9   streetAddressLine2      7966 non-null   object 
 10  streetAddressLine3      5136 non-null   object 
 11  city                    28288 non-null  object 
 12  countrySubdivisionCode  28289 non-null  object 
 13  postalCode              27488 non-null  object 
 14  currentTimeOffset       28289 non-null

In [12]:
# All data in starbucks_location_2021 are assumed to be brands of Starbucks
starbucks_location_2016['Brand'].value_counts()

Starbucks                25249
Teavana                    348
Evolution Fresh              2
Coffee House Holdings        1
Name: Brand, dtype: int64

### B.3.1 Check Duplicate Value

In [13]:
starbucks_location_2016.duplicated().value_counts()

False    25600
dtype: int64

In [14]:
starbucks_location_2021.duplicated().value_counts()

False    28289
dtype: int64

In [15]:
starbucks_location_2016.duplicated(subset=['Store Number']).value_counts()

False    25599
True         1
dtype: int64

In [16]:
starbucks_location_2021.duplicated(subset=['storeNumber']).value_counts()

False    28289
dtype: int64

### B.3.2 Check Empty Value

In [18]:
# Calculate the percentage of null values in each column, and then sort the columns by the percentage of null values
starbucks_location_2016.isnull().mean().mul(100).sort_values(ascending=False).round(2)

Phone Number      26.80
Postcode           5.95
City               0.06
Street Address     0.01
Longitude          0.00
Latitude           0.00
Brand              0.00
Store Number       0.00
Store Name         0.00
Ownership Type     0.00
State/Province     0.00
Country            0.00
Timezone           0.00
dtype: float64

In [19]:
starbucks_location_2021.isnull().mean().mul(100).sort_values(ascending=False).round(2)

streetAddressLine3        81.84
streetAddressLine2        71.84
schedule                  17.98
postalCode                 2.83
streetAddressLine1         0.00
city                       0.00
windowsTimeZoneId          0.00
currentTimeOffset          0.00
countrySubdivisionCode     0.00
Unnamed: 0                 0.00
storeNumber                0.00
longitude                  0.00
latitude                   0.00
slug                       0.00
ownershipTypeCode          0.00
countryCode                0.00
olsonTimeZoneId            0.00
dtype: float64

# C. Data Preparation

## C.1 Data Selection

Since both Stabucks location datasets are not very large, the entire dataset will be used. <br>
In addition, factors that may affect the number of Starbucks in a country were taken from the world bank database. <br>
These are 
* GDP per capital
* Population
* Urban population
* Easy of doing business parameter

In [None]:
population = wb.data.DataFrame(series='SP.POP.TOTL', economy='all', time='2016')
population.reset_index(inplace=True)
population.rename(columns={'economy': 'Country', 'SP.POP.TOTL': 'POP'}, inplace=True)
population

In [None]:
ease_of_business = wb.data.DataFrame(series='IC.BUS.DFRN.XQ', economy='all', time=2016).round(2)
ease_of_business.reset_index(inplace=True)
ease_of_business.rename(columns={'economy': 'Country', 'IC.BUS.DFRN.XQ': 'EODB'}, inplace=True)
ease_of_business

In [None]:
gdp_pcap = wb.data.DataFrame(series='NY.GDP.PCAP.CD', economy='all', time='2016').round(2)
gdp_pcap.reset_index(inplace=True)
gdp_pcap.rename(columns={'economy': 'Country', 'NY.GDP.PCAP.CD': 'GDP PCAP'}, inplace=True)
gdp_pcap

In [None]:
urban_population = wb.data.DataFrame(series='SP.URB.TOTL', economy='all', time=2016)
urban_population.reset_index(inplace=True)
urban_population.rename(columns={'economy':'Country', 'SP.URB.TOTL': 'URB POP'}, inplace=True)
urban_population

In [None]:
number_of_starbucks = starbucks_location_2016.groupby('Country').count()['Store Number'].sort_values(ascending=False).to_frame()
number_of_starbucks.reset_index(inplace=True)
number_of_starbucks

## C.2 Data Reduction

## C.3 Data Transformation

## C.4 Data Cleaning

## C.5 Data Analysis

## C.6 Data Integration

# D. Modelling

# E. Evaluation