# How to Tell a Story Using Data

## Project Description:

You’ve decided to open a small robot-run cafe in Los Angeles. The project is promising but expensive, so you and your partners decide to try to attract investors. They’re interested in the current market conditions—will you be able to maintain your success when the novelty of robot waiters wears off?

You’re an analytics guru, so your partners have asked you to prepare some market research. You have open-source data on restaurants in LA.

## Preparing The Data:

In [1]:
!pip install usaddress

Collecting usaddress
  Downloading usaddress-0.5.10-py2.py3-none-any.whl (63 kB)
     -------------------------------------- 63.9/63.9 kB 866.2 kB/s eta 0:00:00
Collecting probableparsing
  Downloading probableparsing-0.0.1-py2.py3-none-any.whl (3.1 kB)
Collecting python-crfsuite>=0.7
  Downloading python_crfsuite-0.9.9-cp39-cp39-win_amd64.whl (139 kB)
     -------------------------------------- 139.2/139.2 kB 2.0 MB/s eta 0:00:00
Installing collected packages: python-crfsuite, probableparsing, usaddress
Successfully installed probableparsing-0.0.1 python-crfsuite-0.9.9 usaddress-0.5.10


In [2]:
import pandas as pd
import numpy as np
import statistics
import datetime as dt
import math
import seaborn as sns
import sys
from functools import reduce
import matplotlib.pyplot as plt
from scipy import stats as st
import usaddress
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
from pandas.plotting import scatter_matrix
import plotly.express as px
import warnings
from operator import attrgetter
import matplotlib.colors as mcolors
from IPython.display import Image
from IPython.core.display import HTML
from plotly import graph_objects as go

In [3]:
rest_data = pd.read_csv('/datasets/rest_data_us.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/rest_data_us.csv'

In [None]:
rest_data.head(10)

In [None]:
rest_data.describe()

In [None]:
rest_data.info(memory_usage= 'deep')

We can notice that we have 3 values that are missing in the chain column.
I want to check them speratley to see if there is any connection.

Also, we can notice that the chain column contains only true or false and that means we should change its type to "bool".

As for the object_type column, we should change it to category type.

The last thing we should do is to change some of the names of the columns so it will be more understandable.

In [None]:
miss_val = rest_data.query('chain != True & chain != False')

In [None]:
miss_val_list = miss_val['object_name'].unique().tolist()

In [None]:
rest_data.loc[rest_data['object_name'].isin(miss_val_list)]

As we can notice, there isnt any connection between them and because its only 3 values out of almost 10000 values, we can just drop them and they wont affect much the data.

In [None]:
rest_data = rest_data.dropna()

In [None]:
rest_data['chain'] = rest_data['chain'].astype('bool')

In [None]:
rest_data['object_type'] = rest_data['object_type'].astype('category')

In [None]:
rest_data.info()

I have lowered the memory usage of the data drastically from 2.4 MB to less then 400 KB.

In [None]:
rest_data.columns = ['id', 'rest_name', 'address', 'chain', 'rest_type', 'seats_number']

In [None]:
rest_data.isna().sum()

After fixing the problems with the columns and the types, we should adress the inside info of the data, starting from the adresses that are messed up and all caps.

In [None]:
def cleaning_check(raw):
    raw_address = usaddress.parse(raw)
    dict_address={}
    for i in raw_address:
        dict_address.update({i[1]:i[0]})
    if 'StreetName' in dict_address.keys() and 'AddressNumber' in dict_address.keys():
        clean_address = dict_address['AddressNumber']+','+str(dict_address['StreetName'])
        return clean_address
    else:
        return "no street or number"

In [None]:
rest_data['clean_street_check'] = rest_data.address.apply(cleaning_check)

In [None]:
rest_data[rest_data['clean_street_check']=='no street or number']

In [None]:
rest_data.head(10)

In [None]:
def cleaning_final(raw):
        if raw.startswith('OLVERA'):
            clean_address = 'OLVERA, Los Angeles, USA'
        elif raw.startswith('1033 1/2 LOS ANGELES'):
            clean_address = '1033 1/2 LOS ANGELES ST, Los Angeles,USA'
        else:
            raw_address = usaddress.parse(raw)
            dict_address = {}
            for i in raw_address:
                dict_address.update({i[1]:i[0]})
            clean_address = dict_address['AddressNumber']+" "+str(dict_address['StreetName'])+str(', Los Angeles,USA')
        return clean_address

In [None]:
rest_data['clean_street_final'] = rest_data.address.apply(cleaning_final)

In [None]:
def cleaning_tag(raw):
    try:
        if raw.startswith('OLVERA'):
            clean_address = 'OLVERA, Los Angeles, USA'
        elif raw.startswith('1033 1/2 LOS ANGELES'):
            clean_address = '1033 1/2 LOS ANGELES ST, Los Angeles, USA'
        elif raw.startswith('3425 E 1ST ST SO. 3RDFL'):
            clean_address = '3425 E1ST ST SO. 3RDFL'
        elif raw.startswith('3708 N EAGLE ROCK BLVD'):
            clean_address = 'N EAGLE ROCK BLVD'
        elif raw.startswith('100 WORLD WAY # 120'):
            clean_address = 'WORLD WAY'
        elif raw.startswith('6801 HOLLYWOOD BLVD # 253'):
            clean_address = 'HOLLYWOOD BLVD'
        elif raw.startswith('1814 W SUNSET BLVD'):
            clean_address = 'SUNSET BLVD'
        elif raw.startswith('2100 ECHO PARK AVE'):
            clean_address = 'ECHO PARK AVE'
        else:
            clean_address = usaddress.tag(raw)[0]['StreetName']
    except:
        clean_address='no street'
    return clean_address

In [None]:
rest_data['clean_street_tag'] = rest_data.address.apply(cleaning_tag)

In [None]:
rest_data.head(10)

In [None]:
def regex_str_col(rest_data, cols):
    for col in cols:
        rest_data[col] = rest_data[col].str.lower()
        rest_data[col] = rest_data[col].replace('[^a-zA-Z0-9 ]', '', regex=True)
    return rest_data

In [None]:
cols = ['rest_name', 'clean_street_check', 'rest_type']

In [None]:
rest = regex_str_col(rest_data, cols)

In [None]:
rest_data.head(30)

In [None]:
duplic = rest_data[rest_data.duplicated(subset = ['rest_name', 'address'])]

In [None]:
duplic.shape[0]

After we have lowered the caps in our data we found out that we have 19 duplicates that we should discard.

In [None]:
rest_data = rest_data.applymap(lambda s:s.lower() if type(s)==str else s)

In [None]:
rest_data = rest_data.sort_values('seats_number', ascending=False)

In [None]:
rest_data = rest_data.sort_values(by=['rest_name', 'clean_street_check']).drop_duplicates(subset=['rest_name', 'clean_street_check'])

In [None]:
rest_data['word_only'] = rest_data['rest_name'].str.replace(r'\d+', '')

In [None]:
rest_data['word'] = rest_data['word_only'].str.split(' ').str[0]

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
def lemmatize_text(text):
    return[lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

In [None]:
rest_data['lemmatized'] = rest_data['clean_street_tag'].apply(lemmatize_text)

In [None]:
rest_data.head(10)

After using some functions to fix the streets names and the caps in them, we needed to remove the duplicates.

Then we needed to remove the missing values after we found there is some.

And in the end I have orgenasied the adresses using the usaddress library that we have imported before.

## Data Analysis

### Investigate the proportions of the various types of establishments. Plot a graph.

In [None]:
rest_types = rest_data.groupby('rest_type', as_index = False).agg({'id': 'count'})

In [None]:
rest_types.columns = ['rest_type', 'number_of_rests']

In [None]:
rest_types

In [None]:
px.pie(rest_types, values='number_of_rests', title='Number pf restaurants for each type', names='rest_type')

Most of vanues are restaurants(75%) and after it comes the fast food venue with 11%.

### Investigate the proportions of chain and nonchain establishments. Plot a graph.

In [None]:
rest_chain = rest_data.groupby('chain', as_index=False).agg({'id': 'count'})

In [None]:
rest_chain.columns = ['chain', 'number_of_values']

In [None]:
rest_chain

In [None]:
px.pie(rest_chain, values = 'number_of_values', title ='Chains vs Single venues', names='chain')

2/3 of our venues are single venues when 37% are chains.

### Which type of establishment is typically a chain?

In [None]:
chains = rest_data.query('chain == True')

In [None]:
chain = chains.groupby('rest_name')['seats_number'].agg(['count', 'sum'])

In [None]:
chain.columns = ['num_rests', 'total_seats']

In [None]:
chain.sort_values(by='num_rests').head(20)

In [None]:
chains['rest_name'] = chains.rest_name.replace('#', '', regex=True).replace(' [0-9*$]', '', regex=True)

In [None]:
chain = chains.groupby('rest_name')['seats_number'].agg(['count','sum'])

In [None]:
chain.columns = ['num_rests', 'total_seats']

In [None]:
chain.sort_values(by='num_rests').head(20)

In [None]:
chains_types = chains.groupby('rest_type', as_index=False).agg({'id': 'count'})

In [None]:
chains_types.columns = ['rest_type', 'number_of_rests']

In [None]:
chains_types['chances_to_be_chain'] = chains_types['number_of_rests'] / rest_types['number_of_rests']

In [None]:
chains_types

We can notice that 100% of our bakeries are chains.

In [None]:
#Added by reviewer

rest_data.groupby('rest_type').agg({'chain':['count', 'sum', 'mean']})

In [None]:
len(chain.query('num_rests == 1')) / len(rest_data)

We can see that the number of chains with 1 branch are 23% of the chain data, even after we unite the big chains that were divided in different names.

We will ignore these restaurants when refering to chains because they are propably not such.

In [None]:
chains_new = chain.query('num_rests != 1')

In [None]:
chains_new.head()

### What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?

In [None]:
fig = px.histogram(rest_data, x='seats_number', color = 'chain')
fig.update_layout(title='Chain vs Non-Chain venues with number of seats', xaxis_title='Number of seats', yaxis_title='Venues')


From the graph that I have plotted we can notice that there are more restaurants with less number of seats if the restaurant is a chain then the number of restaurants with low number of seats which are not chains.

In big restaurants there is a bit more chains then non chains but the difference isnt that big then in smaller restaurants.

### Determine the average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? Plot graphs.

In [None]:
sns.set(rc={'figure.figsize':(11,8)}, font_scale=1.5, style='whitegrid')
ax = sns.boxplot(x='seats_number', y='rest_type', data=rest_data).set_title('Number of seats depending on the venue')
plt.xlabel('Number of seats')
plt.ylabel('Type of Venue')


In [None]:
rest_seats = rest_data.groupby('rest_type', as_index=False).agg({'seats_number':'mean'})

In [None]:
rest_seats.columns = ['rest_type', 'mean_num_of_seats']

In [None]:
rest_seats

In [None]:
rest_seats.style.format({'mean_num_of_seats': '{:.0f}'})

I have created a table showing the average number of seats in every kind of venue and we can see that the "restaurant" venue is the leading with a mean of 48 seats.

In [None]:
#Added by reviewere

rest_seats.style.format({'mean_num_of_seats': '{:.0f}'})

### Put the data on street names from the address column in a separate column.

In [None]:
street_columns = rest_data[['clean_street_check', 'clean_street_tag']]

In [None]:
street_columns.head()

### Plot a graph of the top ten streets by number of restaurants.

In [None]:
rest_street = rest_data.groupby('clean_street_tag',as_index=False).agg({'id': 'count', 'seats_number': 'mean'})

In [None]:
rest_street.columns = ['street', 'number_of_rests', 'mean_num_of_seats']

In [None]:
top10 = rest_street.sort_values(by='number_of_rests', ascending=False).head(10)

In [None]:
fig = px.bar(top10, x='street', y='number_of_rests', title='Top 10 streets by number of restaurants')
fig.show()

Here we can see the top 10 streets.

### Find the number of streets that only have one restaurant.

In [None]:
one_rest_street = rest_street.query('number_of_rests == 1')

In [None]:
print('Number of streets with only one restaurant:', len(one_rest_street))

### For streets with a lot of restaurants, look at the distribution of the number of seats. What trends can you see?

In [None]:
streets = top10['street'].tolist()

In [None]:
print('List of best streets: ', streets)

In [None]:
rests_top10 = rest_data.query('clean_street_tag == @streets')

In [None]:
print('restaurants from the top 10 streets:')
rests_top10.head(10)

In [None]:
print('Boxplot for destribution of seats number by street from the top 10:')

In [None]:
ax = sns.boxplot(x='seats_number', y='clean_street_tag', data=rests_top10)

In [None]:
rest_seat = rests_top10.groupby('clean_street_tag', as_index= False).agg({'seats_number': 'mean'})

In [None]:
rest_seat.columns = ['street', 'mean_number_of_seats']

In [None]:
print('Mean number of seats by street:')

In [None]:
rest_seat

In [None]:
top10

In [None]:
fig = px.bar(top10, x='street', y='mean_num_of_seats', title='Top 10 streets by number of seats mean')
fig.show()

## Overall Conclusion

Our research was about the a cafe that is run by robots. 

From our data that we have collected to analyze later, we saw that we need to fix the data types, we removed 3 values that were missing in the chain column, we aggregated functions to make the streets mor readable and comfortable to work with and removed the duplicates.

From our research we found out that only 4.5% of the venues are cafes in LA when most of our venues are restaurants so it is good for us because we have less risk with the cafes when its a small amount of the data and we will have less competition.

We also have found out that most of our data on the venues are not chains(more then 60%) but for the cafes, more then 60% are chains.

Also, we have found out that cafes are less profitable in general from the restaurants. The average revenue isnt that high, so making a chain of cafes would be the smartest if we go with the robot cafe.

By checking the number of seats of the venues, we found out that the restaurants are leading with the number of seats (48 in average) while cafes are around 25 seats in average.(almost the double)
From our analysis on the seats that we did, we can conclude that we shouldnt focus at the moment on large areas with alot of seats but more on the grab and go, pick up fast food service deliveries.

Then, we checked the top 10 streets witht he biggest venues and their number of seats. Ofcourse from place to place the number of seats change, fo example in Hollywood, LA and in Grove, LA we saw the most number of seats in average(around 70) so if we want to get more customers and have a bigger cafe, we should go for those place.

In conclusion, we should look to open and find investments for a cafe chain with many branches around the 20-30 seats each, find the street with a high number of venues that atracts alot of visitors and tourists.

Presentation PDF file: < https://drive.google.com/file/d/14k4Ci-32aS0kIGFBRZAf6Pn8mYv8oRfX/view >