# **Project 'Market Research for Catering in Moscow**

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Data.-Preliminary-Stage" data-toc-modified-id="Introduction-to-Data.-Preliminary-Stage-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Data. Preliminary Stage</a></span><ul class="toc-item"><li><span><a href="#Creating-new-columns" data-toc-modified-id="Creating-new-columns-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Creating new columns</a></span></li></ul></li><li><span><a href="#Analysis-by-establishment-category" data-toc-modified-id="Analysis-by-establishment-category-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Analysis by establishment category</a></span></li><li><span><a href="#Seating-capacity-by-category" data-toc-modified-id="Seating-capacity-by-category-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Seating capacity by category</a></span></li><li><span><a href="#Chain-vs.-Independent-establishments" data-toc-modified-id="Chain-vs.-Independent-establishments-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Chain vs. Independent establishments</a></span></li><li><span><a href="#Chain-and-non-chain-establishments-by-category" data-toc-modified-id="Chain-and-non-chain-establishments-by-category-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Chain and non-chain establishments by category</a></span></li><li><span><a href="#Top-15-by-categories" data-toc-modified-id="Top-15-by-categories-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Top 15 by categories</a></span></li><li><span><a href="#Distribution-by-Moscow-Administrative-Districts" data-toc-modified-id="Distribution-by-Moscow-Administrative-Districts-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Distribution by Moscow Administrative Districts</a></span></li><li><span><a href="#Ratings-by-Category" data-toc-modified-id="Ratings-by-Category-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Ratings by Category</a></span></li><li><span><a href="#Ratings-by-Districts" data-toc-modified-id="Ratings-by-Districts-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Ratings by Districts</a></span></li><li><span><a href="#TOP-15-streets-by-the-number-of-establishments" data-toc-modified-id="TOP-15-streets-by-the-number-of-establishments-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>TOP 15 streets by the number of establishments</a></span></li><li><span><a href="#Streets-with-the-only-establishment" data-toc-modified-id="Streets-with-the-only-establishment-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Streets with the only establishment</a></span></li><li><span><a href="#Median-of-the-average-bill-by-districts" data-toc-modified-id="Median-of-the-average-bill-by-districts-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Median of the average bill by districts</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-13"><span class="toc-item-num">13&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Detailed-Study:-Opening-a-Coffee-Shop" data-toc-modified-id="Detailed-Study:-Opening-a-Coffee-Shop-14"><span class="toc-item-num">14&nbsp;&nbsp;</span>Detailed Study: Opening a Coffee Shop</a></span><ul class="toc-item"><li><span><a href="#Competitors" data-toc-modified-id="Competitors-14.1"><span class="toc-item-num">14.1&nbsp;&nbsp;</span>Competitors</a></span></li><li><span><a href="#Operating-Hours" data-toc-modified-id="Operating-Hours-14.2"><span class="toc-item-num">14.2&nbsp;&nbsp;</span>Operating Hours</a></span></li><li><span><a href="#Ratings" data-toc-modified-id="Ratings-14.3"><span class="toc-item-num">14.3&nbsp;&nbsp;</span>Ratings</a></span></li><li><span><a href="#Стоимость-чашки-капучино" data-toc-modified-id="Стоимость-чашки-капучино-14.4"><span class="toc-item-num">14.4&nbsp;&nbsp;</span>Стоимость чашки капучино</a></span></li><li><span><a href="#Price-category" data-toc-modified-id="Price-category-14.5"><span class="toc-item-num">14.5&nbsp;&nbsp;</span>Price category</a></span></li><li><span><a href="#Conclusions-and-Recommendations" data-toc-modified-id="Conclusions-and-Recommendations-14.6"><span class="toc-item-num">14.6&nbsp;&nbsp;</span>Conclusions and Recommendations</a></span></li></ul></li></ul></div>

During this project, we will assist the investors of the fund in analyzing the locations of public dining establishments in Moscow. We will endeavor to discover interesting features and identify patterns that will aid the client in selecting a location and launching a successful startup.

## Introduction to Data. Preliminary Stage

In [None]:
pip install folium

In [None]:
#loading the libraries required for the project
import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import json
from folium import Map, Choropleth
from folium import Marker, Map
from folium.plugins import MarkerCluster
import folium
from math import radians, sin, cos, sqrt, atan2
from collections import Counter

We have access to a dataset of public catering establishments in Moscow, compiled based on data from Yandex Maps and Yandex Business as of the summer of 2022. The file 'moscow_places.csv' contains:
- name — the name of the establishment;
- address —  the address of the establishment;
- category — the category of the establishment, such as 'cafe,' 'pizzeria,' or 'coffee shop';
- hours — information about the days and hours of operation;
- lat — the latitude of the geographic point where the establishment is located;
- lng — the longitude of the geographic point where the establishment is located;
- rating — the establishment's rating based on user ratings in Yandex Maps (highest rating is 5.0);
- price — price category in the establishment, such as 'average,' 'below average,' 'above average,' and so on;
- avg_bill — a string that stores the average order cost as a range
- middle_avg_bill — a numeric value representing the average bill
- middle_coffee_cup — a numeric value representing the price of one cup of cappuccino, which is only indicated for values in the 'avg_bill' column starting with the substring 'Price of one cappuccino cup.' 
1. If a price range of two values is indicated in the row, the median of these two values will be entered into the column.
2. If a single number is indicated in the row as the price without a range, that number will be entered into the column.
3. If there is no value or it does not start with the substring 'Price of one cappuccino cup,' nothing will be entered into the column.
- chain — a binary value expressed as 0 or 1, indicating whether the establishment is part of a chain (there may be errors for small chains). If 0, the establishment is not part of a chain; if 1, the establishment is part of a chain.
- district — the administrative district in which the establishment is located, for example, the Central Administrative District;
- seats — the number of seating places.

In [None]:
#download the dataset
data = pd.read_csv('/Users/tatyanamayorova/Desktop/Yandex/projects/GIT/moscow_places.csv')
data.head()

In [None]:
data.info()

Overall, the dataset contains 8,406 Moscow establishments. However, not all columns have complete data. More than half of the data is missing in the 'price' and 'avg_bill' columns, as well as 'middle_avg_bill.' 'Middle_coffee_cup' has very few data points available. The seating capacity is also missing for approximately half of the establishments. For now, we won't address these gaps. There are no issues with data types, although it's worth remembering that 'object' data type can encompass various types of information

In [None]:
#Let's display duplicates if they exist
duplicates = data[data.duplicated()]
display(duplicates)
#and slightly modify the filter, there may be duplicates by name and address
dupl_name_ad = data[data.duplicated(['name','address'], keep='first')]
dupl_name_ad

In [None]:
#take a look at the missing values in the column with operating hours
data[data['hours'].isna()].head()

It seems that filling in the missing values here is not logical. For instance, we can identify all chain establishments and assume that they have the same operating hours, but that is often not the case. Operating hours are more likely to depend on the location. However, filling in the operating hours based on the district or street would be incorrect. If the establishment is located in a shopping mall, its operating hours will match those of the mall. If it's a standalone establishment, its operating hours may be different. For now, we won't address these missing values.

### Creating new columns

Let's add a column indicating whether the establishment operates 24/7 or not

In [None]:
#create a function to find the relevant values
def is_24_7(hours):
    hours_str = str(hours).lower()  # Преобразуем в строку и переводим в нижний регистр
    return ('пн-вс' in hours_str or 'ежедневно' in hours_str) and \
           ('круглосуточно' in hours_str or '00:00-00:00' in hours_str)
#create a new column
data['is_24_7'] = data['hours'].apply(is_24_7)
#count how many establishments operate around the clock
print('Total establishments operating 24/7:', data.query('is_24_7==True').is_24_7.count())

Extract a column with the name of the street where the establishment is located from the address.

In [None]:
#сcreate a function that will extract the street name from the address
def extract_street(address):
    parts = address.split(',')
    if len(parts) >= 2:
        street_part = parts[1].strip()
        return street_part
    else:
        return None

# create a column street
data['street'] = data['address'].apply(extract_street)

# dataset with a new column
data.head()

In [None]:
#overall measures
data.describe()

Here, it's worth paying attention to the minimum and maximum values of the average bill and the cost of a coffee cup. Also, the seating capacity values of 1288 and 0 look suspicious, so let's take a closer look at them. Additionally, we'd like to examine the most popular names that appear in the dataset. Let's start with that.

In [None]:
#the most common chains
name_counts = data['name'].value_counts()
plt.figure(figsize=(12, 6))
sns.barplot(x=name_counts.head(15).index, y=name_counts.head(15).values)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Names')
plt.ylabel('Quantity')
plt.title('The most frequent names')
plt.show()

There are approximately 180 establishments with the name "Cafe" and about 30 with the name "Restaurant." The "NoName" establishments do not belong to any brand.

Let's analyze the columns "middle_avg_bill," "middle_coffee_cup," and "seats" separately.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(data['middle_avg_bill'], bins=100, color='grey', edgecolor='black')
plt.xlabel('Average bill')
plt.ylabel('Frequency')
plt.title('Distribution of Average bill')
plt.show()

In [None]:
#remove abnormally high and low values of middle_avg_bill
data = data[(data['middle_avg_bill'] > 0) & (data['middle_avg_bill'] <= 11000) | data['middle_avg_bill'].isnull()]
data.describe()

In [None]:
#distribution of the average cost of a coffee cup
plt.figure(figsize=(10, 6))
plt.hist(data['middle_coffee_cup'], bins=100, color='grey', edgecolor='black')
plt.xlabel('Average cuppuccino price')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()

In [None]:
#remove the abnormally high value of middle_coffee_cup
data = data[(data['middle_coffee_cup'] <= 600) | data['middle_coffee_cup'].isnull()]
data.describe()

In [None]:
#plot the distribution of the number of seating places
plt.figure(figsize=(10, 6))
plt.hist(data['seats'], bins=100, color='grey', edgecolor='black')
plt.xlabel('Number of seats')
plt.ylabel('Freaquency')
plt.title('Distribution')
plt.show()

It's clear that most establishments have 40-60 seating places, but where did the 500-1000 seating places come from? It's possible that this is an error. Let's see which streets have such capacious establishments concentrated.

In [None]:
df = data.query('seats>500')
seats_count = df.groupby('street')['seats'].value_counts().reset_index(name='counts')
# Sort seats_count
seats_count = seats_count.sort_values(by='counts', ascending=False)
# Extraction of data for graph.
streets = seats_count['street']
seats = seats_count['seats']
counts = seats_count['counts']

# bar chart 
plt.figure(figsize=(12, 6))
plt.bar(streets, counts, color='skyblue')
plt.xlabel('Street and number of seats')
plt.ylabel('Number of establishments')
plt.title('Distribution of establishments by seating capacity on streets')
plt.xticks(rotation=90)
plt.tight_layout()

plt.show()

"Leningradsky Prospekt" and more than 20 establishments with a seating capacity of 625 people... Let's take a look at what kind of establishments these are.

In [None]:
data.query('street=="Ленинградский проспект" and seats>500').head()

It's hard to imagine a pizzeria with 600 seats. When searching on the internet, we see that "Maxima Pizza Moscow, Leningradsky Prospekt, 78, Building 1" has 160 seating places, and "Stradivari" has a banquet hall for up to 150 persons, and so on. Obviously, there is an error in the data. It's unclear whether we can trust the remaining data on the number of seats. Let's see the range in which 95% of the data on seating capacity falls.

In [None]:
# Removing rows with missing values.
cleaned_data = data.dropna(subset=['seats'])

# Calculating the 2.5th and 97.5th percentiles for a 95% confidence interval.
lower_percentile = np.percentile(cleaned_data['seats'], 2.5)
upper_percentile = np.percentile(cleaned_data['seats'], 97.5)

print(f"95% of the 'seats' column values fall within the range from {lower_percentile} to {upper_percentile}")

In [None]:
#how many values fall within this range.
data.query('seats>421').seats.count()

In [None]:
#remove the questionable data
data = data[(data['seats'] < 421) & (data['seats']>0)| data['seats'].isnull()]
data.describe()

We have completed the data preprocessing. For analysis, we have 8,147 establishments of different categories located in Moscow, which represents nearly 97% of the initial number of establishments in the raw data.

## Analysis by establishment category

In [None]:
#Explore the categories
data['category'].unique()

In [None]:
category = data.groupby('category')['name'].count().sort_values()
plt.figure(figsize=(12, 6))
bars = plt.barh(category.index, category.values, color='skyblue')
plt.xlabel('Number of Establishments')
plt.ylabel('Category')
plt.title('Number of Establishments by Category')
for bar in bars:
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, str(int(bar.get_width())),
             va='center', ha='right', fontsize=11)
plt.show()

The dataset contains the highest number of establishments in the "Cafe" category, while bakeries and canteens are relatively few, ranking lower in prevalence. Does this mean that cafes are popular, and it's better to invest in this category, or could it suggest that the bakery format is undervalued with less competition, making it a potential opportunity to open a bakery? Let's continue to explore further.

## Seating capacity by category

In [None]:
# Create a dataset without missing values in the "seats" column
new_set = data.dropna(subset=['seats'])

# Calculate the average number of seats by category
category_seats = new_set.groupby('category')['seats'].mean().sort_values(ascending=False)

# Create a horizontal bar chart using Seaborn
plt.figure(figsize=(13, 6))
barplot = sns.barplot(x=category_seats.values, y=category_seats.index, palette='Blues_d')
plt.xlabel('Average number of seats')
plt.ylabel('Category')
plt.title('Average number of seats by establishment category')
for index, value in enumerate(category_seats.values):
    barplot.text(value + 1, index, str(round(value)), va='center')
plt.show()

The most capacious establishments are restaurants, bars, and cafes, with seating capacities exceeding one hundred. Let's explore these data from a different angle to understand the range of seating capacities more clearly.

In [None]:
# Grouping data by category and calculating mean, minimum, and maximum seating capacities
category_seats_stats = new_set.groupby('category')['seats'].agg(['mean', 'min', 'max']).sort_values(by='mean')

# Setting the figure size
plt.figure(figsize=(12, 6))

# Plotting horizontal range lines
for cat, (_, mn, mx) in category_seats_stats.iterrows():
    plt.hlines(cat, xmin=mn, xmax=mx, color='salmon')

# Adding maximum, minimum, and mean points with labels
for cat, (mean, mn, mx) in category_seats_stats.iterrows():
    plt.scatter(mx, cat, color='salmon', s=200, edgecolors='black', 
                label='Max' if cat == category_seats_stats.index[-1] else None)
    plt.scatter(mn, cat, color='salmon', s=100, edgecolors='black', 
                label='Min' if cat == category_seats_stats.index[-1] else None)
    plt.scatter(mean, cat, color='lightblue', s=100, edgecolors='black', 
                label='Mean' if cat == category_seats_stats.index[-1] else None)
    plt.text(mx + 5, cat, str(mx), va='center', ha='left', fontsize=10, color='black')
    plt.text(mn - 5, cat, str(mn), va='center', ha='right', fontsize=10, color='black')
    plt.text(mean + 10, cat, f'{int(mean)}', va='center', ha='left', fontsize=10, color='black')

plt.xlabel('Number of seats')
plt.ylabel('Category')
plt.title('Seating Capacity Statistics by Establishment Category')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

I still can't imagine a bakery with 320 seats. Let's take a final look at the distributions in the "Restaurant" and "Cafe" categories for comparison.

In [None]:
restorans=new_set[new_set['category'] == 'ресторан']
# Creating a histogram for the distribution of seats
plt.figure(figsize=(12, 6))
plt.hist(restorans['seats'], bins=30, color='lightblue', edgecolor='black')
plt.xlabel('Number of Seats')
plt.ylabel('Number of Establishments')
plt.title('Distribution of the Number of Seats in Restaurants')
plt.show()

In [None]:
restorans=new_set[new_set['category'] == 'кафе']
# Creating a histogram for the distribution of seats
plt.figure(figsize=(12, 6))
plt.hist(restorans['seats'], bins=30, color='lightblue', edgecolor='black')
plt.xlabel('Number of Seats')
plt.ylabel('Number of Establishments')
plt.title('Distribution of the Number of Seats in Cafe')
plt.show()

It is evident that the main values for the number of seats in restaurants are around 50 and 80 seats. As for cafes, they typically have around 40-50 seats. Let's also create distributions for other categories.

In [None]:
categories = ['кафе', 'ресторан', 'кофейня', 'пиццерия', 'бар,паб', 'быстрое питание', 'булочная', 'столовая']

# Creating histogram plots for each category.
plt.figure(figsize=(12, 6 * len(categories)))  # Overall plot size
for i, category in enumerate(categories, 1):
    subset = data[data['category'] == category]
    plt.subplot(len(categories), 1, i)  # Creating subplots
    plt.hist(subset['seats'], bins=40, color='lightblue', edgecolor='black')
    plt.xlabel('Number of Seats')
    plt.ylabel('Number of Establishments')
    plt.title(f'Distribution of the Number of Seats in {category}')
plt.tight_layout() # Improves the layout of subplots.
plt.show()

Судя по гистаграммам:

| Category       | The most frequently encountered number of seats | Seats in the second place |
|:----------------|:----------------:|:-----------------:|
| Cafe            | 40               | 50                |
| Restaurant        | 100              | 50                |
| Coffeehouse         | 50               | 100               |
| Pizzeria        | 50               | 100               |
| Bar, Pub       | 40               | 90                |
| Fast food | 20               | 50                |
| Bakery        | 50               | 40                |
| Canteens        | 40               | 80                |


## Chain vs. Independent establishments

Let's examine the ratio of chain and independent establishments in the dataset. But first, let's check if all establishments with the same name are marked as chains.

In [None]:
#check this with the example of the Кафе name
data[(data['name']=='Кафе') & (data['chain']==1)]

In [None]:
#and restaurants
data[(data['name']=='Ресторан') & (data['chain']==1)]

Great, there are no obvious errors in identifying chains. Let's move on to examining the ratio.

In [None]:
chain_counts = data['chain'].value_counts()

# creating a pie chart
plt.figure(figsize=(6, 6))
plt.pie(chain_counts, labels=['Independent', 'Chain'], autopct='%1.1f%%', startangle=140, 
        colors=['lightgreen', 'lightblue'], wedgeprops={'linewidth': 0.1, 'edgecolor': 'green'})
plt.title('Distribution of Independent and Chain Establishments')
plt.axis('equal')  # To make the pie chart circular
plt.show()


More than half of the establishments are independent. It's interesting to explore in which categories chains are more prevalent.

## Chain and non-chain establishments by category

In [None]:
# Grouping data
category_counts = data['category'].value_counts()
chain_counts = data[data['chain'] == 1]['category'].value_counts()
nonchain_counts = category_counts - chain_counts

# Calculation of shares
chain_ratios = chain_counts / category_counts
nonchain_ratios = nonchain_counts / category_counts

# Creating a chart
plt.figure(figsize=(12, 6))

# sorting
sorted_categories = chain_ratios.sort_values(ascending=False).index

# Creating stacked columns with labels for shares
plt.bar(sorted_categories, chain_ratios[sorted_categories], color='lightblue', edgecolor='black', label='Сетевые')
plt.bar(sorted_categories, nonchain_ratios[sorted_categories], bottom=chain_ratios[sorted_categories], 
        color='lightgreen', edgecolor='black', label='Несетевые')

# Adding share labels
for cat in chain_ratios.sort_values(ascending=False).index:
    height = chain_ratios[cat]
    plt.text(cat, height / 2, f'{height:.2f}', ha='center', va='center', fontsize=10, color='black')
    height = chain_ratios[cat] + nonchain_ratios[cat] / 2
    plt.text(cat, height, f'{nonchain_ratios[cat]:.2f}', ha='center', va='center', fontsize=10, color='black')

plt.xlabel('Category')
plt.ylabel('Share of Establishments')
plt.title('Ratio of Chain and Non-Chain Establishments by Category')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

Bakeries, pizzerias, and coffeehouses are more frequently represented by chains, whereas among pubs and canteens, over 70% are non-chain establishments.

## Top 15 by categories

In [None]:
# Getting the top 15 most popular establishments
top_restaurants = data['name'].value_counts().head(19).index

# filter
filtered_data = data[(data['name'].isin(top_restaurants)) & (data['chain']==1)]

In [None]:
name_by_category = filtered_data.groupby(['name', 'category']).size().unstack(fill_value=0)
name_by_category

In [None]:
# Creating a column with the total number for sorting.
name_by_category['Total'] = name_by_category.sum(axis=1)

# Sorting by the total quantity and removing the 'Total' column
name_by_category_sorted = name_by_category.sort_values('Total', ascending=False)
name_by_category_sorted.drop('Total', axis=1, inplace=True)
# Resetting the index
name_by_category_sorted.reset_index(inplace=True)

# Creating a horizontal stacked bar chart.
fig, ax = plt.subplots(figsize=(14, 8))
colors = plt.cm.tab20.colors

# Building columns for each establishment.
name_by_category_sorted.plot(x='name', kind='barh', stacked=True, color=colors, ax=ax)

# Adding labels for each column
for i, row in name_by_category_sorted.iterrows():
    prev_value = 0  # Variable to store the sum of previous values
    for col in name_by_category_sorted.columns[1:]:
        value = row[col]
        if value > 0:
            # Calculating the position for the label, taking into account the sum of previous values
            pos = prev_value + value / 2
            ax.annotate(str(value), xy=(pos, i), ha='center', va='baseline', fontsize=10, color='black')
            prev_value += value  # Updating the sum of previous values.

ax.set_xlabel('Quantity')
ax.set_ylabel('Establishment')
ax.set_title('Number of Establishments for Each Name by Category (Top 15)')
ax.legend(title='Category', fontsize='large')
plt.tight_layout()

plt.show()

This means that the largest number of chain establishments belong to coffee shops and pizzerias. Moreover, among the 15 most frequently encountered chains, the majority of locations under one chain have the same category. For example, out of 116 establishments in the Shokoladnitsa chain, 115 are coffee shops. If a chain offers a variety of categories, it would be something like Khinkalnaya or Mu-Mu, where you can find bars, pubs, and restaurants under the same brand.

## Distribution by Moscow Administrative Districts

In [None]:
#determine the administrative districts of Moscow that are present in the dataset
data['district'].unique()

In [None]:
#number of establishments by category in the table
pivot_table_by_category = data.pivot_table(columns='district', index='category', values = 'name', aggfunc ='count')
pivot_table_by_category

In [None]:
# sorting
pivot_table_by_category = pivot_table_by_category[pivot_table_by_category.sum(axis=0).
                                                  sort_values(ascending=False).index]
# viz-ion
categories = pivot_table_by_category.index
districts = pivot_table_by_category.columns

bar_width = 0.5
bar_positions = np.arange(len(districts))

plt.figure(figsize=(10, 10))  # size

bottoms = np.zeros(len(districts))

for i, category in enumerate(categories):
    plt.bar(bar_positions, pivot_table_by_category.loc[category, :], width=bar_width, label=category, bottom=bottoms)
    bottoms += pivot_table_by_category.loc[category, :]
for x, total in zip(bar_positions, bottoms):
    plt.text(x, total, str(int(total)), ha='center', va='bottom', fontweight='bold')

plt.xlabel('Districts')
plt.ylabel('Number of Establishments')
plt.title('Comparison of Establishments by Categories and Districts')
plt.xticks(bar_positions, districts, rotation='vertical')
plt.legend()
plt.tight_layout()
plt.show()

Central Administrative District (ЦАО) is the leader in terms of the number of establishments, with a significant lead. Here, there are more than twice as many establishments as in any other district in Moscow. However, competition is also high in this area, especially among bars, cafes, coffee shops, and restaurants. North-Western Administrative District (СЗАО) appears to be the most residential district, less frequented by foodservice entrepreneurs. But even here, cafes and restaurants dominate over other categories.

## Ratings by Category

In [None]:
#group data
rating_by_cat = data.groupby('category')['rating'].mean().sort_values()
# viz-ion
plt.figure(figsize=(10, 6))  # size

plt.barh(rating_by_cat.index, rating_by_cat.values, color='skyblue')
plt.xlabel('Average Rating')
plt.ylabel('Category')
plt.title('Average Ratings by Category')
plt.xlim(4, 4.5) # Limiting the X-axis for better visibility
plt.tight_layout()
plt.show()

The highest average ratings are found among bar/pubs and pizzerias. Fast-food establishments have the lowest average rating, but the range of average ratings is small, ranging from 4 to 4.5. It's interesting to combine the graph with the number of establishments and the average rating because the more establishments we evaluate, the more honest the average rating seems.

In [None]:
# Grouping data by categories and calculating the average rating and the number of establishments
count_by_cat = data.groupby('category')['name'].count().sort_values()
rating_by_cat = data.groupby('category')['rating'].mean()

# viz-ion
bar_positions = np.arange(len(count_by_cat))

fig, ax1 = plt.subplots(figsize=(12, 7))


ax1.bar(bar_positions, count_by_cat.values, bar_width, label='Number of Establishments', color='skyblue')
ax1.set_xlabel('Category')
ax1.set_ylabel('Number of Establishments', color='black')
ax1.tick_params(axis='y', labelcolor='black')
ax1.yaxis.set_ticks(np.arange(200, 2600, 200))

ax2 = ax1.twinx()
ax2.bar(bar_positions + bar_width, rating_by_cat.values, bar_width, label='Average Rating', color='orange')
ax2.set_ylabel('Average Rating', color='black')
ax2.tick_params(axis='y', labelcolor='black')
ax2.set_ylim(4, 4.5)  # Limiting the Y-axis

# adding labels
for i, (count, rating) in enumerate(zip(count_by_cat.values, rating_by_cat.values)):
    ax1.annotate(f"{count}", (bar_positions[i], count +20), ha='center', fontsize=10)
    ax2.annotate(f"{rating:.2f}", (bar_positions[i] + bar_width, rating + 0.02), ha='center', fontsize=10)
# create a common legend
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)

fig.tight_layout()
plt.title('Comparison of Average Rating and Number of Establishments by Category')
plt.xticks(bar_positions + bar_width/2, count_by_cat.index)
plt.show()

It is noticeable that despite a relatively high number of cafes and restaurants, the average ratings of restaurants are higher. However, the number of bakeries and canteens is significantly lower, and their ratings have immediately risen.

## Ratings by Districts

Let's calculate the average ratings by districts in Moscow and visualize them on a map using color-coding. We will also add establishments to this map using clusters.

In [None]:
#Calculate the average ratings by districts, without considering categories
rating_df = data.groupby('district', as_index=False)['rating'].agg('mean').round(3)

In [None]:
# load the JSON file with the boundaries of Moscow districts.
state_geo = '/Users/tatyanamayorova/Desktop/Yandex/projects/GIT/admin_level_geomap.geojson'

moscow_lat, moscow_lng = 55.751244, 37.618423

# creating the map
m = Map(location=[moscow_lat, moscow_lng], zoom_start=10)

# add Choropleth to the map
Choropleth(
    geo_data=state_geo,
    data=rating_df,
    columns=['district', 'rating'],
    key_on='feature.name',
    fill_color='YlGnBu',
    fill_opacity=0.8,
    legend_name='Average ratings by districts',
).add_to(m)
# create an empty cluster and add it to the map
marker_cluster = MarkerCluster().add_to(m)

# function that takes a row from the DataFrame,
# creates a marker at the current point, and adds it to the marker_cluster
def create_clusters(row):
    Marker(
        [row['lat'], row['lng']],
        popup=f"{row['name']} {row['rating']}",
    ).add_to(marker_cluster)

# apply the create_clusters() to each row 
data.apply(create_clusters, axis=1)
# displaying the map
m

The chart clearly shows the highest average ratings in blue color in the center and the lowest ratings in yellow. You can also zoom in/out on the chart to see all the establishments on the map.

## TOP 15 streets by the number of establishments

In [None]:
# Grouping data by streets and counting the number of names
street_counts = data.groupby('street', as_index=False)['name'].agg('count')

# Sorting by the number of names in descending order.
sorted_street_counts = street_counts.sort_values(by='name', ascending=False)
top = sorted_street_counts.head(15)
top

In [None]:
df = data[(data['street'].isin(top['street']))]
pivot_table_by_category = df.pivot_table(index='street', columns='category', values='name', 
                                         aggfunc='count', fill_value=0)

# Sum up the number of establishments by streets and sort the streets
sorted_streets = pivot_table_by_category.sum(axis=1).sort_values(ascending=False).index

# Sort the pivot table by streets
sorted_pivot_table = pivot_table_by_category.loc[sorted_streets]

categories = sorted_pivot_table.columns

plt.figure(figsize=(10, 8))

bottoms = np.zeros(len(sorted_streets))

for i, category in enumerate(categories):
    sorted_category_data = sorted_pivot_table[category]
    plt.barh(sorted_streets, sorted_category_data, height=0.5, label=category, left=bottoms)
    bottoms += sorted_category_data

plt.xlabel('Number of establishments')
plt.ylabel('Street')
plt.title('Number of establishments by streets and categories')
plt.legend()
plt.tight_layout()
plt.show()

On the bar chart, the top 15 streets with the most establishments are presented. Categories are highlighted with colors. From the visualization, it is noticeable that Prospekt Mira is the most popular street in terms of the number of establishments, even though Warsaw Avenue is the longest in the list. Prospekt Mira has establishments from all 8 categories, but the most popular ones are restaurants, cafes, and coffee shops. On Pyatnitskaya Street and Vavilova Street, there are no canteens at all, but there are the fewest bars on the Moscow Ring Road, which is quite logical.

## Streets with the only establishment

In [None]:
alone = data.groupby('street')['name'].count()
alone = alone[alone==1]
print('Total amount',alone.count())

In [None]:
alone_streets = data[data['street'].isin(alone.index)]
alone_streets.head()

It's interesting to find out what these establishments have in common. It's logical to assume that if there's only one establishment on a street, it operates 24/7. Let's check

In [None]:
is_24_7_counts = alone_streets['is_24_7'].value_counts()
is_24_7_counts

The assumption is incorrect; out of 459 establishments, 31 operate 24/7. Let's look at the categories.

In [None]:
vis = alone_streets.groupby('category')['name'].count().sort_values()
plt.figure(figsize=(12, 6))
bars = plt.barh(vis.index, vis.values, color='skyblue')
plt.xlabel('Number of Establishments')
plt.ylabel('Category')
plt.title('Number of Single Establishments by Category')
for bar in bars:
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, str(int(bar.get_width())),
             va='center', ha='right', fontsize=11)
plt.show()

The same categories remain popular. Let's check if the average rating stands out and how many of them are chain establishments.

In [None]:
print('Average rating of single establishments on the street', alone_streets['rating'].mean().round(2))
alone_streets_chain_counts = alone_streets['chain'].value_counts()
print('Chains:', alone_streets_chain_counts[1], 'Non-chains:', alone_streets_chain_counts[0])

The average rating of single establishments falls within the range of values typical for our other calculations of average ratings. There are fewer chain establishments than non-chain ones, which also corresponds to the general trend of the analyzed establishments in the dataset

In [None]:
# Creating a histogram for the distribution by the number of seats
plt.figure(figsize=(12, 6))
plt.hist(alone_streets['seats'], bins=30, color='lightblue', edgecolor='black')
plt.xlabel('Number of Seats')
plt.ylabel('Number of Establishments')
plt.title('Distribution of the Number of Seats in Single-Establishment Streets')
plt.show()

In [None]:
# creating hist
plt.figure(figsize=(8, 6))
plt.hist(alone_streets['district'], bins=20, edgecolor='black', alpha=0.7)

# Setting the appearance of the plot
plt.title('Distribution of the Number of Establishments by Districts')
plt.xlabel('District')
plt.ylabel('Number of Establishments')
plt.xticks(rotation='vertical')

# Displaying the histogram
plt.tight_layout()
plt.show()

Summarize information about establishments located alone on entire streets.
1. Were founded 459 streets where only 1 establishment operates
2. Only 31 out of 459 establishments operate around the clock.
3. Cafes, restaurants, and coffee shops are the most popular formats here
4. There are only 7 bakeries
5. The average rating of single establishments on the street is 4.24.
6. About 30% are chain establishments, and approximately 70% are standalone
7. These are primarily dining establishments with seating for 30-40 people, but there are also larger establishments with more than 200 seats.
8. Such streets exist in every district, but there are more establishments in the Central Administrative District (ЦАО)

## Median of the average bill by districts

To determine the average bill in an establishment, there is a column in the table called 'middle_avg_bill,' which contains a number indicating the average bill for values from the 'avg_bill' column that start with the substring 'Средний счёт' (Average bill):

If a price range of two values is indicated in the row, the median of these two values will be entered in the column.
If a single number is specified in the row (a price without a range), this number will be entered in the column.
If there is no value or it does not start with the substring 'Средний счёт,' nothing will be entered in the column.

Given that there are quite a few missing values in 'middle_avg_bill,' we will first create a new dataset without missing values in this column.

In [None]:
#new dataset without missing values in middle_avg_bill
avg = data.dropna(subset=['middle_avg_bill'])
avg.middle_avg_bill.describe().to_frame()

In [None]:
#group data and count median
avg_by_dis = avg.groupby('district',as_index=False)['middle_avg_bill'].median()
avg_by_dis.sort_values(by='middle_avg_bill', ascending=False)

In [None]:
m = Map(location=[moscow_lat, moscow_lng], zoom_start=10)

# create a choropleth map using the Choropleth constructor and add it to the map
Choropleth(
    geo_data=state_geo,
    data=avg_by_dis,
    columns=['district', 'middle_avg_bill'],
    key_on='feature.name',
    fill_color='PuBuGn',
    fill_opacity=0.8,
    line_opacity=0.2,
    legend_name='Median of the average bill by districts',
).add_to(m)
# empty cluster
marker_cluster = MarkerCluster().add_to(m)

# write a function that takes a row from the DataFrame,
# creates a marker at the current point, and adds it to the marker cluster
def create_clusters(row):
    Marker(
        [row['lat'], row['lng']],
        popup=f"{row['name']} {row['middle_avg_bill']}",
    ).add_to(marker_cluster)

# apply create_clusters()
data.apply(create_clusters, axis=1)
# show the map
m

The darkest areas of the map (with the highest median of the average check) are located in the central and western districts. It seems that the distance from the center does not affect the cost of the check. Knowing the coordinates of the center of Moscow and the coordinates of all catering points in the average dataset, we will find the distance to each establishment. The haversine function, which we used to calculate the distance, returns the result in kilometers, as the radius of the Earth (R) is specified in kilometers.

In [None]:
# Function to calculate the distance between two points on a sphere (haversine)
def haversine(lat1, lon1, lat2, lon2):
    # Convert coordinates from degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

    # Earth's radius in kilometers
    R = 6371.0

    # Difference in latitude and longitude
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    # Calculate distance using the haversine formula
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    # Distance between two points
    distance = R * c
    return distance

# Latitude and longitude of the center of Moscow
moscow_lat, moscow_lng = radians(55.751244), radians(37.618423)

avg = data.dropna(subset=['middle_avg_bill']).copy()

# Add a new column 'distance_to_center' with the distance from each establishment to the center of Moscow in km
avg['distance_to_center'] = avg.apply(lambda row: haversine(moscow_lat, moscow_lng, radians(row['lat']), 
                                                            radians(row['lng'])), axis=1)

In [None]:
#result
avg.head()

In [None]:
#Visualize the relationship using a scatter plot
plt.figure(figsize=(14, 6))
plt.scatter(avg['distance_to_center'], avg['middle_avg_bill'], alpha=0.5)
plt.xlabel('Distance to Center of Moscow (km)')
plt.ylabel('Middle Average Bill')
plt.title('Relationship between Distance to Center and Middle Average Bill')
plt.grid(True)


In [None]:
# Calculate the correlation between distance from the center and middle average bill
correlation = avg['distance_to_center'].corr(avg['middle_avg_bill'])

print(f'Correlation between distance from the center and middle average bill: {correlation}')

The correlation between the distance from the center and the middle average bill is approximately -0.183. This negative correlation value indicates a weak inverse relationship between these two variables. In this context, it means that the middle average bill tends to decrease (or increase only slightly) as you move away from the center of Moscow, but the relationship between these factors is not very strong.

## Conclusion

The analysis of the public catering market in Moscow provides valuable information for investors from the "Shut Up and Take My Money" fund who are planning to open a new establishment. The key findings of the study are as follows:

1. Variety of Formats: Moscow's market offers a variety of establishment formats, with the most popular being cafes, restaurants, and coffee shops. Bakeries and canteens, on the other hand, are less common.

2. Chain and Non-Chain Establishments: More than half of the establishments are non-chain, but chain establishments such as coffee shops and pizzerias are also popular. Over time, it may be worth considering investing in the chain format given its popularity.

3. Distribution by Districts: The Central Administrative District (CAD) of Moscow has the highest number of establishments but also faces the highest level of competition. The Northwestern Administrative District (NWAD) appears to be less saturated with public catering businesses.

4. Average Check: The average check varies across different districts, for example, it is highest in CAD and Western Administrative District (WAD), at around 1,000 rubles, while it is half that amount in the Southeastern part of Moscow. However, distance from the center of Moscow has a weak impact on the cost of a check. The weak correlation between these factors allows for considering different districts for locating an establishment.

5. Rating: An analysis of the average rating reveals that the average rating of establishments by district ranges from 4 to 4.5 and depends on the number of establishments of a particular type.

6. Streets with Single Establishments: Studying streets with single establishments showed that these are primarily cafes, restaurants, and coffee shops. These establishments have an average rating of around 4.24 and are often non-chain. Prospekt Mira, Profsoyuznaya, and Leninsky Prospekt have the highest number of establishments per street.

With this data in mind, investors can make more informed decisions regarding the format of the establishment, its location, menu, and pricing. It is recommended to conduct additional research and business planning to determine specific strategies and tactics for successfully entering the public catering market in Moscow.

## Detailed Study: Opening a Coffee Shop

During discussions with the client, we received new insights: they are planning to open an affordable coffee shop. Additionally, we will analyze:

How many coffee shops are there in the dataset? In which districts are they most prevalent, and what are the characteristics of their locations?
Are there any 24-hour coffee shops?
What are the ratings of these coffee shops, and how are they distributed across districts?
What price range should we target for a cup of cappuccino when opening, and why?
We'll also attempt to identify the target audience for our future coffee shop.

The format of the "Central Perk" coffee shop, as presented in the series, may not be entirely suitable for business meetings or corporate events. It is more oriented towards informal gatherings and spending time with friends. If the investors decide to create a similar coffee shop, they could target the segment of young adults, fans of the show, and provide a space for performances and events that align with the "Friends" atmosphere.

I believe our target audience is the youth and students. The characters of "Friends" were young adults, and "Central Perk" was often frequented by young people looking to spend time in good company. Let's consider an age range of 20-35, which represents financially capable adults. Understanding the target audience will help us determine the ideal location for our establishment.

The perfect location should be close to places where active young people spend their time. We can rule out residential areas, especially since the investors are "not afraid of competition." Next, it would be beneficial to have business centers or shopping malls nearby, as well as museums, theaters, and so on. Let's see what information is available in our dataset.

In [None]:
# Create a dataset specifically for coffee shops
data_c = data.query('category=="кофейня"')
data_c.info()

### Competitors

In total, there are 1368 coffee shops in the dataset. Let's try to identify our future competitors, taking into account the target audience we defined. The best districts for our coffee shop would be the Central Administrative Okrug (CAO) and the Western Administrative Okrug (WAO). The Central Okrug provides access to a rich infrastructure, entertainment, and business events in the city center. Additionally, this district is attractive due to its proximity to business centers and offices, which can attract a corporate audience.

The CAO is extensive and includes various districts such as Taganka, Basmanny, Yakimanka, and others, which can provide diverse location options depending on the goals and concept of the coffee shop.

The WAO is home to several business and office complexes. Some districts in the WAO, like Kutuzovsky Prospekt and Kievskaya, have developed infrastructure for entertainment and leisure, including restaurants, cinemas, and nightclubs.

In [None]:
#popular coffee shop chains
name_counts = data_c['name'].value_counts().sort_values(ascending=False).head(10)
plt.figure(figsize=(12, 6))
name_counts.plot(kind='bar', color='skyblue')
plt.title('Distribution of the number of establishments by name')
plt.xlabel('Name of the establishment')
plt.ylabel('Number of establishments')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

On the histogram, we can see the top 10 coffee shops by frequency. However, this doesn't tell us much. Let's determine the locations of these coffee shops.

In [None]:
top_10_coffee = data_c[data_c['name'].isin(name_counts.index)]
# group data and count coffee shops
coffee_count_by_district = top_10_coffee.groupby('district')['name'].count().sort_values(ascending=False)

# create a plot
plt.figure(figsize=(12, 6))
coffee_count_by_district.plot(kind='bar', color='skyblue')
plt.title('Number of Coffee Shops in Moscow Districts')
plt.xlabel('District')
plt.ylabel('Number of Coffee Shops')
plt.xticks(rotation=90)  

plt.show()

Based on the fact that a significant number of coffee shops are located in the districts we identified as our target areas, it appears that many coffee shops are catering to our target audience. However, competition won't deter us. Let's take a closer look at these chains

In [None]:
# select coffee shops in ЦАО
coffee_in_CAO = top_10_coffee[top_10_coffee['district'].isin(['Центральный административный округ'])]

# obtain the unique names 
coffee_in_CAO['name'].unique()

In [None]:
# # select coffee shops in ЗАО
coffee_in_ZAO = top_10_coffee[top_10_coffee['district'].isin(['Западный административный округ'])]

# obtain the unique names
coffee_in_ZAO['name'].unique()

All 10 of the most common chain coffee shops have opened their establishments in the Central Administrative District (ЦАО), and 9 of them are also present in the Western Administrative District (ЗАО). Only 'Правда кофе' is not represented in ЗАО

### Operating Hours

Let's find the 5 most frequently encountered operating schedules

In [None]:
# create a copy
data_c_copy = data_c.copy()

# Replace empty values in the 'hours' column with an empty string in the copied data
data_c_copy['hours'].fillna('', inplace=True)

# Split the string in the 'hours' column by the delimiter ";" and create a list of all combinations of days and hours
all_hours_combinations = []
for hours_string in data_c_copy['hours']:
    hours_list = hours_string.split('; ')
    all_hours_combinations.extend(hours_list)

# Use Counter to count the frequency of each combination
hours_count = Counter(all_hours_combinations)

# the 10 most common combinations
hours_count.most_common(10)

In [None]:
print('Total number of 24/7 coffee shops:', data_c.query('is_24_7==True').shape[0])

In [None]:
data_c.query('is_24_7==True').groupby('district')['name'].count()

This way, coffee shops in Moscow most often operate daily from 10:00 AM to 10:00 PM. In the Central Administrative District (CAD) and Western Administrative District (WAD), there are 25 and 9 round-the-clock establishments, respectively. This seems reasonable since WAD is a more business-oriented district, and the demand for coffee shops during the nighttime is unlikely to be high. However, in CAD, the nightlife is vibrant.

### Ratings

In [None]:
# count avg ratings
rating_c = data_c.groupby('district', as_index=False)['rating'].agg('mean').round(3)

In [None]:
Choropleth(
    geo_data=state_geo,
    data=rating_c,
    columns=['district', 'rating'],
    key_on='feature.name',
    fill_color='YlGnBu',
    fill_opacity=0.8,
    legend_name='AVG rating by districts',
).add_to(m)

marker_cluster = MarkerCluster().add_to(m)

def create_clusters(row):
    Marker(
        [row['lat'], row['lng']],
        popup=f"{row['name']} {row['rating']}",
    ).add_to(marker_cluster)

data.apply(create_clusters, axis=1)

m

ЦОА and СЗАО have coffee shops with high ratings, making it challenging to compete. Meanwhile, in ЗAO, coffee shop ratings are among the lowest. It seems we can stand out quite well in this context. However, it's possible that these ratings also reflect the discerning tastes of customers. If we consider ZAO primarily as a business district, people here value both time and quality.

### Cappuccino Price

In [None]:
# create a hist
plt.figure(figsize=(12, 6))
plt.hist(data_c['middle_coffee_cup'], bins=50, color='lightblue', edgecolor='black')
plt.xlabel('Cappuccino Price')
plt.ylabel('Number of Establishments')
plt.title('Distribution of Average Coffee Cup Prices in Moscow Coffee Shops')
plt.show()

In [None]:
coffee_CAO = data_c.query('district=="Центральный административный округ"') 

plt.figure(figsize=(12, 6))
plt.hist(coffee_CAO['middle_coffee_cup'], bins=50, color='lightblue', edgecolor='black')
plt.xlabel('Cappuccino Price')
plt.ylabel('Number of Establishments')
plt.title('Distribution of Average Coffee Cup Prices in ЦАО')
plt.show()

In [None]:
coffee_ZAO = data_c.query('district=="Западный административный округ"') 

plt.figure(figsize=(12, 6))
plt.hist(coffee_ZAO['middle_coffee_cup'], bins=50, color='lightblue', edgecolor='black')
plt.xlabel('Cappuccino Price')
plt.ylabel('Number of Establishments')
plt.title('Distribution of Average Coffee Cup Prices in ЗАО')
plt.show()

So, most often in Moscow coffee shops, cappuccino is purchased for 260-270 rubles. Although the second most popular price is 90 rubles per cup. In Central Administrative Okrug (CAO) and Western Administrative Okrug (WAO), the usual price is 260 rubles. Although there are many inexpensive coffee shops in CAO, WAO is not as flexible in pricing. And there are significantly fewer establishments here.

### Price category

In [None]:
# Creating two subplots for two datasets on one graph
fig, axes = plt.subplots(1, 4, figsize=(12, 6))
# Creating a dictionary with pastel colors
pastel_colors = sns.color_palette("pastel", 4)

# Creating a dictionary that maps colors to categories
fixed_colors = {
    'выше среднего': pastel_colors[0],
    'средние': pastel_colors[1],
    'высокие': pastel_colors[2],
    'низкие': pastel_colors[3]
}

# first (data)
sns.countplot(data=data, x='price', ax=axes[0], palette=fixed_colors)
axes[0].set_title('Price Distribution in Moscow')
axes[0].set_xlabel('Price Category')
axes[0].set_ylabel('Number of Establishments')
axes[0].tick_params(axis='x', labelrotation=45)

# second (data_c)
sns.countplot(data=data_c, x='price', ax=axes[1], palette=fixed_colors)
axes[1].set_title('Price Distribution in Coffee Shops')
axes[1].set_xlabel('Price Category')
axes[1].set_ylabel('Number of Establishments')
axes[1].tick_params(axis='x', labelrotation=45)

# CAO
sns.countplot(data=coffee_CAO, x='price', ax=axes[2], palette=fixed_colors)
axes[2].set_title('Prices in CAO Coffee Shops')
axes[2].set_xlabel('Price Category')
axes[2].set_ylabel('Number of Establishments')
axes[2].tick_params(axis='x', labelrotation=45)

# WAO
sns.countplot(data=coffee_ZAO, x='price', ax=axes[3], palette=fixed_colors)
axes[3].set_title('Prices in WAO Coffee Shops')
axes[3].set_xlabel('Price Category')
axes[3].set_ylabel('Number of Establishments')
axes[3].tick_params(axis='x', labelrotation=45)


plt.tight_layout()


plt.show()

In Moscow's coffee shops, just like in establishments of other categories, medium prices are the most prevalent. It's evident that in the Central Administrative District (CAO), there's a diverse mix of restaurant-goers, which results in demand not only for medium-priced items but also for high and low-priced ones. However, in the Western Administrative District (WAO), where prices are above average, the ratio of such establishments is higher than in CAO.

### Conclusions and Recommendations

Key takeaways from the conducted research:

- Target Audience: The primary target audience should be young adults aged 20 to 35.

- Location: It is recommended to choose a location for the coffee shop in central districts of Moscow, such as the Central Administrative District (CAO) or the Western Administrative District (ZAO), where there is an active young audience and access to business and entertainment centers. There are already 117 coffee shops in CAO, while ZAO has only 50.

- Operating Hours: Provide flexible working hours, including 24/7 service, especially in CAO, where nightlife is active. Operating from 10 AM to 10 PM on weekdays and 24/7 on weekends is also an option. In ZAO, there are few 24/7 establishments, so competition is lower, but to justify such working hours, consider locating the coffee shop near other public places that operate at night. If you focus on daytime hours, consider placing the coffee shop near business centers.

- Main Competitors: Pay attention to chain establishments that are well-represented in the selected areas of Moscow, such as 'Kofemania,' 'CofeFest,' 'Shokoladnitsa,' and 'Cinnabon.' By studying the menus of competitors, you can find unique features that they don't have, making it easier to stand out.

- Price: The price of a cup of coffee should likely be around 260 rubles, as this option is the most popular in other coffee shops and seems to be in demand among customers. However, at the launch stage, you can slightly reduce the price by 10-20 rubles to differentiate yourself from competitors (the effectiveness of this strategy cannot be judged without data). A significantly lower price may raise doubts about quality, while a significantly higher price must be justified.

- Ratings: Based on the conducted research, it appears that ratings in ZAO establishments are low. This can be a good sign and provide an advantage at the start, provided that a high level of service is ensured.

It is advisable to further analyze customer reviews for other coffee shops in the chosen area, taking into account feedback and criticism to avoid mistakes.