# Market Overlap

### SCENARIO

The scenario is: A ficticious company named 'Alpha' is interested in acquiring another ficticious company named 'Beta'. They both belong to the hotelling industry, more specifically they are marketplaces for hotels to announce their accomodation details as well as handle all the booking process with travellers.

### UNDERSTANDING THE COMPANIES

__Alpha__ offer hotels in destinations worldwide. They've entered Brazil's market recently and have been investing in Facebook and Google Ads to get more hotels to publish in their marketplace. However, the CAC (customer acquisition cost) to get more hotels on board is too high and after several weeks trying to improve performance Alpha's board have decided to look for other options of increasing the number of hotels in their marketplace.

__Beta__ on the other hand, operates only in Brazil and although has a smaller scale than Alpha overall, it has a considerable amount of hotels already on board and operating.

### WHY THE MARKET OVERLAP ANALYSIS

Seeking a way to reduce the CAC, Alpha has made a move to buy Beta with all it's hotels and marketplace service. Even though Beta has shown interest on the deal, Alpha still want's to look at some real data in order to reach the conclusion if the CAC would really be lower than their current marketing investments.

As of now, both companies does not know if their hotels are unique to them or actually common between both, as its a common practice among hotels to publish their accomodations in different marketplaces.

### STEPS FOR THE ANALYSIS

1) Setting up

2) Data cleaning

3) Overlap Analysis

4) Conclusions


---

## 1) Setting Up

- Reading files;
- Standardizing columns.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
dtype = {
    'registration_id': str
}

parse_dates = [
    'registration_date',
    'latest_booking'
]

alpha = pd.read_csv(
    'alpha_hotels.csv',
    dtype=dtype,
    parse_dates=parse_dates
)

In [3]:
alpha.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1843 entries, 0 to 1842
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 1843 non-null   int64         
 1   hotel_name         1843 non-null   object        
 2   address            1843 non-null   object        
 3   city               1843 non-null   object        
 4   country            1843 non-null   object        
 5   registration_date  1843 non-null   datetime64[ns]
 6   latest_booking     1843 non-null   datetime64[ns]
 7   total_bookings     1843 non-null   int64         
 8   registration_id    1843 non-null   object        
dtypes: datetime64[ns](2), int64(2), object(5)
memory usage: 129.7+ KB


In [None]:
alpha.head()

In [None]:
dtype = {
    'Registration Id': str
}

parse_dates = [
    'Registration Date',
    'Last Booking'
]

beta = pd.read_csv(
    'Beta Hotels.csv',
    dtype=dtype,
    parse_dates=parse_dates
)

In [None]:
beta.info()

In [None]:
beta.head()

__Key findings__
- Seems like the files don't contain 'null' values
- Alpha has the column names formatted as snake_case (best to be kept this way)
- Beta has column names containing spaces and with an unecessary formatting

> Next let's rename Beta columns to align with Alpha's.

In [None]:
new_cols = {
    'Id': 'id',
    'Hotel Name': 'hotel_name',
    'Address': 'address',
    'City State': 'city_state',
    'Registration Date': 'registration_date',
    'Last Booking': 'latest_booking',
    'Total Bookings': 'total_bookings',
    'Registration Id': 'registration_id'
}

beta.rename(new_cols, axis=1, inplace=True)
beta.head()

---

## 2) Data Cleaning

- Checking for inconsistencies;
- Ensuring both datasets follow the same pattern for same property.

> During the column renaming I noticed the 'location' properties difer in both datasets. Alpha has only 'city' name, whereas Beta contains an extra information of 'state'.

> To keep both properties on the same pattern, I'll drop the 'state' information as only the city seems to be enough.

In [None]:
beta['city'] = beta['city_state'].str.split(' - ').str[0]
beta.head()

In [None]:
alpha['country'].value_counts()

> At the same time, Alpha has a column with 'country' information, which doesn't seem to be really necessary as all rows contain the same value 'BR'.

> 'country' from Alpha will be removed because it's a column that doesn't add up to the data, and is not even mirrored in Beta's properties.

In [None]:
alpha.drop('country', axis=1, inplace=True)
alpha.info()

In [None]:
beta.drop('city_state', axis=1, inplace=True)
beta.info()

In [None]:
alpha['city'].isin(beta['city']).value_counts()

In [None]:
beta['city'].isin(alpha['city']).value_counts()

__Key findings__
- Now both datasets have the same columns as well as data types and formatting;
- Seems also that Alpha's listed cities are the same as Beta's, and no dataset contain any exclusive value for it.

> Next let's make sure 'registration_id' is free of inconsistencies.

In [None]:
alpha['registration_id'].apply(lambda x: len(x)).describe()

In [None]:
beta['registration_id'].apply(lambda x: len(x)).describe()

__Key findings__
- This check was necessary because this column will be our primary source of overlap comparision,
- This is an external numbered document and 12 digits is the norm for this type of ID.

> Next let's make sure dates are free of inconsistencies.

In [None]:
alpha[['registration_date', 'latest_booking']].describe(datetime_is_numeric=True)

In [None]:
beta[['registration_date', 'latest_booking']].describe(datetime_is_numeric=True)

__Key findings__
- No super abnormal dates were found, but analuzing the max's for both columns raised some questions;
- The 'registration_date' should come before the 'latest_booking' by pure logic...

> But does this stand with the reality? Next we will validate it.

In [None]:
alpha[alpha['registration_date'] > alpha['latest_booking']]

In [None]:
beta[beta['registration_date'] > beta['latest_booking']]

> Turns out Beta has some hotels which were registered after the latest booking, so this is a problem. Next we will handle those few cases.

In [None]:
(beta['latest_booking'] - beta['registration_date']).mean()

In [None]:
(beta['latest_booking'] - beta['registration_date']).describe()

__Key findings__
- The solution chosen was to replace the 'registration_date' of those affected by the overall series mean.
- This was decided instead of simply dropping or keeping it like that because:

1) I don't want to drop these Hotels as number 1 insight we are taking from this analysis is the Overlap, and the dates come as a secondary information which is helpful, yet not essential.

2) Still, I will use the dates later on for the analysis, so insted of having the potential wrong data to mess up the whole series, I'll replace it with the series mean, this way it might not be correct, but at least won't affect overall statistics (or at least affect less).

In [None]:
beta.loc[
    beta['registration_date'] > beta['latest_booking'],
    'registration_date'
] = beta.loc[
        beta['registration_date'] > beta['latest_booking'],
        'registration_date'
] - pd.to_timedelta(249, 'day')

In [None]:
beta[beta['registration_date'] > beta['latest_booking']]

In [None]:
beta[beta['id'] == 460]

---

## 3) Overlap Analysis

- 

In [None]:
alpha['source'] = 'alpha'
beta['source'] = 'beta'

In [None]:
hotel_df = pd.concat([alpha, beta], ignore_index=True)

In [None]:
hotel_df.info()

In [None]:
hotel_df.head()

In [None]:
hotel_df.tail()

In [None]:
hotel_df['active_days'] = hotel_df['latest_booking'] - hotel_df['registration_date']
hotel_df['active_days'] = hotel_df['active_days'].dt.days

In [None]:
hotel_df.loc[hotel_df['source'] == 'alpha', 'active_days'].describe()

In [None]:
hotel_df.loc[hotel_df['source'] == 'beta', 'active_days'].describe()

In [None]:
hotel_df['monthly_bookings'] = (hotel_df['total_bookings'] / hotel_df['active_days']) * 30

In [None]:
hotel_df.loc[hotel_df['source'] == 'alpha', 'monthly_bookings'].describe()

In [None]:
hotel_df.loc[hotel_df['source'] == 'beta', 'monthly_bookings'].describe()

# Gerando estatísticas

In [None]:
summary_statistics = pd.DataFrame(index=['hotels'])
summary_statistics['alpha_hotels'] = hotel_df[hotel_df['source'] == 'alpha'].shape[0]
summary_statistics['beta_hotels'] = hotel_df[hotel_df['source'] == 'beta'].shape[0]
summary_statistics['overlap'] = hotel_df.duplicated('registration_id').sum()
summary_statistics['alpha_exclusive'] = summary_statistics['alpha_hotels'] - summary_statistics['overlap']
summary_statistics['beta_exclusive'] = summary_statistics['beta_hotels'] - summary_statistics['overlap']
summary_statistics

In [None]:
summary_statistics = summary_statistics.transpose()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

data = summary_statistics.loc[['alpha_exclusive', 'overlap'], 'hotels']

ax1.pie(
    data,
    #labels=['Alpha exclusive', 'Overlapping with Beta'],
    autopct=lambda p: '{:.0f}% ({:.0f})'.format(p,(p/100)*data.sum()),
    colors=['#77db0a', '#ff4076'],
    explode=[0.07, 0],
    startangle=0,
    shadow=True
)
ax1.set_title("Alpha's hotels analysis" , weight='bold')
ax1.legend(['Alpha exclusive', 'Overlapping with Beta'], loc='lower center', bbox_to_anchor=(0.5, -0.1))

data = summary_statistics.loc[['beta_exclusive', 'overlap'], 'hotels']

ax2.pie(
    data,
    #labels=['Beta exclusive', 'Overlapping with Alpha'],
    autopct=lambda p: '{:.0f}% ({:.0f})'.format(p,(p/100)*data.sum()),
    colors=['#77db0a', '#ff4076'],
    explode=[0.07, 0],
    startangle=0,
    shadow=True
)
ax2.set_title("Beta's hotels analysis" , weight='semibold')
ax2.legend(['Beta exclusive', 'Overlapping with Alpha'], loc='lower center', bbox_to_anchor=(0.5, -0.1))

plt.show()

# Isolando o overlap

In [None]:
overlap = alpha.loc[alpha['registration_id'].isin(beta['registration_id']), 'registration_id']
overlap.info()

In [None]:
hotel_df.duplicated('registration_id').value_counts()

In [None]:
overlap = beta.loc[beta['registration_id'].isin(alpha['registration_id']), 'registration_id']
overlap.info()

In [None]:
hotel_df[hotel_df.duplicated('registration_id', keep=False)].sort_values('registration_id')

In [None]:
beta[beta['registration_id'].isin(alpha['registration_id'])].sort_values('registration_id')

In [None]:
alpha[alpha['registration_id'].isin(beta['registration_id'])].sort_values('registration_id')

In [None]:
alpha.duplicated('registration_id').value_counts()

In [None]:
beta.duplicated('registration_id').value_counts()

In [None]:
alpha[alpha.duplicated('registration_id', keep=False)].sort_values('registration_id')

# Analisando monthly bookings

In [None]:
hotel_df['monthly_bookings'].describe()

In [None]:
alpha_overlap = hotel_df.loc[
    (hotel_df['source'] == 'alpha') & 
    (hotel_df['registration_id'].isin(overlap)),
    ['registration_id', 'monthly_bookings']
]
alpha_overlap.rename({'monthly_bookings': 'monthly_bookings_alpha'}, axis=1, inplace=True)
alpha_overlap.set_index('registration_id', inplace=True)
alpha_overlap.head()

In [None]:
beta_overlap = hotel_df.loc[
    (hotel_df['source'] == 'beta') & 
    (hotel_df['registration_id'].isin(overlap)),
    ['registration_id', 'monthly_bookings']
]
beta_overlap.rename({'monthly_bookings': 'monthly_bookings_beta'}, axis=1, inplace=True)
beta_overlap.set_index('registration_id', inplace=True)
beta_overlap.head()

In [None]:
overlap_monthly_bookings = pd.concat([alpha_overlap, beta_overlap], ignore_index=False)
overlap_monthly_bookings

In [None]:
overlap