## G2M insight for Cab Investment firm
Data Glacier - Henri Edwards

The Client XYZ is a private firm in US. Due to remarkable growth in the Cab Industry in last few years and multiple key players in the market, it is planning for an investment in Cab industry and as per their Go-to-Market(G2M) strategy they want to understand the market before taking final decision.

Datasets contain information on 2 cab companies. Each file (data set) provided represents different aspects of the customer profile. XYZ is interested in using your actionable insights to help them identify the right company to make their investment.

<a id="cont"></a>

### Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading the Data</a>

<a href=#three>3. Data Cleaning & Preprocessing</a>

<a href=#four>4. Model</a>

<a href=#five>5. Conclusion</a>

### 1. Importing Packages
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [1]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 

import numpy as np 
from scipy import stats
import pandas as pd
import datetime

import matplotlib.pyplot as plt
from matplotlib import rc

import seaborn as sns

import plotly.express as px
import plotly.graph_objs as go
import plotly
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import datetime

import plotly.offline as pyo

import plotly.io as pio

sns.set_theme(style="whitegrid")

In [2]:
import plotly.io as pio

### 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

- **Cab_Data.csv** – this file includes details of transaction for 2 cab companies
- **Customer_ID.csv** – this is a mapping table that contains a unique identifier which links the customer’s demographic details
- **Transaction_ID.csv** – this is a mapping table that contains transaction to customer mapping and payment mode
- **City.csv – this file** contains list of US cities, their population and number of cab users

In [3]:
cab_data = pd.read_csv('Cab_Data.csv')
city_data = pd.read_csv('City.csv')
customer_id = pd.read_csv('Customer_ID.csv')
transaction_id = pd.read_csv('Transaction_ID.csv')

In [4]:
cab_data.head(3)

Unnamed: 0,Transaction ID,Date of Travel,Company,City,KM Travelled,Price Charged,Cost of Trip
0,10000011,08-01-2016,Pink Cab,ATLANTA GA,30.45,370.95,313.635
1,10000012,06-01-2016,Pink Cab,ATLANTA GA,28.62,358.52,334.854
2,10000013,02-01-2016,Pink Cab,ATLANTA GA,9.04,125.2,97.632


In [5]:
city_data.head(3)

Unnamed: 0,City,Population,Users
0,NEW YORK NY,8405837,302149
1,CHICAGO IL,1955130,164468
2,LOS ANGELES CA,1595037,144132


In [6]:
customer_id.head(3)

Unnamed: 0,Customer ID,Gender,Age,Income (USD/Month)
0,29290,Male,28,10813
1,27703,Male,27,9237
2,28712,Male,53,11242


In [7]:
transaction_id.head(3)

Unnamed: 0,Transaction ID,Customer ID,Payment_Mode
0,10000011,29290,Card
1,10000012,27703,Card
2,10000013,28712,Cash


### 3. Data Cleaning & Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

- Merge/Create necessary dataframes

- Inspect Null values

- Inspect Data Types

- Feature Engineering

In [8]:
# combine dataframes
combined_df = cab_data.merge(transaction_id,on='Transaction ID').merge(customer_id,on='Customer ID').merge(city_data,on='City')

In [9]:
combined_df.head(5)

Unnamed: 0,Transaction ID,Date of Travel,Company,City,KM Travelled,Price Charged,Cost of Trip,Customer ID,Payment_Mode,Gender,Age,Income (USD/Month),Population,Users
0,10000011,08-01-2016,Pink Cab,ATLANTA GA,30.45,370.95,313.635,29290,Card,Male,28,10813,814885,24701
1,10351127,21-07-2018,Yellow Cab,ATLANTA GA,26.19,598.7,317.4228,29290,Cash,Male,28,10813,814885,24701
2,10412921,23-11-2018,Yellow Cab,ATLANTA GA,42.55,792.05,597.402,29290,Card,Male,28,10813,814885,24701
3,10000012,06-01-2016,Pink Cab,ATLANTA GA,28.62,358.52,334.854,27703,Card,Male,27,9237,814885,24701
4,10320494,21-04-2018,Yellow Cab,ATLANTA GA,36.38,721.1,467.1192,27703,Card,Male,27,9237,814885,24701


In [10]:
# view total null values per feature
combined_df.isnull().sum()

Transaction ID        0
Date of Travel        0
Company               0
City                  0
KM Travelled          0
Price Charged         0
Cost of Trip          0
Customer ID           0
Payment_Mode          0
Gender                0
Age                   0
Income (USD/Month)    0
Population            0
Users                 0
dtype: int64

In [11]:
combined_df.dtypes

Transaction ID          int64
Date of Travel         object
Company                object
City                   object
KM Travelled          float64
Price Charged         float64
Cost of Trip          float64
Customer ID             int64
Payment_Mode           object
Gender                 object
Age                     int64
Income (USD/Month)      int64
Population             object
Users                  object
dtype: object

In [12]:
# Date of Travel is currently viewed as an object by Pandas, we need to convert it to a datetime object.
combined_df['Date of Travel'] = pd.to_datetime(combined_df['Date of Travel'], format='%d-%m-%Y')

In [13]:
# Convert relevant features from object to numerical data types
combined_df['Population'] = combined_df['Population'].replace(',','', regex=True)
combined_df['Population'] = combined_df['Population'].astype('int64')
combined_df['Users'] = combined_df['Users'].replace(',','', regex=True)
combined_df['Users'] = combined_df['Users'].astype('int64')

In [14]:
# Replace spaces with '_' in features
for col in combined_df.columns:
    if ' ' in col:
        combined_df = combined_df.rename(columns={col:col.replace(' ','_')})

In [15]:
# Create profit feature - Price - Expense
combined_df['Profit'] = round(combined_df['Price_Charged'] - combined_df['Cost_of_Trip'],2)

In [16]:
# Change Income_(USD/Month) to more useful name
combined_df.rename(columns = {'Income_(USD/Month)':'income_usd_pm'}, inplace = True)

In [17]:
# Data for only Pink Cab & Yellow Cab Companies.
combined_df['Company'].nunique()

2

In [18]:
combined_df.head(2)

Unnamed: 0,Transaction_ID,Date_of_Travel,Company,City,KM_Travelled,Price_Charged,Cost_of_Trip,Customer_ID,Payment_Mode,Gender,Age,income_usd_pm,Population,Users,Profit
0,10000011,2016-01-08,Pink Cab,ATLANTA GA,30.45,370.95,313.635,29290,Card,Male,28,10813,814885,24701,57.32
1,10351127,2018-07-21,Yellow Cab,ATLANTA GA,26.19,598.7,317.4228,29290,Cash,Male,28,10813,814885,24701,281.28


In [19]:
# Extract/Create Month feature
month=[]
for i in range(len(combined_df['Date_of_Travel'])):
    month.append(combined_df['Date_of_Travel'][i].month)
combined_df['Month'] = month

In [20]:
# Create dataframe for each company.
pink_cab = combined_df[combined_df['Company'] == 'Pink Cab']
yellow_cab = combined_df[combined_df['Company'] == 'Yellow Cab']

In [21]:
pink_cab.head(3)

Unnamed: 0,Transaction_ID,Date_of_Travel,Company,City,KM_Travelled,Price_Charged,Cost_of_Trip,Customer_ID,Payment_Mode,Gender,Age,income_usd_pm,Population,Users,Profit,Month
0,10000011,2016-01-08,Pink Cab,ATLANTA GA,30.45,370.95,313.635,29290,Card,Male,28,10813,814885,24701,57.32,1
3,10000012,2016-01-06,Pink Cab,ATLANTA GA,28.62,358.52,334.854,27703,Card,Male,27,9237,814885,24701,23.67,1
6,10395626,2018-10-27,Pink Cab,ATLANTA GA,13.39,167.03,141.934,27703,Card,Male,27,9237,814885,24701,25.1,10


In [22]:
yellow_cab.head(3)

Unnamed: 0,Transaction_ID,Date_of_Travel,Company,City,KM_Travelled,Price_Charged,Cost_of_Trip,Customer_ID,Payment_Mode,Gender,Age,income_usd_pm,Population,Users,Profit,Month
1,10351127,2018-07-21,Yellow Cab,ATLANTA GA,26.19,598.7,317.4228,29290,Cash,Male,28,10813,814885,24701,281.28,7
2,10412921,2018-11-23,Yellow Cab,ATLANTA GA,42.55,792.05,597.402,29290,Card,Male,28,10813,814885,24701,194.65,11
4,10320494,2018-04-21,Yellow Cab,ATLANTA GA,36.38,721.1,467.1192,27703,Card,Male,27,9237,814885,24701,253.98,4


#### Data Cleaning Conclusions
- No null values present in the data
- Combined dataframes
- Coverted necessary datatypes
- Features Engineered

### 3. Importing Packages
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>