### Business Questions
    G2M Strategy 
    
- Market Segmentation
- Product
- Buyer
- Routes to Market
***
**Pitfalls to Avoid** 
- Poor product market fit
- Oversaturation
***
**Methodologies**
- Funnel - Awareness, consideration, and decision stages of the customer’s journey
- Flywheel - Attracting, engaging, and delighting prospects, leads, and customers
***
**Components**
- **Product-Market Fit:** What problem(s) does your product solve?
- **Target Audience:** Who is experiencing the problem that your product solves? How much are they willing to pay for a solution? What are the pain points and frustrations that you can alleviate?
- **Competition and Demand:** Who already offers what you’re launching? Is there a demand for the product, or is the market oversaturated?
- **Distribution:** Through what mediums will you sell the product or service? A website, an app, or a third-party distributor?


# G2M Strategy - Model Development

**Prediction**

Given the results from our cleaned dataset, we have a good understanding of the data and the relationships that exist, and the insights gather will assist in determining the type of model in predicting future `profit`.

Analysis will consist of traditional hypothesis testing to more advanced modeling and techniques.


Time period of data is from **31/01/2016 to 31/12/2018**:


**Resources:**<br>
[G2M Strategy](https://blog.hubspot.com/sales/gtm-strategy)

### Building a G2M Strategy

1. Identify the buying center and personas.
2. Craft a value matrix to help identify messaging.
3. Test your messaging.
4. Optimize your ads based on the results of your tests before implementing them on a wide scale.
5. Understand your buyer’s journey.
6. Choose one (or more) of the four most common sales strategies.
7. Build brand awareness and demand generation with inbound and/or outbound methods.
8. Create content to get inbound leads.
9. Find ways to optimize your pipeline and increase conversion rates.
10. Analyze and shorten the sales cycle.
11. Reduce customer acquisition cost.
12. Strategize ways to tap into your existing customer base.
13. Adjust and iterate as you go.
14. Retain and delight your customers.

As a company owner - Using the G2M strategy the focus will be considereably to that of its Customers/Users. Th

1) Can we tell how much would be earned by quarter, period, year of a specific time period? (Predictive analysis, Time Series)

2) What areas (city) generate greatest profit? (Using population and number of users as a reference)

3) Customer preference over pink and yellow?

4) Average age of user? (if applicable create age bins)

5) Average income of user? (hist, highest income user - states)

6) If company purchases a specific fleet of vehicles what is the time range for ROI? (Apply hypothesis testing and A/B Testing)

7) What are users/consumers using more (cash or card)? Are prices equivalent - Upcharge if cash, card transaction is pre-pickup - lower rate?

In [9]:
pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.1.2-cp38-cp38-macosx_10_9_x86_64.whl (8.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting joblib>=1.0.0
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.0/307.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: joblib, scikit-learn
  Attempting uninstall: joblib
    Found existing installation: joblib 0.17.0
    Uninstalling joblib-0.17.0:
      Successfully uninstalled joblib-0.17.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency con

In [1]:
# Import libraries 

# Visualization
import shap
%matplotlib inline
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import eli5
from eli5.sklearn import PermutationImportance

# Data Wrangling
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',50)

# Model Creation
from xgboost import XGBRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.linear_model import Ridge, LinearRegression
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve, GridSearchCV
from sklearn.metrics import roc_curve, plot_roc_curve, mean_absolute_error, mean_squared_error, accuracy_score

# Warnings Ignore
import warnings
warnings.filterwarnings("ignore")

# Function to display plotly in jupyter notebook
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
    init_notebook_mode(connected=False)
    
print('Matplotlib:',matplotlib.__version__)
print('Pandas:',pd.__version__)
print('Numpy:',np.__version__)

Matplotlib: 3.1.1
Pandas: 1.1.3
Numpy: 1.23.0


In [14]:
cab_data = pd.read_csv('/Users/jasonrobinson/Desktop/VC/notebooks/cab_data_cleaned')
cab_yellow = pd.read_csv('/Users/jasonrobinson/Desktop/VC/notebooks/yellow_cab.csv')
cab_pink = pd.read_csv('/Users/jasonrobinson/Desktop/VC/notebooks/pink_cab.csv')

### Build on creating statistical analysis - answer business questions.

Perform analysis and provide visualizations to gather further insights and recommendations for the next steps in the process of determing optimal investment following the G2M strategy.

To achieve this we will apply machine learning techniques to predict future profit.

We wil also use the data to determine the optimal investment strategy, related to time series analysis. 
        

In [22]:
# Wrangle function to clean data
def wrangle(filepath):
    
    '''
                 ,,,,,,,
      (\-"""-/) /       \
       |     | /         \
       \ 0 0 //           \
        \_o_//       /\   /
       /`   `\      |  \,/  
      /       \     |  
      \ (   ) /     |
     / \_)-(_/ \    |
    |  /_____\  |  /
    \  \ J.R / /  /
     \ '.___.' / /
    .'  \-=-/  '.
   /   /`   `\   \
  (//./       \.\\)
   `"`         `"`
   
    '''
    # Set conditional satements for filtering times of month to season value
    condition_winter = (cab_data.index>=1)&(cab_data.index<=3)
    condtion_spring = (cab_data.index>=4)&(cab_data.index<=6)
    condition_summer = (cab_data.index>=7)&(cab_data.index<=9)
    condition_autumn = (cab_data.index>=10)@(cab_data.index<=12)
    
    # Create column in dataframe that inputs the season based on the conditions created above
    cab_data['season'] = np.where(condition_winter,'winter',
                            np.where(condtion_spring,'spring',
                                     np.where(condition_summer,'summer',
                                              np.where(condition_autumn,'automn',np.nan))))

    return cab_data

# Applying the wrangle function to the dataset
cab_data=wrangle('/Users/jasonrobinson/Desktop/VC/notebooks/cab_data_cleaned')

In [30]:
# Create a price increase for price_charged by .5%
cab_data['price_inc_05'] = cab_data['price_charged'] * 1.05

# Create a price increase for price_charged by .10%
cab_data['price_inc_10'] = cab_data['price_charged'] * 1.10

# Difference of price charged and trip cost to get profit
cab_data['profit'] = cab_data['price_charged'] - cab_data['trip_cost']
cab_data.head()

Unnamed: 0,travel_date,transact_id,company,city,km_travelled,price_charged,trip_cost,customer_id,payment_mode,gender,age,monthly_income,price_inc_05,price_diff,new_price_charged,travel_date.1,season,price_inc_10,profit
0,2016-02-06,10000011,Pink Cab,ATLANTA GA,30.45,370.95,313.635,29290,Card,Male,28,10813,389.4975,18.5475,389.4975,2016-02-06,automn,408.045,57.315
1,2018-08-19,10351127,Yellow Cab,ATLANTA GA,26.19,598.7,317.4228,29290,Cash,Male,28,10813,628.635,29.935,628.635,2018-08-19,winter,658.57,281.2772
2,2018-12-22,10412921,Yellow Cab,ATLANTA GA,42.55,792.05,597.402,29290,Card,Male,28,10813,831.6525,39.6025,831.6525,2018-12-22,winter,871.255,194.648
3,2016-02-04,10000012,Pink Cab,ATLANTA GA,28.62,358.52,334.854,27703,Card,Male,27,9237,376.446,17.926,376.446,2016-02-04,winter,394.372,23.666
4,2018-05-20,10320494,Yellow Cab,ATLANTA GA,36.38,721.1,467.1192,27703,Card,Male,27,9237,757.155,36.055,757.155,2018-05-20,spring,793.21,253.9808


In [12]:
# Get dummie data for categoriacal values
cab_data[['female', 'male']] = pd.get_dummies(cab_data["gender"])
cab_data[['yellow_cab', 'pink cab']] = pd.get_dummies(cab_data["company"])
cab_data[['card', 'cash']] = pd.get_dummies(cab_data["payment_mode"])
cab_data = cab_data.drop(['gender', 'company', 'payment_mode', 'customer_id'], axis=1)

In [24]:
cab_data = wrangle('/Users/jasonrobinson/Desktop/VC/notebooks/cab_data_cleaned')

In [31]:
cab_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344953 entries, 0 to 344952
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   travel_date        344953 non-null  object 
 1   transact_id        344953 non-null  int64  
 2   company            344953 non-null  object 
 3   city               344953 non-null  object 
 4   km_travelled       344953 non-null  float64
 5   price_charged      344953 non-null  float64
 6   trip_cost          344953 non-null  float64
 7   customer_id        344953 non-null  int64  
 8   payment_mode       344953 non-null  object 
 9   gender             344953 non-null  object 
 10  age                344953 non-null  int64  
 11  monthly_income     344953 non-null  int64  
 12  price_inc_05       344953 non-null  float64
 13  price_diff         344953 non-null  float64
 14  new_price_charged  344953 non-null  float64
 15  travel_date.1      344953 non-null  object 
 16  se

In [None]:
# Daily total price charged

In [None]:
# Figure showing Price per total load
fig = px.scatter(df,x='total_load_actual',
                 y='price_actual',
                 facet_col='season',
                 opacity=0.1,
                 title='Price Per KW Hour Compaired To Total Energy Genereated Per Season',
                 animation_frame=df.index.year)

# Figure customizations
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='darkslateblue')),
                  selector=dict(mode='markers'))

In [28]:
cab_data['monthly_income'].sort_values(ascending=False)[:3]

307038    35000
63653     34996
63659     34996
Name: monthly_income, dtype: int64

In [29]:
cab_data.sort_index(ascending=False)

Unnamed: 0,travel_date,transact_id,company,city,km_travelled,price_charged,trip_cost,customer_id,payment_mode,gender,age,monthly_income,price_inc_05,price_diff,new_price_charged,travel_date.1,season
344952,2018-02-02,10439846,Yellow Cab,TUCSON AZ,13.30,244.65,180.3480,39761,Card,Female,32,10128,256.8825,12.2325,256.8825,2018-02-02,automn
344951,2018-02-04,10439840,Yellow Cab,TUCSON AZ,5.60,92.42,70.5600,41677,Cash,Male,23,19454,97.0410,4.6210,97.0410,2018-02-04,automn
344950,2018-02-02,10439838,Yellow Cab,TUCSON AZ,19.00,303.77,232.5600,41414,Card,Male,38,3960,318.9585,15.1885,318.9585,2018-02-02,automn
344949,2018-02-01,10439799,Yellow Cab,SILICON VALLEY,13.72,277.97,172.8720,12490,Cash,Male,33,18713,291.8685,13.8985,291.8685,2018-02-01,automn
344948,2018-02-05,10439790,Yellow Cab,SEATTLE WA,16.66,261.18,213.9144,38520,Card,Female,42,19417,274.2390,13.0590,274.2390,2018-02-05,automn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,2018-05-20,10320494,Yellow Cab,ATLANTA GA,36.38,721.10,467.1192,27703,Card,Male,27,9237,757.1550,36.0550,757.1550,2018-05-20,spring
3,2016-02-04,10000012,Pink Cab,ATLANTA GA,28.62,358.52,334.8540,27703,Card,Male,27,9237,376.4460,17.9260,376.4460,2016-02-04,winter
2,2018-12-22,10412921,Yellow Cab,ATLANTA GA,42.55,792.05,597.4020,29290,Card,Male,28,10813,831.6525,39.6025,831.6525,2018-12-22,winter
1,2018-08-19,10351127,Yellow Cab,ATLANTA GA,26.19,598.70,317.4228,29290,Cash,Male,28,10813,628.6350,29.9350,628.6350,2018-08-19,winter


In [25]:
cab_data.columns

Index(['travel_date', 'transact_id', 'company', 'city', 'km_travelled',
       'price_charged', 'trip_cost', 'customer_id', 'payment_mode', 'gender',
       'age', 'monthly_income', 'price_inc_05', 'price_diff',
       'new_price_charged', 'travel_date.1', 'season'],
      dtype='object')

In [None]:
#cab_data['travel_date']= pd.to_datetime(cab_data['travel_date'], infer_datetime_format=True)

In [None]:
# 5-day, 4-week work month
cab_data['daily_income'] = (cab_data['monthly_income'] / 20 ).round(2)

In [None]:
sns.displot(cab_data['monthly_income']);

In [None]:
sns.displot(cab_data['trip_cost']);

In [None]:
# Remove outliers above count of 5000
cab_data = cab_data[cab_data['trip_cost'] > 100]and
cab_data = cab_data[cab_data['trip_cost'] < 600]


In [None]:
# Percentage of monthly income on cost of trip
percentage_monthly_income = cab_data['trip_cost'] / cab_data['monthly_income']

## Model builing - Regression

In [None]:
# Split data into train, val, and test
train = cab_data.iloc[:int(len(cab_data) * 0.8)]
val = cab_data.iloc[int(len(cab_data) * 0.8):int(len(cab_data) * 0.9)]
test = cab_data.iloc[int(len(cab_data) * 0.9):]

In [None]:
target = 'trip_cost'
features = cab_data.columns.drop('trip_cost')

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

In [None]:
import category_encoders as ce
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

lr = make_pipeline(
    ce.TargetEncoder(),  
    LinearRegression()
)

lr.fit(X_train, y_train)
print('Linear Regression R^2', lr.score(X_val, y_val))

In [None]:
coefficients = lr.named_steps['linearregression'].coef_
pd.Series(coefficients, features)

In [None]:
from sklearn.metrics import r2_score
from xgboost import XGBRegressor

gb = make_pipeline(
    ce.OrdinalEncoder(), 
    XGBRegressor(n_estimators=100, objective='reg:squarederror', n_jobs=1)
)

gb.fit(X_train, y_train)
y_pred = gb.predict(X_val)
print('Gradient Boosting R^2', r2_score(y_val, y_pred))

In [None]:
# Get a scatterplot of the predictions vs actual values
fig = go.Figure()
fig.add_trace(go.Scatter(x=y_val, y=y_pred, mode='markers', name='predictions'))
fig.add_trace(go.Scatter(x=y_val, y=y_val, mode='markers', name='actual'))
fig.update_layout(title='Gradient Boosting Predictions vs Actual', xaxis_title='Actual', yaxis_title='Predicted')
iplot(fig)



In [None]:
import plotly.express as px

px.scatter(
    X_train,
    y_train,
    x='monthly_income',
    y='trip_cost',
    color='price_charged'
    title='Gradient Boosting Predictions vs Actual',    
    xaxis_title='Actual',
    yaxis_title='Predicted'

)

In [None]:
# Convert string to datetime
X_train['travel_date'] = pd.to_datetime(X_train['travel_date'], infer_datetime_format=True)

In [None]:
#pip install pdpbox

In [None]:
#pip install --upgrade pip

In [None]:
#from pdpbox import pdp
#pdp = pdp_interact.pivot_table(
#    values='preds', 
#    columns=features[0], 
#    index=features[1]
#)[::-1]

In [None]:
# Predicted values
y_pred

Take the price charged out,because it is calculataing future data?

### Time Series Analysis

We can use statsmodels to perform time series analysis and get a general idea of the data over time.

In [None]:
# Create an univariate time series dataset indexed by the travel date
ts_data = pd.DataFrame(cab_data['trip_cost'], index=cab_data['travel_date'])

# Create a function to determine the autocorrelation of the time series
def autocorrelation(ts_data):
    """
    Method:
        Create a function to determine the autocorrelation of the time series
    """
    # Calculate the autocorrelation of the time series
    acf_data = ts.acf(ts_data)
    
    # Plot the autocorrelation of the time series
    plt.figure(figsize=(10,6))
    plt.plot(acf_data)
    plt.xlabel('Lag')
    plt.ylabel('Autocorrelation')
    plt.title('Autocorrelation of Trip Cost')
    plt.show()
    return


autocorrelation(ts_data)

In [None]:
# Create a function to determine the partial autocorrelation of the time series
def partial_autocorrelation(ts_data):
    """
    Method:
        Create a function to determine the partial autocorrelation of the time series
    """
    # Calculate the partial autocorrelation of the time series
    pacf_data = ts.pacf(ts_data)
    # Plot the partial autocorrelation of the time series
    plt.figure(figsize=(10,6))
    plt.plot(pacf_data)
    plt.xlabel('Lag')
    plt.ylabel('Partial Autocorrelation')
    plt.title('Partial Autocorrelation of Trip Cost')
    plt.show()
    return

In [None]:
# Create a function to determine the ADF test of the time series
def adf_test(ts_data):
    """
    Method:
        Create a function to determine the ADF test of the time series
    """
    # Calculate the ADF test of the time series
    adf_data = adfuller(ts_data)
    # Plot the ADF test of the time series
    plt.figure(figsize=(10,6))
    plt.plot(adf_data)
    plt.xlabel('Lag')
    plt.ylabel('ADF Test')
    plt.title('ADF Test of Trip Cost')
    plt.show()
    return
    

In [None]:
# Create a function to determine the ARIMA model of the time series
def arima_model(ts_data):
    """
    Method:
        Create a function to determine the ARIMA model of the time series
    """
    # Calculate the ARIMA model of the time series
    arima_data = ARIMA(ts_data, order=(1,1,1)).fit()
    # Plot the ARIMA model of the time series
    plt.figure(figsize=(10,6))
    plt.plot(arima_data.fittedvalues)
    plt.xlabel('Date')
    plt.ylabel('Trip Cost')
    plt.title('ARIMA Model of Trip Cost')
    print(arima_data.summary())
    plt.show()
    return