In [None]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

<font size=18>Project 1: Airfares</font>

This notebook will give you an idea about the background and scope of this project in addition to instructions.

# The Scenario

The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting non-stop service on routes that previously lacked it.  Building new airports is not generally feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports.  There are numerous players and interests involved in the issue (airlines, city, state, and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players.  

A consulting firm wishes to determine the maximum average fare (FARE) as a function of three variables: COUPON, HI, and DISTANCE.  Moreover, they need to impose constraints on 
- the number of passengers on that route (PAX) $\leq 20000$
- the starting city’s average personal income (S_INCOME) $\leq 30000$
- the ending city’s average personal income (E_INCOME) $\geq 30000$

However, the variables PAX, S_INCOME, and E_INCOME are not decision variables so the firm first model these variables using COUPON, HI, and DISTANCE as predictors using linear regression (predictive analytics).  They'll also use linear regression to model a linear relation between FARE and COUPON, HI, and DISTANCE.  Armed with these predictive models the firm will build a linear program (prescriptive analytics) to maximize the average fare.

Suppose you are in the aviation consulting firm and you want to maximize airfares for the particular set circumstances described below. The file *Airfares.xlsx* contains real data that were collected between Q3-1996 and Q2-1997. The first sheet contains variable descriptions while the second sheet contains the data.  A csv file of the data is also provided (called *Airfares.csv*).

*NOTE: This problem scenario is developed from pp. 170-171 in Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, by Shmueli, Bruce, Yahav, Patel, and Lichtendahl, Wiley, 2017)*

# Predictive Analytics

Use multiple linear regression **through the origin** to fit airfare (FARE) as a linear function of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE).  

Build three more linear regression models with COUPON, HI, and DISTANCE as predictors to fit separate regression equations through the origin for response variables:

- the number of passengers on that route (PAX)
- the starting city’s average personal income (S_INCOME)
- the ending city’s average personal income (E_INCOME)

## Linear Regression Example in Python

In DS705 you saw how to do linear regression in R.  Here is a Python example to get you started.  We'll show you how to do it both with `statsmodels` and with `sklearn`.  The `statsmodel` approach is probably best since you also get the statistical information.  `sklearn` is included since it's a popular machine learning package that is worth learning.

The file age_height.csv contains ages (years) and heights (inches) for 7 children.  Here we show how to get the linear regression model for predicting height from age.  We'll start with the "through the origin" model which means initially we are fitting a model of the form height = c * age with no intercept term (or intercept = 0).

### With `statsmodels` package 

In [7]:
import statsmodels.api as sm
import pandas as pd

age_height = pd.read_csv("data/age_height.csv")

# define predictor variables
X = age_height['age'] 

# for multiple predictors example: X = age_height[['age','gender']]

# define response variables
Y = age_height['height']

# add a constant to the model (uncomment if you want to add an intercept term)
# X = sm.add_constant(X) 

# fit the objective function and pull out coefficients
model_obj = sm.OLS(Y, X).fit()
coefs_obj = model_obj.params

print(model_obj.summary())
print(coefs_obj)

                                 OLS Regression Results                                
Dep. Variable:                 height   R-squared (uncentered):                   0.948
Model:                            OLS   Adj. R-squared (uncentered):              0.939
Method:                 Least Squares   F-statistic:                              109.4
Date:                Sun, 08 Sep 2019   Prob (F-statistic):                    4.49e-05
Time:                        13:22:13   Log-Likelihood:                         -25.770
No. Observations:                   7   AIC:                                      53.54
Df Residuals:                       6   BIC:                                      53.49
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------



### With `sklearn` machine learning package

In [35]:
from sklearn.linear_model import LinearRegression    
import pandas as pd
import numpy as np

age_height = pd.read_csv("data/age_height.csv")

# define predictor variables
X = np.array(age_height['age'])

# for multiple predictors example: X = age_height[['age','gender']]

# define response variables
Y = np.array(age_height['height'])

model = LinearRegression(fit_intercept = False)
model.fit(X.reshape(-1,1), Y)
print('No intercept (through the origin)')
print('Slope is: {:2.4f}'.format(model.coef_[0]) )
print('Model is: Y = {:2.4f} X'.format(model.coef_[0]) )

# if you want to include the usual intercept term
model2 = LinearRegression(fit_intercept = True)
model2.fit(X.reshape(-1,1), Y)
print('\nIntercept included: ')
print('Slope is: {:2.4f}'.format(model2.coef_[0]) )
print('Intercept is: {:2.4f}'.format(model2.intercept_) )
print('Model is: Y = {:2.4f} X + {:2.4f}'.format(model2.coef_[0],model2.intercept_))

No intercept (through the origin)
Slope is: 7.0636
Model is: Y = 7.0636 X

Intercept included: 
Slope is: 2.9375
Intercept is: 25.6250
Model is: Y = 2.9375 X + 25.6250


# Prescriptive Analytics

## Linear Programming

Use the fitted regression equation for airfare (FARE) as a linear function of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) as the objective function.

The three linear regression equations for the number of passengers on that route (PAX), the starting city’s average personal income (S_INCOME), the ending city’s average personal income (E_INCOME) as functions of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) are to be used as three of the constraint equations.

- the number of passengers on that route (PAX) $\leq 20000$
- the starting city’s average personal income (S_INCOME) $\leq 30000$
- the ending city’s average personal income (E_INCOME) $\geq 30000$

For additional constraints, restrict COUPON to no more than 1.5, limit HI to between 4000 and 8000, inclusive, and consider only routes with DISTANCE between 500 and 1000 miles, inclusive.

## Sensitivity Analysis

Produce the sensitivity analysis report.

# Write the Report

Write a report in the Jupyter notebook provided that summarizes the details of this project organized in sections as defined here.

**Section 1 - Introduction**: Summarize the problem statement, establishing the context and methods used in this project.

**Section 2**: Provide a brief summary of the liner regression models used to estimate coefficients that will be used in the linear programming problem.  Explain why the multiple regression equations had to be fitted through the origin (consider the assumptions of linear programming).

**Section 3**: The optimal value of the airfare and for which values of COUPON, HI, and DISTANCE it occurs. 

**Section 4**: From the sensitivity report, explain which constraints are binding for the number of passengers on that route (PAX), the starting city’s average personal income (S_INCOME), and the ending city’s average personal income (E_INCOME). If the constraint is binding, interpret the shadow price in the context of the problem.  If the constraint is not binding, interpret the slack in the context of the problem.

**Section 5**: Interpret the activity ranges for COUPON, HI, and DISTANCE in the context of the problem.

**Section 6 - Conclusion**: Briefly summarize the main conclusion of this project, state what you see as any limitations of the methods used here, and suggest other possible methods of addressing the maximizing of airfare in this problem scenario.

**Section 7**: Include an appendix showing the mathematical formulation for the linear programming problem used in this project.
