# Regression Analysis

#### Evan Yathon

This notebook is intended to be run with papermill from the project root.

The purpose of this notebook is to run a logistic regression analysis on the cleaned reviews data.  Trying to identify contributing factors to a customer recommending an airline or not given that they have submitted a review is the central theme.

Usage:

`papermill src/ipynbs/regression_analysis.ipynb src/regression_analysis_ran.ipynb -p load_path data/topic_modeling_gw_reviews.csv`

In [1]:
load_path = "../../data/topic_modeling_gw_reviews.csv"

In [2]:
# load packages

# utils
import pandas as pd
import sys

# regression
from sklearn.linear_model import LogisticRegression

# regression utils
# custom model preparation script
sys.path.append("../../src")
sys.path.append("./src")
from PrepareForModel import *

In [3]:
reviews = pd.read_csv(load_path, parse_dates = ["date_of_review", "date_flown"])

### Regression Prep

Before running the regression analysis, a few things need to be done.

- Remove dates after the transfer to Eurowings
- Drop columns that have large amount of `NA` values or don't make sense to include in the analysis
- Deal with `NA` values
- Encode categorical columns

Remove dates after `8th February 2016` as Germanwings was transferred to Eurowings (as found in the EDA notebook).

In [4]:
reviews_regr = reviews[reviews["date_of_review"] < "2016-02-08"]

Check column types and percentage of NA values

In [5]:
# column types, na values
pd.DataFrame({
    "data_type" : reviews_regr.dtypes.values,
    "precent_na" : (reviews_regr.isna().sum()/reviews_regr.shape[0])*100
})

Unnamed: 0,data_type,precent_na
title,object,0.0
review_value,float64,9.42029
n_user_reviews,object,92.753623
reviewer_name,object,0.0
reviewer_country,object,9.42029
date_of_review,datetime64[ns],0.0
review_text,object,0.0
aircraft,object,91.304348
traveller_type,object,82.608696
seat_type,object,10.144928


`object` types are categorical/strings.  Below in a table documenting which columns will be dropped and the reasoning behind each.

| Column Name                   	| Reason to Drop                                                  	|
|-------------------------------	|-----------------------------------------------------------------	|
| title                         	| Information extracted via Topic Modeling                        	|
| n_user_reviews                	| Large amount of missing values                                  	|
| reviewer_name                 	| Not relevant for regression analysis                            	|
| reviewer_country              	| Too many categories for small amount of data; EDA backs this up 	|
| date_of_review                	| imbalanced dataset over time                                    	|
| review_text                   	| Information extracted via Topic Modeling                        	|
| aircraft                      	| Large amount of missing values                                  	|
| traveller_type                	| Large amount of missing values                                  	|
| route                         	| Large amount of missing values                                  	|
| date_flown                    	| Large amount of missing values                                  	|
| ground_service_rating         	| Large amount of missing values                                  	|
| inflight_entertainment_rating 	| Large amount of missing values                                  	|
| clean_review_text             	| Information extracted via Topic Modeling                        	|
| clean_title                   	| Information extracted via Topic Modeling                        	|

In [6]:
reviews_regr.columns

Index(['title', 'review_value', 'n_user_reviews', 'reviewer_name',
       'reviewer_country', 'date_of_review', 'review_text', 'aircraft',
       'traveller_type', 'seat_type', 'route', 'date_flown',
       'seat_comfort_rating', 'cabin_staff_service_rating',
       'food_and_beverages_rating', 'inflight_entertainment_rating',
       'ground_service_rating', 'value_for_money_rating', 'recommendation',
       'clean_review_text', 'clean_title', 'review_luggage_seats',
       'review_time_delays', 'review_food_bev_crew', 'title_money_value',
       'title_staff_delays'],
      dtype='object')

In [7]:
drop_cols = ['title', 'n_user_reviews', 'reviewer_name',
             'reviewer_country', 'date_of_review', 'review_text', 
             'aircraft', 'traveller_type', 'route', 'date_flown',
             'ground_service_rating', 'inflight_entertainment_rating',
             'clean_review_text', 'clean_title']

In [8]:
# drop unwanted columns
reviews_regr = reviews_regr.drop(drop_cols, axis = 1)

After dropping, review the na values and column types again

In [9]:
# column types, na values
pd.DataFrame({
    "data_type" : reviews_regr.dtypes.values,
    "precent_na" : (reviews_regr.isna().sum()/reviews_regr.shape[0])*100
})

Unnamed: 0,data_type,precent_na
review_value,float64,9.42029
seat_type,object,10.144928
seat_comfort_rating,float64,34.057971
cabin_staff_service_rating,float64,34.057971
food_and_beverages_rating,float64,36.231884
value_for_money_rating,float64,9.42029
recommendation,object,0.0
review_luggage_seats,float64,0.0
review_time_delays,float64,0.0
review_food_bev_crew,float64,0.0


In [10]:
# levels of the seat_type variable and counts
reviews_regr.seat_type.value_counts()

Economy Class     122
Business Class      1
First Class         1
Name: seat_type, dtype: int64

Seat comfort, cabin staff service and food and beverages still have a fair amount missing, I might need to remove them from the analysis later, or perform two separate analyses.

`seat_type` and `recommendation` need to be encoded.  To do that, I'll use some code that I had developed prior.

`PrepareForModel` dummy encodes categorical variables in a given dataframe.  There also exists an option to effect code the categorical variables, but for this option I will keep effect coding.

The reason for effect coding is that I plan to have `Economy Class` as the reference variable, and perform a reference treatment effect to see if moving to business or first class would increase the chance of a recommendation.  An issue with this is that there is only one business class and one first class seat in the dataset.

### Analysis 1
Dropping all rows containing `NA` values, keeping seat comfort, cabin staff service and food and beverages.

This dataset will have less samples than the original

In [14]:
review_regr1 = reviews_regr.dropna()
prep1 = PrepareForModel(review_regr1)
review_regr_enc1 = prep1.make_dummy_df({"seat_type" : "Economy Class", "recommendation" : "no"})

### Analysis 2
Dropping all rows containing `NA` values, dropping seat comfort, cabin staff service and food and beverages.


In [15]:
review_regr2 = reviews_regr.drop(
    ["seat_comfort_rating", "cabin_staff_service_rating", "food_and_beverages_rating"], axis = 1).dropna()
prep2 = PrepareForModel(review_regr2)
review_regr_enc2 = prep2.make_dummy_df({"seat_type" : "Economy Class", "recommendation" : "no"})

In [17]:
review_regr_enc2

Unnamed: 0,review_value,value_for_money_rating,review_luggage_seats,review_time_delays,review_food_bev_crew,title_money_value,title_staff_delays,seat_type_Business Class,seat_type_First Class,recommendation_yes,intercept
8,8.0,5.0,0.013477,0.013587,0.972936,0.793557,0.206443,0,0,1,1.0
9,2.0,1.0,0.550957,0.053909,0.395134,0.247245,0.752755,0,0,0,1.0
10,8.0,3.0,0.026370,0.026510,0.947120,0.239511,0.760489,0,0,1,1.0
11,5.0,3.0,0.031280,0.029629,0.939091,0.188228,0.811772,0,0,1,1.0
12,10.0,5.0,0.953748,0.021430,0.024822,0.677940,0.322060,0,0,1,1.0
13,2.0,1.0,0.012011,0.976895,0.011094,0.238312,0.761688,0,0,0,1.0
14,7.0,4.0,0.015252,0.023420,0.961329,0.189641,0.810359,0,0,1,1.0
15,9.0,4.0,0.621663,0.018052,0.360284,0.189116,0.810884,1,0,1,1.0
16,8.0,4.0,0.016975,0.697451,0.285574,0.189698,0.810302,0,0,1,1.0
17,2.0,2.0,0.010511,0.979187,0.010302,0.677643,0.322357,0,0,0,1.0
