# Regression Analysis

#### Evan Yathon

This notebook is intended to be run with papermill from the project root.

The purpose of this notebook is to run a logistic regression analysis on the cleaned reviews data.  Trying to identify contributing factors to a customer recommending an airline or not given that they have submitted a review is the central theme.

Usage:

`papermill src/ipynbs/regression_analysis.ipynb src/regression_analysis_ran.ipynb -p load_path data/topic_modeling_gw_reviews.csv`

In [1]:
load_path = "../../data/topic_modeling_gw_reviews.csv"

In [2]:
# load packages

# utils
import pandas as pd
# regression
from sklearn.linear_model import LogisticRegression

In [5]:
reviews = pd.read_csv(load_path, parse_dates = ["date_of_review", "date_flown"])

### Regression Prep

Before running the regression analysis, a few things need to be done.

- Deal with `NA` values
- Drop columns that have large amount of `NA` values or don't make sense to include in the analysis
- Encode categorical columns

Check column types and percentage of NA values

In [19]:
# column types, na values
pd.DataFrame({
    "data_type" : reviews.dtypes.values,
    "precent_na" : (reviews.isna().sum()/reviews.shape[0])*100
})

Unnamed: 0,data_type,precent_na
title,object,0.0
review_value,float64,9.589041
n_user_reviews,object,89.726027
reviewer_name,object,0.0
reviewer_country,object,8.90411
date_of_review,datetime64[ns],0.0
review_text,object,0.0
aircraft,object,89.726027
traveller_type,object,78.767123
seat_type,object,10.273973


`object` types are categorical/strings.  Below in a table documenting which columns will be dropped and the reasoning behind each.

| Column Name                   	| Reason to Drop                                                  	|
|-------------------------------	|-----------------------------------------------------------------	|
| title                         	| Information extracted via Topic Modeling                        	|
| n_user_reviews                	| Large amount of missing values                                  	|
| reviewer_name                 	| Not relevant for regression analysis                            	|
| reviewer_country              	| Too many categories for small amount of data; EDA backs this up 	|
| date_of_review                	| imbalanced dataset over time                                    	|
| review_text                   	| Information extracted via Topic Modeling                        	|
| aircraft                      	| Large amount of missing values                                  	|
| traveller_type                	| Large amount of missing values                                  	|
| route                         	| Large amount of missing values                                  	|
| date_flown                    	| Large amount of missing values                                  	|
| ground_service_rating         	| Large amount of missing values                                  	|
| inflight_entertainment_rating 	| Large amount of missing values                                  	|
| clean_review_text             	| Information extracted via Topic Modeling                        	|
| clean_title                   	| Information extracted via Topic Modeling                        	|

In [21]:
reviews.columns

Index(['title', 'review_value', 'n_user_reviews', 'reviewer_name',
       'reviewer_country', 'date_of_review', 'review_text', 'aircraft',
       'traveller_type', 'seat_type', 'route', 'date_flown',
       'seat_comfort_rating', 'cabin_staff_service_rating',
       'food_and_beverages_rating', 'inflight_entertainment_rating',
       'ground_service_rating', 'value_for_money_rating', 'recommendation',
       'clean_review_text', 'clean_title', 'review_luggage_seats',
       'review_time_delays', 'review_food_bev_crew', 'title_money_value',
       'title_staff_delays'],
      dtype='object')

In [27]:
drop_cols = ['title', 'n_user_reviews', 'reviewer_name',
             'reviewer_country', 'date_of_review', 'review_text', 
             'aircraft', 'traveller_type', 'route', 'date_flown',
             'ground_service_rating', 'inflight_entertainment_rating',
             'clean_review_text', 'clean_title']

In [33]:
# drop unwanted columns
reviews_regr = reviews.drop(drop_cols, axis = 1)

After dropping, review the na values and column types again

In [34]:
# column types, na values
pd.DataFrame({
    "data_type" : reviews_regr.dtypes.values,
    "precent_na" : (reviews_regr.isna().sum()/reviews_regr.shape[0])*100
})

Unnamed: 0,data_type,precent_na
review_value,float64,9.589041
seat_type,object,10.273973
seat_comfort_rating,float64,32.876712
cabin_staff_service_rating,float64,32.876712
food_and_beverages_rating,float64,37.671233
value_for_money_rating,float64,9.589041
recommendation,object,0.0
review_luggage_seats,float64,0.0
review_time_delays,float64,0.0
review_food_bev_crew,float64,0.0


Seat comfort, cabin staff service and food and beverages still have a fair amount missing, I might need to remove them from the analysis later.

`seat_type` and `recommendation` need to be encoded.  To do that, I'll use some code that I had developed prior.

In [None]:
class PrepareForModel:

    """
    PrepareForModel prepares a dataframe with some categorical variable content
    for use in sci-kit learn or stats package regression or other models.

    Arguments:
        df (pd.DataFrame): dataframe that contains various features, some categorical

    Attributes:
        ref_levels (list): list of reference variables that are dropped from the dataframe.
    """

    def __init__(self, df):

        # ensure that df is in the form of a pandas dataframe
        if not isinstance(df, pd.DataFrame):
            raise Exception("the df argument should be a pandas dataframe class or subclass")

        self.df = df.copy()
        self.df_orig = df.copy()
        self.ref_levels = []

    def make_dummy_df(self, dummy_vars, cat_to_drop = 0, drop_cat = True, add_intercept = True):
        """
        Saves a pandas dataframe ready for statistical modelling with packages like statsmodel.
        Ensures that categorical variables are one hot encoded

        Arguments:
            dummy_vars (list or dict): can be a list of column names with columns to be converted to
                                       dummy variables.  Alternatively can be a dict with keys as
                                       column names to be converted to dummies, and values as the
                                       category to be dropped.
            cat_to_drop (int): if dummy_vars is type dict then this value will be ignored.  Otherwise
                               it is the index of the category to drop in each variable specified in
                               dummy_vars
            add_intercept (bool): whether to add an intercept column of 1s, default True
            drop_cat (bool): whether to drop a column that is being coverted to dummies, default True

        Return:
            Pandas Dataframe with selected variables one hot encoded and an intercept column

        Examples:
        ## example_data = {"id": [1,2,3,4],
                        "price":[22., 21., 17., 35.],
                        "colour": ['red', 'white', 'ruby', 'white'],
                        "type" : ["cab sauv", "pinot grigio", "pinot noir", "chardonnay"]}

        ## example_data = pd.DataFrame(example_data)
        ## example = PrepareForModel(example_data)
        ## example.make_dummy_df(['colour', 'type'], 0)
        """

        # reset reference levels, if remaking then previous information would be there
        self.ref_levels = []

        # reset to original data.  Ensures that make_dummy_df can be rerun multiple times
        self.df = self.df_orig.copy()

        # check if dummy_vars is a dict or a list to determine how to drop categories
        if isinstance(dummy_vars, list):
            drop_by_index = True

            # make sure that cat_to_drop is within the index range of the selected variables
            for var in dummy_vars:
                if self.df[var].nunique() < (cat_to_drop + 1):
                    raise Exception("For column {} the number of categories was \
lower than the specified index ({}) to drop.".format(var, cat_to_drop))
        else:
            drop_by_index = False

        # loop through categorical columns in dummy_vars list/dict
        for var in dummy_vars:

            # convert given variable to a dummy variable
            current_dummy = pd.DataFrame(self.df[var])
            current_dummy = pd.get_dummies(current_dummy, prefix = var)


            # drop one level as specified by dummy_vars, if drop_cat is true
            if drop_cat == True:

                # scenario where dummy_vars is a list
                if drop_by_index == True:
                    dummy_to_drop = current_dummy.columns[cat_to_drop]
                    current_dummy = current_dummy.drop([dummy_to_drop], axis = 1)

                    # keep note of the dropped level in ref_levels
                    self.ref_levels.append(dummy_to_drop)

                # scenario where dummy_vars is a dict
                if drop_by_index == False:
                    dummy_to_drop = var + "_" + dummy_vars[var]
                    current_dummy = current_dummy.drop([dummy_to_drop], axis = 1)

                    # keep note of the dropped level in ref_levels
                    self.ref_levels.append(dummy_to_drop)

            # concat with original dataframe
            self.df = pd.concat([self.df, current_dummy], axis=1)

        # drop pre-dummy categorical columns from original dataframe
        if drop_by_index == True:
            self.df = self.df.drop(dummy_vars, axis=1)

        if drop_by_index == False:
            self.df = self.df.drop(dummy_vars.keys(), axis=1)

        # if int add is true, add int
        if add_intercept == True:
            self.df['intercept'] = 1.0

        return self.df
