# Apprentice Chef, round two

I wanted to see if I can improve on what I did when I first started ML. We were given two assignments to do with this dataset:

1. Estimate the revenue brought in by a customer.

2. Classify whether a customer would be succeptible to cross-selling of a different service.

While in school I reached an r-squared of 0.80 for task 1, and an auc score of 0.797 for task 2. I will admit that reaching scores of 0.8 on these was the difference between a B and a C, while 0.9 was required for an A. Let's see if we can get that! 

I'll be using this notebook for exploratory analysis and a final write-up, while the rest of the code will be in scripts.

## Exploratory analysis

I'll begin by importing and looking at the data, see if there are any interesting ideas that pop out.

In [2]:
import pandas as pd                 # Data science essentials
import matplotlib as plt            # Plotting essentials
import seaborn as sns               # More advanced plotting

from IPython.display import clear_output        # Clear output to keep things clean


In [3]:
dataset = pd.read_excel('Datasets/Apprentice_Chef_Dataset.xlsx')


First we need to make sure that the data is in the correct type. What data is numerical and what is categorical?\n
For this I'll make a type dictionary.

In [4]:
type_dict = {'REVENUE':                         'float64',  # Revenue by client, numerical
             'CROSS_SELL_SUCCESS':              'int64',   # Whether a client has purchased the other service, binary categorical
             'NAME':                            'object',   # Full name of client, string
             'EMAIL':                           'object',   # Email provided by client, string
             'FIRST_NAME':                      'object',   # First name of client, string
             'FAMILY_NAME':                     'object',   # Any other names of client, string
             'TOTAL_MEALS_ORDERED':             'int64',    # Number of meals ordered, integer
             'UNIQUE_MEALS_PURCH':              'int64',    # Number of unique meals ordered, integer
             'CONTACTS_W_CUSTOMER_SERVICE':     'int64',    # How many times the client contacted customer service, integer 
             'PRODUCT_CATEGORIES_VIEWED':       'int64',    # How many different meals the client has viewed, integer
             'AVG_TIME_PER_SITE_VISIT':         'float64',  # How long a client spends on the site per visit, float
             'MOBILE_NUMBER':                   'int64',   # Whether a client has registered a mobile phone number, binary categorical
             'CANCELLATIONS_BEFORE_NOON':       'int64',    # How many times a client has cancelled their order before noon, integer
             'CANCELLATIONS_AFTER_NOON':        'int64',    # How many times a client has cancelled their order after noon, integer
             'TASTES_AND_PREFERENCES':          'int64',   # Whether a client has specified their preferences, binary categorical
             'MOBILE_LOGINS':                   'int64',    # How many times the client has logged in from a mobile device, integer
             'PC_LOGINS':                       'int64',    # How many times the client has logged in from other devices, integer
             'WEEKLY_PLAN':                     'int64',    # Number of times the client has ordered the weekly plan, integer
             'EARLY_DELIVERIES':                'int64',    # Number of times the client has received an early delivery, integer
             'LATE_DELIVERIES':                 'int64',    # Number of timee the client has received a late delivery, integer
             'PACKAGE_LOCKER':                  'int64',   # If the client has a package room, binary categorical
             'REFRIGERATED_LOCKER':             'int64',   # If the package room is refrigerated, binary categorical
             'FOLLOWED_RECOMMENDATIONS_PCT':    'int64',    # How often the client followed the meal recommendations, integer
             'AVG_PREP_VID_TIME':               'float64',  # How long the client watches prep videos on average, float
             'LARGEST_ORDER_SIZE':              'int64',    # Largest number of meals in one order, integer
             'MASTER_CLASSES_ATTENDED':         'int64',    # How many classes the client has attended, integer
             'MEDIAN_MEAL_RATING':              'int64',    # The median rating given by client, integer
             'AVG_CLICKS_PER_VISIT':            'float64',  # How many clicks per visit on average, float
             'TOTAL_PHOTOS_VIEWED':             'int64'     # How many photos the client has viewed, integer
             }

dataset = dataset.astype(type_dict)
print(dataset.info())                   # Checking that they are now the correct datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1946 entries, 0 to 1945
Data columns (total 29 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   REVENUE                       1946 non-null   float64
 1   CROSS_SELL_SUCCESS            1946 non-null   object 
 2   NAME                          1946 non-null   object 
 3   EMAIL                         1946 non-null   object 
 4   FIRST_NAME                    1946 non-null   object 
 5   FAMILY_NAME                   1899 non-null   object 
 6   TOTAL_MEALS_ORDERED           1946 non-null   int64  
 7   UNIQUE_MEALS_PURCH            1946 non-null   int64  
 8   CONTACTS_W_CUSTOMER_SERVICE   1946 non-null   int64  
 9   PRODUCT_CATEGORIES_VIEWED     1946 non-null   int64  
 10  AVG_TIME_PER_SITE_VISIT       1946 non-null   float64
 11  MOBILE_NUMBER                 1946 non-null   object 
 12  CANCELLATIONS_BEFORE_NOON     1946 non-null   int64  
 13  CAN

Now I want to go through the variables to see if there is anything we can do with them. These are the ones I want to do something with, and what I want to do:

* Name
    * I want to count how many parts the name has.
    * I want to remove all parts of a name inside parentheses, but I will first flag that the name has parentheses.

* Email
    * Because the email adresses are in the format of 'Name' + domain I will only look at the domain since the name is it's own variable.

    * For the domains I will split them into the groups given in the case description: professional, personal, and junk.

* Weekly Plan
    * I'll see if I can feature the discounts provided by ordering a weekly set, though I think it might show itself by itself.

* Revenue, Total Meals, Unique Meals
    * For the classification task I will try to use the provided meal and beverages tables to see if I can get any insight there. Perhaps a client has trends that make them more likely to go for the cross-sell.


In [5]:
import re
def has_character(string, character):
    """ A check to see if the 'character' is in the 'string'.
    
    :param string: The string to check.
    :param character: The character to match.

    :return: 1 if the character is found, 0 otherwise.

    >>> has_character("Aegon Targaryen (son of Rhaegar)", "(")
    1

    """

    if character in string: return 1
    else: return 0


In [7]:
## Flagging names with parentheses

dataset['parentheses'] = [has_character(name, "(") for name in dataset['NAME']]

In [9]:
dataset.to_csv("Datasets/test.csv", index=False)