# <center> Step 1.1 Attempt to Import and Transform Data with PyCaret </center> #

In this notebook, I attempted to import, transform, and clean data using the setup() function in PyCaret. From this experiment, my conclusion is that PyCaret may not be a "one-stop shop" solution for data cleaning, data imputing, and creating calculated fields. Since my target variables are each calculated fields, I was unable to incorporate them into the raw dataset using methods in PyCaret. 

Other notebooks in this repository will illustrate how PyCaret is a very capable low-code modeling tool, once you have a processed, "clean" dataset.

More information about the PyCaret library is available here: www.pycaret.org

In [1]:
#Import packages
import pandas as pd
import re
import glob
import datetime
import numpy as np
from pycaret.classification import *

### <center> Data Import </center> ###

In [2]:
#Import groups of customer data
appended_data = []
for file in glob.glob('Cust*'):
    data = pd.read_csv(file)
    appended_data.append(data)
cust_df = pd.concat(appended_data)
cust_df.head()

Unnamed: 0,owner_no,First Order Date,First Contribution Date,postal_code,state_desc,geo_area_desc,OP Prelim Capacity,LTV Tkt Value,Lifetime Giving
0,2280536,2015-01-01 00:00:00,,8807,New Jersey,2-Greater Philadelphia(70 mi.),,258.0,
1,2280550,2015-01-01 00:00:00,,18974,Pennsylvania,1-Philadelphia City (20 mi.),,153.0,
2,2280469,2015-01-01 00:00:00,,190068506,Pennsylvania,1-Philadelphia City (20 mi.),3.0,210.0,
3,2269456,2015-01-01 00:00:00,,19403,Pennsylvania,1-Philadelphia City (20 mi.),4.0,281.0,
4,2280674,2015-01-02 00:00:00,,28210,North Carolina,7-USA Balance,3.0,258.0,


In [3]:
#Import groups of order data
appended_data = []
for file in glob.glob('Ord*'):
    data = pd.read_csv(file)
    appended_data.append(data)
order_df = pd.concat(appended_data)
order_df.head()

Unnamed: 0,owner_no,order_dt,Count of order_no,channel_desc,MOS_desc,delivery_desc,tot_ticket_paid_amt,tot_contribution_paid_amt,facility_desc,prod_season_desc,num_seats_pur
0,18251,2014-04-24 00:00:00,1,Phone,Ticketing,OP - US Mail,$1044,$6000,Academy of Music,Don Carlo,2
1,18251,2014-04-24 00:00:00,1,Phone,Ticketing,OP - US Mail,$1044,$6000,Academy of Music,Oscar,2
2,18251,2014-04-24 00:00:00,1,Phone,Ticketing,OP - US Mail,$1044,$6000,Academy of Music,The Barber of Seville,2
3,18251,2014-04-24 00:00:00,1,Phone,Ticketing,OP - US Mail,$1044,$6000,General Admission,40th Anniversary Voucher,2
4,18251,2014-04-24 00:00:00,1,Phone,Ticketing,OP - US Mail,$1044,$6000,Perelman,Ariadne auf Naxos,2


In [4]:
df = pd.merge(order_df, cust_df, how='inner', on = 'owner_no')
df.shape

(82407, 19)

In [5]:
df.nunique()

owner_no                     38522
order_dt                      2160
Count of order_no               23
channel_desc                    20
MOS_desc                        21
delivery_desc                   17
tot_ticket_paid_amt           1286
tot_contribution_paid_amt      131
facility_desc                   36
prod_season_desc                86
num_seats_pur                   46
First Order Date              2931
First Contribution Date       1397
postal_code                   8483
state_desc                      59
geo_area_desc                    7
OP Prelim Capacity              12
LTV Tkt Value                 2560
Lifetime Giving                584
dtype: int64

In [6]:
df = df.drop([
    'prod_season_desc',
    'postal_code',
    'state_desc',
    'Lifetime Giving'
], 
    axis=1
)

In [7]:
df.dtypes

owner_no                       int64
order_dt                      object
Count of order_no              int64
channel_desc                  object
MOS_desc                      object
delivery_desc                 object
tot_ticket_paid_amt           object
tot_contribution_paid_amt     object
facility_desc                 object
num_seats_pur                  int64
First Order Date              object
First Contribution Date       object
geo_area_desc                 object
OP Prelim Capacity            object
LTV Tkt Value                float64
dtype: object

In [8]:
df.columns

Index(['owner_no', 'order_dt', 'Count of order_no', 'channel_desc', 'MOS_desc',
       'delivery_desc', 'tot_ticket_paid_amt', 'tot_contribution_paid_amt',
       'facility_desc', 'num_seats_pur', 'First Order Date',
       'First Contribution Date', 'geo_area_desc', 'OP Prelim Capacity',
       'LTV Tkt Value'],
      dtype='object')

In [9]:
df.geo_area_desc.value_counts()

1-Philadelphia City (20 mi.)      58979
2-Greater Philadelphia(70 mi.)    11998
7-USA Balance                      4480
3-New York City (20 mi.)           3123
5-NEC (140 mi. from Philly)        1768
4-DC (20 mi.)                      1501
6-NEC (210 mi. from Philly)         558
Name: geo_area_desc, dtype: int64

In [10]:
df['OP Prelim Capacity'].value_counts()

4     14398
3     13246
U     12557
2      7838
5      6932
1      3625
6      2313
7       288
8       224
X        42
9        19
10        7
Name: OP Prelim Capacity, dtype: int64

In [11]:
#Attempt to setup data for PyCaret
df_1 = setup(
    data=df, 
    target = 'first_cont_order',
    categorical_features = [
        'MOS_desc',
        'channel_desc',
        'delivery_desc',
    ],
    ordinal_features = {
        'geo_area_desc': [
            '1-Philadelphia City (20 mi.)',
            '2-Greater Philadelphia(70 mi.)',
            '3-New York City (20 mi.)'
            '4-DC (20 mi.)',
            '5-NEC (140 mi. from Philly)',
            '6-NEC (210 mi. from Philly)',
            '7-USA Balance'
        ],
        'OP Prelim Capacity': ['U','X',1,2,3,4,5,6,7,8,9,10]
    },
    high_cardinality_features = [
        'order_dt',
        'tot_ticket_paid_amt',
        'tot_contribution_paid_amt',
        'First Order Date',
        'First Contribution Date',
        'LTV Tkt Value',
    ],
    date_features = [
        'First Order Date',
        'order_dt',
        'First Contribution Date'
    ],
    numeric_features = [
        'tot_ticket_paid_amt',
        'tot_contribution_paid_amt',
        'num_seats_pur',
        'first_cont_order',
        'first_cont_after',
        'LTV Tkt Value'
        
    ],
    combine_rare_levels = True,
    remove_multicollinearity = True,
    profile = True
)

SystemExit: (Value Error): Target parameter doesnt exist in the data provided.

In the above method, an error is thrown because we have no target in the raw dataset - all targets for this project stem from calculated features.