## Data preparation

Before the data can be sent through the KNN model it needs to be tidied. 
Here I will: 

* Remove nulls;
* Create dummies as required;
* Normalise values to prevent exponential scales;
* Further reduce the data set to only relevant classifiers. 



In [2]:
# Import libraries

import pandas as pd

from sklearn.preprocessing import StandardScaler
from sqlalchemy import create_engine

In [3]:
# Create connections to data and read data to pandas dataframe
engine = create_engine('sqlite:///../data/customers_with_behaviours.db')
df_initial = pd.read_sql_table('customers_with_behaviours', engine)
df_initial.head()

Unnamed: 0,UNITID,INSTNM,IALIAS,CITY,STABBR,FIPS,OBEREG,GENTELE,EIN,DUNS,...,AMOUNT_OF_INTERACTIONS_W_SALES_RNG,AMOUNT_OF_CALLS,AMOUNT_OF_CALLS_RNG,AMOUNT_OF_MESSAGES,AMOUNT_OF_MESSAGES_RNG,ENGAGED_WITH_MESSAGING,REACHED_NOT_ENGAGED_WITH_MESSAGING,ATTENDED_WEBINARS,WEBINAR_ATTENDANCE_SIZE,WEBINAR_ATTENDANCE_SIZE_RNG
0,100654,Alabama A & M University,AAMU,Normal,AL,1,5,2563725000,636001109,197216455,...,[51 - 100],18,[1 - 50],36,[1 - 50],1,1,1,15,[11 - 15]
1,100663,University of Alabama at Birmingham,,Birmingham,AL,1,5,2059344011,636005396,63690705,...,[301 - 400],292,[201 - 300],37,[1 - 50],0,1,0,0,[0]
2,100690,Amridge University,Southern Christian University Regions University,Montgomery,AL,1,5,33438738777550,237034324,126307792,...,[401 - 500],66,[51 - 100],371,[301 - 400],1,1,1,17,[16 - 20]
3,100706,University of Alabama in Huntsville,UAH University of Alabama Huntsville,Huntsville,AL,1,5,2568246120,630520830,949687123,...,[101 - 200],117,[101 - 200],74,[51 - 100],1,1,1,20,[16 - 20]
4,100724,Alabama State University,,Montgomery,AL,1,5,3342294100,636001101,40672685,...,[0],0,[0],0,[0],0,0,0,0,[0]


### Introducing additional data for features and or classifiers do the data set. 

In [4]:
df_ic2019 = pd.read_csv('../data/ic2019.csv')
df_adm2019 = pd.read_csv('../data/adm2019.csv')

### Reduce the files to only the required columns. 

In [5]:
df_ic2019_cls_col = ['UNITID', 'PEO1ISTR', 'PEO2ISTR', 'PEO3ISTR', 'PEO4ISTR', 'PEO5ISTR', 'PEO6ISTR', 'CNTLAFFI', 'PUBPRIME',
                     'PUBSECON', 'RELAFFIL', 'LEVEL1', 'LEVEL2', 'LEVEL3', 'LEVEL4', 'LEVEL5', 'LEVEL6', 'LEVEL7', 
                     'LEVEL8', 'LEVEL12', 'LEVEL17', 'LEVEL18', 'LEVEL19', 'CALSYS', 'FT_UG', 'FT_FTUG', 'FTGDNIDP', 
                     'PT_UG', 'PT_FTUG', 'PTGDNIDP', 'DOCPP', 'DOCPPSP', 'OPENADMP', 'CREDITS1', 'CREDITS2', 'CREDITS3', 
                     'CREDITS4', 'STUSRV2', 'STUSRV3', 'STUSRV4', 'STUSRV8', 'LIBRES1', 'LIBRES2', 'LIBRES3', 'LIBRES4', 
                     'LIBRES5', 'TUITPL', 'TUITPL1', 'TUITPL2', 'TUITPL3', 'TUITPL4', 'DSTNUGC', 'DSTNUGP', 'DSTNUGN', 
                     'DSTNGC', 'DSTNGP', 'DSTNGN', 'DISTCRS', 'DISTPGS', 'DSTNCED1', 'DSTNCED2', 'DSTNCED3', 'DISTNCED', 
                     'DISAB', 'ROOM', 'ROOMCAP', 'BOARD']

df_cls_ic2019 = df_ic2019[df_ic2019_cls_col]

df_adm2019_cls_cols = ['UNITID', 'APPLCN', 'APPLCNM', 'APPLCNW', 'ADMSSN', 'ADMSSNM', 'ADMSSNW', 'ENRLT', 'ENRLM',
                       'ENRLW', 'ENRLFT', 'ENRLFTM', 'ENRLFTW', 'ENRLPT', 'ENRLPTM', 'ENRLPTW', 'SATNUM', 'SATPCT', 
                       'ACTNUM', 'ACTPCT', 'SATVR25', 'SATVR75', 'SATMT25', 'SATMT75']

df_cls_adm2019 = df_adm2019[df_adm2019_cls_cols]

### Join the data sets together

In [6]:
df_combined_features = df_cls_adm2019.merge(df_cls_ic2019, on='UNITID') #, how='left')
df_all_data = df_initial.merge(df_combined_features, on='UNITID') #, how='left')
print(f"\nThe dataframe df_all_data is shaped with {df_all_data.shape[1]} columns and {df_all_data.shape[0]} rows\n\n")
print(f"Here is the head of the dataframe ... \n {df_all_data.head()}")


The dataframe df_all_data is shaped with 158 columns and 2010 rows


Here is the head of the dataframe ... 
    UNITID                               INSTNM  \
0  100654             Alabama A & M University   
1  100663  University of Alabama at Birmingham   
2  100706  University of Alabama in Huntsville   
3  100724             Alabama State University   
4  100751            The University of Alabama   

                                  IALIAS        CITY STABBR  FIPS  OBEREG  \
0                                   AAMU      Normal     AL     1       5   
1                                         Birmingham     AL     1       5   
2  UAH  University of Alabama Huntsville  Huntsville     AL     1       5   
3                                         Montgomery     AL     1       5   
4                                         Tuscaloosa     AL     1       5   

      GENTELE        EIN       DUNS  ...  DISTCRS  DISTPGS  DSTNCED1  \
0  2563725000  636001109  197216455  ...        1     

#### Here we will remove the `-1` and `-2` placeholders in the data.

In [7]:
df_all_data.replace([-2, '-2', -1, '-1'], 0, inplace=True)

#### Remove unnecessary uniqueness columns from the data set, and check for any `NaN` values. 

In [8]:
unnecessary_uniqueness_columns = ['INSTNM', 'IALIAS', 'FIPS', 'OBEREG', 'GENTELE', 'EIN', 'DUNS', 'OPEID', 'CNGDSTCD']
df_all_data.drop(unnecessary_uniqueness_columns, axis=1, inplace=True)

In [9]:
data_not_null = df_all_data.isnull().sum().sum()
col_count = df_all_data.shape[1]
row_count = df_all_data.shape[0]

print(f"The size of the data is {col_count*row_count} with {data_not_null} null items")

The size of the data is 299490 with 8465 null items


In [10]:
df_data_null = df_all_data.isnull().sum().to_frame('nulls')
df_data_null[df_data_null['nulls'] > 0]

Unnamed: 0,nulls
ADMSSN,17
ADMSSNM,93
ADMSSNW,103
ENRLT,22
ENRLM,114
ENRLW,110
ENRLFT,38
ENRLFTM,147
ENRLFTW,126
ENRLPT,522


In [11]:
def impute_nulls_with_column_median(list_of_columns, dataframe):

    """
    The function takes a list of applicable columns and a corresponding dataframe. 
    The median is calculated per column and then identifies NaNs to be replaced with the median

    INPUT: 

    list_of_columns: a list of columns with null values
    dataframe: the dataframe relating to the specified list.

    OUTPUT:

    None
        The dataframe values are replaced inplace to there is no need to return the frame
    """

    for col in list_of_columns:
        impute_median = int(dataframe[col].median())
        dataframe[col].fillna(value=impute_median, inplace=True)
    
    df_dataframe_nulls = dataframe.isnull().sum().to_frame('Nulls')
    list_null_cols_vals = df_dataframe_nulls[df_dataframe_nulls['Nulls'] > 0]
    
    print(f'\nImputing of values complete, you can find the df header below: \n\n {dataframe.head()}')
    print(f'\nLets look for the amount of nulls remaining in the dataframe: \n\n')

    for rows in list_null_cols_vals.items():
        print(f'{rows} \n')

In [12]:
columns_with_null_data = list(df_data_null[df_data_null['nulls'] > 0].index)
impute_nulls_with_column_median(columns_with_null_data, df_all_data)


Imputing of values complete, you can find the df header below: 

    UNITID        CITY STABBR  OPEFLAG  SECTOR  ICLEVEL  CONTROL  HLOFFER  \
0  100654      Normal     AL        1       1        1        1        9   
1  100663  Birmingham     AL        1       1        1        1        9   
2  100706  Huntsville     AL        1       1        1        1        9   
3  100724  Montgomery     AL        1       1        1        1        9   
4  100751  Tuscaloosa     AL        1       1        1        1        9   

   UGOFFER  GROFFER  ...  DISTCRS  DISTPGS  DSTNCED1  DSTNCED2  DSTNCED3  \
0        1        1  ...        1        1         1         1         0   
1        1        1  ...        1        1         1         1         0   
2        1        1  ...        1        1         1         1         0   
3        1        1  ...        1        1         1         1         0   
4        1        1  ...        1        1         1         1         0   

   DISTNCED  DISAB 

#### Now that we've improved the data quality by removing the `NaN` values we need to standardise the values for the KNN model. 

In [18]:
categorical_columns = df_all_data.select_dtypes(include=['object'])
list_of_categorical_columns = list(categorical_columns.columns)
list_of_categorical_columns

['CITY',
 'STABBR',
 'ACT',
 'COUNTYNM',
 'AMOUNT_OF_LICENSES_RNG',
 'TERM_OF_LICENSE_RNG',
 'AMOUNT_OF_INTERACTIONS_W_SALES_RNG',
 'AMOUNT_OF_CALLS_RNG',
 'AMOUNT_OF_MESSAGES_RNG',
 'WEBINAR_ATTENDANCE_SIZE_RNG',
 'ROOMCAP']

In [14]:
df_all_scaled_data = df_all_data.copy()
scaler = StandardScaler()
scaler.fit(df_all_scaled_data)

ValueError: could not convert string to float: 'Normal'