# Assignment 8 Model Test

## Discussion and preprocessing plan
A high sparsity exists, with ~0.5% of all transactions being fraud transactions.

#### Time
Going to need to convert date to seperate year-month-weekday columns. Will drop "unix_time" column since it contains highly duplicate data as the transaction date and time column. Granularity of time of day could be explored, but for now a simple day-month-year will be the first plan of attack.

#### Customer Information

Customer first and last name can be dropped in favor of the cc_num column. Customer's date of birth will be converted to simply how many years old they are. Their employment will be kept due to this providing valuable information of potential fraud transactions. The sex field is potentially ommitable, so if increased model performance is required removing sex will be among the first strategies. CC_num is a database index and does not provide true value to the analysis, therefore it will be dropped as well.

**EDIT**
After some experimentation, I found customer CC number to not be ordinal and creating categorically encoding it is unfeasible. I believe the name to be valuable and which customer it is coming from, but worried it will introduce too much sparsity when encoded. Future models could incorporate these data points by introducing some form of dimensionality reduction.

#### Merchant information
Merchant and category of purchase will need to be one-hot encoded. The amount of the transaction is valuable and will be scaled via normalization. Transaction number can be dropped as well since it is not providing any additional value.


#### Geographic Information

Lat-Lon combinations of both merchant and customer will not be used since not much valuable insight was obtained during data exploration. Customer street will also be dropped in favor of the city and city information as geographic information for the customer since I suspect it is too high detail and will be represented to some degree by the city column. Zip will also be dropped in favor of city and state.


City population will be kept also. This could potentially have duplicate information stored in it since city itself is also a field. These two fields might be worth looking into further for processing.


## Encoding Scheme

Year and month will naturally be ordinally encoded. Week day however should be one-hot encoded since it is categorical over ordinal. Employment, merchant, and category will be one-hot encoded. The remainder of the numerical fields will be scaled using standardization.

**EDIT**
 After experimenting, I found that concatenating my encoded dataframes with large numbers of columns produced by encoding some fields crashes the kernel. Merchant, city, job. I decided to only one-hot encode weekday and sex. City population will be correlated to the city and state, so this may indirectly capture geographical information. Weekday and sex are small enough to include in the final models. Some form of dimensionality reduction are required to include more information from the remaining columsn

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import date, datetime, timedelta
from scipy.fft import fft, ifft, fftfreq
import matplotlib.pyplot as plt

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
fraud_dataset = pd.read_csv("transactions.csv")

### Drop Unwanted Columns

In [3]:
fraud_dataset_copy = fraud_dataset.copy()

### Seperate year-month-weekday and delete "trans_date_trans_time"

In [4]:
date_format = '%Y-%m-%d %H:%M:%S'
year_column = [datetime.strptime(row_date, date_format).year for row_date in fraud_dataset_copy.trans_date_trans_time]
month_column = [datetime.strptime(row_date, date_format).month for row_date in fraud_dataset_copy.trans_date_trans_time]
weekday_column = [(datetime.strptime(row_date, date_format).toordinal()%7 + 1) for row_date in fraud_dataset_copy.trans_date_trans_time]

input_data = {"year": year_column,
              "month": month_column,
              "weekday": weekday_column}

time_encoded_df = pd.DataFrame(data=input_data)

### Encoding Processing

In [5]:
# Encode Weekday Data
weekday_encoder = OneHotEncoder()
weekday_encoder.fit(time_encoded_df["weekday"].to_numpy().reshape(-1, 1))
weekday_encoded_values = weekday_encoder.fit_transform(time_encoded_df[['weekday']]).toarray()

weekday_encoded_df = pd.DataFrame(weekday_encoded_values, columns=weekday_encoder.get_feature_names())
weekday_encoded_df.head()

Unnamed: 0,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [6]:
time_encoded_df = pd.concat([time_encoded_df, weekday_encoded_df], axis=1)
time_encoded_df = time_encoded_df.drop(["weekday"], axis=1)
time_encoded_df.head()

Unnamed: 0,year,month,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7
0,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [7]:
fraud_dataset_copy = fraud_dataset_copy.drop(["trans_date_trans_time"], axis=1)
fraud_dataset_copy.head()

Unnamed: 0.1,Unnamed: 0,cc_num,merchant,category,amt,first,last,sex,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [8]:
# Encode Merchant
# merch_encoder = OneHotEncoder()
# merch_encoder.fit(fraud_dataset_copy["merchant"].to_numpy().reshape(-1, 1))
# encoded_merch_values = merch_encoder.fit_transform(fraud_dataset_copy[['merchant']]).toarray()

# merch_encoded_df = pd.DataFrame(encoded_merch_values, columns=merch_encoder.get_feature_names())
# merch_encoded_df.head()
# print(len(merch_encoded_df.columns))

In [9]:
# Encode Category
# category_encoder = OneHotEncoder()
# category_encoder.fit(fraud_dataset_copy["merchant"].to_numpy().reshape(-1, 1))
# encoded_category_values = category_encoder.fit_transform(fraud_dataset_copy[['merchant']]).toarray()

# category_encoded_df = pd.DataFrame(encoded_category_values, columns=category_encoder.get_feature_names())
# category_encoded_df.head()
# print(len(category_encoded_df.columns))

In [10]:
# Encode sex
sex_encoder = OneHotEncoder()
sex_encoder.fit(fraud_dataset_copy["sex"].to_numpy().reshape(-1, 1))
encoded_sex_values = sex_encoder.fit_transform(fraud_dataset_copy[['sex']]).toarray()

sex_encoded_df = pd.DataFrame(encoded_sex_values, columns=sex_encoder.get_feature_names())
sex_encoded_df.head()
print(len(sex_encoded_df.columns))

2


In [11]:
# Encode state
# state_encoder = OneHotEncoder()
# state_encoder.fit(fraud_dataset_copy["state"].to_numpy().reshape(-1, 1))
# encoded_state_values = state_encoder.fit_transform(fraud_dataset_copy[['state']]).toarray()

# state_encoded_df = pd.DataFrame(encoded_state_values, columns=state_encoder.get_feature_names())
# state_encoded_df.head()

In [12]:
# Encode city
# city_encoder = OneHotEncoder()
# city_encoder.fit(fraud_dataset_copy["city"].to_numpy().reshape(-1, 1))
# encoded_city_values = city_encoder.fit_transform(fraud_dataset_copy[['city']]).toarray()

# city_encoded_df = pd.DataFrame(encoded_city_values, columns=city_encoder.get_feature_names())
# city_encoded_df.head()

In [13]:
fraud_dataset_copy.head()

Unnamed: 0.1,Unnamed: 0,cc_num,merchant,category,amt,first,last,sex,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [14]:
# Encode Employment
# employment_encoder = OneHotEncoder()
# employment_encoder.fit(fraud_dataset_copy["job"].to_numpy().reshape(-1, 1))
# encoded_employment_values = employment_encoder.fit_transform(fraud_dataset_copy[['job']]).toarray()
# print(employment_encoder.get_feature_names())

In [15]:
# print(len(employment_encoder.get_feature_names()))

In [16]:
# encoded_employment_df = pd.DataFrame(encoded_employment_values, columns=employment_encoder.get_feature_names())
# encoded_employment_df.head()

### Recreate 

In [17]:
fraud_dataset_copy = fraud_dataset.copy()

drop_columns = ["Unnamed: 0", "cc_num", "trans_date_trans_time"]
fraud_dataset_copy = fraud_dataset_copy.drop(drop_columns, axis=1)
fraud_dataset_copy.head()

Unnamed: 0,merchant,category,amt,first,last,sex,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,28654,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,99160,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,83252,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,59632,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,24433,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [18]:
drop_columns = ["first", "last", "zip", "lat", "long", "merch_lat", "merch_long", "trans_num", "unix_time", "street"]
fraud_dataset_copy = fraud_dataset_copy.drop(drop_columns, axis=1)

In [19]:
# Drop modified columns from dataset
fraud_dataset_copy = fraud_dataset_copy.drop(["category", "sex", "merchant", "state", "city", "job", "dob"], axis=1)

In [20]:
total_df = pd.concat([time_encoded_df, fraud_dataset_copy], axis=1)
total_df.head()

Unnamed: 0,year,month,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7,amt,city_pop,is_fraud
0,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.97,3495,0
1,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,107.23,149,0
2,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,220.11,4154,0
3,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,45.0,1939,0
4,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,41.96,99,0


In [21]:
total_df2 = pd.concat([total_df, sex_encoded_df], axis=1)
total_df2.head()

Unnamed: 0,year,month,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7,amt,city_pop,is_fraud,x0_F,x0_M
0,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.97,3495,0,1.0,0.0
1,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,107.23,149,0,1.0,0.0
2,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,220.11,4154,0,0.0,1.0
3,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,45.0,1939,0,0.0,1.0
4,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,41.96,99,0,0.0,1.0


### Numerical Processing

In [22]:
# Splitting the dataset into the Training set and Test set - use trian_test_split
y = total_df2["is_fraud"]
x = total_df2.drop(["is_fraud"], axis=1)

test__perc = .20      # 20% for test split
feature_train, feature_test, label_train, label_test = train_test_split(x, y, test_size= test__perc)

In [23]:
x.head()

Unnamed: 0,year,month,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7,amt,city_pop,x0_F,x0_M
0,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.97,3495,1.0,0.0
1,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,107.23,149,1.0,0.0
2,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,220.11,4154,0.0,1.0
3,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,45.0,1939,0.0,1.0
4,2019,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,41.96,99,0.0,1.0


In [24]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: is_fraud, dtype: int64

In [25]:
# Feature Scaling - required due to different orders of magnitude across the features
# make sure to save the scaler for future use in inference

feature_info = {}
feature_year_info = {}
feature_month_info = {}
feature_amt_info = {}
feature_pop_info = {}

In [26]:
feature_year_info["mean"] = feature_train["year"].mean()
feature_year_info["std"] = feature_train["year"].std()
feature_info["year"] = feature_year_info

In [27]:
feature_month_info["mean"] = feature_train["month"].mean()
feature_month_info["std"] = feature_train["month"].std()
feature_info["month"] = feature_month_info

In [28]:
feature_amt_info["mean"] = feature_train["amt"].mean()
feature_amt_info["std"] = feature_train["amt"].std()
feature_info["amt"] = feature_amt_info

In [29]:
feature_pop_info["mean"] = feature_train["city_pop"].mean()
feature_pop_info["std"] = feature_train["city_pop"].std()
feature_info["city_pop"] = feature_pop_info

In [30]:
# scale feature train inputs
feature_train.loc[:, "year_scaled"] = (feature_train["year"]-feature_year_info["mean"])/feature_year_info["std"] 
feature_train.loc[:, "month_scaled"] = (feature_train["month"]-feature_month_info["mean"])/feature_month_info["std"]
feature_train.loc[:, "amt_scaled"] = (feature_train["amt"]-feature_amt_info["mean"])/feature_amt_info["std"]
feature_train.loc[:, "city_pop_scaled"] = (feature_train["city_pop"]-feature_pop_info["mean"])/feature_pop_info["std"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [31]:
# scale feature test inputs with train mean/std from training set
feature_test.loc[:, "year_scaled"] = (feature_test["year"]-feature_year_info["mean"])/feature_year_info["std"] 
feature_test.loc[:, "month_scaled"] = (feature_test["month"]-feature_month_info["mean"])/feature_month_info["std"]
feature_test.loc[:, "amt_scaled"] = (feature_test["amt"]-feature_amt_info["mean"])/feature_amt_info["std"]
feature_test.loc[:, "city_pop_scaled"] = (feature_test["city_pop"]-feature_pop_info["mean"])/feature_pop_info["std"]

In [32]:
feature_train = feature_train.drop(["year", "month", "amt", "city_pop"], axis=1)

In [33]:
feature_train.head()

Unnamed: 0,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7,x0_F,x0_M,year_scaled,month_scaled,amt_scaled,city_pop_scaled
1337229,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.998134,-0.043977,-0.42216,-0.275836
205888,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.001869,-0.919862,-0.082446,-0.287545
1765699,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.998134,1.415833,0.090879,0.392535
1631812,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.998134,0.831909,-0.393861,-0.293864
1146927,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.998134,-0.919862,1.100917,-0.125784


In [34]:
feature_test = feature_test.drop(["year", "month", "amt", "city_pop"], axis=1)

In [35]:
feature_test.head()

Unnamed: 0,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7,x0_F,x0_M,year_scaled,month_scaled,amt_scaled,city_pop_scaled
899870,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-1.001869,1.415833,2.249418,-0.290206
733752,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,-1.001869,1.123871,-0.372992,-0.26947
440039,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.001869,-0.043977,0.698351,-0.293606
502062,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.001869,0.247985,-0.276329,-0.284231
205824,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.001869,-0.919862,0.141656,-0.260115


### Test Logistic Regression

In [36]:
log_regress_model = LogisticRegression(random_state=0).fit(feature_train, label_train)

#### Test Accuracy

In [37]:
log_regress_model.score(feature_test, label_test)

0.9943397601483485

The accuracy apears good from an accuracy standpoint, but let's check it's abillity to get the spare points correct:

In [44]:
y_pred = log_regress_model.predict(feature_test)

In [45]:
# Confusion Matrix
confusion_matrix(label_test, y_pred)

array([[368382,    167],
       [  1930,      0]])

The logistic regression model is unable to predict any instances of fraud correct.

### Test Random Forest

In [48]:
random_forest_model = RandomForestClassifier(max_depth=2, random_state=0)
random_forest_model.fit(feature_train, label_train)

RandomForestClassifier(max_depth=2, random_state=0)

#### Test Accuracy

In [49]:
random_forest_model.score(feature_test, label_test)

0.9947905279381557

In [50]:
y_pred_rf = random_forest_model.predict(feature_test)

In [51]:
confusion_matrix(label_test, y_pred_rf)

array([[368549,      0],
       [  1930,      0]])

Once again the model achieves high accuracy due to the fact that most instances are not fraud. The model is still unable to identify fraud instances from the trained features.

### Test K-Nearest Neighbors

In [53]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(feature_test, label_test)

KNeighborsClassifier(n_neighbors=3)

In [57]:
knn_model.score(feature_test, label_test)

0.9964829315561746

In [55]:
y_pred_knn = knn_model.predict(feature_test)

In [58]:
confusion_matrix(y_pred_knn, label_test)

array([[368274,   1028],
       [   275,    902]])

KNN achieves a higher accuracy and is able to accurately predict instances of regression compared with the previous 2 models