In [102]:
import tensorflow as tf
import numpy as np                                   # Math and calculations
import pandas as pd                                  # Data manipulation
from keras.models import Sequential                  # Sequential model library from keras
from keras.layers import Dense, Activation           # Dense keras layers of nodes, activation functions
from keras.optimizers import Adam                    # Adam optimizer
from keras.metrics import CategoricalCrossentropy    # Calculate loss
from keras.losses import mean_squared_error          # used for testing
from sklearn.model_selection import train_test_split # split data

Firstly, all major packages are imported. Although the keras modules are all contained within the tensorflow library, I imported methods and classes in order to improve code readability (i.e. not having to write tf.keras.metrics.BinaryCrossentropy in code).

In [103]:
widgets_train = pd.read_csv("../input/pl234142-widgets/pl234142.csv", thousands=',')
print(widgets_train.head())
thing = widgets_train.copy()

   User ID             name  Gender  Age  EstimatedSalary  Purchased
0    10000      Bruce Evans  Female   23            67928          0
1    10001    Maria Stevens    Male   47           147285          1
2    10002      Brett Banks    Male   31           123436          1
3    10003  Danielle Snyder  Female   28           130425          0
4    10004    Jessica Smith  Female   48            90212          1


The data is imported and copied into thing, which is then printed for basic analysis.

In [104]:
Sex_Binary = []
for x in thing["Gender"]:
    if x.lower() == "male":
        Sex_Binary.append(1)
    elif x.lower() == "female":
        Sex_Binary.append(0)
    else:
        Sex_Binary.append("MISSING VALUE")
        print("missing value")
thing["GenderBin"] = Sex_Binary
print(thing.head())

   User ID             name  Gender  Age  EstimatedSalary  Purchased  \
0    10000      Bruce Evans  Female   23            67928          0   
1    10001    Maria Stevens    Male   47           147285          1   
2    10002      Brett Banks    Male   31           123436          1   
3    10003  Danielle Snyder  Female   28           130425          0   
4    10004    Jessica Smith  Female   48            90212          1   

   GenderBin  
0          0  
1          1  
2          1  
3          0  
4          0  


Using code I had previously written, I loop through the "Gender" label in my data, and manually append a 1 to represent male, and a 0 to represent female. Although I do have an else statement to handle missing data, no data was missing. Finally, this new binary gender label is added to the pandas dataframe, which is then printed.

In [105]:
corr_matrix = thing.corr()
print(corr_matrix["Purchased"].sort_values())

EstimatedSalary   -0.040757
GenderBin          0.015427
Age                0.041791
User ID            0.044952
Purchased          1.000000
Name: Purchased, dtype: float64


The correlation matrix is printed for the "Purchased" label. As you can see, the correlation for all 4 usable attributes is very poor. This could've been a coincidence that occurred in the process of generating the data, or perhaps this bad data was created with malicious intent. *raises eyebrow*

In [106]:
#print(type(thing["EstimatedSalary"]).values)
thing2=thing.copy()
thing2.pop('Purchased')
thing2.pop('Gender')
thing2.pop('name')
X2 = np.array(thing2)
X = X2.reshape(1000,4)
print(X)
#print(thing2)
y = thing['Purchased']
#print(y)

#split training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[[ 10000     23  67928      0]
 [ 10001     47 147285      1]
 [ 10002     31 123436      1]
 ...
 [ 10997     30 112727      0]
 [ 10998     24  41431      1]
 [ 10999     34 140120      0]]


Another copy of the data is created in order to remove columns while not losing anything permanently. The "Purchased", "Gender", and "name" columns of the dataframe are removed using the pop() method. This new dataframe is made into a numpy array, which is then converted into a format of 1000 lists, each with a length of 4 using the reshape() method from numpy. This is the X training data for the ML algorithm. The "Purchased" column from the original dataframe is then turned saved as the training value for y.

In [120]:
tf.random.set_seed(7)

#creating the keras model
model = Sequential()

#first layer uses ReLU activation. Input is the proper shape for model
model.add(Dense(4, input_shape=(1000,4), activation='relu'))

#hidden layer using ReLU activation.
model.add(Dense(25, activation='relu'))

#final layer outputs one node using tanh activation
model.add(Dense(1, activation='tanh'))

print("Layers added.")

Layers added.


A sequential keras model is created, with an input layer, 1 hidden layer (I couldn't find any benefits for more than one in this use case), and an output layer. The first two layers use the Rectified Linear Unit (ReLU) activation function due to its simplicity, accuracy, efficiency, and widespread support (i.e. number of related Stack Overflow threads). Since the final prediction value is supposed to be a -1, 0, or a 1, the output layer uses the tanh activation function, as it is nonlinear, and capable of categorizing these three data values.

In [121]:
#compile the model; since output is non-binary, categorical crossentropy is used to calculate loss.
#Adam optimizer is used for decent speed and accuracy (although I'm not entirely sure how it works.)
#Accuracy argument used to determine metrics.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])

# Fit the model
model.fit(X_train, y_train, epochs=10)
print("Compiled and Fitted.")

model.evaluate(X_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Compiled and Fitted.


[4.5299529460862686e-08, 0.5350000262260437]

The compile() and fit() methods are called on the method, using categorical crossentropy to calculate loss, and the Adam optimizer. The accuracy metric is used to determine how accurate the model is. Interestingly, the loss and accuracy remain static at   5.0068e-08 and 0.5040, respectively. What could I have done to make this model more accurate?