# Importing data

In this section, we will read data from a csv file at the given URI that contains values of the pages accessed by a user of a website and whether that user purchased something from the website.
We want to train a model that can predict whether a user will purchase something based on the pages they access.

In [1]:
import pandas as pd # Importing pandas library

# URI to get the data
uri = 'https://gist.githubusercontent.com/guilhermesilveira/2d2efa37d66b6c84a722ea627a897ced/raw/10968b997d885cbded1c92938c7a9912ba41c615/tracking.csv'

# Reading the data from the URI
data = pd.read_csv(uri) 
data.head()

Unnamed: 0,home,how_it_works,contact,bought
0,1,1,0,0
1,1,1,0,0
2,1,1,0,0
3,1,1,0,0
4,1,1,0,0


In [2]:
# Selecting only the first 3 columns
x = data[['home', 'how_it_works', 'contact']]
x.head()

Unnamed: 0,home,how_it_works,contact
0,1,1,0
1,1,1,0
2,1,1,0
3,1,1,0
4,1,1,0


In [3]:
y = data['bought']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: bought, dtype: int64

In [4]:
# Splitting the data into training and test data

# X
train_x = x[:75]
test_x = x[75:]

# Y
train_y = y[:75]
test_y = y[75:]

print(f'Training with {len(train_x)} elements and testing with {len(test_x)} elements')

Training with 75 elements and testing with 24 elements


In [5]:
# Importing sklearn utils
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Creating and training the model
model = LinearSVC(dual='auto')
model.fit(train_x, train_y)

# Making the prediction
predict = model.predict(test_x)

# Calculating the accuracy
accuracy = accuracy_score(test_y, predict) * 100
print(f'Accuracy: {accuracy:.2f}%')

Accuracy: 95.83%


### Stratify

In the following section, we will use built-in functions of SKLearn to split our input data into train and test data automatically.

Then run our model training and testing again.

It is important to say that this data division function makes a random separation of our data, which can cause different results from our model each time we run it.

To avoid this and be able to replicate models, we can use an initial value to generate our random split.

In [6]:
# Importing sklearn utils
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

SEED = 20

train_x, test_x, train_y, test_y = train_test_split(
    x, y, # input data
    test_size=0.25, # percentage that we will use as tests
    random_state=SEED, # seed value to the randomized split
    stratify=y # reference to keep the same ratio between testing and training data
)

print(f'Training with {len(train_x)} elements and testing with {len(test_x)} elements')

# Creating and training the model
model = LinearSVC()
model.fit(train_x, train_y)

# Making the predictions
predict = model.predict(test_x)

# Calculating the accuracy
accuracy = accuracy_score(test_y, predict) * 100
print(f'Accuracy: {accuracy:.2f}%')

Training with 74 elements and testing with 25 elements
Accuracy: 96.00%


