## Let's set up our work environment

In [2]:
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Get the data

In [3]:
dataset = pd.read_csv('Data.csv')

## Discover the data

In [40]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


### Look at the Big Picture
We have a dataset that contains 4 columns (Country, Age, Salary and Purchased). It is very important to distinguish the difference between the independent variables and the dependent variables. The independent variables are here: Country, Age and Salary. These are the first 3 columns. And the dependent variable is the last column that is Purchased. The independent variables are the predictor variables and the dependent variable is the variable that must be predicted. Each row represents informations about one customer. 

### Frame the problem
Business Scenario: We can imagine that this dataset is thus a dataset of a company which has a file of customers, who has information on these customers and especially who has the information to know if the customers bought or not one of these new products . The company tries to correlate country, age and salary with the customer's decision to buy the product. We will therefore use the independent variables to try to predict whether the customer buys the product or not. In other words, with the independent variables, we will try to predict whether the person will buy a product or not.

## Visualize the data to gain insights

In [5]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


Each row represents informations about one customer. There are 4 attributs. : Country, Age, Salary and Purchased. We see that there are null values (NaN). The info() method is useful to get more informations, in particular the total number of rows and each attribute's type and number of non-null values.

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 400.0+ bytes


There are 10 instances in the dataset, which means that it is very small by Machine Learning standards, but it's perfect to get started. Notice that the total Age attribute and Salary attribute have only 9 non-null values. Meaning that 1 value is missing in each feature. We need to take care of this later. Two attributes (Age and Salary) are numerical and two are not (Country and Purchased). We noticed that the values in Country's column and Purchased's column are repetitive, which means that it is a categorical attributes. We can find out what categories exist and how many Country and Purchased belong to each category by using the value_counts() method

In [7]:
dataset["Country"].value_counts()

France     4
Germany    3
Spain      3
Name: Country, dtype: int64

In [8]:
dataset["Purchased"].value_counts()

Yes    5
No     5
Name: Purchased, dtype: int64

It's pretty clear. Let's look at the other fields. The describe() method shows a summary of the numerical attributes.

In [9]:
dataset.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


The count, mean, min, and max rows are self-explanatory. Note that the null values are ignored (so, for example, count of Age and Country is 9, not 10). The std row shows the standard deviation (which measures how dispersed the values are). The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations falls. For example, 25% of the customers have an age lower than 35 and salary lower than 54 000, while 50% have an age lower than 38 and salary lower than 61 000 and 75% are lower than 44 and 72 000.

## Preparation the data for Machine Learning algoritms 
We will start by creating 2 matrix. A matrix of independent variables that is to say a matrix that will contain the first 3 columns (Country, Age and Salary). And next, another matrix that will be the vector of the dependent variable (Purchased).

### Create variables : x and y
Create our variables, we are going to use a very useful technical of Pandas, named Iloc.

In [10]:
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

### Visualize x and y

In [11]:
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

We see values of our columns Country, Age and Salary. Pretty good!

In [12]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

We also see here values of column Purchased.

## Let's take care of our missed values
There are 2 missed values (NaN). One in the Age's column and the second in the Salary's column. We are going to replace NaN values by something relevant. Our dataset is pretty good distributed. The Age goes to 30 at 50. It is the same thing on Salary'column. There are no big outliers. So you can replace this NaN values by the mean of each column.

In [13]:
# Take care of NaN values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer.fit(x[:, 1:3])



Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [14]:
x[:, 1:3] = imputer.transform(x[:, 1:3])

### Visualize x again 

In [15]:
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Our NaN values have been replaced.

### Let's take care our categorical variables
First, for the independent variables, we are going to use Dummy Varibles Encoder (OneHotEncoder) beacuse there is no order relation between countries. And second, for the dependent variable, we are going to encoder directly values of Purchased column (0 = No and 1 = Yes).

### Import classes to do the job

In [16]:
# Take care categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 0] = labelencoder_x.fit_transform(x[:, 0])

In [17]:
# Visualize x again
x

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

### We have no text. 0 is France, 2 is Spain and 1 is Germany. We are a order relation. Let's fix it with OneHotEncoder.

In [18]:
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [19]:
# Visualize our x now
x

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

A bit strange and not beautiful, I know. But the job is done.

In [20]:
# To avoid multicolinarity and Dummy variables problem, we delete one dummy variable
x = x[:, 1:]
x

array([[0.00000000e+00, 0.00000000e+00, 4.40000000e+01, 7.20000000e+04],
       [0.00000000e+00, 1.00000000e+00, 2.70000000e+01, 4.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 3.00000000e+01, 5.40000000e+04],
       [0.00000000e+00, 1.00000000e+00, 3.80000000e+01, 6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 4.00000000e+01, 6.37777778e+04],
       [0.00000000e+00, 0.00000000e+00, 3.50000000e+01, 5.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 3.87777778e+01, 5.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 4.80000000e+01, 7.90000000e+04],
       [1.00000000e+00, 0.00000000e+00, 5.00000000e+01, 8.30000000e+04],
       [0.00000000e+00, 0.00000000e+00, 3.70000000e+01, 6.70000000e+04]])

In [21]:
# It's time to take care of y variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [22]:
# Take a look on our y variable
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

Job done. Good. Clear. 0 to No and 1 to Yes.

## Feature Scaling
This step is about to put all our variables on the same scale.
In fact, when we look our variables Age and Salary, they are not on the same scale. The Salary variable can dominate and turn off variable Age. This is why it is important to use Feature Scaling.

In [41]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

In [42]:
# Visualize x 
x

array([[-6.54653671e-01, -6.54653671e-01,  7.58874362e-01,
         7.49473254e-01],
       [-6.54653671e-01,  1.52752523e+00, -1.71150388e+00,
        -1.43817841e+00],
       [ 1.52752523e+00, -6.54653671e-01, -1.27555478e+00,
        -8.91265492e-01],
       [-6.54653671e-01,  1.52752523e+00, -1.13023841e-01,
        -2.53200424e-01],
       [ 1.52752523e+00, -6.54653671e-01,  1.77608893e-01,
         6.63219199e-16],
       [-6.54653671e-01, -6.54653671e-01, -5.48972942e-01,
        -5.26656882e-01],
       [-6.54653671e-01,  1.52752523e+00,  0.00000000e+00,
        -1.07356980e+00],
       [-6.54653671e-01, -6.54653671e-01,  1.34013983e+00,
         1.38753832e+00],
       [ 1.52752523e+00, -6.54653671e-01,  1.63077256e+00,
         1.75214693e+00],
       [-6.54653671e-01, -6.54653671e-01, -2.58340208e-01,
         2.93712492e-01]])

### It's perfect. We see all variables are on the same scale specially Age and Salary.

### Select and train a model
No divide our dataset in Training Set and Test because we have few data.

In [43]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x, y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

That's it, our logistic regression model is well created and has all the default settings as we can see.

## Make a prediction

In [44]:
# Lets make new predictions 
y_pred = classifier.predict(x)
y_pred

array([1, 1, 1, 0, 0, 1, 0, 1, 0, 1])

In [45]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

When we quickly compare the results of 'y_pred' and 'y', we see that our model has made few errors. But let's analyze all of this effectively using the confusion matrix.

### Confusion Matrix

In [46]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, y_pred)
cm

array([[3, 2],
       [1, 4]], dtype=int64)

Of the 10 customers, our model predicted 3 times correctly that the customer does not buy. He correctly predicted 4 times that the customer buys. He incorrectly predicted 1 time that the customer does not buy. And 2 times incorrectly that the customers buys.

### Accuracy

The accuracy is the number of correct predictions on the total number of observations.

In [47]:
# Accuracy
classifier.score(x, y)

0.7

### 70%. Not bad at all.