## Using Orange for regression

In [1]:
import Orange
import random

Regression in Orange is very similar to classification. Both require labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:

In [2]:
data = Orange.data.Table("housing")
learner = Orange.regression.LinearRegressionLearner()
model = learner(data)

print("predicted, observed:")
for d in data[:3]:
    print("%.1f, %.1f" % (model(d)[0], d.get_class()))

predicted, observed:
30.0, 24.0
25.0, 21.6
30.6, 34.7


Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:

In [3]:
tree_learner = Orange.regression.SimpleTreeLearner(max_depth=2)
tree = tree_learner(data)
print(tree.to_string())


RM (22.5: 506.0)
: <=6.941
   LSTAT (19.9: 430.0)
   : <=14.4 --> (23.3: 255.0)
   : >14.4 --> (15.0: 175.0)
: >6.941
   RM (37.2: 76.0)
   : <=7.437 --> (32.1: 46.0)
   : >7.437 --> (45.1: 30.0)


Following is initialization of few other regressors and their prediction of the first five data instances in housing price dataset:

In [4]:
random.seed(42)
test = Orange.data.Table(data.domain, random.sample(data, 5))
train = Orange.data.Table(data.domain, [d for d in data if d not in test])

lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()

learners = [lin, rf, ridge]
regressors = [learner(train) for learner in learners]

print("y   ", " ".join("%5s" % l.name for l in regressors))

for d in test:
    print(("{:<5}" + " {:5.1f}"*len(regressors)).format(
        d.get_class(),
        *(r(d)[0] for r in regressors)))

y    linear regression    rf ridge regression
22.2   19.3  20.4  19.5
31.6   33.2  30.2  33.2
21.7   20.9  20.3  21.0
10.2   16.9  12.6  16.8
14.0   13.6  15.0  13.5


Looks like the housing prices are not that hard to predict

##### Question 5-4-1
Show the way the predicted value changes according to the actual value with a scatter plot. Comment this picture.

##### Question 5-4-2
Show how the prediction error changes according to the actual value. Comment this picture.

### Cross-Validation

Evaluation and scoring methods are available at Orange.evaluation:

In [5]:
lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()
mean = Orange.regression.MeanLearner()

learners = [lin, rf, ridge, mean]

res = Orange.evaluation.CrossValidation(data, learners, k=5)
rmse = Orange.evaluation.RMSE(res)
r2 = Orange.evaluation.R2(res)

print("Learner  RMSE  R2")
for i in range(len(learners)):
    print("{:8s} {:.2f} {:5.2f}".format(learners[i].name, rmse[i], r2[i]))

Learner  RMSE  R2
linear regression 4.88  0.72
rf       3.95  0.82
ridge regression 4.91  0.71
mean     9.20 -0.00


Not much difference here. Each regression method has a set of parameters. We have been running them with default parameters, and parameter fitting would help. Also, we have included MeanLearner in the list of our regressors; this regressor simply predicts the mean value from the training set, and is used as a baseline.

## Association rules

In [6]:
from orangecontrib.associate.fpgrowth import * 
from scipy.sparse import issparse

ModuleNotFoundError: No module named 'orangecontrib'

Orange provides two algorithms for induction of association rules, a standard Apriori algorithm for sparse (basket) data analysis and a variant of Apriori for attribute-value data sets. Both algorithms also support mining of frequent itemsets.

Let's start with market basket data:

In [7]:
data = Orange.data.Table("podatki/foodmart.basket")

Let's explore the data.

In [8]:
print(len(data))
print(type(data.X))
data[:5]

62560
<class 'scipy.sparse.csr.csr_matrix'>


[[Pasta=3.000, Soup=2.000, STORE_ID_2=1.000],
 [Soup=1.000, STORE_ID_2=1.000, Fresh Vegetables=3.000, Milk=3.000, Plastic Utensils=2.000, ...],
 [STORE_ID_2=1.000, Cheese=2.000, Deodorizers=1.000, Hard Candy=2.000, Jam=2.000, ...],
 [STORE_ID_2=1.000, Fresh Vegetables=2.000],
 [STORE_ID_2=1.000, Cleaners=1.000, Cookies=2.000, Eggs=2.000, Preserves=1.000, ...]
]

We can’t use table data directly; we first have to one-hot transform it.

We get a database we can use to find frequent itemsets, and a mapping we will use later to revert the transformation.

In [9]:
X, mapping = OneHot.encode(data)
X, mapping

NameError: name 'OneHot' is not defined

We can use ```decode``` to link each index with an item's name.

In [10]:
names = {item: ('{}').format(var.name, val)
                 for item, var, val in OneHot.decode(mapping, data, mapping)}

NameError: name 'OneHot' is not defined

We want itemsets with low support, since it will be hard to find prevailing rules for more than 62,000 transactions.

In [11]:
itemsets = {}
for itemset, support in frequent_itemsets(X, 0.01/100):
    itemsets[itemset] = support
len(itemsets)

NameError: name 'frequent_itemsets' is not defined

Now we can generate all association rules that have at least 70% confidence (i.e. classification rules):

In [12]:
for rule in association_rules(itemsets, 0.7):
        left, right, support, confidence = rule
        left_str =  ', '.join(names[i] for i in sorted(left))
        right_str = ', '.join(names[i] for i in sorted(right))
        print(left_str+" -> "+right_str)

NameError: name 'association_rules' is not defined

##### Question 5-4-3
Filter rules. Find all the rules that predict the purchase of cheese.