## Work with data

- Natural Language Processing (Text)
- Computer Vision (Images)
- Speech Processing (Voice)
- Music Processing (Audio)
- Time Series
- Mixed Data

## Mixed data

- Feature definition
- Degitize features 
- Normalization


Boston House Prices Dataset: https://www.kaggle.com/vikrishnan/boston-house-prices

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are deﬁned as follows (taken from the UCI Machine Learning Repository1): CRIM: per capita crime rate by town

- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per 10,000
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    lower status of the population
- MEDV     Median value of owner-occupied homes in 1000's

We can see that the input attributes have a mixture of units.

### Read data

In [51]:
import csv
import numpy as np

samples = []
targets = []
with open('data/housing/housing.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ', quotechar=None)
    for row in reader:
        samples += [row[0:-1]]
        targets += [row[-1]]

In [52]:
samples[1]

['0.02731',
 '0.00',
 '7.070',
 '0',
 '0.4690',
 '6.4210',
 '78.90',
 '4.9671',
 '2',
 '242.0',
 '17.80',
 '396.90',
 '9.14']

### Degitize data

In [53]:
digitized_samples = [list(map(float, sample)) for sample in samples]

In [54]:
digitized_samples[1]

[0.02731,
 0.0,
 7.07,
 0.0,
 0.469,
 6.421,
 78.9,
 4.9671,
 2.0,
 242.0,
 17.8,
 396.9,
 9.14]

In [55]:
targets = [float(target) for target in targets]

In [56]:
targets

[24.0,
 21.6,
 34.7,
 33.4,
 36.2,
 28.7,
 22.9,
 27.1,
 16.5,
 18.9,
 15.0,
 18.9,
 21.7,
 20.4,
 18.2,
 19.9,
 23.1,
 17.5,
 20.2,
 18.2,
 13.6,
 19.6,
 15.2,
 14.5,
 15.6,
 13.9,
 16.6,
 14.8,
 18.4,
 21.0,
 12.7,
 14.5,
 13.2,
 13.1,
 13.5,
 18.9,
 20.0,
 21.0,
 24.7,
 30.8,
 34.9,
 26.6,
 25.3,
 24.7,
 21.2,
 19.3,
 20.0,
 16.6,
 14.4,
 19.4,
 19.7,
 20.5,
 25.0,
 23.4,
 18.9,
 35.4,
 24.7,
 31.6,
 23.3,
 19.6,
 18.7,
 16.0,
 22.2,
 25.0,
 33.0,
 23.5,
 19.4,
 22.0,
 17.4,
 20.9,
 24.2,
 21.7,
 22.8,
 23.4,
 24.1,
 21.4,
 20.0,
 20.8,
 21.2,
 20.3,
 28.0,
 23.9,
 24.8,
 22.9,
 23.9,
 26.6,
 22.5,
 22.2,
 23.6,
 28.7,
 22.6,
 22.0,
 22.9,
 25.0,
 20.6,
 28.4,
 21.4,
 38.7,
 43.8,
 33.2,
 27.5,
 26.5,
 18.6,
 19.3,
 20.1,
 19.5,
 19.5,
 20.4,
 19.8,
 19.4,
 21.7,
 22.8,
 18.8,
 18.7,
 18.5,
 18.3,
 21.2,
 19.2,
 20.4,
 19.3,
 22.0,
 20.3,
 20.5,
 17.3,
 18.8,
 21.4,
 15.7,
 16.2,
 18.0,
 14.3,
 19.2,
 19.6,
 23.0,
 18.4,
 15.6,
 18.1,
 17.4,
 17.1,
 13.3,
 17.8,
 14.0,
 14.4,
 13.4,

### Normalization

In [57]:
train_data = digitized_samples[:-100]
train_targets = digitized_samples[:-100]

test_data = digitized_samples[-100:]
test_targets = digitized_samples[-100:]

In [45]:
len(train_data)

406

In [46]:
len(train_targets)

406

In [47]:
len(test_data)

100

In [48]:
len(test_targets)

100

In [58]:
train_data = np.array(train_data)
train_targets = np.array(train_targets)
test_data = np.array(test_data)
test_targets = np.array(test_targets)

In [59]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

In [64]:
train_data

array([[-0.29826747,  0.152053  , -1.10625078, ..., -1.23440314,
         0.42131747, -0.9380677 ],
       [-0.29538202, -0.56117121, -0.37840188, ..., -0.10686607,
         0.42131747, -0.33239816],
       [-0.29538477, -0.56117121, -0.37840188, ..., -0.10686607,
         0.32068793, -1.07638165],
       ...,
       [ 3.11030426, -0.56117121,  1.30818915, ...,  0.97556951,
         0.42131747,  1.2152622 ],
       [ 5.40980052, -0.56117121,  1.30818915, ...,  0.97556951,
        -1.24611651,  2.32322979],
       [ 9.03780144, -0.56117121,  1.30818915, ...,  0.97556951,
         0.12635176,  1.68261778]])