### 1.17 Summary - important!

- less than one per cent of all data is ever analysed and used. Underutilisation, disparate sources for data generation, the requirement for harvesting new data and the non realisation of value in the available data are **the four main reasons why companies are interested in big data**. 


In this topic, we have seen the following four important underlying characteristics of big data: 
1. `the volume of the data, which is so huge and difficult to store; `
2. `the velocity of the data arriving, which is so high; `
3. `the variety of forms in which the data arrives (structured, semi-structured and unstructured); and `
4. `the veracity, which refers to the accuracy of the data.`


We have introduced TF-IDF, which is a useful numerical statistic that measures how important a word is to a document in a collection or corpus. TF-IDF has many applications in big data analysis, such as identifying the similarity between documents. We have also seen that the power law relation is useful in estimating the scale invariant relations in big data sets.

From the `assessment` point of view, after finishing this topic, you must be clear about: 
1. the four Vs; 
2. why companies are interested in big data; 
3. TF-IDF; and
4. the power law relation.

In [1]:
import numpy as np

# Create a 10x10 matrix with random values from 0 to 10
matrix = np.random.randint(0, 10, (10, 10))
matrix

array([[7, 1, 8, 1, 1, 5, 8, 6, 7, 6],
       [7, 4, 0, 1, 6, 4, 8, 0, 3, 3],
       [7, 1, 2, 7, 1, 8, 5, 3, 3, 4],
       [4, 1, 3, 3, 4, 3, 6, 2, 1, 1],
       [3, 5, 8, 8, 4, 0, 3, 6, 3, 6],
       [8, 5, 7, 2, 2, 9, 3, 8, 2, 2],
       [5, 8, 2, 9, 7, 8, 4, 3, 4, 9],
       [9, 2, 5, 9, 8, 5, 7, 3, 5, 7],
       [2, 1, 4, 5, 6, 1, 3, 6, 7, 5],
       [1, 0, 2, 2, 0, 3, 4, 2, 4, 2]])

In [68]:
# some statistics
matrix.mean(), np.median(matrix), matrix.std()

(4.28, 4.0, 2.607987730032486)

In [65]:
from scipy.stats import mode

mode(matrix)

ModeResult(mode=array([7, 1, 2, 1, 1, 3, 3, 3, 3, 2], dtype=int64), count=array([3, 4, 3, 2, 2, 2, 3, 3, 3, 2], dtype=int64))

In [67]:
# Linear algebra
np.linalg.det(matrix), np.linalg.matrix_rank(matrix), np.trace(matrix)

(38674328.99999984, 10, 45)

### Two notebooks with code examples 

These were very bad. But concepts like determinant, rank and trace should be studied.

In [None]:
# Dimensions
MatrixSize = 10
NumberOfRows = 5
NumberOfColumns = 5

MyArray = np.random.randint(0, 10, size=(MatrixSize, MatrixSize))
MyArray

In [61]:
# Extract the determinant, rank and trace of the matrix
print('Determinant:', np.linalg.det(MyArray))
print('Rank:', np.linalg.matrix_rank(MyArray))
print('Trace:', np.trace(MyArray))

Determinant: 0.0
Rank: 9
Trace: 53


In [63]:
# Alternative approach to calculating the trace
Trace1 = int(0)
for i in range(0, MatrixSize):
    Trace1 = Trace1 + int(MyArray[i, i])
print('Trace1:', Trace1)

# Test zeroing determinant and lowering rank with 1
MyArray[-1, i] = 0

MyArray

Trace1: 53


array([[7, 8, 5, 5, 5, 9, 4, 3, 6, 9],
       [5, 2, 2, 8, 3, 1, 5, 4, 5, 6],
       [0, 3, 9, 8, 9, 5, 1, 1, 5, 9],
       [2, 3, 8, 8, 8, 3, 3, 1, 7, 3],
       [9, 4, 6, 7, 8, 3, 2, 5, 5, 7],
       [4, 9, 8, 1, 4, 8, 8, 3, 6, 9],
       [0, 5, 1, 5, 6, 0, 8, 5, 0, 9],
       [5, 6, 3, 0, 3, 2, 2, 2, 8, 9],
       [4, 7, 1, 7, 4, 1, 4, 4, 1, 8],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [62]:
# Calculate new determinant, rank and trace
print('Determinant:', np.linalg.det(MyArray))
print('Rank:', np.linalg.matrix_rank(MyArray))
print('Trace:', np.trace(MyArray))

Determinant: 0.0
Rank: 9
Trace: 53
