# Topics
1. ML and MLLib libraries 
  - Data frame vs RDD based , latter getting deprecated
2. Matrix support 
  - Sparse and Dense vectors and matrices 
3. Working with libSVM kind of data. 
4. Feature Transformers  
5. Feature Extractors   
6. Feature Selectors 
7. Model selection and Tuning 
8. Pipelines
    

#### 2. Linear Algebra Module 
- Dense vectors, matrices   
  - Dense and Sparse are interconvertible, Dense rep. is essentially same as numpy array 
- Sparse vectors, matrices 
  - Can take in scipy.sparse type matrices ( check how to create)

In [None]:
import numpy as np
import scipy as sp
import sklearn as sk

from pyspark.ml.linalg import SparseMatrix, DenseMatrix, Vectors

x = Vectors.dense(np.arange(1,20,1))
x
print(type(x))

x.dot(x)

In [None]:
from pyspark.ml.linalg import SparseVector
a = SparseVector(4, [1, 3], [3.0, 4.0])
a

# conversion of dense matri to sparse or numpy array
x = DenseMatrix(5,6, range(30))
x

x.toSparse()

x.toArray()

# sparse matrix types in scipy - two of ones that allow inverse calculation
from scipy.sparse import csc_matrix, csr_matrix 

A = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
A

print(A)

B = csc_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
B

print(B)

from pyspark.ml.linalg import Vectors
denseVec = Vectors.dense(1.0, 2.0, 3.0)
size = 3
idx = [1, 2] # locations of non-zero elements in vector
values = [2.0, 3.0]
sparseVec = Vectors.sparse(size, idx, values)


#### 3. Working with libSVM kind of data.  
- libsvm is a popular data format for large scale ML esp, used in popular SVM libraries LIBSVM and LIBLINEAR. 
- Format is - 
  - label index1:value1 index2:value2 ...  
  - index is 1 based, post reading made to 0 based in spark

In [None]:
libsvm = spark.read.format('libsvm').load('/user/sumad/Data/sample_libsvm_data.txt')
libsvm.show(5)
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
+-----+--------------------+

#### 4. Feature transformers 
- Ref : https://spark.apache.org/docs/2.3.0/ml-features.html

#### 4.1 Categorical feature encoding 

4.1.1 String Indexer  
- Creates a numeric index on a categorical column, essentially assigning frequenct based index to each 
  category  
- When fit on a new data set, three ways to handle new categories encountered : error, remove records, assign 
  a new index to all new cats.

In [None]:
s_ind = StringIndexer(inputCol= 'category', outputCol = 'catIndex',
handleInvalid='error', stringOrderType='frequencyDesc')
df_indexed = s_ind.fit(df).transform(df)
df_indexed.show()

+---+--------+--------+
| id|category|catIndex|
+---+--------+--------+
|  0|       a|     0.0|
|  1|       b|     2.0|
|  2|       c|     1.0|
|  3|       a|     0.0|
|  4|       a|     0.0|
|  5|       c|     1.0|
+---+--------+--------+

4.1.2 One hot encoding 
- Unlike sklearn's one hot encoder, as default, it uses a feature vector of length n-1 to represent all categories of a column . dropLast = True ensure n-1 feature vec length
- Treatment of new category is as in StringIndexer. 
- The ouput representation is a sparse vector

In [None]:
from pyspark.ml.feature import OneHotEncoderEstimator
df = spark.createDataFrame([
    (0.0, 1.0),
    (1.0, 9.0),
    (2.0, 1.0),
    (3.0, 2.0),
    (0.0, 1.0),
    (2.0, 4.0)
], ["categoryIndex1", "categoryIndex2"])

oh_enc = OneHotEncoderEstimator(inputCols= ["categoryIndex1", "categoryIndex2"],
                      outputCols= ["oh_categoryIndex1", "oh_categoryIndex2"],
                      handleInvalid='error',dropLast=True)
df_enc = oh_enc.fit(df).transform(df)

+--------------+--------------+-----------------+-----------------+
|categoryIndex1|categoryIndex2|oh_categoryIndex1|oh_categoryIndex2|
+--------------+--------------+-----------------+-----------------+
|           0.0|           1.0|    (3,[0],[1.0])|    (9,[1],[1.0])|
|           1.0|           9.0|    (3,[1],[1.0])|        (9,[],[])|
|           2.0|           1.0|    (3,[2],[1.0])|    (9,[1],[1.0])|
|           3.0|           2.0|        (3,[],[])|    (9,[2],[1.0])|
|           0.0|           1.0|    (3,[0],[1.0])|    (9,[1],[1.0])|
|           2.0|           4.0|    (3,[2],[1.0])|    (9,[4],[1.0])|
+--------------+--------------+-----------------+-----------------+

In [None]:
IndextoString
Response Encoding
VectorIndexer (Improves performance, automates cat. identification)
Inreraction ( gives a cartesian product, how is it used)

In [None]:
4.2 Continuous Feature Transformers 
- Scalers 
- Bucketizer