## Meta -

File explaining tips and tricks neccesary before building Neural Networks. Can build separate files for pandas, keras, sklearn tricks.

## Author - Rahul Suresh

## NLTK tricks

### Tokenizing and Cleaning data using regex

In [9]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

## Keras tricks

### Tokenizing and Cleaning data 
* Splits words by space (split=” “)
* Filters out punctuation (filters=’!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’)
* Converts text to lowercase (lower=True)

Above can be changed using function parameters

 

In [10]:
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# tokenize the document
result = text_to_word_sequence(text)
print(result)

Using TensorFlow backend.


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


### Tokenizing and integer encoding function

#### Using hash function

In [11]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


#### Using one_hot

In [20]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

8
[2, 9, 5, 7, 2, 1, 2, 1, 3]


#### Using tokenizer class

In [31]:
from keras.preprocessing.text import Tokenizer

# define 5 documents
# An input consists of multiple documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)


* word_counts: A dictionary of words and their counts.
* word_docs: A dictionary of words and how many documents each appeared in.
* word_index: A dictionary of words and their uniquely assigned integers.
* document_count:An integer count of the total number of documents that were used to fit the Tokenizer.


In [25]:
# summarize what was learned
print(t.word_counts)
print()
print(t.word_docs)
print()
print(t.word_index)
print()
print(t.document_count)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])

defaultdict(<class 'int'>, {'effort': 1, 'work': 2, 'good': 1, 'excellent': 1, 'done': 1, 'well': 1, 'great': 1, 'nice': 1})

{'effort': 6, 'work': 1, 'good': 4, 'excellent': 8, 'done': 3, 'well': 2, 'great': 5, 'nice': 7}

5


The document encoding modes available include:

* ‘binary‘: Whether or not each word is present in the document. This is the default.
* ‘count‘: The count of each word in the document.
* ‘tfidf‘: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
* ‘freq‘: The frequency of each word as a ratio of words within each document.


In [30]:
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='tfidf')
print(encoded_docs)

[[0.         0.         1.25276297 1.25276297 0.         0.
  0.         0.         0.        ]
 [0.         0.98082925 0.         0.         1.25276297 0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.         1.25276297
  1.25276297 0.         0.        ]
 [0.         0.98082925 0.         0.         0.         0.
  0.         1.25276297 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         1.25276297]]


## SKlearn tricks

### Integer encoding labels

In [3]:
from sklearn.preprocessing import LabelEncoder

data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(data)
print("Interger encoded values",integer_encoded)
inverted = label_encoder.inverse_transform(integer_encoded)
print("Interger decoded values",inverted)

Interger encoded values [0 0 2 0 1 1 2 0 2 1]
Interger decoded values ['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']


### Binarizing labels / One hot encoding

In [5]:
from sklearn.preprocessing import LabelBinarizer
#assign labels to y in the form of -
y=[1,2,4]
# also works with text data
# Eg y = ['cat','dog','cat']
Y = LabelBinarizer().fit_transform(y)
print(Y)

[[1 0 0]
 [0 1 0]
 [0 0 1]]


## Dataset 

In [36]:
import numpy as np
import random
x=[]
y=[]
for i in range(0,100):
    x.append([random.randint(200,300),random.randint(200,300),random.randint(600,1000)])
    y.append(1)
    
for i in range(0,100):
    x.append([random.randint(1,100),random.randint(1,100),random.randint(600,1000)])
    y.append(0)
x=np.array(x)
y=np.array(y)

## Splitting data into training and test data

In [32]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)
#having a fixed random state guarentess fixed results each time

In [34]:
train_features.shape

(150, 3)

In [35]:
test_features.shape

(50, 3)

## Scaling data into a range

### Based on min and max values in dataset

In [28]:
from sklearn.preprocessing import MinMaxScaler



#input shape has to be 2D
data = [[1,2,3],[3,4,5]]
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)
scaler.fit(data)

print("Max Value in feature set",scaler.data_max_)

print("\nTransfromed input data\n",scaler.transform(data))

#new data shape has to be 2D
print("\nNew data [2,2,3] scaled to",scaler.transform([[2,2,3]]))



Max Value in feature set [3. 4. 5.]

Transfromed input data
 [[0. 0. 0.]
 [1. 1. 1.]]

New data [2,2,3] scaled to [[0.5 0.  0. ]]


### Based on mean and standard deviation values of dataset

In [27]:
from sklearn.preprocessing import StandardScaler

data = [[1,2,3],[3,4,5]]
scaler = StandardScaler()
scaler.fit(data)

print("Mean of data set",scaler.mean_)
print("Standard deviation of data set",pow(scaler.var_,1/2))

print("\nTransfromed input data\n",scaler.transform(data))

#new data shape has to be 2D
print("\nNew data [2,2,3] scaled to",scaler.transform([[2,2,3]]))

Mean of data set [2. 3. 4.]
Standard deviation of data set [1. 1. 1.]

Transfromed input data
 [[-1. -1. -1.]
 [ 1.  1.  1.]]

New data [2,2,3] scaled to [[ 0. -1. -1.]]


## Numpy Tricks

### Returning unique elements as difference between two arrays

In [1]:
import numpy as np

a=[1,2,3,4]
b=[3,4,5]

# return unique elements of a not present in b
np.setdiff1d(a, b)

array([1, 2])

### Creating a random matrix of desired size

In [2]:
import numpy as np

a=np.random.rand(3,2)
print(a.shape)

(3, 2)


### Creating zero matrix of desiered size

In [4]:
import numpy as np

np.zeros([3,2]).shape

(3, 2)

Creating ones matrix of desiered size

In [6]:
import numpy as np

np.ones((2, 1)).shape

(2, 1)

Creating an identity matrix of desiered size

In [5]:
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Transpose of an array

In [15]:
y=np.array([[1,2,3,4],[10,0,0,0]])
print("Array:\n",y)
print()
print("Transpose:\n",y.T)

Array:
 [[ 1  2  3  4]
 [10  0  0  0]]

Transpose:
 [[ 1 10]
 [ 2  0]
 [ 3  0]
 [ 4  0]]


Sum of Elements

In [20]:
#axis=0 translates to columns
print("Column wise sum:",y.sum(axis=0))
#axis=1 translates to rows
print("Row wise sum:",y.sum(axis=1))

#Note: An axis always has to be supplied for the sum to be calculated


Column wise sum: [11  2  3  4]
Row wise sum: [10 10]


Mean of Elements

In [22]:
#axis=0 translates to columns
print("Column wise mean:",y.mean(axis=0))
#axis=1 translates to rows
print("Row wise mean:",y.mean(axis=1))

#Note: An axis always has to be supplied for the mean to be calculated


Column wise mean: [5.5 1.  1.5 2. ]
Row wise mean: [2.5 2.5]


Variance of Elements

In [6]:
y=np.array([[1,2,3,4],[10,0,0,0]])

print("Column wise varience:",y.var(axis=0))

print("Row wise varience:",y.var(axis=1))

Column wise varience: [20.25  1.    2.25  4.  ]
Row wise varience: [ 1.25 18.75]


Dot Product

In [26]:
print(np.dot(y[0],y[1]))
#or
from sklearn.utils import extmath
print(extmath.safe_sparse_dot(y[0],y[1]))

10
10


In [23]:
#becomes matrix multiplication if the result is 2D or more with conditions for rows and columns holding

In [29]:
print(y.shape)
print(y.T.shape)
np.dot(y,y.T)

(2, 4)
(4, 2)


array([[ 30,  10],
       [ 10, 100]])

### Writing into file

In [None]:
file=open("filename.txt","w")
file.write("Please write me")
file.write("\n") 
file.close() 

### Reading file

In [161]:
f = open("/home/rahulsuresh/Downloads/mik_dataset/actual_random_forest_features_importances.txt", "r")
for x in f:
#     x.split()
    print(x)

### Preparing a random classification dataset

In [1]:
from sklearn.datasets import make_classification
from collections import Counter

x, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))

Original dataset shape Counter({1: 900, 0: 100})


### [Getting K nearest vectors of a point based on metric of choice](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors)

In [2]:
import numpy as np
from sklearn.neighbors import NearestNeighbors

neigh = NearestNeighbors(5, 0.4, metric='cosine')
neigh.fit(x)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=0.4)

In [3]:
distances, points=neigh.kneighbors([x[0]],5,return_distance=True)

In [4]:
points
# Note : The point itself is returned as its nearest neighbour sometimes

array([[  0, 259, 429, 748, 360]])