# _Trial 1 Notebook: August 15, 2019_

__Notebook adapted from Rachel Thomas' [notebook](https://github.com/fastai/course-nlp/blob/master/3-logreg-nb-imdb.ipynb) for fast.ai's course on NLP.__

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *

In [3]:
import sklearn.feature_extraction.text as sklearn_text

### _Tokenizing and term document matrix creation_

- to start, good idea to work on sample of data before using full dataset
    - quicker computations as you debug and get code working
- We'll be looking at IMDB dataset, which is hosted via [AWS Open Datasets](https://course.fast.ai/datasets.html)

In [4]:
path = untar_data(URLs.IMDB_SAMPLE)
path

PosixPath('/Users/jai/.fastai/data/imdb_sample')

In [5]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


### _fast.ai's_ [`TextList`](https://docs.fast.ai/text.data.html#TextList)

- is based on [`ItemList`](https://docs.fast.ai/data_block.html#ItemList), which is basic class to get your inputs into
    - can be `ImageList`, `TabularList`, or in our case an `ItemList`
- it regroups inputs for our model in `items` and saves a `path` attribute where it will look for any files
    - `label_cls` will be called to create the labels from result of label function
    - `inner_df` is underlying dataframe
    - `processor` to be applied to inputs after splitting and labeling

In [6]:
movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df('is_valid')
                         .label_from_df('label'))

### _Explore the Data!_

In [7]:
movie_reviews.valid.x[0], movie_reviews.valid.y[0]

(Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
  
   xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
  
   xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk ,

In [8]:
len(movie_reviews.train.x), len(movie_reviews.valid.x)

(800, 200)

**NOTE**: ints-to-string and string-to-ints have different lengths. Why is this?

In [9]:
len(movie_reviews.vocab.itos), len(movie_reviews.vocab.stoi)

(6016, 19160)

In [10]:
movie_reviews.vocab.itos[2:21]

['xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the',
 '.',
 ',',
 'and',
 'a',
 'of',
 'to',
 'is',
 'it',
 'in',
 'i',
 'that']

In [11]:
movie_reviews.vocab.itos[6009]

'sollett'

In [12]:
movie_reviews.vocab.stoi['language']

917

In [13]:
movie_reviews.vocab.itos[917]

'language'

**remember** number of algorithms, words for people (i.e. interpretability)

- language is `917` but can appear multiple times (as `917`)
- capitalization, a lot of words mapping to unknown, two examples of why `itos` and `stoi` is different

In [14]:
type(movie_reviews.vocab.stoi)

collections.defaultdict

In [15]:
type(movie_reviews.vocab.itos)

list

Takes less space to store a list, because index is implicit in list

In [16]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['rrachell']]

'xxunk'

In [17]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['language']]

'language'

In [18]:
t = movie_reviews.train[0][0]

In [19]:
t.data[:30]

array([   2,    5, 4622,   25,    0,   25,  867,   52,    5, 3776,    5, 1800,   95,   37,   85,  191,   63,  936,
          0, 2740,  517,   18,   21,   11,   84, 2418,  192,   88, 3777,   63])

### _Creating a term-document matrix_

- term-document matrix represents a document as a _"bag of words"_
    - doesn't keep track of the order the words are in, just which occur and how often
- previous lesson used sklearn's [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
    - going to create a similar version
        - help to understand what is going on underneath the hood
        - to create something that will work with a fastai `TextList`
- need to learn about _counters_ and _sparse matrices_

In [20]:
c = Counter([4, 2, 8, 8, 4, 8])
c

Counter({4: 2, 2: 1, 8: 3})

In [21]:
c.values()

dict_values([2, 1, 3])

In [22]:
c.keys()

dict_keys([4, 2, 8])

### _Sparse Matrices (Scipy)_

- most tokens don't appear in most reviews, want to take advantage of this by storing data as **sparse matrix**
    - can save a lot of memory by storing only non-zero values
- _Coordinate-wise storage method_
    - sparse matrix is stored using 3 arrays
        - `Val[N]` --> contains value of non-zero elements
        - `Row[N]` --> contains row-index of non-zero elements
        - `Col[N]` --> contains column-index of the non-zero elements
    - accessing values is pretty easy
- _Compressed Sparse Row (CSR) Data Structure_
    - stored in 3 arrays
        - `Val[N]`: contains the value of the non-zero elements
        - `RowPtr[N]`: contains the row-index range of the non-zero elements
        - `Col[N]`: contains the column-index of the non-zero elements
        
__Advantage of CSR method over Coordinate-wise method:__
- number of operations to perform matrix-vector multiplication in both storage methods are the same...
- however, number of memory accesses is reduced by 2 to be exact in CSR method
- The statement:
    - `result[Row[k]] = result[Row[k]] + Val[k]*d[Col[k]];`
    - in a coordinate-wise method, uses 2 more memory accesses than the statement
    - `result[i] = result[i] + Val[k]*d[Col[k]];`
    - in the CSR method

### _Our version of CountVectorizer_

In [23]:
doc = movie_reviews.train[0]

In [24]:
doc

(Text xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !,
 Category negative)

In [25]:
Counter((movie_reviews.valid.x)[0].data)

Counter({2: 1,
         5: 32,
         21: 3,
         71: 1,
         189: 1,
         748: 1,
         289: 1,
         285: 1,
         62: 2,
         221: 1,
         666: 2,
         59: 1,
         13: 4,
         2707: 1,
         14: 6,
         2877: 1,
         11: 10,
         18: 2,
         358: 1,
         0: 32,
         77: 1,
         15: 6,
         478: 1,
         1833: 1,
         50: 3,
         9: 10,
         319: 1,
         6: 1,
         2745: 1,
         12: 1,
         115: 1,
         4129: 1,
         197: 2,
         1331: 1,
         25: 2,
         324: 1,
         10: 7,
         3963: 1,
         16: 4,
         74: 1,
         24: 3,
         2819: 1,
         5823: 1,
         2597: 1,
         710: 1,
         3430: 1,
         84: 1,
         149: 1,
         20: 1,
         26: 1,
         605: 1,
         378: 1,
         1057: 1,
         251: 1,
         258: 1,
         1346: 1,
         194: 1,
         239: 1,
         49: 1,
         27

In [26]:
movie_reviews.vocab.itos[6]

'xxup'

In [27]:
(movie_reviews.valid.x)[1]

Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . 
 
  xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . 
 
  xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk

In [28]:
(movie_reviews.valid.x)[0]

Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
 
  xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
 
  xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxma

In [29]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)
    
    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                  shape=(len(indptr) - 1, vocab_len),
                                  dtype=int)

In [30]:
%%time
val_term_doc = get_term_doc_matrix(movie_reviews.valid.x, len(movie_reviews.vocab.itos))

CPU times: user 41.3 ms, sys: 3.51 ms, total: 44.8 ms
Wall time: 59 ms


In [31]:
%%time
train_term_doc = get_term_doc_matrix(movie_reviews.train.x, len(movie_reviews.vocab.itos))

CPU times: user 146 ms, sys: 9.02 ms, total: 155 ms
Wall time: 151 ms


In [36]:
train_term_doc.shape

(800, 6016)

In [37]:
train_term_doc[:, -10:]

<800x10 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [38]:
val_term_doc.shape

(200, 6016)

### _More data exploration_

In [39]:
movie_reviews.vocab.itos[:4]

['xxunk', 'xxpad', 'xxbos', 'xxeos']

In [40]:
val_term_doc.todense()[:10,:10]

matrix([[32,  0,  1,  0, ...,  1,  0,  0, 10],
        [ 9,  0,  1,  0, ...,  1,  0,  0,  7],
        [ 6,  0,  1,  0, ...,  0,  0,  0, 12],
        [78,  0,  1,  0, ...,  0,  0,  0, 44],
        ...,
        [ 8,  0,  1,  0, ...,  0,  0,  0,  8],
        [43,  0,  1,  0, ...,  8,  1,  0, 25],
        [ 7,  0,  1,  0, ...,  1,  0,  0,  9],
        [19,  0,  1,  0, ...,  2,  0,  0,  5]])

In [41]:
movie_reviews.vocab.itos[-1:]

['xxfake']

In [42]:
review = movie_reviews.valid.x[1]
review

Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . 
 
  xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . 
 
  xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk