This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

In [63]:
import pandas as pd
import re
import spacy.cli
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')
from nltk.corpus import stopwords

from collections import Counter

# Plotting
import squarify
import matplotlib.pyplot as plt
import seaborn as sns

# NLP Libraries
import spacy
from spacy.tokenizer import Tokenizer
from nltk.stem import PorterStemmer

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
import numpy as np

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/cannabis.csv')
print('Shape:', df.shape, '\n')
df.head()

Shape: (2351, 6) 



Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%..."


# Tokens

## Definitions

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Natural Language Processing**: `It is the combined study of linguistics, computer science, and artificial intelligence. It's how computers can process large amounts of documents with human language`

**Token**: `The smallest element of language. So for example, a word could be a token in a sentence. Tokens don't necissarily have to be words however. They can also be numbers.`

**Corpus**: `A corpus is a collection of machine-readable text that represents the documents it was taken from. For example, a corpus from books published by NASA would include a lot of words relating to space. A corpus taken from a collection of cookbooks would contain a lot of words about food.`

**Stopwords**: `Filler words in text that don't add much context to the sentence. Think "and, about, this, is"`

**Statistical Trimming**: `The process of removing the most and least frequent occurences or the outliers`

**Stemming**: `Cutting off the end of text to just leave the base. This doesn't change tenses of words`

**Lemmatization**: `Breaking down text to its base. For example, turning aplural word into the singular word, a past tense into present and so forth.`

**Vectorization**: `Representing text with a numerical vector so it can be processed by computers`

## Questions of Understanding

1. What are at least 4 common cleaning tasks you need to do when creating tokens?
 1. `lowercase all = text.lower()`
 2. `remove punctuation`
 3. `remove extra spaces & reduce outliers`
 4. `lemmitize`

2. Why is it important to apply custom stopwords to our dataset in addition to the ones that come in a library like spaCy?
```
So we can get better results and uncover the words that are really important
```

3. Explain the tradeoffs between statistical trimming, stemming, and lemmatizing.
```
In statistical trimming, you are losing the most common and least common data but sometimes in a rather imprecise way. 
```

4. Why do we need to vectorize our documents?
```
Your Answer Here
```

## Practice Problems

Write a function to tokenize the `Description` column. Make sure to include the following:
- Return the tokens in an iterable structure
- Normalize the case
- Remove non-alphanumeric characters such as punctuation, whitespace, unicode, etc.
- Apply stopwords and make sure to add stopwords specific to this dataset
- Lemmatize the tokens before returning them

In [59]:
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2"}

In [60]:
nlp.Defaults.stop_words

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'example',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'f

In [45]:
df['Test'] = df['Description'].astype('str')

In [51]:
def clean_token(text, allowed = ['NOUN', 'ADJ', 'VERB', 'ADV']):

    lemmas = []
    
#     # remove anything outside of letters and numbers
    text = re.sub('[^a-zA-Z]',' ', text)
    
    doc = nlp(text)
    
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False) and
            (token.pos_ != allowed)):
            lemmas.append(token.lemma_.lower())
    
    return lemmas

In [52]:
df['Tokens'] = df['Test'].apply(lambda text: clean_token(text))

In [53]:
df.head()

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description,Test,Tokens
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...,$100 OG is a 50/50 hybrid strain that packs a ...,"[ , og, , hybrid, strain, pack, stro..."
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...,The ‘98 Aloha White Widow is an especially pot...,"[ , aloha, white, widow, especially, potent..."
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...,1024 is a sativa-dominant hybrid bred in Spain...,"[ , sativa, dominant, hybrid, breed, spain..."
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...,13 Dawgs is a hybrid of G13 and Chemdawg genet...,"[ , dawgs, hybrid, g, , chemdawg, genetic,..."
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%...","Also known as Kosher Tangie, 24k Gold is a 60%...","[know, kosher, tangie, , k, gold, , ind..."


Apply your function to `Description` and save the resulting tokens in a new column, `Tokens`

In [None]:
# Done

Use the function below to create a `word_count` dataframe based off the `df['Tokens']` column you created.

In [64]:
def count(docs):
        word_counts = Counter()
        appears_in = Counter()
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [67]:
r = df['Tokens'].apply(count)

In [69]:
word_count = pd.DataFrame(data=r)

In [70]:
word_count

Unnamed: 0,Tokens
0,word appears_in count rank pct_total c...
1,word appears_in count rank pct_total c...
2,word appears_in count rank pct_total c...
3,word appears_in count rank pct_total c...
4,word appears_in count rank pct_total c...
...,...
2346,word appears_in count rank pct_total c...
2347,word appears_in count rank pct_total c...
2348,word appears_in count rank pct_total c...
2349,word appears_in count rank pct_total c...


Run the line of code below, and then explain how to interpret the graph.

```
Your Answer Here
```

In [72]:
sns.lineplot(x='rank', y='cul_pct_total', data=word_count);

ValueError: Could not interpret value `rank` for parameter `x`

# Vectorization

## Definitions

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Vectorization**: `Your Answer Here`

**Document Term Matrix (DTM)**: `Your Answer Here`

**Latent Semantic Analysis**: `Your Answer Here`

**Term Frequency - Inverse Document Frequency (TF-IDF)**: `Your Answer Here`

**Word Embedding**: `Your Answer Here`

**N-Gram**: `Your Answer Here`

**Skip-Gram**: `Your Answer Here`

## Questions of Understanding

1. Why do we need to vectorize our documents?
```
Your Answer Here
```

2. How is TF-IDF different from simple word frequency? Why do we use TF-IDF over word frequency?
```
Your Answer Here
```

3. Why might we choose a word embedding approach over a bag-of-words approach when it comes to vectorization?
```
Your Answer Here
```

## Practice Problems

Use the dataframe `df` above to complete the following.

Vectorize the `Tokens` column.

Build a Nearest Neighbors model from your dataframe and then find the 5 nearest neighbors to the strain "100-OG"

You will be putting together a classification model below, but before you do you'll need a baseline. Run the line of code below and then find the normalized value counts for the `Rating` column in `df`.

In [None]:
df['Rating'] = df['Rating'].round().astype(int)

What is the baseline accuracy?
```
Your Answer Here
```

Visualize the rating counts from above

Use your vectorized tokens in the `df` dataframe to train a classification model

Predict the score of the fake strain description below.

```
'Afgooey, also known as Afgoo, is a potent indica strain that is believed to descend from an Afghani indica and Maui Haze. 
Its sativa parent may lend Afgoo some uplifting, creative qualities, but this strain undoubtedly takes after its indica 
parent as it primarily delivers relaxing, sleepy effects alongside its earthy pine flavor. Growers hoping to cultivate Afgoo 
may have a better chance of success indoors, but this indica can also thrive in Mediterranean climates outdoors.'
```

# Topic Modeling

## Questions of Understanding

1. What is Latent Dirichlet Allocation? What is another name for LDA in NLP?
```
Your Answer Here
```

2. How do interpret the results of a topic modeling output?
```
Your Answer Here
```

## Practice Problems

Find the top 5 topics of the `Description` column using LDA

In a short paragraph, explain how to interpret the first topic your model came up with. If your topic words are difficult to interpret, explain how you could clean up the descriptions to improve your topics

```
Your Answer Here
```

Use `pyLDAvis` to create a visualization to help you interpret your topic modeling results

Explain how to interpret the results of `pyLDAvis`

```
Your Answer Here
```

Create at least 1 more visualization to help you interpret the results of your topic modeling