# This notebook contains some quick introduction to a simple NLP model, then proceeds to move on to some language exercises.

# Language Kitties

These kitties are all trying to study Natural Language Processing, but they're having some trouble. First off, let's install some packages to help us!

In [170]:
%%capture 

!pip install scikit-learn
!pip install re
!pip install nltk
!pip install pandas

import nltk
import sklearn
import re
import pandas as pd

def hint(code):
    hints = [
        "kitty_dataset = pd.DataFrame(kitty_list)",
        "use kitty_dataset.columns"
    ]
    return hints[code]

print("Done importing")

# Bag-of-words Kitty




This kitty wants to try out a very simple bag-of-words model, where using Naive-Bayes, you simple determine whether a text is positive or negative (or in our case, is a "dog" or "cat" sentence) based on the individual words in the sentences. It's simple, yet still efffective. To start off, let's make our own dataset with sentences and their sentiment. 

**In order to make a later part work, add sentences until 19, and make number 20 to be your test sentence, to test your model later**

In [166]:
kitty_list = [
    ["I like dogfood", "dog"], #1
    ["Woof!", "dog"], #2
    ["I bark at people", "dog"], #3
    ["I chase my own tail", "dog"], #4
    ["I'm a good boy", "dog"], # 5
    ["I Love dogs", "dog"], # 6
    ["I like wetfood", "dog"], # 7
    ["I miauw at people", "cat"], #8
    ["I purr when stroked", "cat"], # 9
    ["I'm one of the kitties", "cat"], # 10
    ["I hate dogs", "cat"], # 11
    ["I love cats", "cat"], # 12
    ["I'm way smarter'", "cat"], # 13
    ["..", "dog"], # 14
    ["..", "cat"], # 15
    ["..", "dog"], # 16
    ["..", "cat"], # 17
    ["..", "dog"], # 18
    ["..", "cat"], # 19
    ["I'm a cute cat", "cat"], # Test Sentence
]

Next, we'll need to convert this simple nested-list to a pandas dataset. You can google how to (it's actually *that* easy, yes)

In [None]:
# Convert kitty_list into a pandas kitty_dataset
kitty_dataset = ..

In [None]:
# hint
hint(0)

Now, let's add some columns to the dataset. Think about what the column names could be. What is our data, and what are our labels?

In [None]:
# Give the dataset some columns
kitty_dataset.columns = ..

In [None]:
# hint
hint(1)

'use kitty_dataset.columns'

Let's take a look at our dataset now

In [None]:
kitty_dataset

# Cleaning the kitty Dataset



Awesome! Now, we'll want to clean our dataset so that the eventual model won't learn unnecessary things! There are several ways to do this, but a good strategy is to remove the stop words, as these don't really add anything to classification. We'll also do stemming, which takes the root of all the words, and ignores different versions (like "ran/running" -> "run"). We'll also remove punctuation and make evrything lowercase.

First, let's download the stopwords!

In [None]:
nltk.download("stopwords")

Now, we'll apply all the transformations to all the sentences we have, and add them to our corpus!

In [None]:
corpus = []

# Create our normalize function
def normalize(text):
    text = text.strip("!,.")
    text = text.replace("'", "")
    text = text.lower()
    return text

# Run it over all instances, and add it to our corpus
for i in range(len(kitty_dataset)):
    text = kitty_dataset[kitty_dataset.columns[0]][i]
    text = normalize(text)
    corpus.append(text)

In [None]:
# Let's see what our corpus looks like
corpus

# Corpus training

As you can see, our corpus now contains the raw words, without any special characters, as well as being in lowercase! Next, we'll use the CountVectorizer, to create rows of numbers representing our text, so that the machine can understand us.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()
y = kitty_dataset.iloc[:, 1].values

In [None]:
# Run this
X[0]

As you can see here, our first entry is just a list of numbers (array to be precise). Every number represents a word that exists ANYWHERE in our dataset. The actual counts (mostly 0/1) are how many times that word is present. It's simple a bag of words.

Next, we'll split off our dataset into train and test, where test in this case is our 20th sentence.

In [None]:

# fitting naive bayes to the training set
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle=False)

classifier = GaussianNB()
classifier.fit(X, y)
print("Done with training!")
 

Awesome, now let's see if we managed to successfully predict our last sentence! You can change the last sentence if you want and re-run all the cells to try again!

In [None]:
classifier.predict(X_test)

# Chat challenge



This exercise is a two-part one. First, you'll make a little function that takes in three values, (chat, nora, true_value). The function must return "nora" (NO CAPITAL LETTER) if nora's value is closer, else it will return "chat". So for example:

```python
closer(15, 20, 16) -> "nora"
closer(5, 99, 10) -> "chat"
```

* Make a function called closer(chat, nora, true_value) that checks which number, that of chat or that of nora is closer to the true value
* Return either "chat" or "nora"
* Hint, chat will probably give you a helpful method, this is fine to use

Simple right?

In [None]:
# work here

In [186]:
# Example










def closer(chat, nora, true_value):
    return "chat" if abs(true_value - chat) < abs(true_value - nora) else "nora"

closer(1, 5, -19)

'chat'

Great, now let's fill in this game function. The function takes as argument *num_wins*, which is the number of wins either chat or nora must have to defeat each other. You must fill in the following:

* At part **A**: Ask the chat for a number, store it as **chat**, then you think of a number thats at least 10 digits more or less than chat's number, store it as **nora**
* At part **B**: use your closer() function to find out whether chat or nora won, store it as **winner**
* At part **C**: Add a "*" to the winner's list!

In [None]:
import random
from IPython.display import clear_output

def game(num_wins):
    chat_wins = []
    nora_wins = []
    count = 0

    while True:
        # Making sure to generate a number that's difficult to guess
        first_num_start = random.randint(1, 15)
        first_num_end = random.randint(first_num_start, first_num_start * 5)
        first_num = random.randint(first_num_start, first_num_end)
        true_value = random.randint(first_num, first_num * 5)

        # A !!! get the input values for chat and nora


        # B !!! determine the winner


        # C !!!  add a "*" to the winner list


        # The rest is already added
        count += 1
        print("Round", count)
        print("Chat Wins:", "".join(chat_wins))
        print("Nora Wins:", "".join(nora_wins), "\n")

        if (len(nora_wins) == num_wins) or (len(chat_wins)  == num_wins):
            print("Nora Wins!") if len(nora_wins) > len(chat_wins) else print("Chat Wins!")
            return




game(2)


In [None]:
# Example











import random
from IPython.display import clear_output

def game(num_wins):
    chat_wins = []
    nora_wins = []
    count = 0

    while True:
        # Making sure to generate a number that's difficult to guess
        first_num_start = random.randint(1, 15)
        first_num_end = random.randint(first_num_start, first_num_start * 5)
        first_num = random.randint(first_num_start, first_num_end)
        true_value = random.randint(first_num, first_num * 5)

        # A !!! get the input values for chat and nora
        chat = 20
        nora = 60


        # B !!! determine the winner
        winner = closer(chat, nora, true_value)

        # C !!!  add a "*" to the winner list
        if winner == "nora":
            nora_wins.append("*")
        else:
            chat_wins.append("*")

        # The rest is already added
        count += 1
        print("Round", count)
        print("Chat Wins:", "".join(chat_wins))
        print("Nora Wins:", "".join(nora_wins), "\n")

        if (len(nora_wins) == num_wins) or (len(chat_wins)  == num_wins):
            print("Nora Wins!") if len(nora_wins) > len(chat_wins) else print("Chat Wins!")
            return



game(2)

# Kitty De-compression




This question involves decompressing a compressed string. Your input is a compressed string of the format **number[string]** and the decompressed output form should be the string written number times. For example:

```python
"3[abc]"          -> abcabcabc            (3 * abc)
"4[a]c"           -> aaaac                (4 * a) + (c)
"3[abc]4[ab]c"    -> abcabcabcababababc   (both combined)
```

Since this is a junior exercise, the tests won't involve nested versions like **2[3[a]b]**.

* Name the function decompress()
* Start with a simple example, then move on to the more difficult example if you can
* return the output as a string

Hint: It might be easier to work with a list version of the compressed string: list(compressed_string)

In [159]:
# Work here
easy = "3[abc]"
difficult = "3[abc]4[ab]c"

..

SyntaxError: invalid syntax (208105880.py, line 5)

In [None]:
# Test here

assert(decompress("4[a]c") == "aaaac")
assert(decompress("3[c]4[a]") == "cccaaaa")
assert(decompress("2[ac]3[b]1[a]1[d]") == "acacbbbad")

In [None]:
# Example solution

















# There are several ways of solving this problem
# This method would also solve the actual google interview assignment
# With nested "3[ab2[ac]]c" versions:
def decompress(compressed_string):
    # Convert the compressed string to a list
    compressed_list = list(compressed_string)

    # Loop until no more "[" is found in the compressed list
    while True:
        open_index = []
        close_index = []
        if "[" not in compressed_list:
            print("Done:", compressed_list)
            return "".join(compressed_list)
            
        # Register where (position) the "[" and "]" are
        for i, char in enumerate(compressed_list):
            if char == "[":
                open_index.append(i)
            elif char == "]":
                close_index.append(i)

        # Get the final "[" and solve that
        final_open_pos = open_index[-1]
        count = final_open_pos
        while True:
            try:
                close_pos = close_index[close_index.index(count)]
                break
            except:
                count += 1

        # Insert that part into the original compressed string, and do the next loop
        temp = int(compressed_list[final_open_pos -1]) * "".join(compressed_list[final_open_pos + 1 : close_pos])
        list_temp = list(temp)
        compressed_list[final_open_pos -1 : close_pos + 1] = list_temp

decompress("3[ab3[c]]")