# Tasks Vectorizing Raw Data

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk

In [10]:

sample_data = ['This is the first paper.',
               'This document is the second paper.',
               'And this is the third one.',
               'Is this the first paper?']


## Count Vectorizer
### Task 1
 Use `CountVectorizer` on the same data
1. Once with `stop_words="english"` and once without and find what is the difference?
2. Once with `lowercase=True` and once without and find what is the difference








In [17]:
count_vectorizer = CountVectorizer()
x = count_vectorizer.fit_transform(sample_data)

In [18]:
count_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'paper', 'second', 'the',
       'third', 'this'], dtype=object)

In [19]:
count_vectorizer.vocabulary

In [20]:
x.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 0, 1, 1, 1],
       [0, 0, 1, 1, 0, 1, 0, 1, 0, 1]])

### Quz 1:
what is the shape of `x.toarray()`?

(4, 3)

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x.toarray().shape
    </p>
</details>

### Quz 2:
What is the type of `x`?

<details>
    <summary>Click to reveal answer</summary>
    <p>
    type(x)
    </p>
</details>

### Quz 3:
What is the type of `count_vectorizer`?

<details>
    <summary>Click to reveal answer</summary>
    <p>
    type(count_vectorizer)
    </p>
</details>

### Quz 4:
Print the vector the representation of the first document in the sample_data

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x.toarray()[0]
    </p>
</details>

### Quz 5:
Print the vector that represents the word "second" in the sample_data

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x.toarray()[:, 2]
    </p>
</details>

### Quz 6:
Create the instance of `CountVectorizer` with `stop_words="english"` and do the quizes 1-5, what is the difference?

1. print the feature names of the vectorized data
2. print the shape of the vectorized data `x.toarray()`


<details>
    <summary>Click to reveal answer</summary>
    <p>
    there is no "is" in the feature names, or any other stop words
    so the shape of the aarray is different
    </p>
</details>

## Tf-IDF Vectorizer

### Task 2

 Use `TfidfVectorizer` on the same data
1. Once with `tokenizer=my_custom_tokenizer` and once without and find what is the difference?


In [12]:
def my_custom_tokenizer(text):
    return text.split(" ")

In [16]:
tfidf_vectorizer = TfidfVectorizer(tokenizer=my_custom_tokenizer)

In [14]:
x2 = tfidf_vectorizer.fit_transform(sample_data)



In [15]:
tfidf_vectorizer.get_feature_names_out()

array(['document', 'one.', 'paper.', 'paper?', 'second'], dtype=object)

In [57]:
x2.toarray()

array([[0.        , 0.        , 0.54929352, 0.36357175, 0.        ,
        0.54929352, 0.        , 0.        , 0.36357175, 0.        ,
        0.36357175],
       [0.        , 0.53927767, 0.        , 0.28141746, 0.        ,
        0.42517271, 0.        , 0.53927767, 0.28141746, 0.        ,
        0.28141746],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.        , 0.        , 0.26710379, 0.51184851,
        0.26710379],
       [0.        , 0.        , 0.50487895, 0.3341742 , 0.        ,
        0.        , 0.64037493, 0.        , 0.3341742 , 0.        ,
        0.3341742 ]])

#### Quz 1:
What is the type of `tfidf_vectorizer`?

sklearn.feature_extraction.text.TfidfVectorizer

<details>
    <summary>Click to reveal answer</summary>
    <p>
    type(tfidf_vectorizer)
    </p>
</details>

#### Quz 2:
What is the type of `x2`?

scipy.sparse._csr.csr_matrix

<details>
    <summary>Click to reveal answer</summary>
    <p>
    type(x2)
    </p>
</details>


#### Quz 3:
What is the shape of `x2.toarray()`?

(4, 11)

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x2.toarray().shape
    </p>
</details>


### Quz 4:
Print the vector the representation of the second document in the sample_data

array([0.        , 0.53927767, 0.        , 0.28141746, 0.        ,
       0.42517271, 0.        , 0.53927767, 0.28141746, 0.        ,
       0.28141746])

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x2.toarray()[1]
    </p>
</details>

### Quz 5:
Print the vector that represents the word "second" in the sample_data

array([0.54929352, 0.        , 0.        , 0.50487895])

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x2.toarray()[:, 2]
    </p>
</details>

### Quz 6:
Create the instance of `TfidfVectorizer` with `stop_words="english"` and do the quizes 1-5, what is the difference?

1. print the feature names of the vectorized data
2. print the shape of the vectorized data `x2.toarray()`

<details>
    <summary>Click to reveal answer</summary>
    <p>
    the feature names are different
    the shape of the array is different
    </p>
</details>

### Quz 7:
Update the `my_custom_tokenizer` function to remove the punctuations and do the quizes 1-5, what is the difference?

<details>
    <summary>Click to reveal answer</summary>
    <p>
    the feature names are different
    no punctuations in the feature names
    </p>
</details>

## N-grams vectorization

### Task 3

In [62]:
n_gram_vectorizer = CountVectorizer(ngram_range=(2, 2))

In [63]:
x3 = n_gram_vectorizer.fit_transform(sample_data)

In [68]:
x3.toarray()

array([[0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]])

In [64]:
n_gram_vectorizer.get_feature_names_out()

array(['and this', 'document is', 'first paper', 'is the', 'is this',
       'second paper', 'the first', 'the second', 'the third',
       'third one', 'this document', 'this is', 'this the'], dtype=object)

### Quz 1:
What is the type of `n_gram_vectorizer`?

sklearn.feature_extraction.text.CountVectorizer

<details>
    <summary>Click to reveal answer</summary>
    <p>
    type(n_gram_vectorizer)
    </p>

### Quz 2:
What is the type of `x3`?

scipy.sparse._csr.csr_matrix

<details>
    <summary>Click to reveal answer</summary>
    <p>
    type(x3)
    </p>

### Quz 3:
What is the shape of `x3.toarray()`?

In [72]:
x3.toarray().shape

(4, 13)

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x3.toarray().shape
    </p>
</details>    

### Quz 4:
Print the vector the representation of the second document in the sample_data

array([0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0])

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x3.toarray()[1]
    </p>

### Quz 5:
Print the vector that represents the word "second paper" in the sample_data

array([1, 0, 0, 1])

<details>
    <summary>Click to reveal answer</summary>
    <p>
    x3.toarray()[:, 2]
    </p>

## Task 4

1. choose a dataset from the datasets folder
2. Clean the dataset
3. Use one of the vectorizers we discussed above on the dataset
4. Print the shape of the vectorized data
5. Print the first 5 rows of the vectorized data
6. Print the feature names of the vectorized data
7. Print the vocabulary of the vectorized data

## Task 5 Text similarity


In [1]:
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# create two vectors
vector_1 = [1, 2, 3]
vector_2 = [1, 2, 3]


In [3]:
# calculate the cosine similarity between the two vectors
cosine_similarity([vector_1], [vector_2])

array([[1.]])

In [4]:
doc_1 = "i love cats, but not dogs"
doc_2 = "i love dogs, but not cats"
documents = [doc_1, doc_2]

### Quiz 5.1

1. Use `CountVectorizer` to vectorize the documents
2. fit and transform the documents
3. get matrix of the vectorized documents
4. create a dataframe from the matrix
5. calculate the cosine similarity between the two documents

In [28]:
 # 1 create the instance of the CountVectorizer


<details>
    <summary>Click to reveal answer</summary>
    <p>
    count_vectorizer = CountVectorizer()
    </p>
</details>

In [9]:
# 2 fit and transform the documents


<details>
    <summary>Click to reveal answer</summary>
    <p>
    sparse_matrix = count_vectorizer.fit_transform(documents)
    </p>
</details>

In [10]:
# 3 get matrix of the vectorized documents


<details>
    <summary>Click to reveal answer</summary>
    <p>
    doc_term_matrix = sparse_matrix.todense()
    </p>
</details>

In [11]:
# 4 create a dataframe from the matrix


<details>
    <summary>Click to reveal answer</summary>
    <p>
    df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names_out(), index=['doc_1', 'doc_2'])
    </p>
</details>

In [13]:
# 5 calculate the cosine similarity between the two documents


array([[1., 1.],
       [1., 1.]])

<details>
    <summary>Click to reveal answer</summary>
    <p>
    cosine_similarity(df, df)
    </p>
</details>

### Quiz 5.2
1. use `ngram_range=(2, 3)` in the `CountVectorizer` and calculate the cosine similarity between the two documents

[[1. 0.]
 [0. 1.]]


<details>
    <summary>Click to reveal answer</summary>
    <p>
    count_vectorizer = CountVectorizer(stop_words='english', ngram_range=(2, 3))
    sparse_matrix = count_vectorizer.fit_transform(documents)
    doc_term_matrix = sparse_matrix.todense()
    df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names_out(), index=['doc_1', 'doc_2'])
    print(cosine_similarity(df, df))
    </p>
</details>

### Quiz 5.3
Use both approaches(`5.1` and `5.2`) with `stop_words='english'` and see if there is a difference

### Quiz 5.4

use TF-IDF and find cosine similarity between the two documents once
1. without `ngram_range`
2. with `ngram_range=(2,3)`

Notice the difference 

<details>
    <summary>Click to reveal answer</summary>
    <p>
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(2,3))
    sparse_matrix = tfidf_vectorizer.fit_transform(documents)
    doc_term_matrix = sparse_matrix.todense()
    df = pd.DataFrame(doc_term_matrix, columns=tfidf_vectorizer.get_feature_names_out(), index=['doc_1', 'doc_2'])
    </p>
</details>

In [27]:
cosine_similarity(df, df)

array([[1.        , 0.07780894],
       [0.07780894, 1.        ]])