## Sklearn TF-IDF Tutorial

Before learning how to calculate TF-IDF using sklearn, let's first start with the `CountVectorizer` as it is easier.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

Imagine having the following simple corpus with 3 documents:

In [None]:
train = ['This train is leaving the train station to make room for another train',
         'You can use this room as your office.',
         'We have no room for error in this experiment.']

We can create the document-term matrix (https://en.wikipedia.org/wiki/Document-term_matrix) very easily as follows:

In [None]:
c_vectorizer = CountVectorizer(analyzer='word', stop_words='english')
counts = c_vectorizer.fit_transform(train)

In [None]:
c_array = counts.toarray()
c_array

In [None]:
# or get_feature_names_out() depending on the sklearn version
c_tokens = c_vectorizer.get_feature_names()
c_tokens

In [None]:
c_df = pd.DataFrame(
    data = c_array,
    index = [f'Doc {i}' for i in range(c_array.shape[0])],
    columns = c_tokens)
c_df

Now we can use `c_vectorizer` on new document(s) using the `transform` method instead of `fit_transform`:

In [None]:
test = ['There is no room in this train']

In [None]:
counts_test = c_vectorizer.transform(test)
c_array_test = counts_test.toarray()
c_array_test

In [None]:
c_tokens = c_vectorizer.get_feature_names()
c_tokens

In [None]:
c_df_test = pd.DataFrame(
    data = c_array_test,
    index = [f'Doc {i}' for i in range(c_array_test.shape[0])],
    columns = c_tokens)

c_df_test

Now let's apply `TfidfVectorizer` in the same way:

In [None]:
train = ['This train is leaving the train station to make room for another train',
         'You can use this room as your office.',
         'We have no room for error in this experiment.']

In [None]:
tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(train)
tfidf_array = tfidf.toarray()
tfidf_array

In [None]:
tfidf_tokens = tfidf_vectorizer.get_feature_names()
tfidf_tokens

In [None]:
tfidf_df = pd.DataFrame(
    data = tfidf_array,
    index = [f'Doc {i}' for i in range(tfidf_array.shape[0])],
    columns = tfidf_tokens)
tfidf_df

In [None]:
test = ['There is no room in this train']

In [None]:
tfidf_test = tfidf_vectorizer.transform(test)
tfidf_array_test = tfidf_test.toarray()
tfidf_array_test

In [None]:
tfidf_tokens_test = tfidf_vectorizer.get_feature_names()
tfidf_tokens_test

In [None]:
tfidf_df_test = pd.DataFrame(
    data = tfidf_array_test,
    index = [f'Doc {i}' for i in range(tfidf_array_test.shape[0])],
    columns = tfidf_tokens_test)
tfidf_df_test

Think about why the term 'train' has a higher value than 'room', even though counts were 1 for both.

Finally it is **very important** to read the `TfidfVectorizer` documentation to learn about the parameters. For example I removed the stop words using the keyword argument `stop_words='english'`.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Quiz

In [None]:
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
training_positive = all_positive_tweets[:4000]
test_positive = all_positive_tweets[4000:]

training_negative = all_negative_tweets[:4000]
test_negative = all_negative_tweets[4000:]

training_set = training_positive + training_negative
test_set = test_positive + test_negative

print(len(training_set), len(test_set))

Let's create one big document (a string) that contains all the positive tweets:

In [None]:
positive_tweets = ''
for tweet in training_positive:
    positive_tweets += ' ' + tweet

Similarly for the negative tweets let's create one big document (a string) that contains all the negative tweets:

In [None]:
negative_tweets = ''
for tweet in training_negative:
    negative_tweets += ' ' + tweet

and create the corpus we will analyse

In [None]:
corpus = [positive_tweets, negative_tweets]

In [None]:
len(corpus)

In [None]:
len(corpus[0]), len(corpus[1])

### Task 1

Use `TfidfVectorizer` on `corpus` to calculate tf-idf values for each token in positive and negative tweets. The resulting dataframe should look like the one below where each column is a token and first row representing all positive tweets and the second row representing all the negative tweets:

||00|000|001|00128835|009|00962778381838|...|ｓｅｅ|
|--|--|--|--|--|--|--|--|--|
|Positive|0.001657|0.010481|0.000000|0.000000|0.001165|0.001165|...|0.000000|
|Negative|0.002383|0.000000|0.001675|0.001675|0.000000|0.000000|...|0.058611|
2 rows × 17223 columns

In [None]:
# YOUR CODE HERE #

### Task 2

Calculate td-idf for the tokens `happy` and `sad`

||happy|sad|
|--|--|--|
|Positive|0.124290|0.003314|
|Negative|0.021447|0.119148|

In [None]:
# YOUR CODE HERE #

In this final part of the quiz, your task is to do part of speech tagging for an example tweet

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

In [None]:
example_tweet = training_positive[0]
print(example_tweet)

### Task 3

Find the parts of speech for the following tweet

Example tweet:

```
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
```

Expected output:

```
[('#', '#'),
 ('FollowFriday', 'NNP'),
 ('@', 'NNP'),
 ('France_Inte', 'NNP'),
 ('@', 'NNP'),
 ('PKuchly57', 'NNP'),
 ('@', 'NNP'),
 ('Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':', ':'),
 (')', ')')]
 ```

In [None]:
# YOUR CODE HERE #

## Task 4

Use `nltk.help.upenn_tagset()` to get the explanation for the tags. Print the explanation for 'DT' and 'NN'.

```
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
    
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
```



In [None]:
# YOUR CODE HERE #