# Lab Work 2: Text Processing: Preparation of texts

Use this notebook for the subsequence excecise's parts

## 6.2.1 Load the data and CountVectorize them
You will find a list of files in Ilias [sherlock.zip](https://www.ili.fh-aachen.de/goto_elearning_file_815003_download.html)
Download the zip file and adapt your next line accordingly.

In [94]:
import numpy as np
import pandas as pd


filenames = [r"./Sherlock.txt", 
             r"./Sherlock_blanched.txt",
             r"./Sherlock_black.txt",
             r"./Sherlock_blue.txt",
             r"./Sherlock_card.txt"]

Now we create a count Vectorizer. The parameter given tells the CountVectorizer that its methods shall operate on a list of filenames.

In [95]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input="filename")

Now generate the Bag of Words with the CountVectorizer and check:
* the total number of different words
* the total number of words per document
* the total number of occurences of each word

In [96]:
X = vectorizer.fit_transform(filenames).toarray()
dataframe = pd.DataFrame(X, columns=vectorizer.get_feature_names_out())
dataframe

Unnamed: 0,117,12,12s,13,131,13th,14th,15,16,1840,...,yourselves,youth,youthful,youths,yoxley,zealous,zenith,zest,zigzag,zoo
0,0,1,0,1,2,1,1,1,2,1,...,1,5,1,0,7,1,1,1,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,2,0,1,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [97]:
df = dataframe.copy()
totalNumber = df.shape[1]
numWordsPerDoc = df.sum(axis=1)
numOccurEachWord = df.sum(axis=0)
print('Total number of different words:', totalNumber)
print('Total number of words per document:\n', numWordsPerDoc)
print('Total number of occurencies of each word:\n', numOccurEachWord)

Total number of different words: 8879
Total number of words per document:
 0    107416
1      7258
2      7775
3      7497
4      8242
dtype: int64
Total number of occurencies of each word:
 117        2
12         1
12s        1
13         1
131        2
          ..
zealous    1
zenith     1
zest       1
zigzag     1
zoo        1
Length: 8879, dtype: int64


## 6.2.2 Which word is occuring the most?

This must be done in three steps. Reason is, that the vectorizer.vocabulary_ is organized as a dictonary with the value indicating the position of the word in the array
1. Find out the highest count of a word
2. Find out the position of this count
3. Find out the word at this position

Let's see, how the vocabulary of the vectorizer looks like. For that purpose we use `vocabulary_` attribute. It shows the word and its index from the table

In [98]:
vectorizer.vocabulary_

{'the': 7921,
 'return': 6563,
 'of': 5385,
 'sherlock': 7015,
 'holmes': 3843,
 'collection': 1528,
 'adventures': 190,
 'by': 1142,
 'sir': 7146,
 'arthur': 487,
 'conan': 1614,
 'doyle': 2454,
 'adventure': 189,
 'empty': 2687,
 'house': 3902,
 'it': 4319,
 'was': 8572,
 'in': 4038,
 'spring': 7411,
 'year': 8847,
 '1894': 18,
 'that': 7918,
 'all': 274,
 'london': 4711,
 'interested': 4239,
 'and': 342,
 'fashionable': 2997,
 'world': 8800,
 'dismayed': 2343,
 'murder': 5136,
 'honourable': 3858,
 'ronald': 6649,
 'adair': 150,
 'under': 8269,
 'most': 5101,
 'unusual': 8376,
 'inexplicable': 4115,
 'circumstances': 1411,
 'public': 6126,
 'has': 3701,
 'already': 293,
 'learned': 4565,
 'those': 7967,
 'particulars': 5580,
 'crime': 1898,
 'which': 8659,
 'came': 1171,
 'out': 5460,
 'police': 5852,
 'investigation': 4286,
 'but': 1132,
 'good': 3499,
 'deal': 2055,
 'suppressed': 7725,
 'upon': 8382,
 'occasion': 5365,
 'since': 7138,
 'case': 1241,
 'for': 3207,
 'prosecution': 

Firstly, let's find the hightest count of a word combined in all documents using the function `max()`

In [99]:
max_count = df.sum().max()
max_count

7975

Now let's find a position of this maximal word count using `values` attribute and `argmax()` function

In [100]:
max_index = df.values.argmax()
max_index

7921

In the end we find out which word is under this index. To achieve that we go through the whole vocabulary and check one by one if the index of the corresponding word is equal to the index, which we found in the previous step And not surprisingly it is the word `the`, which occurs a lot in sentences in english language

In [101]:
max_index_word = [w for w, i in vectorizer.vocabulary_.items() if i == max_index][0]
max_index_word

'the'

An alternative way of doing it in one step is sorting it using an integrated sorting function of DataFrame `sort_values`. Here we can additionally see the count of the word in each file

In [102]:
df.T.sort_values(by=0, ascending=False)

Unnamed: 0,0,1,2,3,4
the,6237,360,517,466,395
and,2919,200,244,200,256
of,2796,170,232,233,209
to,2665,201,153,193,219
that,2163,148,162,122,159
...,...,...,...,...,...
covent,0,0,0,3,0
remonstrance,0,0,0,1,0
inwardly,0,1,0,0,0
involuntary,0,1,0,0,0


# 6.3 Improving using stop word, ngrams and tf-idf
The feature space is vast with nearly 9000 dimensions. Hence we should try to reduce the number of dimensions by:

1. use only words that have a mimimum occurence in all documents (minimal document frequency) min_df
2. remove stop words (like 'a', 'and', 'the') as they don't give valuable information for classification and/or 
3. remove words that occur in many documents (maximum document frequency) max_df 

Experiment with the values of min_df and max_df and see how the size of the vocabulary is changing.

Implement all three options and check for their separate outcome an their combinations

---
First of all, we initialize our count vectorizers for each case differently: for the words with minimum document frequency, for the stop words, for the words with maximum document frequency and in the end we also initialize a combined version of our count vectorizer with all of the parameters from above

We took `min_df=0.3`, which means "ignore terms that appear in less than 30% of the documents", because after some experiments and trying out using other values we came to conclusion, that using the value under 0.3 doesn't affect the text at all, as it didn't remove any words from the initial table. On the other side, we didn't want to go too far using the value more that 0.3, which results the loss of valuable information, which we don't want to have

We took `max_df=0.6`, which means "ignore terms that appear in more than 60% of the documents", because after some experiments and trying out using other values we came to conclusion, that using the value over 0.6 doesn't remove as many words as we want and leaves some unnecessary repeating words, which occur in almost all of the texts. On the other hand, we didn't take values under 0.6 not to lose important information and context

In other two initializations we just took english language list of stop words and in the last count vectorizer initialization we took combination of all the parameters from above. Important thing to mention is, that in all of the initializations we use `input="filename"` parameter to inform the vectorizer that we are giving an array of file names as a parameter for `fit_transform` and it should read all the values not directly from an array, but from the files given in that array

In [103]:
cv_min_df = CountVectorizer(min_df=0.3, input="filename")
cv_stop_words = CountVectorizer(stop_words='english', input="filename")
cv_max_df = CountVectorizer(max_df=0.6, input="filename")
cv_combined = CountVectorizer(min_df=0.3, stop_words='english', max_df=0.6, input="filename")

Now we use `fit_transform` function to obtain term frequency and then we turn this result to an array for the later usage. As you can see, we are giving our `filenames` array as a parameter

In [104]:
text_min_df = cv_min_df.fit_transform(filenames).toarray()
text_stop_words = cv_stop_words.fit_transform(filenames).toarray()
text_max_df = cv_max_df.fit_transform(filenames).toarray()
text_combined = cv_combined.fit_transform(filenames).toarray()

Here we are creating a dataframe from the results of vectorization and the dataframe shows how many occurencies are there for each word in each file

In [105]:
df_min_df = pd.DataFrame(text_min_df, columns=cv_min_df.get_feature_names_out(), index=filenames)
df_stop_words = pd.DataFrame(text_stop_words, columns=cv_stop_words.get_feature_names_out(), index=filenames)
df_max_df = pd.DataFrame(text_max_df, columns=cv_max_df.get_feature_names_out(), index=filenames)
df_combined = pd.DataFrame(text_combined, columns=cv_combined.get_feature_names_out(), index=filenames)

In [106]:
df_min_df

Unnamed: 0,1883,1884,1901,30,45,46,83,95,aback,abandon,...,yet,yonder,you,young,younger,your,yours,yourself,yourselves,youth
./Sherlock.txt,5,2,1,3,1,1,1,5,2,6,...,110,7,1659,115,5,412,13,32,1,5
./Sherlock_blanched.txt,0,0,1,1,0,0,0,0,0,0,...,2,0,112,5,0,24,1,1,0,1
./Sherlock_black.txt,5,1,0,0,1,1,1,2,1,1,...,7,0,109,10,0,19,2,0,0,1
./Sherlock_blue.txt,0,0,0,1,0,0,0,0,0,0,...,2,1,146,0,0,27,2,1,1,1
./Sherlock_card.txt,0,0,0,0,0,0,0,0,0,0,...,1,0,111,2,2,40,2,3,0,0


In [107]:
df_stop_words

Unnamed: 0,117,12,12s,13,131,13th,14th,15,16,1840,...,youngster,youth,youthful,youths,yoxley,zealous,zenith,zest,zigzag,zoo
./Sherlock.txt,0,1,0,1,2,1,1,1,2,1,...,2,5,1,0,7,1,1,1,0,1
./Sherlock_blanched.txt,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
./Sherlock_black.txt,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
./Sherlock_blue.txt,2,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
./Sherlock_card.txt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [108]:
df_max_df

Unnamed: 0,117,12,12s,13,131,13th,14th,15,16,1840,...,youngster,yourselves,youthful,youths,yoxley,zealous,zenith,zest,zigzag,zoo
./Sherlock.txt,0,1,0,1,2,1,1,1,2,1,...,2,1,1,0,7,1,1,1,0,1
./Sherlock_blanched.txt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
./Sherlock_black.txt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
./Sherlock_blue.txt,2,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
./Sherlock_card.txt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [109]:
df_combined

Unnamed: 0,1883,1884,1901,30,45,46,83,95,aback,abandon,...,wrote,yacht,yard,yarn,yarned,yell,yellow,yesterday,yonder,younger
./Sherlock.txt,5,2,1,3,1,1,1,5,2,6,...,24,3,22,1,1,4,7,31,7,5
./Sherlock_blanched.txt,0,0,1,1,0,0,0,0,0,0,...,4,0,0,0,0,0,0,0,0,0
./Sherlock_black.txt,5,1,0,0,1,1,1,2,1,1,...,0,3,0,1,1,3,0,2,0,0
./Sherlock_blue.txt,0,0,0,1,0,0,0,0,0,0,...,0,0,3,0,0,0,1,0,1,0
./Sherlock_card.txt,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,4,3,0,2


# 6.4 Rescaling the data using term frequency inverse document frequency
Here, term frequency is the number of occurences of a term (word) $t$ in a document $d$. 

$\operatorname{tf}(t, d) = f_{t, d}$ 

Sometimes tf gets normalized to the length of $d$
The inverse document frequency idf is a measure on the amount of information a term t carries. Rare occurences of t leads to a high amount of information common occurence to a low amount of information. The idf is computed as 

$\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$

where $n$ is the total number of documents and $\text{df}(t)$ is the number of documents that contain the term $t$. Hence, the tf-idf is the product of the two terms:

$\text{tf-idf(t,d)}=\text{tf(t,d)} \cdot \text{idf(t)}$

scikit-learn supports this in the `TfidfTransformer`, when using the following parameters: `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`. Refer to the scikit documentation for the parameter sets and how this changes the formula.

Combining Bag of Words and tf-idf can be done using the `TfidfVectorizer`

# 6.4.1 Find maximum value for each of the features over dataset

We are going to find the maximum value for each of the features over dataset using two methods: first is with TfidfTransformer and the second with TfidfVectorizer. <br>
Using `TfidfTransformer`, you follow a step-by-step process where you first calculate word counts using CountVectorizer, then compute the Inverse Document Frequency (IDF) values, and finally, calculate the Tf-idf scores. <br>
On the other hand, with `TfidfVectorizer`, all three steps are performed simultaneously at once. It internally handles the computation of word counts, IDF values, and Tf-idf scores using the same dataset

For `TfidfTransformer` we import the library from SciKitLearn and initialize the class with the given parameters from above. After that we do fit transformation, but now not with the raw text, but with our `text_combined` variable, which has the term frequencies already and used combined parameters, which removed unnecessary terms. Using resulting variable we create a new dataframe, which has the inverse document frequency, which shows an amount of information the term carries. In the end we use `max()` function to find maximum values for each feature. As you can see, we did multiple steps to get the IDF values as we mentioned above

In [110]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
tf_idf_vector = tfidf_transformer.fit_transform(text_combined).toarray()

tf_idf_transformer = pd.DataFrame(tf_idf_vector, columns=cv_combined.get_feature_names_out(), index=filenames) 
tf_idf_transformer_max = tf_idf_transformer.max()
print('TfidfTransformer: Maximum TF-IDF values for each feature: ')
print(tf_idf_transformer_max)

TfidfTransformer: Maximum TF-IDF values for each feature: 
1883         0.063727
1884         0.012745
1901         0.018570
30           0.017889
45           0.012745
               ...   
yell         0.038236
yellow       0.064668
yesterday    0.066740
yonder       0.021551
younger      0.038952
Length: 2509, dtype: float64


In [111]:
tf_idf_transformer

Unnamed: 0,1883,1884,1901,30,45,46,83,95,aback,abandon,...,wrote,yacht,yard,yarn,yarned,yell,yellow,yesterday,yonder,younger
./Sherlock.txt,0.012968,0.005187,0.002594,0.006459,0.002594,0.002594,0.002594,0.012968,0.005187,0.015561,...,0.062246,0.007781,0.047364,0.002594,0.002594,0.010374,0.01507,0.06674,0.018155,0.012968
./Sherlock_blanched.txt,0.0,0.0,0.01857,0.015415,0.0,0.0,0.0,0.0,0.0,0.0,...,0.07428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
./Sherlock_black.txt,0.063727,0.012745,0.0,0.0,0.012745,0.012745,0.012745,0.025491,0.012745,0.012745,...,0.0,0.038236,0.0,0.012745,0.012745,0.038236,0.0,0.02116,0.0,0.0
./Sherlock_blue.txt,0.0,0.0,0.0,0.017889,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.053667,0.0,0.0,0.0,0.017889,0.0,0.021551,0.0
./Sherlock_card.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.032334,0.0,0.0,0.0,0.064668,0.048501,0.0,0.038952


For `TfidfVectorizer` we import the library from SciKitLearn and also initialize the class with the given parameters from above. After that we do fit transformation with the raw text, therefore we use `filenames` as a parameter for `fit_transform`. Then we repeat the same process as before and create dataframe, which places TF-IDF values in a dataframe. In the end we again find the maximum values for each of the features. However because we the raw dataset and only stop words in the parameter without max_df and min_df, the resulting table has more values than the table above. This was done to show how different can the results be with and without using `max_df` and `min_df` values. And as you can see `TfidfVectorizer` does multiple steps, which we did in the first example, at once

In [112]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, input='filename', stop_words='english')
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(filenames)
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=filenames)

tfidf_df_max = tfidf_df.max()
print('TfidfVectorizer: Maximum TF-IDF values for each feature: ')
print(tfidf_df_max)

TfidfVectorizer: Maximum TF-IDF values for each feature: 
117        0.023922
12         0.001176
12s        0.011961
13         0.001176
131        0.002353
             ...   
zealous    0.001176
zenith     0.001176
zest       0.001176
zigzag     0.011961
zoo        0.001176
Length: 8601, dtype: float64


In [113]:
tfidf_df

Unnamed: 0,117,12,12s,13,131,13th,14th,15,16,1840,...,youngster,youth,youthful,youths,yoxley,zealous,zenith,zest,zigzag,zoo
./Sherlock.txt,0.0,0.001176,0.0,0.001176,0.002353,0.001176,0.001176,0.001176,0.002353,0.001176,...,0.002353,0.003314,0.001176,0.0,0.008235,0.001176,0.001176,0.001176,0.0,0.001176
./Sherlock_blanched.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
./Sherlock_black.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.006573,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
./Sherlock_blue.txt,0.023922,0.0,0.011961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.006739,0.0,0.0,0.0,0.0,0.0,0.0,0.011961,0.0
./Sherlock_card.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.011645,0.0,0.0,0.0,0.0,0.0,0.0


As you can see, after adding `min_df` and `max_df` as the parameters, we got the same result as we got with `TfidfTransformer`

In [114]:
tfidf_vectorizer = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, input='filename', stop_words='english', min_df=0.3, max_df=0.6)
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(filenames)
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=filenames)

tfidf_df_max = tfidf_df.max()
print('TfidfVectorizer with Min-DF and Max-DF: Maximum TF-IDF values for each feature: ')
print(tfidf_df_max)

TfidfVectorizer with Min-DF and Max-DF: Maximum TF-IDF values for each feature: 
1883         0.063727
1884         0.012745
1901         0.018570
30           0.017889
45           0.012745
               ...   
yell         0.038236
yellow       0.064668
yesterday    0.066740
yonder       0.021551
younger      0.038952
Length: 2509, dtype: float64


In [115]:
tfidf_df

Unnamed: 0,1883,1884,1901,30,45,46,83,95,aback,abandon,...,wrote,yacht,yard,yarn,yarned,yell,yellow,yesterday,yonder,younger
./Sherlock.txt,0.012968,0.005187,0.002594,0.006459,0.002594,0.002594,0.002594,0.012968,0.005187,0.015561,...,0.062246,0.007781,0.047364,0.002594,0.002594,0.010374,0.01507,0.06674,0.018155,0.012968
./Sherlock_blanched.txt,0.0,0.0,0.01857,0.015415,0.0,0.0,0.0,0.0,0.0,0.0,...,0.07428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
./Sherlock_black.txt,0.063727,0.012745,0.0,0.0,0.012745,0.012745,0.012745,0.025491,0.012745,0.012745,...,0.0,0.038236,0.0,0.012745,0.012745,0.038236,0.0,0.02116,0.0,0.0
./Sherlock_blue.txt,0.0,0.0,0.0,0.017889,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.053667,0.0,0.0,0.0,0.017889,0.0,0.021551,0.0
./Sherlock_card.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.032334,0.0,0.0,0.0,0.064668,0.048501,0.0,0.038952


### Conclusion

In conclusion we can say, that we addressed multiple approaches for text processing and its preparation using `CountVectorizer`, `TfidfTransformer` and `TfidfVectorizer`. <br>
We used `CountVectorizer` to create a bag of words and also check some interesting features like number of different words, words per document and occurences of each word. Moreover we found out two ways how we can use CountVectorizer to find which word is occuring the most. In the end we also could figure out, that we can use some parameters like `max_df`, `stop_words` and `min_df` to clean the dataset from the words with less importance and amount of information. <br>
Moreover, we demonstrated the application of `TfidfTransformer` in combination with CountVectorizer. This involved first computing term frequency (TF) using CountVectorizer and subsequently transforming these counts into TF-IDF scores using TfidfTransformer.<br>
Furthermore, we did a research of the use of `TfidfVectorizer` to combine the bag of words model with TF-IDF weighting. This approach allows for the creation of a matrix where each entry represents the TF-IDF score of a term in a document, emphasizing the importance of terms while considering their rarity across the entire corpus.