**TF-IDF: term-frequncy - inverse document frequency**
*   Allows you to determine the most important words in each document in the corpus.
*   The idea behind tf-idf is that each corpus might have more shared words than stopwords.
*   Ensures the most common words don't show up as keywords.
*   Keeps the document-specific frequent words weighted high and the common words across the entire corpus weighted low.











In [16]:
import re
import os
from google.colab import files

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Import Dictionary
from gensim.corpora.dictionary import Dictionary
from collections import defaultdict

from gensim.models.tfidfmodel import TfidfModel

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [17]:
def read_documents(uploaded_files):
    documents = []
    for filename in uploaded_files.keys():
        with open(filename, 'r', encoding='utf-8') as file:
            documents.append(file.read())
    return documents

In [18]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    lower_tokens = [t.lower() for t in tokens]
    alpha_only = [t for t in lower_tokens if t.isalpha()]
    no_stops = [t for t in alpha_only if t not in stopwords.words('english')]
    wordnet_lemmatizer = WordNetLemmatizer()
    words_lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
    return words_lemmatized

In [19]:
uploaded = files.upload()

Saving wiki_article_0.txt to wiki_article_0 (1).txt
Saving wiki_article_1.txt to wiki_article_1 (1).txt
Saving wiki_article_2.txt to wiki_article_2 (1).txt
Saving wiki_article_3.txt to wiki_article_3 (1).txt
Saving wiki_article_4.txt to wiki_article_4 (1).txt
Saving wiki_article_5.txt to wiki_article_5 (1).txt
Saving wiki_article_6.txt to wiki_article_6 (1).txt
Saving wiki_article_7.txt to wiki_article_7 (1).txt
Saving wiki_article_8.txt to wiki_article_8 (1).txt
Saving wiki_article_9.txt to wiki_article_9 (1).txt


In [20]:
documents = read_documents(uploaded)
articles = [preprocess_text(doc) for doc in documents]
print (articles)

[['computer', 'digital', 'electronic', 'machine', 'programmed', 'carry', 'sequence', 'arithmetic', 'logical', 'operation', 'computation', 'automatically', 'modern', 'computer', 'perform', 'generic', 'set', 'operation', 'known', 'program', 'program', 'enable', 'computer', 'perform', 'wide', 'range', 'task', 'computer', 'system', 'complete', 'computer', 'includes', 'hardware', 'operating', 'system', 'main', 'software', 'peripheral', 'equipment', 'needed', 'used', 'full', 'operation', 'term', 'may', 'also', 'refer', 'group', 'computer', 'linked', 'function', 'together', 'computer', 'network', 'computer', 'cluster', 'broad', 'range', 'industrial', 'consumer', 'product', 'use', 'computer', 'control', 'system', 'simple', 'device', 'like', 'microwave', 'oven', 'remote', 'control', 'included', 'factory', 'device', 'like', 'industrial', 'robot', 'design', 'well', 'device', 'like', 'personal', 'computer', 'mobile', 'device', 'like', 'smartphones', 'computer', 'power', 'internet', 'link', 'billio

In [21]:
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

In [22]:
# Create a Corpus: corpus
corpus = [dictionary.doc2bow(a) for a in articles]



*   Initialize a new **TfidfModel** called **corpus_tfidf** using **corpus**.
*   Use **doc** to calculate the weights by passing **[doc]** to **tfidf**.



In [23]:
# Create a new TfidfModel using the corpus: corpus_tfidf
corpus_tfidf = TfidfModel(corpus)

# Choose the document to print TF-IDF values (e.g., fifth document)
doc_index = corpus[4]  # Index for the fifth document

# Get the TF-IDF weights for the document
tfidf_weights = corpus_tfidf[doc_index]

# Map the word IDs to words and print their TF-IDF weights
words_with_tfidf = [(dictionary[id], tfidf) for id, tfidf in tfidf_weights]

# Print the results
for word, tfidf_weight in words_with_tfidf[:5]:
    print(f"Word: {word}, TF-IDF: {tfidf_weight}")

Word: along, TF-IDF: 0.040038879506213966
Word: analog, TF-IDF: 0.031456084554369525
Word: billion, TF-IDF: 0.06291216910873905
Word: calculation, TF-IDF: 0.031456084554369525
Word: century, TF-IDF: 0.1258243382174781




*   Sort the term ids and weights in a new list from highest to lowest weight.

*   Using your pre-existing **dictionary**, print the top five weighted words (**term_id**) from **sorted_tfidf_weights**, along with their weighted score (**weight**).




In [24]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

technology 0.2143326946895572
bc 0.18047827438943564
becoming 0.18047827438943564
mechanical 0.126148900232541
semiconductor 0.126148900232541
