# Topics.ipnyb

**Purpose:**

The script analyzes Markdown files (.md) in a specific folder on your computer. It extracts the text, cleans it up, and then identifies the most and least common words, two-word phrases (bigrams), and three-word phrases (trigrams). This helps you understand the key topics and language patterns in your documents.

**Steps:**

1. **Setup:**
   - Import necessary libraries: `glob`, `markdown`, `re`, `nltk`, `collections.Counter`, `nltk.tokenize`, `nltk.corpus`, `nltk.util`, and `bs4`.
   - Download NLTK data: Ensures the script has the required language resources (tokenizers, stop words).
   - Define the folder path: Specifies the directory where your Markdown files are located.

2. **Load and Prepare Text:**
   - Find Markdown files: Uses `glob.glob` to get a list of all `.md` files in the specified folder.
   - Read and concatenate text: Opens each file, converts Markdown to plain text using `markdown`, and combines the text from all files into one string.
   - Remove HTML tags: Uses `BeautifulSoup` to strip out any HTML code that might be present.
   - Clean text:
      - Removes punctuation and converts everything to lowercase.
      - Tokenizes the text into individual words using `word_tokenize`.

3. **Filter Words:**
   - Define filter lists: Creates sets of words to exclude from the analysis:
      - `weasel_words`: Words that are vague or lack specific meaning (e.g., "actually," "probably").
      - `days_of_week`: Names of days and "timeline."
      - `months`: Names of months (full and abbreviated).
      - `week_variations`: Variations of the word "week."
      - `stopwords`: Common words with little meaning (e.g., "the," "and").
   - Combine filters: Merges all these lists into one set for efficiency.
   - Filter out words: Keeps only words that are not in the filter list and are not numbers.

4. **Analyze Text:**
   - Calculate word frequencies:  Uses `Counter` to count how often each word appears.
   - Find top words: Gets the 25 most frequent words and their counts.
   - Create bigrams and trigrams: Forms two-word and three-word phrases using `ngrams`.
   - Count n-gram frequencies: Similar to word counts, but for phrases.
   - Find top n-grams: Gets the 25 most frequent bigrams and trigrams.
   - Find least common trigrams: Gets the 5 least frequent trigrams.

5. **Display Results:**
   - Print the 5 least common, 25 most common trigrams, 25 most common words, and 25 most common bigrams, along with their counts.

**Customization:**
    - This script can be modified by adjusting:
       - folder_path to reflect the actual directory containing your markdown files
       - words in the weasel_words list to add or remove words to ignore from the analysis
       - the number of common words or phrases to print to adjust the scope of your results

# Setup your environent

- Use pip to upgrade pip
- install pandas, requests, markdown, textblob, nltk, gensim, spacy, beautifulsoup4
   

In [None]:
pip install --upgrade pip

In [None]:
pip install pandas

In [None]:
pip install requests

In [None]:
pip install markdown

In [None]:
pip install textblob

In [None]:
pip install nltk

In [None]:
pip install gensim

In [None]:
pip install spacy

In [None]:
pip install beautifulsoup4

# Customize for your needs 

- Update the ```folder path```
- Run the script to see what updates you want to append to ```weasel_words```
- Alter the numerical values for returned results

In [None]:
import glob
import markdown
import re
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from bs4 import BeautifulSoup

# Download NLTK data if needed
nltk.download('punkt')  # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('averaged_perceptron_tagger')

# 1. Define Folder Path and Load Markdown Files
folder_path = '/path/to/your/blog/posts/*.md'  # Your folder path
markdown_files = glob.glob(folder_path)
all_text = ""

for file_path in markdown_files:
    with open(file_path, 'r') as file:
        md_text = file.read()
        plain_text = markdown.markdown(md_text)

        # Remove HTML tags
        soup = BeautifulSoup(plain_text, 'html.parser')
        plain_text = soup.get_text()

        all_text += plain_text + " " 

# 2. Clean and Preprocess Text
all_text = re.sub(r'[^\w\s]', '', all_text).lower() 
words = word_tokenize(all_text)

# Extended list of words to filter out

weasel_words = {"subscribe", "pulls", "self", "oooh", "wow", "nice", "thats", "grocery", "store", 
                "headroom", "shot"} # Add your words

days_of_week = {"timeline", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"}
months = {"january", "february", "march", "april", "may", "june", "july", "august", "september",
          "october", "november", "december", "jan", "feb", "mar", "apr", "may", "jun", "jul",
          "aug", "sep", "oct", "nov", "dec"}  # Added abbreviations
week_variations = {"week", "weeks", "weekly"}

words_to_filter = set(stopwords.words('english'))
words_to_filter.update(weasel_words, days_of_week, months, week_variations) 

# Filter out words that are numbers
filtered_words = [word for word in words if not word in words_to_filter and not word.isdigit()]

# 3. Analyze for Most Common Words, Bigrams, and Trigrams
word_counts = Counter(filtered_words)
top_words = word_counts.most_common(25)

bigrams = ngrams(filtered_words, 2)
bigram_counts = Counter(bigrams)
top_bigrams = bigram_counts.most_common(25)

trigrams = ngrams(filtered_words, 3)
trigram_counts = Counter(trigrams)
top_trigrams = trigram_counts.most_common(25)

# Get the 5 least common trigrams
least_common_trigrams = trigram_counts.most_common()[-5:] 
# Note:  We reverse the order using `[-5:]` to get the least common ones.

print("\nTop 5 Least Common Three-Word Phrases:")
for trigram, count in least_common_trigrams:
    print(f"{trigram}: {count}")  

print("\nTop 25 Most Common Three-Word Phrases:")
for trigram, count in top_trigrams:
    print(f"{trigram}: {count}")

print("\nTop 25 Most Common Words:")
for word, count in top_words:
    print(f"{word}: {count}")

print("\nTop 25 Most Common Two-Word Phrases:")
for bigram, count in top_bigrams:
    print(f"{bigram}: {count}")
