<a href="https://colab.research.google.com/github/Srivathsa252/College_Management_System/blob/main/LAB1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TTR Analysis Report | LAB 1 | TASK 1 | SRIVATHSA K | 2022BCD0020 |

This report compares the Type-Token Ratio (TTR) of text extracted from the Natural Language Processing Wikipedia page with the TTRs of various genres from the Brown Corpus.

The Type-Token Ratio (TTR) is a measure of lexical diversity, calculated as the number of unique words (types) divided by the total number of words (tokens) in a text. A higher TTR generally indicates a more diverse vocabulary.

### TTR Results

The TTR for the Natural Language Processing Wikipedia page is {{nlp_ttr}}.

The TTRs for the Brown Corpus genres are as follows:

{{report_df.to_markdown(index=False)}}

### Comparison and Analysis

The TTR of the NLP Wikipedia page ({{nlp_ttr}}) is significantly higher than the TTRs of all genres in the Brown Corpus. This suggests that the vocabulary used in the NLP Wikipedia page is much more diverse compared to the Brown Corpus genres. This is expected, as a Wikipedia page on a technical topic like Natural Language Processing is likely to contain a wide range of specialized terms and concepts, leading to a higher number of unique words relative to the total word count, especially when compared to more general or less technical genres in the Brown Corpus.

Genres like 'humor' and 'science_fiction' have relatively higher TTRs within the Brown Corpus, possibly due to creative language use or specialized terminology, but they are still considerably lower than the TTR of the NLP Wikipedia page. Conversely, genres like 'learned' and 'government' have lower TTRs, which might indicate more repetitive or standardized language.

In conclusion, the high TTR of the NLP Wikipedia page reflects the specialized and diverse vocabulary characteristic of technical documentation and academic topics.

In [14]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
import pandas as pd

# Download all NLTK resources
try:
    nltk.download('all')
except LookupError:
    print("Could not download all NLTK resources. Please check your internet connection.")


def get_text_from_url(url):
    """Extracts text content from a given URL."""
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extract text from paragraphs
        text = ' '.join([p.get_text() for p in soup.find_all('p')])
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

def calculate_ttr(text):
    """Calculates the Type-Token Ratio (TTR) of a given text."""
    if not text:
        return 0
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    types = set(tokens)
    if not tokens:
        return 0
    return len(types) / len(tokens)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

In [15]:
url = "https://en.wikipedia.org/wiki/Natural_language_processing"
nlp_text = get_text_from_url(url)

if nlp_text:
    nlp_ttr = calculate_ttr(nlp_text)
    print(f"TTR for Natural Language Processing Wikipedia page: {nlp_ttr:.4f}")
else:
    nlp_ttr = None
    print("Could not retrieve text from the URL.")

TTR for Natural Language Processing Wikipedia page: 0.4022


In [19]:
brown_genres = brown.categories()
brown_ttr_data = []

for genre in brown_genres:
    words = brown.words(categories=genre)
    genre_text = ' '.join(words)
    genre_ttr = calculate_ttr(genre_text)
    brown_ttr_data.append({'Genre': genre, 'TTR': genre_ttr})

brown_ttr_df = pd.DataFrame(brown_ttr_data)
print("\nTTRs for Brown Corpus Genres:")
display(brown_ttr_df)


TTRs for Brown Corpus Genres:


Unnamed: 0,Genre,TTR
0,adventure,0.113972
1,belles_lettres,0.095434
2,editorial,0.143135
3,fiction,0.122161
4,government,0.102401
5,hobbies,0.127673
6,humor,0.212629
7,learned,0.082826
8,lore,0.118007
9,mystery,0.107466


In [18]:
report_data = brown_ttr_data.copy()
if nlp_ttr is not None:
    report_data.append({'Genre': 'NLP Wikipedia Page', 'TTR': nlp_ttr})

report_df = pd.DataFrame(report_data)
print("\nCombined TTR Data for Reporting:")
display(report_df)


Combined TTR Data for Reporting:


Unnamed: 0,Genre,TTR
0,adventure,0.113972
1,belles_lettres,0.095434
2,editorial,0.143135
3,fiction,0.122161
4,government,0.102401
5,hobbies,0.127673
6,humor,0.212629
7,learned,0.082826
8,lore,0.118007
9,mystery,0.107466
