Skip to content

Parisaroozgarian/NLTK-Corpora-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLTK Corpora Analysis 📚

📋 Project Overview

Subtask 1: Install and Import NLTK

  • Install the NLTK library if you haven't already.
  • Import the necessary modules from NLTK.
  • Download the required NLTK data files to complete the next subtasks.

Subtask 2: Sentence and Word Tokenization

  • Open the Gutenberg corpus.
  • Choose a specific file (e.g., 'austen-emma.txt') and tokenize it into sentences.
  • Print the total number of sentences and the first sentence.
  • Tokenize the text into words and print the tokens.

Subtask 3: Bigrams, Trigrams, and POS Tagging

  • Generate bigrams and trigrams from the word tokens and print the first 10 of each.
  • Perform POS tagging on the word tokens and print the first 10 tokens with their POS tags.

Subtask 4: Stemming, Lemmatization, and Frequency Distribution

  • Stem each word token and print the original token, its POS tag, and its stem.
  • Lemmatize each word token and print the original token and its lemma.
  • Create a frequency distribution of the word tokens and plot the top 20 words.

🔑 Key Skills

  • Python Programming
  • Data Analysis
  • Documentation

🛠️ Tools

📖 Libraries

About

Use the NLTK library to open a corpus and perform different types of analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors