## Lab8 Assignment Task PROG8245 - NLP Introduction

### Name: Paras Rupani
### ID: 8961758


Using the discussed topic in class for tokenizers, stop-word removal, stemming/lemmatization, and POS Tagging. <br><br>
Create **ONE** function, that takes as an input a string, and returns the output of a string after stemming/lemmatization.<br><br>
**Kindly note that you are required to consider the POS Tag while doing your stemming or lemmatization step (you should use whatever is more suitable for this task)** <br><br>
After creating the function, you need to run your function on 10 **Random** files from reuters corpus, an example of how to download and load a file of reuters corpus is below. <br><br>
**Your 10 **Random** files should be retrieved by getting a random array of length 10 which picks numbers RANDOMLY from 0 to len(reuters.fileids()), then the elements retrieved will be your corpus.<br> <br>*You need to set your Seed to be Equal to the last 3 digits in your studentID.*<br>** If your ID is 8000888 then seed =888 <br>
**You may need to tailor your task based on the dataset to remove some special characters.**
 

#### Last Step
After finishing your code, run your code and save the result in a python dictionary, which would be of format:<br>
{DocumentID: [List of Words], <br>
...} <br>
Save your python dictionary as a JSON file, or Pickle file. <br>
A sample code for saving a python dictionary is available at the end of this notebook.<br>


### Importing Required Libraries

In [2]:
import string
import numpy as np

import nltk, pickle
from nltk import pos_tag
from nltk.tokenize import  word_tokenize 
from nltk.corpus import stopwords, wordnet, reuters
from nltk.stem import WordNetLemmatizer

### Downloading various resources and corpora

In [3]:
nltk.download('reuters')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt') 

[nltk_data] Downloading package reuters to /home/paras/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package wordnet to /home/paras/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/paras/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/paras/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### File Inspections

In [4]:
# to read specific file
document_id = 'training/9865'
text = reuters.raw(document_id) #reading a sample file from reuters

print("Number of files:",len(reuters.fileids())) #checking how many files are there

text 

Number of files: 10788


"FRENCH FREE MARKET CEREAL EXPORT BIDS DETAILED\n  French operators have requested licences\n  to export 675,500 tonnes of maize, 245,000 tonnes of barley,\n  22,000 tonnes of soft bread wheat and 20,000 tonnes of feed\n  wheat at today's European Community tender, traders said.\n      Rebates requested ranged from 127.75 to 132.50 European\n  Currency Units a tonne for maize, 136.00 to 141.00 Ecus a tonne\n  for barley and 134.25 to 141.81 Ecus for bread wheat, while\n  rebates requested for feed wheat were 137.65 Ecus, they said.\n  \n\n"

### Setting random seed

In [5]:
# Setting Random seed as last 3 digits of 8961758
np.random.seed(758)
random_files = np.random.choice(np.arange(len(reuters.fileids())), size=10, replace=False)

random_files

array([7751, 8164, 9745, 8821, 5016,  759, 3060, 2724, 2385, 4594])

### Function to perform PreProcessing

In [6]:
def nlp_preprocess(input_text):
    special_chars = set(string.punctuation)
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    # POS Tagging
    def get_wordnet_pos(tag):
        if tag.startswith('N'):
            return wordnet.NOUN

        elif tag.startswith('R'):
            return wordnet.ADV

        elif tag.startswith('V'):
            return wordnet.VERB

        elif tag.startswith('J'):
            return wordnet.ADJ
            
        else:
            return wordnet.NOUN

    # Tokenizing the input
    tokens = word_tokenize(input_text)

    # Removing stopwords and punctuation
    clean_tokens = [token for token in tokens if token.lower() not in stop_words and token not in special_chars]

    # Implementing Part-of-speech tagging   
    pos_tags = pos_tag(clean_tokens)

    # Lemmatization with POS tags
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in pos_tags]

    return lemmatized_tokens

### PreProcessing Example

In [7]:
# Example:
input_text = "I am using an advanced laptop to perform machine learning evaluations"
print(nlp_preprocess(input_text))

['use', 'advanced', 'laptop', 'perform', 'machine', 'learn', 'evaluation']


### Saving python dictionary:


In [8]:
# Dictionary to store preprocessed text
my_dict = {}

for id in random_files:
    # Retrieve the file ID
    file_id = reuters.fileids()[id]
    
    # Extract the raw text content
    text = reuters.raw(file_id)
    
    # Preprocess the text
    preprocessed_text = nlp_preprocess(text)
    
    # Storing the preprocessed text
    my_dict[file_id] = preprocessed_text

### Reading, Writing the Corpus into a Pickle File

In [9]:
# Save the dictionary to a file using Pickle
with open('output.pkl', 'wb') as f:
    pickle.dump(my_dict, f)

    
# Reading the pickle file:
with open('output.pkl', 'rb') as f:
    loaded_dict = pickle.load(f)

### File Content of 10 Files

In [10]:
# Printing the FileID and content
for key, value in loaded_dict.items():
    print(f"{key}: {value}")
    print()

training/529: ['lt', 'FRANKLIN', 'UTILITIES', 'FUND', 'SETS', 'PAYOUT', 'Qtly', 'div', '14', 'ct', 'vs', '14', 'ct', 'prior', 'Pay', 'March', '13', 'Record', 'March', 'Two']

training/5901: ['PAYLESS', 'CASHWAYS', 'INC', 'lt', 'PCI', '1ST', 'QTR', 'FEB', '28', 'NET', 'Shr', '10', 'ct', 'vs', 'seven', 'ct', 'Net', '3,501,000', 'v', '2,420,000', 'Sales', '332.7', 'mln', 'vs', '274.9', 'mln', 'Qtly', 'div', 'four', 'ct', 'vs', 'four', 'ct', 'prior', 'Pay', 'April', 'Six', 'Record', 'March', 'Six']

training/8388: ['COMPUTER', 'DEVICES', 'INC', '4TH', 'QTR', 'Shr', 'loss', 'one', 'cnt', 'v', 'profit', 'one', 'cnt', 'Net', 'loss', '35,000', 'vs', 'profit', '42,000', 'Revs', '881,000', 'v', '1.3', 'mln', 'Year', 'Shr', 'profit', 'seven', 'ct', 'vs', 'profit', 'nine', 'ct', 'Net', 'profit', '291,000', 'vs', 'profit', '366,000', 'Revs', '4.4', 'mln', 'v', '5.9', 'mln', 'NOTE:1985', '4th', 'qtr', 'year', 'include', 'gain', '7,000', 'dlrs', '147,000', 'dlrs', 'respectivley', '1986', 'year', 'inc