## Given a string of text, write a function to extract all the email addresses from the text.

In [1]:
import re

In [2]:
def extract_email_address(text):
    pattern = r'\w+\@\w+(?:\.\w+)+'
    
    matches = re.findall(pattern,text)
    
    return matches

In [3]:
sample_text = "Hello my id is email@abc.com and phone is 8002003001"
email = extract_email_address(sample_text)
print(email)

['email@abc.com']


## Write a python program to extract all the URLs from a webpage.

In [4]:
import requests
from bs4 import BeautifulSoup

In [5]:
def extract_urls(url):
    response = requests.get(url)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    links = soup.find_all('a') # find everything from the <a> anchor tags in html 
    
    urls = []
    for link in links:
        if 'href' in link.attrs:
            urls.append(link.attrs['href'])

    return urls
    

In [6]:
Url = 'https://en.wikipedia.org/wiki/Data_science'
Urls = extract_urls(Url)
print(Urls)

['#bodyContent', '/wiki/Main_Page', '/wiki/Wikipedia:Contents', '/wiki/Portal:Current_events', '/wiki/Special:Random', '/wiki/Wikipedia:About', '//en.wikipedia.org/wiki/Wikipedia:Contact_us', 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en', '/wiki/Help:Contents', '/wiki/Help:Introduction', '/wiki/Wikipedia:Community_portal', '/wiki/Special:RecentChanges', '/wiki/Wikipedia:File_upload_wizard', '/wiki/Main_Page', '/wiki/Special:Search', '/w/index.php?title=Special:CreateAccount&returnto=Data+science', '/w/index.php?title=Special:UserLogin&returnto=Data+science', '/w/index.php?title=Special:CreateAccount&returnto=Data+science', '/w/index.php?title=Special:UserLogin&returnto=Data+science', '/wiki/Help:Introduction', '/wiki/Special:MyContributions', '/wiki/Special:MyTalk', '#', '#Foundations', '#Relationship_to_statistics', '#Etymology', '#Early_usage', '#Modern_usage', '#See_also', '#Referenc

## Write a function to extract all the phone numbers from a string.

In [7]:
def extract_phone(text):
    
    pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
    
    matches = re.findall(pattern,text)
    
    return matches

In [8]:
text = "Hey my phone number is 123-234-5678"
print(extract_phone(text))

['123-234-5678']


## Write a function CSV file with a large amount of data, how would you extract a specific subset of the data


In [9]:
import csv

In [21]:
def extract_csv_subset(ip_filename, op_filename, condition):
    with open(ip_filename,'r') as input_file, open(op_filename, 'w', newline='') as output_file:
        reader = csv.reader(input_file)
        writer = csv.writer(output_file)
        
        writer.writerow(next(reader))
        
        for row in reader:
            if condition(row):
                writer.writerow(row)
                
def greater_than_10(row):
    return int(row[0]) > 10 and int(row[1]) < 100 ## extarcts the data in first column > 10 and second column <100

In [22]:
extract_csv_subset('./attachments/telecom_churn.csv', './attachments/output.csv', greater_than_10)

## Write a program to extract all the text from a PDF file.

In [42]:
import fitz

In [44]:
def extract_pdf_text(file_path):
    with fitz.open(file_path) as file:
        
        ##reader = PyPDF2.PdfReader(file)
        
        text = ""
        for page in file:
            text+=page.get_text()
        return text

In [45]:
extract_pdf_text('./ManasaShivarudra_CV.pdf')

"N/A\nN/A\nManasa Shivarudra\nmanasa.shivarudra@gmail.com\n8067301117\nhttps://www.linkedin.com/in/manasa-shivarudra/\nSummary\nSelf-taught Data Science and Machine Learning enthusiast seeking full time opportunities in Data Science and Machine Learning field. Dependable and \nself-motivated software professional with 5 years experience working directly with the customer to build their outbound IVR applications in different \nchannels. Worked as the Technical Lead for a 3 year implementation project which involved migration of 250 applications.\nExperience\nMachine Learning Engineer\nOmdena AI, Philadelphia\nJune 2021 - July 2022\n|\n|\n• Collaborated on the end-to-end Machine Learning project to build a model to predict the Energy consumption in a building which involved data \nprocessing, EDA, data visualization, feature engineering, model building and deployment on Azure ML Studio. \n• Deployed the model in Azure machine learning studio by creating a real-time endpoints and inferenc

In [51]:
## Second way of writing the above using camelot

import camelot


ModuleNotFoundError: No module named 'camelot'

In [52]:
def extract_text_from_pdf(file_path):
    tables = camelot.read_pdf(file_path, pages='all', flavor='stream')
    text=""
    for table in tables:
        text+= table.df.to_string()
    return text

## Given a string of HTML code, write a optimized python function to extract all the links from the HTML code.

time complexity of O(n), where n is the total number of characters in the HTML code. 

this program has a space complexity of O(k), where k is the number of links in the HTML code. This is because we store the links in a list, which takes up space proportional to the number of links.

In [85]:
def extract_links(filename):
    
    html_file = open(filename,'r', encoding="utf8")
    
    soup = BeautifulSoup(html_file, 'html.parser')
    links=[]
    
    for link in soup.find_all('a'):
        if 'href' in link.attrs:
            links.append(link.attrs['href'])
    
    #for link in soup.find_all('a'):
        #href= link.get('href')
       # if href is not None:
         #if 'href' in link.attrs:
            #links.append(link.attrs['href'])

            #links.append(href)
    return links

In [86]:
extract_links('./Data science - Wikipedia.html')

['https://en.wikipedia.org/wiki/Data_science#bodyContent',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Wikipedia:Contents',
 'https://en.wikipedia.org/wiki/Portal:Current_events',
 'https://en.wikipedia.org/wiki/Special:Random',
 'https://en.wikipedia.org/wiki/Wikipedia:About',
 'https://en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://en.wikipedia.org/wiki/Help:Contents',
 'https://en.wikipedia.org/wiki/Help:Introduction',
 'https://en.wikipedia.org/wiki/Wikipedia:Community_portal',
 'https://en.wikipedia.org/wiki/Special:RecentChanges',
 'https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Special:Search',
 'https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Data+science',
 'https://en.wikipedi

## Write a function to extract all the hashtags from a string.

In [89]:
def extract_hashtags(text):
    pattern = r'#\w+'
    return re.findall(pattern, text)

In [90]:
text = "I am a queen. #doglover #queen"
print(extract_hashtags(text))

['#doglover', '#queen']


## Given a string of JSON data, write an optimized program to extract specific fields from the data.

In [4]:
import json

In [53]:
def extract_fields(json_string, field_names):
    data = json.loads(json_string)
    result = []
    for item in data:
        extracted_fields = {}
        for field_name in field_names:
            if field_name in item:
                extracted_fields[field_name] = item[field_name]
        result.append(extracted_fields)
    return result

In [55]:
with open('./sample_json.json', 'r') as f:
    json_string = f.read()
result = extract_fields(json_string, field_names)
print(result)

[{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}]


In [35]:
import json

def extract_fields_from_json_file(file_path, field_names):
    with open(file_path, 'r') as file:
        data = json.load(file)
    result = []
    for item in data:
        extracted_fields = {}
        for field_name in field_names:
            if field_name in item:
                extracted_fields[field_name] = item[field_name]
        result.append(extracted_fields)
    return result


In [34]:
file_path = './sample_json.json'
field_names = ['name', 'age']
result = extract_fields_from_json_file(file_path, field_names)
print(result)


[{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}]


##  function to extract all the mentions from a string (i.e., all the Twitter usernames starting with '@').

In [58]:
import re
def extract_mentions(text):
    return re.findall(r'@(\w+)', text)

In [59]:
text = "Just saw @johndoe and @janedoe at the coffee shop. @johndoe makes the best coffee!"
mentions = extract_mentions(text)
print(mentions)  # Output: ['johndoe', 'janedoe', 'johndoe']

['johndoe', 'janedoe', 'johndoe']


## Python program that reads a log file and extracts the number of requests per hour

In [60]:
from collections import defaultdict

In [67]:
def extract_requests_per_hour(log_file):
    with open(log_file, 'r') as f:
        requests_per_hr = defaultdict(int)
        
        for line in f:
            pattern = re.search(r'\[(\d{2})\/(\w{3})\/(\d{4}):(\d{2}):\d{2}:\d{2}', line)
            if pattern:
                hour = pattern.group(4)
                
                requests_per_hr[hour]+=1
                
        for hour, count in sorted(requests_per_hr.items()):
            print(f'hour: {hour} : {count}')
            
            
            

In [68]:
log_file = './log.txt'
extract_requests_per_hour(log_file)

hour: 12 : 2
hour: 13 : 2
hour: 14 : 1


## Write a program to extract all the text from a Microsoft Word document.

In [70]:
import docx2txt

def extract_text_from_docx(file_path):
    # load the document
    text = docx2txt.process(file_path)

    # return the extracted text
    return text

In [71]:
extract_text_from_docx('./Manasa_shivarudra_CV.docx')

'N/AManasa Shivarudra\n\n\n\n\t\tmanasa.shivarudra@gmail.com\t8067301117\thttps://www.linkedin.com/in/manasa-shivarudra/      https://github.com/Manasa-Shivarudra\n\n\n\nSummary\n\n\tSelf-taught Data Science and Machine Learning enthusiast seeking full time opportunities in Data Science and Machine Learning field. Independent and self-motivated software professional with 5 years’ experience working directly with the customer to build their outbound IVR applications in different channels. Worked as the Technical Lead for a 3-year implementation project which involved migration of 250 applications.\n\nSkills\n\n\tPython, Scikit- Learn, Pandas, Numpy, Seaborn, TensorFlow, pytorch, pyspark, BERT, Scala, Azure, Linear Regression, Logistic Regression, NLP, SVM, K- means, Decision Trees, Random Forest, ANN, EDA, Data Visualization, Statistics, SQLite, SQL Server, Mongo DB/Atlas, Kubernetes, Apache Spark, Git\n\n\t\n\nCertifications\n\nMicrosoft: Azure Fundamentals, Azure Data Fundamentals, Az

## Given a string of text, write a program to extract the most commonly occurring words in the text.

In [72]:
from collections import Counter

def extract_common_words(corpus, count_words):
    words = re.findall(r'\b\w+\b', corpus.lower())
    word_count = Counter(words)
    
    return word_count.most_common(count_words)

In [74]:
corpus = 'N/AManasa Shivarudra\n\n\n\n\t\tmanasa.shivarudra@gmail.com\t8067301117\thttps://www.linkedin.com/in/manasa-shivarudra/      https://github.com/Manasa-Shivarudra\n\n\n\nSummary\n\n\tSelf-taught Data Science and Machine Learning enthusiast seeking full time opportunities in Data Science and Machine Learning field. Independent and self-motivated software professional with 5 years’ experience working directly with the customer to build their outbound IVR applications in different channels. Worked as the Technical Lead for a 3-year implementation project which involved migration of 250 applications.\n\nSkills\n\n\tPython, Scikit- Learn, Pandas, Numpy, Seaborn, TensorFlow, pytorch, pyspark, BERT, Scala, Azure, Linear Regression, Logistic Regression, NLP, SVM, K- means, Decision Trees, Random Forest, ANN, EDA, Data Visualization, Statistics, SQLite, SQL Server, Mongo DB/Atlas, Kubernetes, Apache Spark, Git\n\n\t\n\nCertifications\n\nMicrosoft: Azure Fundamentals, Azure Data Fundamentals, Azure AI Fundamentals, Azure AI Engineer Associate\n\nExperience\n\nMachine Learning Engineer | Omdena AI | June 2021 - July 2022\n\n\tCollaborated on the end-to-end Machine Learning project to build a model to predict the Energy consumption in a building which involved data.\n\n\tprocessing, EDA, data visualization, feature engineering, model building and deployment on Azure ML Studio.\n\n\tDeployed the model in Azure machine learning studio by creating a real-time endpoints and inference pipelines.\n\n\tDesigned and implemented various machine learning algorithms like Boosted Decision Trees, Random Forest, Support Vector Machine and Artificial Neural Networks.\n\nSoftware Engineer/Tech Lead | Nuance Communications Inc | NJ, USA and Mississauga, Canada | October 2017 - October 2021\n\n\tInvolved in customizing the IVR product based on complex client requirements. \n\n\tBuilt high volume Conversational AI apps using Nuance proprietary SDK and Nuance Voice Platform\n\n\tDeveloped robust Conversational AI applications on the voice platform for various top brands with 10% improvement in the application performance on the new platform. \n\n\tMentoring junior and offshore developers on a large-scale migration project in Scala framework which involved reviewing the Scala code and providing feedback on best practices, improving performance etc. \n\n\tCreated high-level Technical Design Documents for more than 100 applications. \n\n\tWorked as the Technical Lead for a 3-year implementation project which involved redesigning of 250 applications on the new platform from legacy framework.\n\n\tNuance Spot Award 2019 in recognition of going above and beyond for my hard work and contribution.\n\n\tNuance Team Player Award June 2019 in keeping the migration project on track. \n\n\tNuance Special Program Bonus October 2019 in recognition of hard work.\n\n\n\nTechnical Analyst | Trizetto Healthcare Solutions | Arizona, USA | February 2017 - August 2017\n\n\tWorked on a health care product - Encounter Data Manager which is involved in creation of 837 files for claim submits.\n\n\tInvolved in the customization and maintenance of the product based on the business needs of the user.\n\n\t\n\nNew York Life Insurance Company- Object Oriented Intern | Dallas, TX | June 2016 - August 2016\n\n\tDeveloped a Java application for treasury systems using MVC framework and DB2 for trade data transactions.\n\n\tUsed TFS (Team Foundation Server) for source code management.\n\n\tAdhered to the agile (SCRUM) methodologies of software development life cycle which involved all the phases of SDLC. \n\n\n\nTechnical Analyst | Oracle Financial Services Software Ltd | Bangalore, India | October 2012 - December 2014\n\n\tDelivered bug fixes for the product FLEXCUBE to Bank’s across geographies, many of which were base lined.\t         \n\n\tEngaged in the product customization for consumer lending which included both the UI and Database changes (PL/SQL).\n\n\tCollaborated with Banks like Chase, Diamond trust Bank and coordinated to satisfy their requirements.\n\n\n\nProjects\n\nTweet Emotion Recognition with TensorFlow (Hugging Face Data)\n\n\tBuilt a Recurrent Neural Network for multi class classification of 6 emotions to train tweet emotion dataset to learn to recognize emotions in tweets. The model performed with a training accuracy of 98.56% and validation accuracy of 88.35% with 7 epochs using TensorFlow.\n\nTelecom Customers Churn Prediction\n\n\tImplemented various classification models like Logistic Regression, SVM, KNN, Random Forest Classifier to predict the churn rate of telecommunication customers with an f1-score of 83% and an accuracy of 82%. Models were validated using the AUROC and ROC curves.\n\nBank Personal Loan Acceptance Prediction\n\n\tBuilt a simple multilayer neural network model for predicting the probability of customers accepting the Personal Loan. Trained the model with ANN with a training accuracy of 99.19% and a test accuracy of 98%.\n\nEducation\n\nMasters in Computer Science | Texas Tech University | 3.6 | Texas, USA | 2016 \n\nBachelors of Engineering in Computer Science | Visvesvaraya Technological University | 3.8 | Karnataka, India | 2012'
sorted(extract_common_words(corpus, 10))

[('a', 13),
 ('and', 21),
 ('data', 9),
 ('for', 13),
 ('in', 17),
 ('of', 20),
 ('on', 11),
 ('the', 28),
 ('to', 11),
 ('with', 10)]

## Given a large dataset of text documents, write a program to extract the most commonly occurring words across all the documents.

In [105]:
import os
def extract_commonwords_doc(data_dir, num_words):
    word_count = Counter()
    for file_name in os.listdir(data_dir):
        file_path = os.path.join(data_dir, file_name)

        with open(file_path, 'r', encoding = 'utf-8') as f:
            text = f.read()
            words = text.split() 
            word_count.update(words)
    return word_count.most_common(num_words)

In [104]:
data_dir = './EmailDirectory/'
num_words = 10
common_words= extract_common_words(data_dir, num_words)
print(common_words)

[('-', 9), ('for', 5), ('me', 5), ('a', 4), ('Subject:', 3), ('To:', 3), ('From:', 3), ('Hi', 3), ('of', 3), ('if', 3)]


## Write an optimized program to extract all the named entities from a text document, such as people, organizations, and locations using NLTK library with time and space complexity

The time complexity of this approach is O(n^2) in the worst case, where n is the number of sentences in the text. The space complexity of this approach depends on the size of the text and the number of named entities in it. The space complexity could be large if the text contains many named entities.

In [107]:
import nltk

In [118]:
def extract_named_entities(text):
    sentences = nltk.sent_tokenize(text) #tokenize the text
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    
    named_entities=[]
    for sentence in tagged_sentences:
        tree= nltk.ne_chunk(sentence)
        for subtree in tree.subtrees():
            if subtree.label() in ['PERSON', 'ORGANIZATION', 'GPE']:
                named_entity = " ".join([token for token, pos in subtree.leaves()])
                named_entities.append((named_entity, subtree.label()))
    
    return named_entities
    

In [119]:
text = "The United States of America (USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major self-governing territories, 326 Indian reservations, and some minor possessions. At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area. With a population of over 328 million, it is the third most populous country in the world."
named_entities= extract_named_entities(text)
print(named_entities)

[('United States', 'GPE'), ('America', 'GPE'), ('USA', 'ORGANIZATION'), ('United States', 'GPE'), ('U.S.', 'GPE'), ('US', 'GPE'), ('America', 'GPE'), ('North America', 'GPE'), ('Indian', 'GPE')]


In [123]:
## same program using spacy

import spacy

In [131]:
def extract_named_entity_spacy(text):
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    
    named_entities=[]
    for entity in doc.ents:
        if entity.label_ in ['PERSON', 'ORG', 'GPE']:
            named_entities.append((entity.text, entity.label_))
    
    return named_entities

In [132]:
text = "The United States of America (USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major self-governing territories, 326 Indian reservations, and some minor possessions. At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area. With a population of over 328 million, it is the third most populous country in the world."
named_entities= extract_named_entity_spacy(text)
print(named_entities)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[('The United States of America', 'GPE'), ('USA', 'GPE'), ('the United States', 'GPE'), ('U.S.', 'GPE'), ('US', 'GPE'), ('America', 'GPE')]


## Write a program to extract all the synonyms for a given word from an online thesaurus.

In [205]:
import requests

def get_synonyms(word):
    url = f'https://api.dictionaryapi.dev/api/v2/entries/en/{word}'
    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()
        meanings = data[0]['meanings']
        
        synonyms = []
        for meaning in meanings:
            if 'synonyms' in meaning:
                for synonym in meaning['synonyms']:
                    synonyms.append(synonym)
        
        return synonyms
    
    return None

In [206]:
synonyms = get_synonyms('happy')
print(synonyms)

['happify', 'cheerful', 'content', 'delighted', 'elated', 'exultant', 'glad', 'joyful', 'jubilant', 'merry', 'orgasmic', 'fortunate', 'lucky', 'propitious']


## Given a text document, write a program to extract the topics or themes of the document using sklearn

In [215]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = stopwords.words('english')

def preprocess_text(text):
    """
    Preprocesses the input text by tokenizing it into words, removing stop words, 
    and returning a preprocessed string.
    """
    # Tokenize text into words
    words = text.lower().split()
    
    # Remove stop words
    words = [word for word in words if word not in stop_words]
    
    # Join words back into a string
    preprocessed_text = " ".join(words)
    
    return preprocessed_text

def extract_topics_sklearn(text, num_topics=5):
    """
    Extracts topics from the input text using the LDA algorithm.
    Returns a list of tuples, where each tuple represents a topic and its associated words.
    """
    # Preprocess text
    preprocessed_text = preprocess_text(text)
    
    # Create a count vectorizer
    count_vectorizer = CountVectorizer(stop_words='english')
    
    # Fit the count vectorizer to the preprocessed text
    count_vectorizer.fit_transform([preprocessed_text])
    
    # Create an LDA model
    lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    
    # Fit the LDA model to the count vectorizer
    lda.fit(count_vectorizer.transform([preprocessed_text]))
    
    # Get the top words for each topic
    feature_names = count_vectorizer.get_feature_names_out()
    topics = []
    for topic_idx, topic in enumerate(lda.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-11:-1]]
        topics.append((topic_idx, top_words))
    
    return topics


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [216]:
text = """Natural Language Processing (NLP) is a field of study that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is useful for various applications such as language translation, sentiment analysis, and chatbots.

NLP involves a variety of techniques such as text preprocessing, machine learning, and deep learning. Preprocessing involves tasks like tokenization, stopword removal, stemming, and lemmatization to clean and prepare the text data for further analysis. Machine learning techniques like Naive Bayes, Support Vector Machines (SVM), and Random Forests are commonly used for tasks like sentiment analysis, text classification, and named entity recognition. Deep learning techniques like Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers are used for more complex tasks like language modeling, machine translation, and text generation.

NLP has become increasingly important in recent years with the explosion of digital data and the need to extract insights and information from unstructured text data. The applications of NLP are wide-ranging and include sentiment analysis for social media monitoring, chatbots for customer service, language translation for international business, and speech recognition for personal assistants like Siri and Alexa. As the field of NLP continues to evolve, it has the potential to transform the way we interact with computers and each other through natural language."""
topics = extract_topics_sklearn(text)
print(topics)

[(0, ['language', 'like', 'nlp', 'text', 'analysis', 'learning', 'translation', 'tasks', 'techniques', 'machine']), (1, ['years', 'need', 'generation', 'goal', 'important', 'include', 'increasingly', 'information', 'insights', 'interact']), (2, ['years', 'need', 'generation', 'goal', 'important', 'include', 'increasingly', 'information', 'insights', 'interact']), (3, ['years', 'need', 'generation', 'goal', 'important', 'include', 'increasingly', 'information', 'insights', 'interact']), (4, ['years', 'need', 'generation', 'goal', 'important', 'include', 'increasingly', 'information', 'insights', 'interact'])]


## Given a text document, write a program to extract the topics or themes of the document using gensim

In [210]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = stopwords.words('english')

def preprocess_text(text):
    """
    Preprocesses the input text by tokenizing it into words, removing stop words, 
    and returning a preprocessed string.
    """
    # Tokenize text into words
    words = text.lower().split()
    
    # Remove stop words
    words = [word for word in words if word not in stop_words]
    
    # Create a dictionary from the words
    dictionary = corpora.Dictionary([words])
    
    # Create a corpus from the dictionary
    corpus = [dictionary.doc2bow([word]) for word in words]
    
    return corpus, dictionary

def extract_topics(text, num_topics=5):
    """
    Extracts topics from the input text using the LDA algorithm.
    Returns a list of tuples, where each tuple represents a topic and its associated words.
    """
    # Preprocess text
    corpus, dictionary = preprocess_text(text)
    
    # Create an LDA model
    lda = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
    
    # Get the top words for each topic
    topics = []
    for topic_idx, topic in lda.show_topics(num_topics=num_topics, num_words=10, formatted=False):
        top_words = [word[0] for word in topic]
        topics.append((topic_idx, top_words))
    
    return topics


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [212]:
text = """Natural Language Processing (NLP) is a field of study that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is useful for various applications such as language translation, sentiment analysis, and chatbots.

NLP involves a variety of techniques such as text preprocessing, machine learning, and deep learning. Preprocessing involves tasks like tokenization, stopword removal, stemming, and lemmatization to clean and prepare the text data for further analysis. Machine learning techniques like Naive Bayes, Support Vector Machines (SVM), and Random Forests are commonly used for tasks like sentiment analysis, text classification, and named entity recognition. Deep learning techniques like Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers are used for more complex tasks like language modeling, machine translation, and text generation.

NLP has become increasingly important in recent years with the explosion of digital data and the need to extract insights and information from unstructured text data. The applications of NLP are wide-ranging and include sentiment analysis for social media monitoring, chatbots for customer service, language translation for international business, and speech recognition for personal assistants like Siri and Alexa. As the field of NLP continues to evolve, it has the potential to transform the way we interact with computers and each other through natural language."""
topics = extract_topics(text)
print(topics)

[(0, ['nlp', 'machine', 'text', 'human', 'international', 'entity', 'continues', 'potential', 'alexa.', 'speech']), (1, ['language', 'nlp', 'analysis,', 'data', 'way', 'like', 'applications', 'translation,', 'study', 'increasingly']), (2, ['text', 'language.', 'natural', 'techniques', 'learning', 'wide-ranging', 'neural', 'networks', 'applications', 'bayes,']), (3, ['sentiment', 'computers', 'involves', 'field', 'support', '(svm),', 'chatbots', 'complex', 'variety', 'translation,']), (4, ['like', 'tasks', 'techniques', 'deep', 'used', 'learning', 'networks', 'human', 'neural', 'assistants'])]


## Given a text document, write an optimized python program to extract the topics or themes of the document using NLTK

In [222]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist
from nltk import pos_tag
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import wordnet as wn
from gensim import corpora, models

# Download necessary NLTK packages
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

def preprocess_text(text):
    """
    Preprocesses the input text by tokenizing it into words, removing stop words, 
    lemmatizing the words, and returning a preprocessed string.
    """
    # Tokenize text into words
    words = word_tokenize(text)
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if not word.lower() in stop_words]
    
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
    
    # Join the words back into a string
    preprocessed_text = ' '.join(words)
    
    return preprocessed_text

def get_wordnet_pos(word):
    """
    Map POS tag to first character used by WordNetLemmatizer.
    """
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wn.ADJ,
                "N": wn.NOUN,
                "V": wn.VERB,
                "R": wn.ADV}
    return tag_dict.get(tag, wn.NOUN)

def extract_topics_nltk(text, num_topics=5):
    """
    Extracts topics from the input text using the LDA algorithm from Gensim library.
    """
    # Preprocess the text
    preprocessed_text = preprocess_text(text)
    
    # Tokenize the preprocessed text
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(preprocessed_text)
    
    # Create a bag of words from the tokens
    bag_of_words = FreqDist(tokens)
    
    # Create a list of the most common words
    most_common_words = [word for word, count in bag_of_words.most_common(50)]
    
    # Remove the most common words from the bag of words
    filtered_bag_of_words = {word:count for word, count in bag_of_words.items() if word not in most_common_words}
    
    # Create a corpus from the filtered bag of words
    corpus = []
    for word, count in filtered_bag_of_words.items():
        corpus.append([word] * count)
    
    # Create a dictionary from the corpus
    dictionary = corpora.Dictionary(corpus)
    
    # Create a document-term matrix from the corpus and dictionary
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]
    
    # Train the LDA model
    lda_model = models.ldamodel.LdaModel(doc_term_matrix, num_topics=num_topics, id2word=dictionary)
    
    # Extract the topics from the LDA model
    topics = lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False)
    
    return topics


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\manas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\manas\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\manas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [225]:
text = """Natural Language Processing (NLP) is a field of study that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is useful for various applications such as language translation, sentiment analysis, and chatbots.

NLP involves a variety of techniques such as text preprocessing, machine learning, and deep learning. Preprocessing involves tasks like tokenization, stopword removal, stemming, and lemmatization to clean and prepare the text data for further analysis. Machine learning techniques like Naive Bayes, Support Vector Machines (SVM), and Random Forests are commonly used for tasks like sentiment analysis, text classification, and named entity recognition. Deep learning techniques like Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers are used for more complex tasks like language modeling, machine translation, and text generation.

NLP has become increasingly important in recent years with the explosion of digital data and the need to extract insights and information from unstructured text data. The applications of NLP are wide-ranging and include sentiment analysis for social media monitoring, chatbots for customer service, language translation for international business, and speech recognition for personal assistants like Siri and Alexa. As the field of NLP continues to evolve, it has the potential to transform the way we interact with computers and each other through natural language."""
topics = extract_topics_nltk(text, 3)
print(topics)

[(0, [('extract', 0.037028715), ('Recurrent', 0.03645307), ('evolve', 0.036421407), ('Random', 0.03632566), ('model', 0.03554333), ('digital', 0.034580395), ('natural', 0.03456814), ('social', 0.03431792), ('name', 0.034242976), ('complex', 0.033465277)]), (1, [('potential', 0.034049515), ('explosion', 0.03279551), ('international', 0.032790456), ('medium', 0.03258431), ('personal', 0.0324477), ('commonly', 0.032179188), ('classification', 0.032178752), ('entity', 0.031956233), ('include', 0.03186875), ('important', 0.031829584)]), (2, [('wide', 0.035317622), ('speech', 0.03491443), ('recent', 0.034565855), ('Transformers', 0.03419726), ('continue', 0.033830557), ('Deep', 0.033496328), ('Vector', 0.03313662), ('year', 0.033101015), ('Support', 0.033037666), ('customer', 0.032932263)])]
