In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [3]:
import os

spam_emails_path = os.path.join("spamassassin-public-corpus", "spam") #spamassassin-public-corpus/spam
ham_emails_path = os.path.join("spamassassin-public-corpus", "ham")#spamassassin-public-corpus/ham
labeled_file_directories = [(spam_emails_path, 0), (ham_emails_path, 1)]

1. `import os`: This statement is used to import the `os` module. This module provides a way of using operating system dependent functionality like reading or writing to the environment, or creating or deleting directories.

2. `spam_emails_path = os.path.join("spamassassin-public-corpus", "spam")`: Here, the `os.path.join` method is used to join one or more path components intelligently. This method concatenates various path components with exactly one directory separator (`'/'`) following each non-empty part except the last path component. If the last path component to be joined is empty then a directory separator (`'/'`) is put at the end. This will refer to the spam email files located in "spamassassin-public-corpus/spam" directory.

3. `ham_emails_path = os.path.join("spamassassin-public-corpus", "ham")`: This is similar to the previous point but it's setting the path for ham (non-spam) email files located in "spamassassin-public-corpus/ham" directory.

4. `labeled_file_directories = [(spam_emails_path, 0), (ham_emails_path, 1)]`: This is a list of tuples where each tuple contains a string defined earlier (either `spam_emails_path` or `ham_emails_path`) and an integer which could be a label for spam (0) and ham (1) emails.

In [4]:
email_corpus = []
labels = []

for class_files, label in labeled_file_directories:
    files = os.listdir(class_files)
    for file in files:
        file_path = os.path.join(class_files, file)
        try:
            with open(file_path, "r") as currentFile:
                email_content = currentFile.read().replace("\n", "")
                email_content = str(email_content)
                email_corpus.append(email_content)
                labels.append(label)
        except:
            pass

1. `email_corpus = []` and `labels = []` -- These lines are initializing two empty lists, `email_corpus` and `labels`.

2. `for class_files, label in labeled_file_directories:` -- It starts a for loop that iterates over the `labeled_file_directories`. This is assumed to be a list of tuples where each tuple contains two items, the directory name (`class_files`) and the corresponding label (`label`).

3. `files = os.listdir(class_files)` -- It is reads the names of all files in the current directory (`class_files`).

4. `for file in files:` -- It starts another for loop that iterates over each file in the directory.

5. `file_path = os.path.join(class_files, file)` -- It builds each file's full path by joining the directory name and the file name.

6. `try:` -- It starts a try-except block to handle potential errors when opening files. Any errors will be silently ignored due to the subsequent `pass`.

7. `with open(file_path, "r") as currentFile:` -- It opens the current file.

8. `email_content = currentFile.read().replace("\n", "")` -- It reads the entire file content as a string and replaces all newline characters (`\n`) with an empty string.

9. `email_content = str(email_content)` -- It ensures the data read is in string format.

10. `email_corpus.append(email_content)` and `labels.append(label)` -- These lines append the file content and corresponding label to their respective lists, `email_corpus` and `labels`.


Summary, it reads text files from various directories ('class_files'), stores their contents in `email_corpus`, and records their corresponding labels in `labels`.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    email_corpus, labels, test_size=0.2, random_state=11
)


NLP stands for Natural Language Processing. It is a branch of artificial intelligence (AI) and linguistics concerned with the interaction between computers and human (natural) languages. NLP enables computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

Key tasks and applications of NLP include:

Text Classification and Sentiment Analysis: NLP algorithms can classify text documents into predefined categories or determine the sentiment (positive, negative, or neutral) expressed in text data. This is useful for tasks such as spam detection, sentiment analysis of social media posts, and categorization of news articles.
Named Entity Recognition (NER): NLP techniques can identify and extract named entities, such as names of people, organizations, locations, dates, and other entities, from unstructured text data. NER is used in information extraction, document summarization, and entity linking.
Part-of-Speech (POS) Tagging: NLP algorithms can assign grammatical tags (e.g., noun, verb, adjective) to each word in a sentence, indicating its syntactic role and grammatical category. POS tagging is important for tasks such as syntactic parsing, machine translation, and speech recognition.
Language Modeling and Generation: NLP models can be trained to learn the statistical properties of natural languages and generate coherent and contextually relevant text. Language modeling is used in autocomplete suggestions, machine translation, and dialogue generation applications.
Text Summarization: NLP techniques can generate concise summaries of long text documents by extracting key information and main ideas. Text summarization is useful for news aggregation, document summarization, and content generation.
Machine Translation: NLP algorithms can translate text from one language to another, enabling cross-lingual communication and information retrieval. Machine translation systems use statistical models, neural networks, or hybrid approaches to achieve accurate translations.
Question Answering: NLP systems can analyze natural language questions and provide relevant answers by extracting information from large text corpora or knowledge bases. Question answering systems are used in virtual assistants, search engines, and customer support chatbots.
Speech Recognition and Speech Synthesis: NLP techniques are used to transcribe spoken language into text (speech recognition) and generate human-like speech from text input (speech synthesis). Speech recognition and synthesis are used in virtual assistants, dictation systems, and voice-controlled devices.
NLP has numerous real-world applications across various industries, including healthcare, finance, e-commerce, education, customer service, and entertainment. It plays a crucial role in enabling human-computer interaction, information retrieval, knowledge discovery, and automation of language-related tasks. As NLP technology continues to advance, it is expected to have a transformative impact on how humans interact with computers and access information in the digital age

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
from sklearn import tree

nlp_followed_by_dt = Pipeline(
    [
        ("vect", HashingVectorizer(input="content", ngram_range=(1, 3))),
        ("tfidf", TfidfTransformer(use_idf=True,)),
        ("dt", tree.DecisionTreeClassifier(class_weight="balanced")),
    ]
)
nlp_followed_by_dt.fit(X_train, y_train)

 Firstly, it imports necessary functions: - `Pipeline` from `sklearn.pipeline`. Pipelines streamline the machine learning workflow by chaining consecutive steps of extraction, transformation and assorted processes. - `HashingVectorizer` and `TfidfTransformer` from `sklearn.feature_extraction.text`. These classes are used for text feature extraction. - `tree` from `sklearn`. It's a subpackage that contains tree-based models.

 2. Then, it creates the pipeline by calling the `Pipeline` function, which takes a list of (name, transform) tuples, specifying the sequence of steps to be performed. The steps are as follows: - `HashingVectorizer`: A transformer that converts a collection of text documents to a matrix of token occurrences. It's using a hash function to identify tokens and create token counts. - `TfidfTransformer`: It's used to transform a count matrix to a normalized tf (term-frequency) or tf-idf (term frequency-inverse document frequency) representation. Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This transformer applies the tf-idf transformation to the token counts. - `tree.DecisionTreeClassifier`: A classifier that uses a decision tree model. The `class_weight` is set to `balanced` as this will adjust weights inversely proportional to class frequencies in the input data.

 Finally, `nlp_followed_by_dt.fit(X_train, y_train)` fits the pipeline model to the input training data, `X_train` and `y_train`.

 So, it's transforming the data using NLP as specified in pipeline and then fitting a decision tree to the transformed data. This pipeline with NLP followed by Decision Tree is a common approach for text classification problems, where we first convert text data into numerical feature vectors, then use a machine learning classifier for classification prediction.

In [7]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_test_pred = nlp_followed_by_dt.predict(X_test)
print(accuracy_score(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))

0.9948849104859335
[[249   3]
 [  1 529]]
