<a href="https://colab.research.google.com/github/Mabinogit/AI-Image-Classification/blob/main/Project_2_Inner_products_and_angles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset

# Inner Products and Angles

Build a document similarity system using TF - IDF vectorization and cosine similarity to compare text documents.

# Task
Load the data from the file "/content/reuters21578.tar.gz", then display the first 5 rows, the columns and their types.

## Extract the data

### Subtask:
Extract the files from the `.tar.gz` archive.


**Reasoning**:
Extract the files from the tar.gz archive using the tarfile library.



In [None]:
import tarfile

with tarfile.open("/content/reuters21578.tar.gz", "r:gz") as tar:
    tar.extractall()

## Identify data files

### Subtask:
Examine the extracted files to understand their structure and identify the relevant data files.


**Reasoning**:
List the extracted files and directories to understand the file structure and identify potential data files.



In [None]:
import os

extracted_files = os.listdir('.')
print(extracted_files)

['.config', 'all-people-strings.lc.txt', 'reut2-010.sgm', 'all-places-strings.lc.txt', 'all-orgs-strings.lc.txt', 'reut2-006.sgm', 'reut2-016.sgm', 'reut2-015.sgm', 'reut2-004.sgm', 'all-exchanges-strings.lc.txt', 'reut2-007.sgm', 'reut2-008.sgm', 'reut2-001.sgm', 'reut2-013.sgm', 'reut2-017.sgm', 'all-topics-strings.lc.txt', 'reut2-019.sgm', 'reut2-009.sgm', 'cat-descriptions_120396.txt', 'reut2-020.sgm', 'reut2-002.sgm', 'reut2-000.sgm', 'reut2-003.sgm', 'reut2-005.sgm', 'feldman-cia-worldfactbook-data.txt', 'README.txt', 'reut2-012.sgm', 'lewis.dtd', 'reut2-014.sgm', 'reut2-021.sgm', 'reuters21578.tar.gz', 'reut2-011.sgm', 'reut2-018.sgm', 'sample_data']


**Reasoning**:
The file listing shows several files with the '.sgm' extension (e.g., 'reut2-000.sgm', 'reut2-001.sgm', etc.). These are likely the main data files containing the text documents. I will use the `head` command to peek into one of these files to confirm its content structure.



In [None]:
!head reut2-000.sgm

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 


## Load data

### Subtask:
Load the data from the relevant files into a suitable data structure (e.g., pandas DataFrame).


**Reasoning**:
Import necessary libraries and define a function to parse the SGM files and extract relevant data.



In [None]:
import re

def parse_sgm(file_path):
    with open(file_path, 'r', encoding='latin-1') as f:
        content = f.read()

    # Split the file content by the <REUTERS> tag to get individual documents
    docs = content.split('</REUTERS>')

    parsed_data = []
    for doc in docs:
        if '<REUTERS' not in doc:
            continue

        # Extract relevant information using regex
        title_match = re.search(r'<TITLE>(.*?)</TITLE>', doc, re.DOTALL)
        body_match = re.search(r'<BODY>(.*?)</BODY>', doc, re.DOTALL)
        topics_match = re.search(r'<TOPICS>(.*?)</TOPICS>', doc, re.DOTALL)

        title = title_match.group(1).strip() if title_match else None
        body = body_match.group(1).strip() if body_match else None
        topics = []
        if topics_match:
            topic_list = re.findall(r'<D>(.*?)</D>', topics_match.group(1))
            topics = topic_list

        if body: # Only include documents with a body
            parsed_data.append({
                'title': title,
                'body': body,
                'topics': topics
            })
    return parsed_data


**Reasoning**:
Iterate through all SGM files, parse them, collect the data, and convert it into a pandas DataFrame.



In [None]:
import os
import pandas as pd

sgm_files = [f for f in os.listdir('.') if f.endswith('.sgm')]

all_docs_data = []
for sgm_file in sgm_files:
    file_path = os.path.join('.', sgm_file)
    all_docs_data.extend(parse_sgm(file_path))

df = pd.DataFrame(all_docs_data)

## Inspect data

### Subtask:
Display the first few rows and the columns and their types to get an initial understanding of the data.


**Reasoning**:
Display the first few rows and the columns and their types to get an initial understanding of the data.



In [None]:
display(df.head())
display(df.info())

Unnamed: 0,title,body,topics
0,AMOCO &lt;AN> UNIT EXPANDS CARPET YARN PLANT,Amoco Corp said its Amoco Fabrics Co\nwill exp...,[]
1,ISLAND TELEPHONE SHARE SPLIT APPROVED,&lt;Island\nTelephone Co Ltd> said the previou...,[earn]
2,BIOGEN &lt;BGNF> GETS PATENT FROM EUROPEAN OFFICE,Biogen Inc said the European\nPatent Office gr...,[]
3,KEY U.S. HOUSE MEMBER OPPOSES CFTC USER PLAN,A key U.S. House member said he\nopposed a Rea...,[]
4,U.K. GROWING IMPATIENT WITH JAPAN - THATCHER,Prime Minister Margaret Thatcher said\nthe U.K...,"[trade, acq]"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19043 entries, 0 to 19042
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   19043 non-null  object
 1   body    19043 non-null  object
 2   topics  19043 non-null  object
dtypes: object(3)
memory usage: 446.4+ KB


None

# Pre-processing

In [62]:
text1 = "Hello my name is christian km"
text2 = "Hello my name is christian mmbk "


global dictionary

In [None]:
def dic(docs):
  stopwords = ["the"]

# Initialize the words list outside the loop
  all_words = []

  for doc in docs:
  # Get words from the current document, convert to lowercase and split
    doc_words = doc.lower().split()
  # Extend the all_words list with words from the current document
    all_words.extend(doc_words)

# Remove stopwords from the collected words
  filtered_words = [word for word in all_words if word not in stopwords]
  vec_dictionary = tuple(sorted(set(filtered_words)))

  return vec_dictionary

In [63]:
docs = [text1, text2]
vec_dictionary = dic(docs)
print(vec_dictionary)

('christian', 'hello', 'is', 'km', 'mmbk', 'my', 'name')


# IDF

In [None]:
# N - Total number of documents
# df - Number of documents containing the term
import math

def idf_calc(df, docs):
    """Calculates the Inverse Document Frequency (IDF) for a term."""
    N = len(docs)
    idf = {}
    for word in df:
      df_val = df[word]

      idf_val = math.log((N + 1) / (df_val + 1)) + 1


      idf[word] = idf_val
    return idf
def calculate_document_frequency(vec_dictionary, docs):
    """Calculates the document frequency for each word in a list of documents."""
    df = {}
    for word in vec_dictionary:
      for doc in docs:
        if word in doc:
          if word in df:
            df[word] += 1
          else:
            df[word] = 1
    return df

In [80]:
freq = calculate_document_frequency(vec_dictionary, docs)
print(freq)

{'christian': 2, 'is': 2, 'km': 1, 'mmbk': 1, 'my': 2, 'name': 2}


In [82]:
idf = idf_calc(freq, docs)
print(idf)

{'christian': 1.0, 'is': 1.0, 'km': 1.4054651081081644, 'mmbk': 1.4054651081081644, 'my': 1.0, 'name': 1.0}


In [94]:
def tf(text):

  stopwords = ["the"]

# Initialize the words list outside the loop
  all_words = []

  for word in text:
  # Get words from the current document, convert to lowercase and split
    doc_words = text.lower().split()
  # Extend the all_words list with words from the current document
    all_words.extend(doc_words)

# Remove stopwords from the collected words
  filtered_words = [word for word in all_words if word not in stopwords]


  # Initialize word_val and vectors (can be used for TF calculation later)
  word_val = {}
  vectors = []
  sum = 0


# Iterate through the words and count their occurrences
  for word in filtered_words:
    if word in word_val:
      word_val[word] += 1
    else:
      word_val[word] = 1


  for word in word_val:
    sum += word_val[word]


  for word in word_val:
    word_val[word] = word_val[word] / sum

  return word_val




In [96]:
tf_text1 = tf(text1)
print(tf_text1)

{'hello': 0.1111111111111111, 'my': 0.1111111111111111, 'name': 0.1111111111111111, 'is': 0.1111111111111111, 'christian': 0.1111111111111111, 'n': 0.1111111111111111, 'joejen': 0.1111111111111111, 'kjnkjno': 0.1111111111111111, 'kjnknvs': 0.1111111111111111}


# IF-IDF

In [None]:
def tf_idf(tf, idf, dictionary):
  """Calculates the TF-IDF value for terms based on their TF and IDF values."""
  tfidf_values = {}

  for word, tf_value in tf.items():
    if word in idf:

      tfidf_values[word] = tf_value * idf[word]
    else:
      # Handle cases where a word in TF is not in IDF (shouldn't happen if both are from the same corpus)
      tfidf_values[word] = tf_value * 0 # Or handle as appropriate, e.g., raise an error or log a warning

  vector = []
  for word in dictionary:
      if word in tfidf_values:
          vector.append(tfidf_values[word])
      else:
          vector.append(0) # Append 0 if the word is not in tfidf_values

  return tfidf_values, vector

In [None]:
tf_idf_text1 , vector1 = tf_idf(tf_text1, idf, vec_dictionary )
print(tf_idf_text1)
print(vector1)

{'hello': 0.0, 'my': 0.18181818181818182, 'name': 0.18181818181818182, 'is': 0.18181818181818182, 'christian': 0.18181818181818182, 'x': 0.1277695552825604}
[0.18181818181818182, 0.0, 0.18181818181818182, 0.18181818181818182, 0.18181818181818182, 0.1277695552825604]


In [None]:
tf_idf_text2 , vector2 = tf_idf(tf_text2, idf, vec_dictionary )
print(tf_idf_text2)
print(vector2)

{'hello': 0.0, 'my': 0.18181818181818182, 'name': 0.18181818181818182, 'is': 0.18181818181818182, 'christian': 0.18181818181818182, 'x': 0.1277695552825604}
[0.18181818181818182, 0.0, 0.18181818181818182, 0.18181818181818182, 0.18181818181818182, 0.1277695552825604]


## Cosine Similarity

In [120]:
import numpy as np

def cosine(vector1, vector2):
  """Calculates the cosine similarity between two vectors."""
  vector1 = np.array(vector1)

  vector2 = np.array(vector2)

  cosine = (vector1 @ vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
  return cosine

# Cosine similarity - TF_IDF

In [None]:
similarity = cosine(vector1, vector2)
print(similarity)

if similarity > 0.5:
  print("The documents are similar")
else:
  print("The documents are not similar")

[0.18181818 0.         0.18181818 0.18181818 0.18181818 0.12776956]
[0.18181818 0.         0.18181818 0.18181818 0.18181818 0.12776956]
0.9999999999999998
The documents are similar


# System

In [117]:
def similarity_system(doc, text_1, texts_2 ):
  vec_dictionary = dic(doc)



  freq = calculate_document_frequency(vec_dictionary, doc)


  idf = idf_calc(freq, doc)


  tf_text1 = tf(text_1)

  tf_text2 = tf(texts_2)


  tf_idf_text1 , vector1 = tf_idf(tf_text1, idf, vec_dictionary )


  tf_idf_text2 , vector2 = tf_idf(tf_text2, idf, vec_dictionary )


  similarity = cosine(vector1, vector2)
  print(similarity)
  if similarity > 0.5:
    print("The documents are similar")
  else:
    print("The documents are not similar")



# Test on Corpus

In [122]:
text1 = "and you me"
text2 = "how about we rule"

docx = [text1, text2]
similarity_system(docx, text1, text2)



0.0
The documents are not similar


In [127]:
import os
import pandas as pd
import re

def parse_sgm(file_path):
    with open(file_path, 'r', encoding='latin-1') as f:
        content = f.read()

    # Split the file content by the <REUTERS> tag to get individual documents
    docs = content.split('</REUTERS>')

    parsed_data = []
    for doc in docs:
        if '<REUTERS' not in doc:
            continue

        # Extract relevant information using regex
        title_match = re.search(r'<TITLE>(.*?)</TITLE>', doc, re.DOTALL)
        body_match = re.search(r'<BODY>(.*?)</BODY>', doc, re.DOTALL)
        topics_match = re.search(r'<TOPICS>(.*?)</TOPICS>', doc, re.DOTALL)

        title = title_match.group(1).strip() if title_match else None
        body = body_match.group(1).strip() if body_match else None
        topics = []
        if topics_match:
            topic_list = re.findall(r'<D>(.*?)</D>', topics_match.group(1))
            topics = topic_list

        if body: # Only include documents with a body
            parsed_data.append({
                'title': title,
                'body': body,
                'topics': topics
            })
    return parsed_data

sgm_files = [f for f in os.listdir('.') if f.endswith('.sgm')]

all_docs_data = []
for sgm_file in sgm_files:
    file_path = os.path.join('.', sgm_file)
    all_docs_data.extend(parse_sgm(file_path))

# Extract the 'body' of the documents for similarity testing
doc_texts = [doc['body'] for doc in all_docs_data if doc['body']]

In [129]:
# Example: Test similarity between the first two documents
if len(doc_texts) >= 2:
    text1 = doc_texts[0]
    text2 = doc_texts[1]
    print(f"Calculating similarity between document 1 and document 2:")
    similarity_system(doc_texts, text1, text2)
elif len(doc_texts) == 1:
    print("Only one document found. Cannot calculate similarity between two documents.")
else:
    print("No documents with body found in the SGM files.")

Calculating similarity between document 1 and document 2:
0.053568768973858275
The documents are not similar
