## Q1.Apply Steps of KDD to obtain applicable IPC Section/ BNS Section from the fol/owing complaint, Give useful alrorithms/R/ Python code snippet / processes for identifying applicable BNS (Shartiya Nyay Sanhita) sections
### "*Theft of my mobile phone by an unknown person at MG Road on 2023-02-20. The phone was stolen when I was sopping at a store. The thief was wearing a black jacket and Ind a sear on his face.""




This project processes a text-based crime report and identifies the most applicable sections of IPC (Indian Penal Code) or BNS (Basic National Standards). It involves text preprocessing, entity recognition, relationship extraction, and classification.

## Steps for Execution

### 1. Text Preprocessing
#### Purpose:
Prepare the input text for further analysis.

#### Sub-steps:
1. **Tokenization**:
   Break the text into tokens (words or punctuation).
2. **Stop Word Removal and Lemmatization**:
   Remove commonly used words that do not carry significant meaning (e.g., "is", "the") and reduce words to their base form.

---

### 2. Entity Recognition
#### Purpose:
Extract key named entities from the text, such as:
- **Crimes**: Theft, assault, fraud, etc.
- **Items**: Mobile phone, wallet, etc.
- **Locations**: MG Road, Indore, etc.
- **Dates**: 2023-02-20, etc.

---

### 3. Pattern or Relationship Extraction
#### Purpose:
Use dependency parsing to identify relationships between entities (e.g., who did what, where, and when).

#### Example:
- Identify that "unknown person" is linked to "theft" at "MG Road" on a specific date.

---

### 4. Classification
#### Purpose:
Map the extracted information to relevant sections of IPC or BNS.

#### Example:
- Theft might map to **IPC Section 378**.

---

### 5. Post-Processing and Ranking
#### Purpose:
1. Filter and validate the identified sections.
2. Rank sections based on relevance.

---



**Step 1: Preprocessing**

In [None]:
nltk.download('all', force=True)


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure necessary data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Define the input text
text = (
    "Theft of my mobile phone by an unknown person at MG Road on 2023-02-20. "
    "The phone was stolen when I was shopping at a store. The thief was wearing a black jacket and had a scar on his face."
)

try:
    # Tokenization
    tokens = word_tokenize(text)
    print("Tokens:", tokens)

    # Removing stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    print("Filtered Tokens:", filtered_tokens)

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    print("Lemmatized Tokens:", lemmatized_tokens)

except LookupError as e:
    print(f"LookupError: {e}")
    print("Please ensure the NLTK data is properly installed.")


Tokens: ['Theft', 'of', 'my', 'mobile', 'phone', 'by', 'an', 'unknown', 'person', 'at', 'MG', 'Road', 'on', '2023-02-20', '.', 'The', 'phone', 'was', 'stolen', 'when', 'I', 'was', 'shopping', 'at', 'a', 'store', '.', 'The', 'thief', 'was', 'wearing', 'a', 'black', 'jacket', 'and', 'had', 'a', 'scar', 'on', 'his', 'face', '.']
Filtered Tokens: ['Theft', 'mobile', 'phone', 'unknown', 'person', 'MG', 'Road', '2023-02-20', '.', 'phone', 'stolen', 'shopping', 'store', '.', 'thief', 'wearing', 'black', 'jacket', 'scar', 'face', '.']
Lemmatized Tokens: ['Theft', 'mobile', 'phone', 'unknown', 'person', 'MG', 'Road', '2023-02-20', '.', 'phone', 'stolen', 'shopping', 'store', '.', 'thief', 'wearing', 'black', 'jacket', 'scar', 'face', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Step 2: Named Entity Recognition (NER)**

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Extract entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
print("Entities:", entities)


Entities: [('MG Road', 'FAC'), ('2023-02-20', 'DATE')]


**Step 3: Dependency Parsing**

In [None]:
for token in doc:
    print(f"{token.text} -> {token.dep_} -> {token.head.text}")


Theft -> ROOT -> Theft
of -> prep -> Theft
my -> poss -> phone
mobile -> amod -> phone
phone -> pobj -> of
by -> prep -> Theft
an -> det -> person
unknown -> amod -> person
person -> pobj -> by
at -> prep -> person
MG -> compound -> Road
Road -> pobj -> at
on -> prep -> person
2023 -> nummod -> 20
- -> punct -> 2023
02 -> appos -> 2023
- -> punct -> 20
20 -> pobj -> on
. -> punct -> Theft
The -> det -> phone
phone -> nsubjpass -> stolen
was -> auxpass -> stolen
stolen -> ROOT -> stolen
when -> advmod -> shopping
I -> nsubj -> shopping
was -> aux -> shopping
shopping -> advcl -> stolen
at -> prep -> shopping
a -> det -> store
store -> pobj -> at
. -> punct -> stolen
The -> det -> thief
thief -> nsubj -> wearing
was -> aux -> wearing
wearing -> ROOT -> wearing
a -> det -> jacket
black -> amod -> jacket
jacket -> dobj -> wearing
and -> cc -> wearing
had -> conj -> wearing
a -> det -> scar
scar -> dobj -> had
on -> prep -> had
his -> poss -> face
face -> pobj -> on
. -> punct -> wearing


**Step 4: Classification**

In [None]:
# Sample mapping dictionary for IPC/BNS sections
section_mapping = {
    "theft": "IPC 379",
    "property crime": "BNS 4.1",
}

# Mapping extracted entities to sections
crime = "theft"  # Example entity identified
section = section_mapping.get(crime, "Unknown Section")
print(f"Mapped Section for {crime}: {section}")


Mapped Section for theft: IPC 379


**Step 5: Post-Processing**


In [None]:
# Example of validation and ranking
identified_sections = [("IPC 379", 0.95), ("BNS 4.1", 0.90)]
validated_sections = sorted(identified_sections, key=lambda x: x[1], reverse=True)
print("Validated Sections:", validated_sections)


Validated Sections: [('IPC 379', 0.95), ('BNS 4.1', 0.9)]



## Q2. Expion Decision Tree process of data science and its various approaches te construct a decision tree from a database, how these struetures and algorithins be applied to effectively classily tie complaint into appropriate IPC or BNS sections.

### Classifying Complaints into IPC/BNS Sections Using Decision Trees

The structures and algorithms, such as Decision Trees using Information Gain, can be effectively applied to classify complaints into appropriate IPC or BNS sections by following these steps:

### 1. Data Collection and Preparation
- Collect and preprocess complaint data, including:
  - Tokenization
  - Stopword removal
  - Entity extraction (e.g., identifying crimes, locations, dates)

### 2. Feature Selection
- Use attributes from the complaint data (e.g., crime type, location, date, involved parties) as features for classification.

### 3. Entropy and Information Gain
- Calculate the entropy of the dataset.
- Evaluate the information gain for each attribute to determine the best splitting criteria.

### 4. Tree Construction
- Construct a decision tree by iteratively splitting the data based on the attribute with the highest information gain.

### 5. Classification
- Traverse the constructed tree to classify each complaint into specific IPC or BNS sections based on the attributes and splitting rules.

### 6. Validation and Optimization
- Validate the model on test data.
- Optimize the model by pruning or adjusting thresholds to improve accuracy.

## Conclusion
This approach ensures that complaints are classified systematically and aligned with legal frameworks, providing a structured method for handling and categorizing complaint data.


## 3. What metries are appropriate for issessing the performance of algorithams used in text, mining? Give Detail the explanation that evaluate the effectiveness and accuracy of text mining techniques.