<a href="https://colab.research.google.com/github/Prashant-1008/Cyber_Believers/blob/main/NLP_Based_model(2nd_phase).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Complaint Classification Using Machine Learning

This notebook demonstrates how to classify user complaints into categories and sub-categories using trained machine learning models. It includes steps to:

- Accept a complaint as input
- Transform the input using a TF-IDF vectorizer
- Predict the category and sub-category using pre-trained models


## Dependencies and Imports
Ensure you have the required libraries installed before running this notebook.

In [1]:
import pandas as pd
import numpy as np
import re
import string
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import nltk
from sklearn.impute import SimpleImputer

# Download NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Text preprocessing function
def preprocess_text(text):
    if isinstance(text, str):  # Check if text is a string
        # Lowercase
        text = text.lower()
        # Remove punctuation and digits
        text = re.sub(f'[{string.punctuation}0-9]', '', text)
        # Remove extra spaces
        text = re.sub(r'\s+', ' ', text).strip()
        # Remove stopwords
        text = ' '.join([word for word in text.split() if word not in stop_words])
    else:
        text = ''  # If text is not a string (e.g., NaN), return an empty string
    return text

# Apply preprocessing to the train dataset
train['processed_text'] = train['crimeaditionalinfo'].apply(preprocess_text)

# Handle NaN values in target columns
train['category'] = train['category'].fillna('Unknown')  # Fill NaNs with 'Unknown' for category
train['sub_category'] = train['sub_category'].fillna('Unknown')  # Fill NaNs with 'Unknown' for sub_category

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train['processed_text']).toarray()

# Target variables
y_category = train['category']
y_sub_category = train['sub_category']

# Split the data into training and validation sets for both targets
X_train, X_val, y_train_category, y_val_category = train_test_split(X, y_category, test_size=0.2, random_state=42)
_, _, y_train_sub_category, y_val_sub_category = train_test_split(X, y_sub_category, test_size=0.2, random_state=42)

# Initialize and train the category classifier
category_model = RandomForestClassifier(random_state=42)
category_model.fit(X_train, y_train_category)

# Initialize and train the sub-category classifier
sub_category_model = RandomForestClassifier(random_state=42)
sub_category_model.fit(X_train, y_train_sub_category)

# Evaluate on the validation set
y_pred_category = category_model.predict(X_val)
y_pred_sub_category = sub_category_model.predict(X_val)

print("Validation Accuracy for Category:", accuracy_score(y_val_category, y_pred_category))
print("Category Classification Report:")
print(classification_report(y_val_category, y_pred_category))

print("Validation Accuracy for Sub-Category:", accuracy_score(y_val_sub_category, y_pred_sub_category))
print("Sub-Category Classification Report:")
print(classification_report(y_val_sub_category, y_pred_sub_category))

# Preprocess the test data and predict
test['processed_text'] = test['crimeaditionalinfo'].apply(preprocess_text)
X_test = vectorizer.transform(test['processed_text']).toarray()

test['predicted_category'] = category_model.predict(X_test)
test['predicted_sub_category'] = sub_category_model.predict(X_test)

# Save test predictions
test[['predicted_category', 'predicted_sub_category']].to_csv('test_predictions.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Validation Accuracy for Category: 0.7609136514035649
Category Classification Report:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


                                                      precision    recall  f1-score   support

                               Any Other Cyber Crime       0.64      0.10      0.17      2091
Child Pornography CPChild Sexual Abuse Material CSAM       0.94      0.25      0.39        69
                                Cryptocurrency Crime       0.75      0.03      0.06        96
                      Cyber Attack/ Dependent Crimes       1.00      1.00      1.00       765
                                     Cyber Terrorism       1.00      0.03      0.06        31
      Hacking  Damage to computercomputer system etc       0.64      0.05      0.10       341
                            Online Cyber Trafficking       0.00      0.00      0.00        34
                              Online Financial Fraud       0.77      0.98      0.87     11471
                            Online Gambling  Betting       0.00      0.00      0.00        97
               Online and Social Media Related Crime       

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


                                                                      precision    recall  f1-score   support

               Against Interest of sovereignty or integrity of India       0.00      0.00      0.00         1
                             Business Email CompromiseEmail Takeover       0.00      0.00      0.00        60
                                           Cheating by Impersonation       0.33      0.01      0.01       386
                                                Cryptocurrency Fraud       0.83      0.05      0.10        96
                                   Cyber Bullying  Stalking  Sexting       0.48      0.56      0.51       836
                                                     Cyber Terrorism       0.00      0.00      0.00        31
                             Damage to computer computer systems etc       0.00      0.00      0.00        30
                                                   Data Breach/Theft       0.10      0.10      0.10       103
         

In [2]:

# Add interactive user input for predictions
while True:
    user_input = input("Enter crime additional information (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break

    # Preprocess the input
    processed_input = preprocess_text(user_input)

    # Vectorize the input
    input_vector = vectorizer.transform([processed_input]).toarray()

    # Predict category and sub-category
    predicted_category = category_model.predict(input_vector)[0]
    predicted_sub_category = sub_category_model.predict(input_vector)[0]

    # Display predictions
    print(f"Predicted Category: {predicted_category}")
    print(f"Predicted Sub-Category: {predicted_sub_category}")

Enter crime additional information (or type 'exit' to quit): got a lottery call
Predicted Category: Online Financial Fraud
Predicted Sub-Category: UPI Related Frauds
Enter crime additional information (or type 'exit' to quit): i am from delhi saket, i was receiving an email of lottery and was asking for my bank details now he is not responding after taking 2 lakh as security
Predicted Category: Online Financial Fraud
Predicted Sub-Category: Other
Enter crime additional information (or type 'exit' to quit): exit
