<a href="https://colab.research.google.com/github/Cefloresm/MessageFraudDetector/blob/master/Final_Proyect_Prototyping_with_Data_and_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steps to make our fraud detector


1 Transform our data - Libraries: **Pycaret** or **NLTK**

**MUST DOs in Text Mining techniques!**

- **Tokenization**: Split into sentences >  sentences into words > Transform everything to lower case > remove punctuations
- **Remove all stopwords** (there are dictionaries to help us do this)
- **Lemmatize your words** (for ex. Changing from 3rd person into 1st person, changing verbs from past/future tenses into present tenses)
- **Stem your words** (for example “walking” and “walked” are reduced to “walk”)


2 Count vectorizer - Conteo sencillo de cuantas palabras hay
Ejemplo:
text_list= []
vectorizer= countvectorizer()
vectorizer.fit (text_list_
x= vectorizer.transform (list2)
x= [] -> Un dataframe con todas las palabras en columnas y un conteo de cuanto


3 Balance the data (Fraud vs Non fraud)


4 Plug el dataset limpio al modelo de ML.

Intentar primero con Supervised learning con **Random Forests** o con **Logistic Regression.**

O TAMBIEN puede ser

Unsupervised learning con el algoritmo **"LDA model"** parecido a Kmeans que divide la data en diferentes "clusters" o segmentos.

5.Con la probabilidad del modelo y el output, conectarlo con gmail y enviar al usuario un aviso de **BAJA** probabilidad de fraude o **ALTA** probabilidad de fraude.


# BACK END

# Step 0 Data cleaning


In [None]:
# Load pandas package to read tables (dataframes)
import pandas as pd


In [None]:
# Load the CSV file with proper delimiter
data = pd.read_csv('fraud_call.csv')

# Display the transformed data
data.head()

In [None]:
# Split the first column into multiple parts based on the observed delimiter
split_data = data.iloc[:, 0].str.split(r'\s+', expand=True, n=1)

# Select only the first two columns and rename them
cleaned_data = split_data.iloc[:, :2]
cleaned_data.columns = ['Fraud/normal', 'Message']

## Handle any missing values by filling with an appropriate placeholder
cleaned_data.loc[:, 'Message'] = cleaned_data['Message'].fillna('')

# Display the cleaned data
print(cleaned_data.head(100))
cleaned_data= df= pd.read_csv('cleaned_merged_data.csv') #Lets now work with the combined dataset that I merged 20/06/2024
print(cleaned_data.head)

In [None]:
print(cleaned_data.dtypes)


In [None]:
cleaned_data.describe()

In [None]:
# Find non-unique messages
non_unique_messages = cleaned_data[cleaned_data.duplicated(subset=['Message'], keep=False)]
print (non_unique_messages)

# Find unique messages
unique_messages = cleaned_data[~cleaned_data.duplicated(subset=['Message'], keep=False)]

# Making the main dataframe as 'df' making it equal with the cleaned dataset (unique_messages)
df= unique_messages
print(df)
print(df.describe())


In [None]:
#Count of normal and fraud
pncount = cleaned_data['Fraud/normal'].value_counts()
print(pncount)

# % of positive (fraud) and negative (nonfraud) class in data
pnpercentage= pncount/len(cleaned_data)
print(pnpercentage*100)

Preprocessing text data with Pycaret

In [None]:
!pip install pycaret[nlp]

In [None]:
!pip install nltk gensim pyLDAvis

#Step 0.5: Tokenization

In [None]:
#Import required libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
# Descargar stopwords y tokenizer de NLTK
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
def clean_text(text):
    # Remove punctuations and convert to lowercase
    text = re.sub(r'[^\w\s]', '', text.lower())

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]

    return ' '.join(tokens)

# Apply the clean_text function to each message and replace the original column
df['Message'] = df['Message'].apply(clean_text)

# Display the DataFrame with cleaned messages
print(df)

#Step 1: Count Vectorization of Tokenized Text


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(stop_words='english')
dtm = cv.fit_transform(df['Message'])

# Convert the document-term matrix to a DataFrame for better visualization
dtm_df = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names_out())

# Select the row you are interested in (e.g., first row)
row_index = 0
row_data = dtm_df.iloc[row_index]

# Filter the row to show only columns with a value greater than 0
non_zero_columns = row_data[row_data > 0]

# Display the non-zero columns
print(f"Non-zero columns for row {row_index}:")
print(non_zero_columns)

print(dtm_df.columns)

#Step 2: LDA Model, Fit + Transform Document/Term Matrix (dtm)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# Build LDA Model with GridSearch parameters
lda_model = LatentDirichletAllocation(n_components=8,
                                      learning_decay=0.5,
                                      max_iter=50,
                                      learning_method='online',
                                      random_state=42,
                                      batch_size=5000,
                                      evaluate_every = -1,
                                      n_jobs = -1)

lda_output = lda_model.fit_transform(dtm)

#Step 3: Manual Review of Top Topic Features for Each Topic

In [None]:
for index,topic in enumerate(lda_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

#Step 4: pyLDAvis-Interactive Visualization of LDA Model Output

In [None]:
!pip install pyLDAvis

In [None]:
!pip install pyLDAvispyLDAvis.enable_notebook()

In [None]:
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()
panel = pyLDAvis.lda_model.prepare(lda_model, dtm, cv, mds='tsne')
panel

#Step 5: Attribute Labels to Topics

In [None]:
import numpy as np

test_df = df[df['Fraud/normal'] == 'fraud']
fraud_messages = test_df['Message'].tolist()

# Transform the new data using the same CountVectorizer
new_data_dtm = cv.transform(fraud_messages)

# Use the trained LDA model to predict the topic distribution
topic_distribution = lda_model.transform(new_data_dtm)

# Initialize a counter for topics
topic_counts = {i: 0 for i in range(lda_model.n_components)}

# Display the topic distribution and the topic with the highest weight for each new document
for i, dist in enumerate(topic_distribution):
    max_topic = np.argmax(dist)
    topic_counts[max_topic] += 1

# Display the counts of each topic
print("\nCounts of each topic being the highest:")
for topic, count in topic_counts.items():
    print(f"Topic {topic}: {count} documents")

This shows that topic #6,0 (7 and 1 in the graph) are the 2 highest with frauds.  

# Step 6: Supervised Learning approach

Let's try using the tokenized texted with the fraud-non fraud classification column to see how a supervised learning model works.

In [None]:
#Lets use the dtm_df dataframe that is already tokenized
dtm_df

# Combine with the original 'Fraud/normal' column
df = pd.concat([df['Fraud/normal'].reset_index(drop=True), dtm_df.reset_index(drop=True)], axis=1)

# Print the combined DataFrame to verify
print(df)


In [None]:
from pycaret.classification import *

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from pycaret.classification import setup

# Setup PyCaret
setup(data=df, target= 'Fraud/normal', session_id=1, train_size= 0.8)

In [None]:
lr= create_model('lr')

In [None]:
evaluate_model(lr)

# Step 7: Testing working Supervised Learning model (Random Forest) with a real example:

In [None]:
# Sample message
sample_message = {'Message': ['Your delivery has been suspended due to a lack of a street no. Please update.']}

# Convert the dictionary to a DataFrame
sample_df = pd.DataFrame(sample_message)

# Clean and tokenize the message (assume you have a clean_text function)
sample_df['Message'] = sample_df['Message'].apply(clean_text)

# Transform the message using the same CountVectorizer
transformed_message = cv.transform(sample_df['Message'])

# Convert the transformed message to a DataFrame
transformed_df = pd.DataFrame(transformed_message.toarray(), columns=cv.get_feature_names_out())

In [None]:
predict_model(lr, data= transformed_df)

In [None]:
print('hello')

# Put it all inside a function called "Fraud detection"

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def fraud_detector(Recieved_msg=None):
  email_message= 'You are a £1000 winner or Guaranteed Caller Prize, this is our Final attempt to contact you! To Claim Call 09071517866 Now! 150ppmPOBox10183BhamB64XE'

  # Descargar stopwords y tokenizer de NLTK
  nltk.download('stopwords')
  nltk.download('punkt')

  # Clean the email message
  cleaned_message = clean_text(email_message)
  print(cleaned_message)
  # Create a DataFrame from the cleaned message
  data = {'Message': [cleaned_message]}
  df = pd.DataFrame(data)

  return df

# Call the fraud_detector function and print the DataFrame
#df = fraud_detector()
#print(df)


  #cv = CountVectorizer(stop_words='english')
  #cvtest = cv.fit_transform(df['Message'])

  # Convert the document-term matrix to a DataFrame for better visualization
  #cvtest_df = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names_out())




#FRONT END

Recieve email at x@gmail.com

Transform it and run the model

Generate message: <<CAUTION!!! xyz@gmail.com. There is a x% chance that this message is fraud. Please do not

            f"⚠️ ALERT: This email has a high probability ({probability:.2f}%) of being fraudulent.\n"
            "Suggested Action:\n"
            "1. Do not click on any links or download any attachments in the email.\n"
            "2. Do not provide any personal information or financial details.\n"
            "3. Report this email to your IT/security department immediately.\n"
            "4. Delete the email from your inbox.\n"
        )


# Let's try with Streamlit


In [None]:
pip install streamlit

# Setting up Gmail API to read/send emails

In [None]:
pip install simplegmail

# Step 1- Recieve e-mail

In [None]:
from simplegmail import Gmail

gmail= Gmail()




# Step 2- Process it with created function

In [None]:
fraud_detector(Recieved_msg=None)



# Step 3- Replying to sender

In [None]:
from email.message import EmailMessage
import ssl
import smtplib
import requests
from google.colab import userdata

def get_bible_verse():
    # Placeholder function for getting the Bible verse
    return "John 3:16 - For God so loved the world..."

# Login and sending email (Sender and recipient)
def send_email():
    print('Preparing to send e-mail...')
    email_sender = userdata.get('gmail_CE')
    email_password = userdata.get('gmailpass_CE')
    email_receiver = "galapito100@gmail.com"

    subject = "ALERT: This email has a high probability of being fraudulent."

    body= f"""
    Suggested Action:
        1. Do not click on any links or download any attachments in the email.
        2. Do not provide any personal information or financial details.
        3. Report this email to your IT/security department immediately.
        4. Delete the email from your inbox.
    """

    em = EmailMessage()
    em['From'] = email_sender
    em['To'] = email_receiver
    em['Subject'] = subject
    em.set_content(body)

    context = ssl.create_default_context()

    with smtplib.SMTP_SSL('smtp.gmail.com', 465, context=context) as smtp:
        smtp.login(email_sender, email_password)
        smtp.sendmail(email_sender, email_receiver, em.as_string())

    print('Email sent')

send_email()
