<a href="https://colab.research.google.com/github/Ankythaaa/mini-pythonprojects/blob/main/Spam_Mail_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTING THE DEPENDENCIES

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from  sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

DATA COLLECTION AND PREPROCESSING

In [44]:
# LOADING THE DATA FROM CSV FILE TO A PANDAS DATAFRAME
#raw_mail_data = pd.read_csv('/content/spam.csv')
raw_mail_data = pd.read_csv('/content/spam.csv', encoding='latin-1')


In [46]:
# DISPLAYING THE DATASET
print(raw_mail_data)
# THE BELOW DATA SET CONTAINS A LOT OF MISSING VALUES AND NULL VALUES

        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  


In [47]:



list(raw_mail_data.columns.values)




['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']

In [48]:
raw_mail_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [49]:
list(raw_mail_data.columns.values)


['v1', 'v2']

In [50]:
raw_mail_data.rename(columns={"v1": "Category", "v2": "Message"}, inplace=True)

In [51]:
list(raw_mail_data.columns.values)

['Category', 'Message']

In [52]:
# REPLACE THE NULL VALUES WITH THE NULL STRING
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')

In [53]:

# PRINTING THE FIRST 5 ROWS OF THE DATAFRAME
mail_data.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [54]:
# CHECKING THE NUMBER OF ROWS AND COLUMNS(SIZE) IN THE DATAFRAME
mail_data.shape

(5572, 2)

 LABEL ENCODING THAT IS MAKING SPAM AS ZERO AND HAM AS ONE


In [55]:
# LABEL SPAM MAIL AS 0 ; HAM MAIL AS 1;
# taking the mail_data dataframe and we are going to locate few values
# that is in the dataframe consider category column and if the text in the category column is spam then read all the values with 0
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0

# similarly for ham
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

spam = 0 and ham = 1

In [56]:
# here we seperate the labels and the messages that is as text and labels
# x axis value is text and y axis value is the labels
X =  mail_data['Message']

Y = mail_data['Category']


In [57]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [58]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


SPLITING INTO TRAINING DATA SET AND TESTING DATA SET


In [59]:
# test_size is amount of data needed for training the data set i.e 20% to the testing i.e out of 5571 - 20% will go to testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [60]:
# print the number of rows and columns

print(X.shape)
# out of 5572 80% is used to train
print(X_train.shape)
# out of 5572 20% is used to test
print(X_test.shape)

(5572,)
(4457,)
(1115,)


FEATURE EXTRACTION

USING TfidfVectorizer FOR CONVERTING TEST DATA INTO NUMERICAL DATA
since the model does not understand text data
hence we convert the text data into meaningfull numerical values


In [66]:
# Transform the text data to feature vectors that can be used as input to the logistic regression model
# TfidfVectorizer looks at the messages and we can say all the spam mail may contain words like free etc
# so TfidfVectorizer go through all the words, if the word is repeated 1000 times it will get higher score if the word is repeated 100 times then it;s score will be less
# and this score is used by the model to find which mail can be spam or ham
# paramter min_df is used if the score of a particular word is less than 1 then we need to ignore it ( word repeated only once)
# if the score is more than 1 then we need to include it
# stop_words are those words that will be repeated multiple times in a document eg this, is, and these are stop words
# stop_words= english will contain all the words that are not nessasary for prediction
# lowercase = true will change all the letters to lower case for easier prediction


feature_extraction = TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)
# we dont need to convert y_train and y_test since all thse are numbers hence we only convert x_train and x_test
# here we convert all the x_train messages in number as X_train_features
X_train_feature = feature_extraction.fit_transform()

TypeError: TfidfVectorizer.fit_transform() missing 1 required positional argument: 'raw_documents'