### Import numpy and pandas libraries

In [5]:

import numpy as np
import pandas as pd

### Loading the dataset

In [9]:
# Fix: Specify the encoding as 'latin-1'
df = pd.read_csv('spam.csv', encoding='latin-1')

# Now check the first few rows to confirm it loaded correctly
print(df.head())

     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [10]:
df.shape

(5572, 5)

* Data cleaning
* EDA
* Text Preprocessing
* Model building
* Evaluation

#### 1. Data Cleaning

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


#### Dropping unnamed colums

In [12]:
# drop last 3 cols
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

In [13]:
df.sample(5)

Unnamed: 0,v1,v2
958,ham,My sort code is and acc no is . The bank is n...
3035,ham,;-) ok. I feel like john lennon.
3408,ham,Whats that coming over the hill..... Is it a m...
4579,ham,Hi ....My engagement has been fixd on &lt;#&g...
4285,ham,Congrats. That's great. I wanted to tell you n...


#### Renaming the colums 

In [14]:
# renaming the cols
df.rename(columns={'v1':'target','v2':'text'},inplace=True)
df.sample(5)

Unnamed: 0,target,text
3267,ham,Which is why i never wanted to tell you any of...
4194,spam,Double mins and txts 4 6months FREE Bluetooth ...
492,ham,"Sorry,in meeting I'll call later"
5083,ham,Aiya we discuss later lar... Pick Ì_ up at 4 i...
2484,ham,Only if you promise your getting out as SOON a...


#### LabelEncoder 
* A tool used to convert categorical text data into numbers. Machine learning models cannot work with text directly; they require all input and output variables to be numeric. LabelEncoder helps you do this transformation.
* `.fit(data)`: The encoder learns all the unique categories in your data. (e.g., It learns that the categories are "Red", "Green", and "Blue").
* `.transform(data)`: The encoder applies the learned mapping to convert the text labels into numbers.

In [15]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [16]:
df['target'] = encoder.fit_transform(df['target'])

In [17]:
df.head()

Unnamed: 0,target,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [19]:
# missing values
df.isnull().sum()

target    0
text      0
dtype: int64

In [20]:
# check for duplicate values
df.duplicated().sum()

np.int64(403)

In [21]:
# remove duplicates
df = df.drop_duplicates(keep='first')

In [22]:
df.duplicated().sum()

np.int64(0)

In [23]:
df.shape

(5169, 2)