In [47]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [48]:
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [50]:
df.shape

(5572, 2)

In [51]:
df.isnull().sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of       Category  Message
0        False    False
1        False    False
2        False    False
3        False    False
4        False    False
...        ...      ...
5567     False    False
5568     False    False
5569     False    False
5570     False    False
5571     False    False

[5572 rows x 2 columns]>

In [52]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [54]:
df['spam'] = df['Category'].apply(lambda x: 0 if x=='spam' else 1 )
df

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",1
1,ham,Ok lar... Joking wif u oni...,1
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0
3,ham,U dun say so early hor... U c already then say...,1
4,ham,"Nah I don't think he goes to usf, he lives aro...",1
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,0
5568,ham,Will ü b going to esplanade fr home?,1
5569,ham,"Pity, * was in mood for that. So...any other s...",1
5570,ham,The guy did some bitching but I acted like i'd...,1


**CountVectorizer** is a text preprocessing technique provided by scikit-learn (sklearn) that converts a **collection of text documents** to a **matrix of token counts**. This matrix represents the **frequency of each word (token)** in the documents.

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

text_data = vectorizer.fit_transform(df['Message'])
text_data.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### **Multinomial Naive Bayes:**

- **Suitability:** This variant is commonly used for text classification tasks where features represent word counts or term frequencies.
  
- **Application:** If you represent your email data as bag-of-words or TF-IDF (Term Frequency-Inverse Document Frequency) vectors, Multinomial Naive Bayes can be a good choice.

### **Gaussian Naive Bayes:**

- **Suitability:** Gaussian Naive Bayes assumes that features follow a Gaussian (normal) distribution. It is suitable for continuous data.
  
- **Application:** If your features are continuous and normally distributed (perhaps statistical measures derived from the email content), Gaussian Naive Bayes might be appropriate.

### **Bernoulli Naive Bayes:**

- **Suitability:** This variant assumes binary features, which can be suitable if your representation involves whether a word is present or absent in the email (binary bag-of-words).
  
- **Application:** If you are using a binary representation of the presence or absence of words in the email, Bernoulli Naive Bayes may be a good choice.

Source:
ChatGPT

### **sklearn pipeline:**

**A pipeline is an excellent choice when you want to streamline and simplify your machine learning workflow.**

Instead of manually applying **label encoding, scaling, or any other preprocessing steps before fitting your model**, you can include all these steps in a single pipeline. This not only makes your code **shorter and more readable** but also ensures that the **same preprocessing steps are consistently applied to both your training and testing data**.

**Benefits of using a pipeline:**
1. **Simplicity and Readability:** Pipelines make your code more concise and easier to read.
   
2. **Consistency:** The same preprocessing steps are consistently applied to both training and testing data, reducing the risk of mistakes.

3. **Reproducibility:** Pipelines contribute to reproducibility by clearly defining the sequence of steps in your workflow.

4. **Efficiency:** Pipelines facilitate a more efficient workflow, allowing easy experimentation with different models or hyperparameters.

5. **Integration with Grid Search:** Pipelines seamlessly integrate with tools like `GridSearchCV` for hyperparameter tuning and model selection.

In summary, using pipelines in your machine learning projects promotes code organization, reduces errors, and enhances collaboration. It's a best practice that contributes to the development of robust and maintainable machine learning workflows.

Source:
ChatGPT

In [56]:
from sklearn.model_selection import KFold #Train
from sklearn.model_selection import cross_val_score #Evaluate
from sklearn.feature_extraction.text import CountVectorizer #CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline #Pipeline

In [69]:
pipeline = Pipeline([
    ('scaler', CountVectorizer()),    # Assuming text data is used
    ('classifier', MultinomialNB())
])

x = df['Message']
y = df['spam']
kf = KFold(n_splits = 3, shuffle=True, random_state=42)

cvs = cross_val_score(pipeline,x,y , cv=kf)
cvs.mean()*100

pipeline.fit(x,y)

In [73]:
new_text = ['Hi, Have you reported to the HR about Analytics of Sales this month?']
prediction = pipeline.predict(new_text)
prediction

array([1], dtype=int64)